# Predicting Car Prices with K-Nearest Neighbors

This project stems from the 'Data Scientist' path on [Dataquest](https://www.dataquest.io/). The data used can be found [here](https://archive.ics.uci.edu/ml/datasets/automobile). 

Date of completion: 05/29/2020

In [1]:
from google.colab import files
uploaded = files.upload()

Saving imports-85.data to imports-85 (2).data


In [0]:
import io

import pandas as pd 

import numpy as np

cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

cars = pd.read_csv((io.BytesIO(uploaded['imports-85.data'])), names=cols)

In [3]:
cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


After inspecting each column, we see only the following are strictly numeric and thus useful to our algorithm: 

'normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'

In [4]:
num_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

numeric_cars = cars[num_cols]

numeric_cars.head()


Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495
1,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500
2,?,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500
3,164,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950
4,164,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450


# 1. Data Cleaning

In [5]:
numeric_cars = numeric_cars.replace('?',np.nan)

numeric_cars.head()

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495
1,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500
2,,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500
3,164.0,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950
4,164.0,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450


In [6]:
numeric_cars = numeric_cars.astype(float)

numeric_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   normalized-losses  164 non-null    float64
 1   wheel-base         205 non-null    float64
 2   length             205 non-null    float64
 3   width              205 non-null    float64
 4   height             205 non-null    float64
 5   curb-weight        205 non-null    float64
 6   bore               201 non-null    float64
 7   stroke             201 non-null    float64
 8   compression-rate   205 non-null    float64
 9   horsepower         203 non-null    float64
 10  peak-rpm           203 non-null    float64
 11  city-mpg           205 non-null    float64
 12  highway-mpg        205 non-null    float64
 13  price              201 non-null    float64
dtypes: float64(14)
memory usage: 22.5 KB


In [7]:
numeric_cars.isnull().sum()

normalized-losses    41
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

Since the 'price' column is our target, we should eliminate every row with this missing value

In [8]:
numeric_cars = numeric_cars.dropna(subset=['price'])

numeric_cars.isnull().sum()

normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

In [9]:
numeric_cars = numeric_cars.fillna(numeric_cars.mean())

numeric_cars.isnull().sum()

normalized-losses    0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

In [0]:
price_col = numeric_cars['price']

numeric_cars = (numeric_cars - numeric_cars.min()) / (numeric_cars.max() - numeric_cars.min())

numeric_cars['price'] = price_col

# 2. Univariate Model

In [0]:
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error

def knn_train_test(train_col, target_col, df):
  knn = KNeighborsRegressor()
  np.random.seed(1)

  # Randomize order of rows in dataframe
  shuffled_index = np.random.permutation(df.index)
  rand_df = df.reindex(shuffled_index)

  # Setting the cut for train and test sets
  last_train_row = int(len(df) / 2)

  # First half as training set
  train_df = rand_df.iloc[:last_train_row]
  test_df = rand_df.iloc[last_train_row:]

  # Fit a knn model using a default k value
  knn.fit(train_df[[train_col]], train_df[target_col])

  # Make a prediction using the model
  prediction = knn.predict(test_df[[train_col]])

  # Return the root-mean-squared error
  rmse = mean_squared_error(test_df[target_col], prediction) ** (1/2)
  return rmse

In [12]:
rmse_results = {}

train_cols = numeric_cars.columns.drop('price')

# For each column except 'price', calculate the RMSE value and store it in the dictionary
for col in train_cols:
  rmse_value = knn_train_test(col, 'price', numeric_cars)
  rmse_results[col] = rmse_value

# Creating a series object from the dictionary to easily interpret the results
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
best_features = rmse_results_series.sort_values()
best_features

horsepower           4037.037713
curb-weight          4401.118255
highway-mpg          4630.026799
width                4704.482590
city-mpg             4766.422505
length               5427.200961
wheel-base           5461.553998
compression-rate     6610.812153
bore                 6780.627785
normalized-losses    7330.197653
peak-rpm             7697.459696
stroke               8006.529545
height               8144.441043
dtype: float64

# 3. Multivariate Model

In this next step we'll perform 4 measurements. Namely, we'll calculate the RMSE with the 2nd, 3rd, 4th and 5th best-performing columns in the dataframe.

In [13]:
def knn_train_test_multi(target_col, df):

  np.random.seed(1)
    
  # Randomize order of rows in data frame.
  shuffled_index = np.random.permutation(df.index)
  rand_df = df.reindex(shuffled_index)

  # Divide number of rows in half and round.
  last_train_row = int(len(rand_df) / 2)
    
  # Select the first half and set as training set.
  # Select the second half and set as test set.
  train_df = rand_df.iloc[0:last_train_row]
  test_df = rand_df.iloc[last_train_row:]
  
  multi_rmse_results = {}
  
  for i in range(1,5):

    if i == 1:
      train_col = ['horsepower','curb-weight']
      knn = KNeighborsRegressor()
      knn.fit(train_df[train_col], train_df[target_col])

      # Make predictions using model.
      predicted_labels = knn.predict(test_df[train_col])

      # Calculate and return RMSE.
      mse = mean_squared_error(test_df[target_col], predicted_labels)
      rmse = np.sqrt(mse)
      multi_rmse_results['Root-Mean-Square-Error with the 2 best features'] = rmse

    elif i == 2:
      train_col = ['horsepower','curb-weight','highway-mpg']
      knn = KNeighborsRegressor()
      knn.fit(train_df[train_col], train_df[target_col])

      # Make predictions using model.
      predicted_labels = knn.predict(test_df[train_col])

      # Calculate and return RMSE.
      mse = mean_squared_error(test_df[target_col], predicted_labels)
      rmse = np.sqrt(mse)
      multi_rmse_results['Root-Mean-Square-Error with the 3 best features'] = rmse

    elif i == 3:
      train_col = ['horsepower','curb-weight','highway-mpg','width']
      knn = KNeighborsRegressor()
      knn.fit(train_df[train_col], train_df[target_col])

      # Make predictions using model.
      predicted_labels = knn.predict(test_df[train_col])

      # Calculate and return RMSE.
      mse = mean_squared_error(test_df[target_col], predicted_labels)
      rmse = np.sqrt(mse)
      multi_rmse_results['Root-Mean-Square-Error with the 4 best features'] = rmse

    elif i == 4:
      train_col = ['horsepower','curb-weight','highway-mpg','width','city-mpg']
      knn = KNeighborsRegressor()
      knn.fit(train_df[train_col], train_df[target_col])

      # Make predictions using model.
      predicted_labels = knn.predict(test_df[train_col])

      # Calculate and return RMSE.
      mse = mean_squared_error(test_df[target_col], predicted_labels)
      rmse = np.sqrt(mse)
      multi_rmse_results['Root-Mean-Square-Error with the 5 best features'] = rmse

  return multi_rmse_results   

knn_train_test_multi('price', numeric_cars)

{'Root-Mean-Square-Error with the 2 best features': 3257.849049435976,
 'Root-Mean-Square-Error with the 3 best features': 3365.9110004529675,
 'Root-Mean-Square-Error with the 4 best features': 3358.6915801682458,
 'Root-Mean-Square-Error with the 5 best features': 3341.6024539726504}

As we can see from our results, the smallest errors were produced when using the first 2 and 4 best features, namely:

['horsepower','curb-weight'] and ['horsepower','curb-weight','highway-mpg','width']

# 4. Hyperparameter Tuning

In this final step we'll try to change the number of neighbors used in our calculations to reduce our error.

In [14]:
def knn_train_test_multi_k(target_col, df):

  np.random.seed(1)
    
  # Randomize order of rows in data frame.
  shuffled_index = np.random.permutation(df.index)
  rand_df = df.reindex(shuffled_index)

  # Divide number of rows in half and round.
  last_train_row = int(len(rand_df) / 2)
    
  # Select the first half and set as training set.
  # Select the second half and set as test set.
  train_df = rand_df.iloc[0:last_train_row]
  test_df = rand_df.iloc[last_train_row:]
  
  multi_k_rmse_results = {}
  
  for i in range(1,5):

    if i == 1:
      train_col = ['horsepower','curb-weight']
      for n in range(1,25):
        knn = KNeighborsRegressor(n_neighbors=n)
        knn.fit(train_df[train_col], train_df[target_col])

        # Make predictions using model
        predicted_labels = knn.predict(test_df[train_col])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        multi_k_rmse_results['Root-Mean-Square-Error with the 2 best features and {} Neighbors: '.format(n)] = rmse

    elif i == 2:
      train_col = ['horsepower','curb-weight','highway-mpg']
      for n in range(1,25):
        knn = KNeighborsRegressor(n_neighbors=n)
        knn = KNeighborsRegressor()
        knn.fit(train_df[train_col], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_col])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        multi_k_rmse_results['Root-Mean-Square-Error with the 3 best features and {} Neighbors: '.format(n)] = rmse

    elif i == 3:
      train_col = ['horsepower','curb-weight','highway-mpg','width']
      for n in range(1,25):
        knn = KNeighborsRegressor(n_neighbors=n)
        knn = KNeighborsRegressor()
        knn.fit(train_df[train_col], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_col])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        multi_k_rmse_results['Root-Mean-Square-Error with the 3 best features and {} Neighbors: '.format(n)] = rmse

    elif i == 4:
      train_col = ['horsepower','curb-weight','highway-mpg','width','city-mpg']
      for n in range(1,25):
        knn = KNeighborsRegressor(n_neighbors=n)
        knn = KNeighborsRegressor()
        knn.fit(train_df[train_col], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_col])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        multi_k_rmse_results['Root-Mean-Square-Error with the 3 best features and {} Neighbors: '.format(n)] = rmse

  result_series = pd.Series(multi_k_rmse_results)
  result_series_final = result_series.sort_values()
  return result_series_final.head()


results = knn_train_test_multi_k('price', numeric_cars)

results

Root-Mean-Square-Error with the 2 best features and 2 Neighbors:     2700.747235
Root-Mean-Square-Error with the 2 best features and 1 Neighbors:     2790.107143
Root-Mean-Square-Error with the 2 best features and 3 Neighbors:     3003.748806
Root-Mean-Square-Error with the 2 best features and 4 Neighbors:     3106.605626
Root-Mean-Square-Error with the 2 best features and 5 Neighbors:     3257.849049
dtype: float64

# Conclusion

The best set of criteria for predicting the car value as of now is comprised of the following:

Horsepower and Curb-Weight features with 2 neighbors.

The Root-Mean-Squared-Error stands at 2700.



In [15]:
round(sum(numeric_cars['price']) / len(numeric_cars['price']))

13207

With a mean price of about 13.000, our error represents roughly 20% of the average car value. 

If necessary, the next steps for reducing this gap include and improving the algorithm include: 

1) Add data cleaning to our function

2) Perform k-fold cross validation 