#  Predicting Car Prices

In this notebook, we will try to predict a car's market price using its attributes. The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more.

Let's start by importing the libraries we are going to use in the rest of the notebook.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

pd.options.display.max_columns = 99

The documentation of the data can be found at the [dataset's documentation](https://archive.ics.uci.edu/ml/datasets/automobile). To match the documentation, we will specify the column names when reading the data.

In [2]:
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv('imports-85.data', names=cols)

Let's have a look at the first few rows of the data.

In [3]:
cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Now, let's try to have more info about the data.

In [4]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-rate     205 non-null float64
horsepower           205 non-nul

In the rest of the notebook, we will try to build a machine learning model using only the numeric features. 
For this purpose, we will select only the columns with continuous values from the [documentation](https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names).

In [5]:
continuous_values_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
numeric_cars = cars[continuous_values_cols]

In [6]:
numeric_cars.head(5)

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495
1,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500
2,?,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500
3,164,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950
4,164,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450


## Data Cleaning

From the data, we can notice that the `normalized-losses` column contains some missing values represented using "?". Let's replace these values and look for the presence of missing values in other numeric columns.

In [7]:
numeric_cars = numeric_cars.replace('?', np.nan)
numeric_cars.head()

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495
1,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500
2,,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500
3,164.0,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950
4,164.0,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450


We will convert all the type of columns to be `float` and we will look at the number of missing values for each columns.

In [8]:
numeric_cars = numeric_cars.astype('float')
numeric_cars.isnull().sum()

normalized-losses    41
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

We notice that some columns have missing values. Before taking a decision on how to deal with them, since the `price` is the column we want to predict, let's first remove any rows with missing `price` values.

In [9]:
numeric_cars = numeric_cars.dropna(subset=['price'])
numeric_cars.isnull().sum()

normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

For the rest of missing values, we will replace them in other columns using the column means.

In [10]:
numeric_cars = numeric_cars.fillna(numeric_cars.mean())

We will just confirm that there's no more missing values before proceeding.

In [11]:
numeric_cars.isnull().sum()

normalized-losses    0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

First, we will split the dataset into the Training set (80%) and Test set (20%).

In [12]:
X = numeric_cars.iloc[:, :-1].values
y = numeric_cars.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Now, we will scale all features columnns.

In [13]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Now, we will build a 10-Fold cross-validator.

In [14]:
cv = KFold(n_splits=10, random_state=0)

## Multiple Linear Regressor

We will start by fitting a multiple linear regressor.

In [15]:
lr = LinearRegression()

Now, we will fit the model and compute the root mean square error (rmse) of the model.

In [16]:
lr_grid = GridSearchCV(estimator = lr,
                       param_grid = {},
                       scoring = 'neg_mean_squared_error',
                       cv = cv)
lr_grid = lr_grid.fit(X_train, y_train)

accuracy = lr_grid.best_score_
# Printing the rmse
print(np.sqrt(-accuracy))

3919.9748648595337


In [17]:
y_pred = lr_grid.predict(X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 4461.218255180123


## k-nearest neighbor

Now, we will fit a k-nearest neighbor regressor with different number of neighbors *k*.

In [18]:
knn = KNeighborsRegressor()

In [19]:
parameters = {'n_neighbors': np.arange(1,10)}
knn_grid = GridSearchCV(estimator = knn,
                       param_grid = parameters,
                       scoring = 'neg_mean_squared_error',
                       cv = cv)
knn_grid = knn_grid.fit(X_train, y_train)

best_accuracy = knn_grid.best_score_
best_parameters = knn_grid.best_params_
# Printing the smallest rmse
print(np.sqrt(-best_accuracy))
# Printing the best parameters
print(best_parameters)

3196.432492590372
{'n_neighbors': 2}


In [20]:
y_pred = knn_grid.predict(X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 4343.746642743044


We can see that the k-nearset neighbors with *k=2* gives better than the multiple linear regression in terms of the RMSE metric.

# Random forest

Let'stry a random forest model to see if we can improve the results.

In [21]:
rf = RandomForestRegressor(random_state=0)

In [22]:
n_estimators = [10, 50, 100]
max_depth_values = [5, 6, 7, 8, 9]
max_features_values = [4, 5, 6, 7]
tree_params = {'n_estimators' : n_estimators,
               'max_depth': max_depth_values,
               'max_features': max_features_values}
rf_grid = GridSearchCV(estimator=rf, 
                       param_grid=tree_params,
                       scoring='neg_mean_squared_error', 
                       n_jobs=-1, 
                       cv=cv, 
                       verbose=1)
rf_grid.fit(X_train, y_train)

best_accuracy = rf_grid.best_score_
best_parameters = rf_grid.best_params_
# Printing the smallest rmse
print(np.sqrt(-best_accuracy))
# Printing the best parameters
print(best_parameters)

Fitting 10 folds for each of 60 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    9.3s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   19.0s


2655.06565735125
{'max_depth': 8, 'max_features': 4, 'n_estimators': 100}


[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:   25.4s finished


The best parameters for the random forest are
- Maximum depth = 8
- Maximum features = 4
- Number of estimators = 100

In [23]:
y_pred = rf_grid.predict(X_test)
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 2916.4429394485383


We can see that the predictions have improved and also we can see that the random forest model has prevented overfitting unlike multiple linear regression and the k-nearest neighbors. However, using *100* estimators to contruct the forest is expensive especially if the dataset is large. 