# Car Price Prediction and Interpretation  
## - The two main goals of this project are:
* Predicting the price of a car according to a certain set of features
* Comparing LinearRegression, RidgeRegression and LassoRegression performances

## - Data Description
The dataset consists of many different car types and models with their prices included, it also contains many features of each car such as the dimensions of the car, engine size, hourse power and many other features that we're going to use in order to predict the car price.
## - Data exploration 
I started by loading the data into jupyter notebook using pandas, then eliminated some of the columns that are irrelecant to the target feature. Cosecutively, I checked the number of featues that are object (containing strings) and one hot encoded them to enclode them in the regression models, some of them were nominal and some of them were ordinal.
###  Loading the Data

In [88]:
import numpy as np
import pandas as pd

data = pd.read_csv("C:/Users/HP/Desktop/IBM machine learning/2- Supervised Machine Learning Regression/Final Project/CarPrice_Assignment.csv")

data = data.drop("car_ID", axis=1) #  removing the care ID column since it's no different from the index
data = data.drop("CarName", axis=1) # removing the CarName column since it's irrelevant to the car price
data.head()

Unnamed: 0,symboling,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,3,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,3,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,1,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,2,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,2,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


### Checking columns types

In [89]:
# Select the object (string) columns
data.dtypes.value_counts()

object     9
float64    8
int64      7
dtype: int64

### one hot encoding the categorical variables 

In [90]:
# Select the object (string) columns
mask = data.dtypes == np.object
categorical_cols = data.columns[mask]

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mask = data.dtypes == np.object


In [91]:
# Determine how many extra columns would be created
num_ohc_cols = (data[categorical_cols]
                .apply(lambda x: x.nunique())
                .sort_values(ascending=False))


# No need to encode if there is only one value
small_num_ohc_cols = num_ohc_cols.loc[num_ohc_cols>1]

# Number of one-hot columns is one less than the number of categories
small_num_ohc_cols -= 1

# This is 175 columns, assuming the original ones are dropped. 
# This is quite a few extra columns!
small_num_ohc_cols.sum()

29

In [92]:
from sklearn.preprocessing import OneHotEncoder

# Copy of the data
data_ohc = data.copy()

# The encoder
ohc = OneHotEncoder()

for col in num_ohc_cols.index:
    
    # One hot encode the data--this returns a sparse array
    new_dat = ohc.fit_transform(data_ohc[[col]])
    
    # Remove the original column from the dataframe
    data_ohc = data_ohc.drop(col, axis=1)

    # get names of all the unique values in the colunm so we can identify later
    cats = ohc.categories_

    # Create columns names for each OHE by value
    new_cols = ['_'.join([col, cat]) for cat in cats[0]]

    # Create the new dataframe
    new_df = pd.DataFrame(new_dat.toarray(),
                          columns=new_cols)

    # Append the new data to the dataframe
    data_ohc = pd.concat([data_ohc, new_df], axis=1)

In [93]:
# Column difference is as calculated above
data_ohc.shape[1] - data.shape[1]

29

In [94]:
print(data.shape[1])

# Remove the string columns from the dataframe
data = data.drop(num_ohc_cols.index, axis=1)

print(data.shape[1])

24
15


### Training and test splits

In [108]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
# standard scaling all x before splitting
s = StandardScaler()

y_col = 'price'
# Split the data that is one-hot encoded
feature_cols = [x for x in data_ohc.columns if x != y_col]
X_data_ohc = data_ohc[feature_cols]
X_data_ohc = s.fit_transform(X_data_ohc)
y_data_ohc = data_ohc[y_col]

X_train_ohc, X_test_ohc, y_train_ohc, y_test_ohc = train_test_split(X_data_ohc, y_data_ohc, 
                                                    test_size=0.3, random_state=42)


### Using Grid Search CV to find the best parameters.

#### - Needed Libraries

In [109]:
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#### 1- linear regression

In [121]:
# without PolynomialFeatures transformation
lr_error_df = list()
lr = LinearRegression()
lr = lr.fit(X_train_ohc,y_train_ohc )
y_pred_LR = lr.predict(X_test_ohc)

lr_error_df.append(pd.Series({'Accuracy': r2_score(y_test_ohc, y_pred_LR)},
                           name='no PolynomialFeatures'))

# using grid search cv to apply polynomial features tranformtion
estimator = Pipeline([("polynomial_features", PolynomialFeatures()),
        ("linear_regression", LinearRegression())])

params = {
    'polynomial_features__degree': [1, 2, 3]
}
kf = KFold(shuffle=True, random_state=72019, n_splits=3)

grid = GridSearchCV(estimator, params, cv=kf)
grid.fit(X_train_ohc,y_train_ohc)
grid_y_pred_LRP = grid.predict(X_test_ohc)
lr_error_df.append(pd.Series({'Accuracy': r2_score(y_test_ohc,grid_y_pred_LRP)},
                           name='PolynomialFeatures'))
lr_error_df = pd.concat(lr_error_df, axis=1)
lr_error_df

Unnamed: 0,no PolynomialFeatures,PolynomialFeatures
Accuracy,0.883029,0.883029


AS we can see above, polynomial transformation did not change the accuracy of the Linear regression model.
#### 2- Ridge Regression (regularization regression)

In [128]:
# without polynomial transformation
RG_error_df = list()
estimator_R = Pipeline([("ridge_regression", Ridge())])

params_R = {
    'ridge_regression__alpha': np.geomspace(0.1, 10, 30)
}

grid_R = GridSearchCV(estimator_R, params_R, cv=kf)
grid_R.fit(X_train_ohc,y_train_ohc)
grid_y_pred_R = grid_R.predict(X_test_ohc)
RG_error_df.append(pd.Series({'Accuracy': r2_score(y_test_ohc,grid_y_pred_R)},
                           name='No PolynomialFeatures'))

# with polynomial transformation
estimator_RP = Pipeline([("polynomial_features", PolynomialFeatures()),
        ("ridge_regression", Ridge())])

params_RP = {
    'polynomial_features__degree': [1, 2, 3],
    'ridge_regression__alpha': np.geomspace(0.1, 10, 30)
}
grid_RP = GridSearchCV(estimator_RP, params_RP, cv=kf)
grid_RP.fit(X_train_ohc,y_train_ohc)
grid_y_pred_RP = grid_RP.predict(X_test_ohc)
RG_error_df.append(pd.Series({'Accuracy': r2_score(y_test_ohc,grid_y_pred_RP)},
                           name='PolynomialFeatures'))


RG_error_df = pd.concat(RG_error_df, axis=1)
RG_error_df

Unnamed: 0,No PolynomialFeatures,PolynomialFeatures
Accuracy,0.863438,0.863438


In [129]:
print(r2_score(y_test_ohc,grid_y_pred_RP))
print(r2_score(y_test_ohc,grid_y_pred_R))

0.8634379859751133
0.8634379859751131


As we can see above, the difference between using polynomial transformation and not using it is in the 16th decimal, which mean it's insignificant.

#### 2- Lasso Regression (regularization regression)

In [119]:
# without polynomial transformation
L_error_df = list()
estimator_L = Pipeline([("Lasso_regression", Lasso())])

params_L = {
    'Lasso_regression__alpha': np.geomspace(4, 20, 30)
}

grid_L = GridSearchCV(estimator_L, params_L, cv=kf)
grid_L.fit(X_train_ohc,y_train_ohc)
grid_y_pred_L = grid_L.predict(X_test_ohc)
L_error_df.append(pd.Series({'Accuracy': r2_score(y_test_ohc,grid_y_pred_L)},
                           name='No PolynomialFeatures'))

# with polynomial transformation
estimator_LP = Pipeline([("polynomial_features", PolynomialFeatures()),
        ("Lasso_regression", Lasso())])

params_LP = {
    'polynomial_features__degree': [1, 2, 3],
    'Lasso_regression__alpha': np.geomspace(4, 20, 30)
}
grid_LP = GridSearchCV(estimator_LP, params_LP, cv=kf)
grid_LP.fit(X_train_ohc,y_train_ohc)
grid_y_pred_LP = grid_LP.predict(X_test_ohc)
L_error_df.append(pd.Series({'Accuracy': r2_score(y_test_ohc,grid_y_pred_LP)},
                           name='PolynomialFeatures'))


L_error_df = pd.concat(L_error_df, axis=1)
L_error_df

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Unnamed: 0,No PolynomialFeatures,PolynomialFeatures
Accuracy,0.881807,0.926257


### comparing the results of the three different models

In [126]:
error_df = list()
error_df.append(pd.Series({'LinearRgression': r2_score(y_test_ohc,y_pred_LR),
                          'RidgeRegression': r2_score(y_test_ohc,grid_y_pred_R),
                          'LassoRegression': r2_score(y_test_ohc,grid_y_pred_L)},
                           name='No PolynomialFeatures'))
error_df.append(pd.Series({'LinearRgression': r2_score(y_test_ohc,grid_y_pred_LRP),
                          'RidgeRegression': r2_score(y_test_ohc,grid_y_pred_RP),
                          'LassoRegression': r2_score(y_test_ohc,grid_y_pred_LP)},
                           name='No PolynomialFeatures'))

error_df = pd.concat(error_df, axis=1)
error_df

Unnamed: 0,No PolynomialFeatures,No PolynomialFeatures.1
LinearRgression,0.883029,0.883029
RidgeRegression,0.862456,0.862456
LassoRegression,0.881807,0.926257


## Recommendation
As my goal was to score the highest accuracy, I was expecting that Ridge regression would scare the highest accuracy, because it has high penalty in regularization, which means it has less error on the expense of less interpretability. Surprisingly, Lasso Regression model scored a higher accuracy altho it's better for model and features interpretability since it zeros out more features than Ridge regression model. Moreover, this is most likely due to better tuning for Lasso model than Ridge model.

## Key Findings
in LinearRegression and Ridgeregression, applying polynomial transformation had a small impact on the accuracy of the model. on the other hand, applying polynomial transformation on Lasso regression model increased the accuracy significantly. Surprisingly, Ridge Regression scored the lowest accuracy although it was expected to score the highest one.

## Suggestions
I believe it's really important to do further analyses into why Ridge regression model scored the lowest accuracy, we should try to tune the model in different ways apply different scaling and check the results again to see if there is an changes.