# Used Car Price Prediction

## 1) Problem statement.

* This dataset comprises used cars sold on cardehko.com in India as well as important features of these cars.
* If user can predict the price of the car based on input features.
* Prediction results can be used to give new seller the price suggestion based on market condition.

## 2) Data Collection.
* The Dataset is collected from scrapping from cardheko webiste
* The data consists of 13 column and 15411 rows.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df=pd.read_csv('cardekho_imputated.csv',index_col=[0])
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


## Data Cleaning
### Handling Missing values

* Handling Missing values 
* Handling Duplicates
* Check data type
* Understand the dataset

In [3]:
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [4]:
###car_name,brand,model are providing same info but model is bery important to predict the price so keep it ad remove the remaining
df.drop(columns=['car_name'],axis=1,inplace=True)
df.drop(columns=['brand'],axis=1,inplace=True)

In [5]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [6]:
df['model'].value_counts()

model
i20             906
Swift Dzire     890
Swift           781
Alto            778
City            757
               ... 
Altroz            1
C                 1
Ghost             1
Quattroporte      1
Gurkha            1
Name: count, Length: 120, dtype: int64

In [7]:
df['transmission_type'].value_counts()

transmission_type
Manual       12225
Automatic     3186
Name: count, dtype: int64

In [8]:
### Get numerical features
num_features=[feature for feature in df.columns if  df[feature].dtype!='O'] ### Not equal to O means not objects which means integer or numerical
print("Numerical Features: ",len(num_features))
### Get Categorical features
cat_features=[feature for feature in df.columns if  df[feature].dtype=='O'] ### equal to O means objects which means not integer or numerical
print("Categorical Features: ",len(cat_features))
### Discrete Features 
dis_features=[feature for feature in num_features if  len(df[feature].unique())<=25] ### Not equal to O means not objects which means integer or numerical
print("Discrete Features: ",len(dis_features))
### Continuos Features 
con_features=[feature for feature in num_features if  feature not in dis_features] ### Not equal to O means not objects which means integer or numerical
print("Continuos Features: ",len(con_features))

Numerical Features:  7
Categorical Features:  4
Discrete Features:  2
Continuos Features:  5


In [9]:
### Divide independent and depenedent features
x=df.drop(['selling_price'],axis=1)
y=df['selling_price']

In [10]:
x.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [11]:
#### Train test split before encoding and normalisation
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
x_train.shape,x_test.shape

((12328, 10), (3083, 10))

In [12]:
### Now lets do label enocding for model name
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(x['model']) ### We are fitting for entire dataset because some categories may absent in train and present in test so we cannot use transform for test data
x_train['model']=le.transform(x_train['model'])
x_test['model']=le.transform(x_test['model'])
x_train.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
14238,108,7,70252,Dealer,Diesel,Automatic,11.2,2400,215.0,5
1731,91,2,10000,Individual,Petrol,Manual,23.84,1199,84.0,5
13218,17,2,6000,Dealer,Diesel,Automatic,19.0,1950,241.3,5
403,25,7,63000,Dealer,Petrol,Manual,17.8,1497,117.3,5
13550,117,10,80292,Dealer,Petrol,Manual,20.36,1197,78.9,5


In [13]:
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
### Get numerical features
numeric_features=[feature for feature in x_train.columns if  x_train[feature].dtype!='O'] 
print("Numerical Features: ",len(numeric_features))
one_hot_columns=['seller_type','fuel_type','transmission_type']
scaler=StandardScaler()
ohe=OneHotEncoder(drop='first')
preprocessor=ColumnTransformer(
    [
        ('OHE',ohe,one_hot_columns),
        ('scaler',scaler,numeric_features)
    ],remainder='passthrough'
)


Numerical Features:  7


In [14]:
x_train=preprocessor.fit_transform(x_train)

In [15]:
x_test=preprocessor.transform(x_test)

In [16]:
pd.DataFrame(x_train)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.255968,0.323969,0.349100,-2.050819,1.756765,2.681685,-0.403824
1,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.789199,-1.337798,-1.069394,0.985661,-0.547081,-0.382744,-0.403824
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.242618,-1.337798,-1.163564,-0.177042,0.893542,3.296910,-0.403824
3,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.022962,0.323969,0.178369,-0.465315,0.024564,0.396229,-0.403824
4,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.503081,1.321030,0.585469,0.149668,-0.550917,-0.502047,-0.403824
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12323,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.556193,0.323969,1.702310,0.248161,-0.453086,-0.270460,2.070500
12324,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.899027,1.653383,0.084198,-0.876105,0.218310,0.066393,-0.403824
12325,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.036312,0.323969,-0.833967,0.185702,-0.932654,-0.779483,-0.403824
12326,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.530538,-1.337798,-0.951680,-0.273133,-0.550917,-0.432805,-0.403824


#### Model training and model selection

In [17]:
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error,root_mean_squared_error

In [18]:
#### FUnctioin for evaluating model
def evaluate(true,pred):
    mse=mean_squared_error(true,pred)
    r2=r2_score(true,pred)
    mae=mean_absolute_error(true,pred)
    rmse=root_mean_squared_error(true,pred)
    
    return mae,mse,rmse,r2

In [19]:
models={
    'Ridge':Ridge(),
    'Lasso':Lasso(),
    'LinearRegression':LinearRegression(),
    'KNeighborsRegressor':KNeighborsRegressor(),
    'DecisonTree':DecisionTreeRegressor(),
    'RandomForest':RandomForestRegressor(),
    'AdaBoostRegressor':AdaBoostRegressor(),
    'GradingBoostRegressor':GradientBoostingRegressor()
}

for i in range(len(list(models))):
    model=list(models.values())[i]
    model.fit(x_train,y_train)
    
    ## Prediction of Training
    y_train_pred=model.predict(x_train)
    y_test_pred=model.predict(x_test)
    
    train_mae,train_mse,train_rmse,train_r2=evaluate(y_train,y_train_pred)
    test_mae,test_mse,test_rmse,test_r2=evaluate(y_test,y_test_pred)
    
    print(f"=========================={list(models.keys())[i]}================================")
    print('-----------------Train EValuation---------------')
    print('MAE: ',train_mae)
    print('MSE: ',train_mse)
    print('RMSE: ',train_rmse)
    print('R2: ',train_r2)
    print('-----------------Test EValuation---------------')
    print('MAE: ',test_mae)
    print('MSE: ',test_mse)
    print('RMSE: ',test_rmse)
    print('R2: ',test_r2)

-----------------Train EValuation---------------
MAE:  268060.0140124581
MSE:  306756818582.05334
RMSE:  553856.3158275378
R2:  0.6217710708807319
-----------------Test EValuation---------------
MAE:  279557.4540451885
MSE:  252540889637.0159
RMSE:  502534.46611851
R2:  0.664523115689461
-----------------Train EValuation---------------
MAE:  268099.2286254692
MSE:  306756104247.9568
RMSE:  553855.6709540463
R2:  0.6217719516494844
-----------------Test EValuation---------------
MAE:  279614.7567457252
MSE:  252549201754.48557
RMSE:  502542.7362468644
R2:  0.664512073821054
-----------------Train EValuation---------------
MAE:  268101.6070829935
MSE:  306756099359.7596
RMSE:  553855.6665411663
R2:  0.6217719576765959
-----------------Test EValuation---------------
MAE:  279618.5794158427
MSE:  252550062888.5656
RMSE:  502543.5930230985
R2:  0.6645109298852004
-----------------Train EValuation---------------
MAE:  92496.04153147306
MSE:  103356587611.94029
RMSE:  321491.19367712125
R2:  

In [28]:
### SO KNN,RandomForest is working very fine by above report so do hyperparameter tuning
#Initialize some parameters for Hyperparamter tuning
knn_params = {"n_neighbors": [2, 3, 10, 20, 40, 50]}
rf_params = {"max_depth": [5, 8, 15, None, 10],
             "max_features": [5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
             "n_estimators": [100, 200, 500, 1000]}
gard_params={
    'loss':['squared_error','huber','absolute_error'],
    'criterion':['friedman_mse','squared_error','mse'],
    'min_samples_split':[2,8,15,20],
    'n_estimators':[100,200,500],
    'max_depth':[5,8,10,15,None],
    'learning_rate':[0.1,0.01,0.02,0.03]
}


In [29]:
randomcv_models=[
    ('GB',GradientBoostingRegressor(),gard_params)
]

In [None]:
### Gradient boost is very time taking may be around 40min

from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')
model_params={}
for name,estimator,params in randomcv_models:
    random=RandomizedSearchCV(estimator=estimator,param_distributions=params,n_iter=100,cv=3,verbose=3,n_jobs=-1,scoring='r2')
    random.fit(x_train,y_train)
    
    model_params[name]=random.best_params_



Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [27]:
for model_name in model_params:
    print(f"{model_name} : {model_params[model_name]}")

RF : {'n_estimators': 200, 'min_samples_split': 2, 'max_features': 7, 'max_depth': 15}


In [None]:
models={
    
    'KNeighborsRegressor':KNeighborsRegressor(n_neighbors=10,n_jobs=-1),
    'RandomForest':RandomForestRegressor(n_estimators=1000,min_samples_split=2,max_features=5,max_depth=None,n_jobs=-1),
    'GRBoost':GradientBoostingRegressor(n_estimators=200,loss='huber',criterion='mse',min_samples_split=8,learning_rate=,max_depth=10)
}

for i in range(len(list(models))):
    model=list(models.values())[i]
    model.fit(x_train,y_train)
    
    ## Prediction of Training
    y_train_pred=model.predict(x_train)
    y_test_pred=model.predict(x_test)
    
    train_mae,train_mse,train_rmse,train_r2=evaluate(y_train,y_train_pred)
    test_mae,test_mse,test_rmse,test_r2=evaluate(y_test,y_test_pred)
    
    print(f"=========================={list(models.keys())[i]}================================")
    print('-----------------Train EValuation---------------')
    print('MAE: ',train_mae)
    print('MSE: ',train_mse)
    print('RMSE: ',train_rmse)
    print('R2: ',train_r2)
    print('-----------------Test EValuation---------------')
    print('MAE: ',test_mae)
    print('MSE: ',test_mse)
    print('RMSE: ',test_rmse)
    print('R2: ',test_r2)

-----------------Train EValuation---------------
MAE:  104563.31116158339
MSE:  132849858686.7294
RMSE:  364485.74551925814
R2:  0.8361970892220374
-----------------Test EValuation---------------
MAE:  118507.04670775219
MSE:  70247826401.43529
RMSE:  265043.0651826893
R2:  0.9066823516595116
-----------------Train EValuation---------------
MAE:  38894.866390768184
MSE:  15480155255.028006
RMSE:  124419.27204025912
R2:  0.9809130810139006
-----------------Test EValuation---------------
MAE:  97562.93502007551
MSE:  42278275253.50516
RMSE:  205616.81656300672
R2:  0.9438372769001667
-----------------Train EValuation---------------
MAE:  444001.9129655787
MSE:  289608497472.10596
RMSE:  538152.856976627
R2:  0.6429148262488749
-----------------Test EValuation---------------
MAE:  462598.99486367014
MSE:  360609737484.99457
RMSE:  600507.8996024902
R2:  0.5209637878547513
