# Used Car Price Prediction

## 1) Problem statement.

* This dataset comprises used cars sold on cardehko.com in India as well as important features of these cars.
* If user can predict the price of the car based on input features.
* Prediction results can be used to give new seller the price suggestion based on market condition.

## 2) Data Collection.
* The Dataset is collected from scrapping from cardheko webiste
* The data consists of 13 column and 15411 rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
df = pd.read_csv('cardekho_imputated.csv',index_col=[0])

In [3]:
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


## Data Cleaning 
1. Handling missing values
2. Handling Duplicate values
3. check data types 
4. understand the data

In [4]:
## check if the data set has the null values or not 
## checking nan values 
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [5]:
## dropping unnecessary features or columns 
df.drop(['car_name','brand'],axis=1,inplace=True)

In [6]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [7]:
df.size,len(df),len(df.columns)

(169521, 15411, 11)

In [8]:
df.shape,df.shape[0]

((15411, 11), 15411)

In [9]:
len(df['model'].unique())

120

In [10]:
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [11]:
df.columns

Index(['model', 'vehicle_age', 'km_driven', 'seller_type', 'fuel_type',
       'transmission_type', 'mileage', 'engine', 'max_power', 'seats',
       'selling_price'],
      dtype='object')

In [12]:
## getting all different types of features
num_features = [feature for feature in df.columns if df[feature].dtype != 'O']
print('Number of numerical features are ',len(num_features))
cat_features = [feature for feature in df.columns if df[feature].dtype == 'O']
print('Number of categorical features ',len(cat_features))
descrete_feature  = [feature for feature in num_features if len(df[feature].unique()) <= 25]
print('number of descrete features ', len(descrete_feature))
continious_feature = [feature for feature in num_features if feature not in descrete_feature]
print('number of continious features ',len(continious_feature))

Number of numerical features are  7
Number of categorical features  4
number of descrete features  2
number of continious features  5


In [13]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [14]:
## Independent and Dependent features 
from sklearn.model_selection import train_test_split
X = df.drop(['selling_price'],axis=1)
y = df['selling_price']

In [15]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


## Feature Encoding and Scaling
#### One Hot Encoding for columns which had lesser unique values and not ordinal
 One hot encoding is the process of converting categorical features into numerical form so that we can give it to ML Algorithms

In [16]:
len(df['model'].unique())

120

In [17]:
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [18]:
df['model'].value_counts()

model
i20             906
Swift Dzire     890
Swift           781
Alto            778
City            757
               ... 
Altroz            1
C                 1
Ghost             1
Quattroporte      1
Gurkha            1
Name: count, Length: 120, dtype: int64

In [19]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['model'] = le.fit_transform(X['model'])

In [20]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [21]:
len(X['seller_type'].unique()),len(X['fuel_type'].unique()),len(X['transmission_type'].unique())

(3, 5, 2)

In [22]:
## as the number of categories are very less we can apply one hot encoding on these categories 

In [23]:
## create column transformer with 3 types of transformers 
num_features = X.select_dtypes(exclude='object').columns
onehot_columns = ['seller_type','fuel_type','transmission_type']
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
num_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')
preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder",oh_transformer,onehot_columns),
        ("StandardScaller",num_transformer,num_features),
    ],remainder='passthrough'
)

In [24]:
X = preprocessor.fit_transform(X)

In [25]:
X

array([[ 1.        ,  0.        ,  0.        , ..., -1.32425883,
        -1.26335238, -0.40302241],
       [ 1.        ,  0.        ,  0.        , ..., -0.55471774,
        -0.43257082, -0.40302241],
       [ 1.        ,  0.        ,  0.        , ..., -0.55471774,
        -0.47911321, -0.40302241],
       ...,
       [ 0.        ,  0.        ,  1.        , ...,  0.02291783,
         0.06822523, -0.40302241],
       [ 0.        ,  0.        ,  1.        , ...,  1.32979434,
         0.91715831,  2.07344426],
       [ 0.        ,  0.        ,  0.        , ...,  0.02099878,
         0.39588361, -0.40302241]], shape=(15411, 14))

In [26]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,1.247335,-0.000276,-1.324259,-1.263352,-0.403022
1,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.225693,-0.343933,-0.690016,-0.192071,-0.554718,-0.432571,-0.403022
2,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.536377,1.647309,0.084924,-0.647583,-0.554718,-0.479113,-0.403022
3,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,-0.360667,0.292211,-0.936610,-0.779312,-0.403022
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.666211,-0.012060,-0.496281,0.735736,0.022918,-0.046502,-0.403022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15406,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.508844,0.983562,-0.869744,0.026096,-0.767733,-0.757204,-0.403022
15407,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.556082,-1.339555,-0.728763,-0.527711,-0.216964,-0.220803,2.073444
15408,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.407551,-0.012060,0.220539,0.344954,0.022918,0.068225,-0.403022
15409,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.426247,-0.343933,72.541850,-0.887326,1.329794,0.917158,2.073444


In [27]:
## now we will do train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((12328, 14), (3083, 14), (12328,), (3083,))

In [28]:
X_train

array([[ 0.        ,  0.        ,  1.        , ...,  1.75390551,
         2.66249771, -0.40302241],
       [ 1.        ,  0.        ,  0.        , ..., -0.55087963,
        -0.38602844, -0.40302241],
       [ 0.        ,  0.        ,  1.        , ...,  0.89033072,
         3.27453006, -0.40302241],
       ...,
       [ 1.        ,  0.        ,  0.        , ..., -0.9366097 ,
        -0.78070786, -0.40302241],
       [ 0.        ,  0.        ,  0.        , ..., -0.55471774,
        -0.43582879, -0.40302241],
       [ 1.        ,  0.        ,  0.        , ..., -0.04616815,
         0.06194201, -0.40302241]], shape=(12328, 14))

## Model Training and model selection

In [29]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [30]:
## creating a function to evaluate the model
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true,predicted)
    mse = mean_squared_error(true,predicted)
    rmse = np.sqrt(mse)
    r2Score = r2_score(true,predicted)
    return mae, mse , rmse, r2Score

In [31]:
## Beginning Model Training 
models = {
    "Linear Regression" : LinearRegression(),
    "Lasso" : Lasso(),
    "Ridge" : Ridge(),
    "K-Neighbors Regressor" : KNeighborsRegressor(),
    "Decision Tree" : DecisionTreeRegressor(),
    "Random Forest Regressor" : RandomForestRegressor(),
    "AdaBoostRegressor" : AdaBoostRegressor(),
    "Gradient boost regressor": GradientBoostingRegressor(),
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train) ## here we are training our model
    
    ## now we can make predictions 
    ## first we will do with train data and then test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    ## now we can call our evaluate function and then evaluate the model
    model_train_mae , model_train_mse, model_train_rmse,model_train_r2score = evaluate_model(y_train,y_train_pred)
    model_test_mae, model_test_mse , model_test_rmse, model_test_r2score = evaluate_model(y_test,y_test_pred)

    print(list(models.keys())[i])

    print('model performance for training dataset')
    print('-mean squared Error : {:.4f}'.format(model_train_mse))
    print('-mean absolute Error : {:.4f}'.format(model_train_mae))
    print('-Root mean squared Error : {:.4f}'.format(model_train_rmse))
    print('-r2 score : {:.4f}'.format(model_train_r2score))

    print('----------------------------------')

    print('model performance for testing dataset')
    print('-mean squared Error : {:.4f}'.format(model_test_mse))
    print('-mean absolute Error : {:.4f}'.format(model_test_mae))
    print('-Root mean squared Error : {:.4f}'.format(model_test_rmse))
    print('-r2 score : {:.4f}'.format(model_test_r2score))
    print('='*35)
    print('\n')

Linear Regression
model performance for training dataset
-mean squared Error : 306756099359.7596
-mean absolute Error : 268101.6071
-Root mean squared Error : 553855.6665
-r2 score : 0.6218
----------------------------------
model performance for testing dataset
-mean squared Error : 252550062888.5655
-mean absolute Error : 279618.5794
-Root mean squared Error : 502543.5930
-r2 score : 0.6645


Lasso
model performance for training dataset
-mean squared Error : 306756104248.3742
-mean absolute Error : 268099.2226
-Root mean squared Error : 553855.6710
-r2 score : 0.6218
----------------------------------
model performance for testing dataset
-mean squared Error : 252549134806.7813
-mean absolute Error : 279614.7461
-Root mean squared Error : 502542.6696
-r2 score : 0.6645


Ridge
model performance for training dataset
-mean squared Error : 306756818740.9266
-mean absolute Error : 268059.8015
-Root mean squared Error : 553856.3160
-r2 score : 0.6218
----------------------------------
mod

In [56]:
## here we can see that KNN and Random forest is performing well so we will hyperparameter tune the knn and Random forest

In [32]:
## Initialising few parameters for hyperparameter tunning 
# knn_params = {"n_neighbors" : [2,3,4,10,20,30,40,50,100]}
rf_params = {
    "max_depth":[5,8,15,None,10],
    "max_features":[5,7,'auto',8],
    "min_samples_split":[2,8,15,20],
    "n_estimators":[100,200,500,1000],
}
# ada_params = {
#     'n_estimators' : [50,60,75,80,90,100,120,150,180],
#     'loss' : ['linear','square','exponential']
# }
gradient_params={
    "loss":['squared_error','absolute_error','huber','quantile'],
    'criterion':['friedman_mse','squared_error','mse'],
    'min_samples_split':[2,8,15,20],
    'n_estimators':[100,200,500,1000],
    'learning_rate':[0.1,0.01,0.02,0.03],
    'max_depth':[5,8,15,None,10]
}

In [33]:
## models list for hyperparameter tunning 
randomcv_models = [
    # ("KNN",KNeighborsRegressor(),knn_params),
    ("RF",RandomForestRegressor(),rf_params),
    ("GBR",GradientBoostingRegressor(),gradient_params)
]

In [34]:
## Hyper parameter tuning
from sklearn.model_selection import RandomizedSearchCV
model_param = {}
for name,model,params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,param_distributions=params,n_iter=100,cv=5,verbose=2,n_jobs=-1)
    random.fit(X_train,y_train)
    model_param[name] = random.best_params_

for model_name in model_param:
    print(f"-------------best params for {model_name}-----------------")
    print(model_param[model_name])

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Fitting 5 folds for each of 100 candidates, totalling 500 fits
-------------best params for RF-----------------
{'n_estimators': 500, 'min_samples_split': 2, 'max_features': 7, 'max_depth': 15}
-------------best params for GBR-----------------
{'n_estimators': 1000, 'min_samples_split': 20, 'max_depth': 10, 'loss': 'huber', 'learning_rate': 0.03, 'criterion': 'friedman_mse'}


In [35]:
## Beginning Model Training 
models = {
    # "K-Neighbors Regressor" : KNeighborsRegressor(n_neighbors=2,n_jobs=-1),
    "Random Forest Regressor" : RandomForestRegressor(n_estimators=500,min_samples_split=2,max_features=7,max_depth=15),
    "Gradient Regressor" : GradientBoostingRegressor(n_estimators=1000,min_samples_split=20,max_depth=10,loss='huber',learning_rate=0.03,criterion='friedman_mse')
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train,y_train) ## here we are training our model
    
    ## now we can make predictions 
    ## first we will do with train data and then test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    ## now we can call our evaluate function and then evaluate the model
    model_train_mae , model_train_mse, model_train_rmse,model_train_r2score = evaluate_model(y_train,y_train_pred)
    model_test_mae, model_test_mse , model_test_rmse, model_test_r2score = evaluate_model(y_test,y_test_pred)

    print(list(models.keys())[i])

    print('model performance for training dataset')
    print('-mean squared Error : {:.4f}'.format(model_train_mse))
    print('-mean absolute Error : {:.4f}'.format(model_train_mae))
    print('-Root mean squared Error : {:.4f}'.format(model_train_rmse))
    print('-r2 score : {:.4f}'.format(model_train_r2score))

    print('----------------------------------')

    print('model performance for testing dataset')
    print('-mean squared Error : {:.4f}'.format(model_test_mse))
    print('-mean absolute Error : {:.4f}'.format(model_test_mae))
    print('-Root mean squared Error : {:.4f}'.format(model_test_rmse))
    print('-r2 score : {:.4f}'.format(model_test_r2score))
    print('='*35)
    print('\n')

Random Forest Regressor
model performance for training dataset
-mean squared Error : 19646499949.7676
-mean absolute Error : 54544.2872
-Root mean squared Error : 140165.9729
-r2 score : 0.9758
----------------------------------
model performance for testing dataset
-mean squared Error : 45106414460.1889
-mean absolute Error : 97530.0141
-Root mean squared Error : 212382.7075
-r2 score : 0.9401


Gradient Regressor
model performance for training dataset
-mean squared Error : 5992210308.8528
-mean absolute Error : 41554.3619
-Root mean squared Error : 77409.3684
-r2 score : 0.9926
----------------------------------
model performance for testing dataset
-mean squared Error : 49306381805.4508
-mean absolute Error : 95907.0260
-Root mean squared Error : 222050.4037
-r2 score : 0.9345


