 ## Used Car Price Prediction

## Problem Statement
The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including the make and model of the car, its mileage, its condition and the current market conditions. As a result, it can be difficult for sellers to accurately price their cars.

This dataset contains information about used cars.

Prediction of Used Car Price  using different Machine Learning Techniques.

In [261]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import warnings

warnings.filterwarnings('ignore')

In [262]:
df = pd.read_csv('cardekho_dataset.csv', index_col=[0])

In [263]:
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


## Data cleaning

### Handling Missing Values

- Understand the dataset
- Handling Missing values and Duplicates
- check data type

In [264]:
# check null values
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [265]:
# Remove unnecessary columns
df.drop('car_name', axis=1, inplace=True)
df.drop('brand', axis=1, inplace=True)

In [266]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [267]:
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [268]:
#numerical features
numeric_features = [feature for feature in df.columns if df[feature].dtype!='O']
print(f'Number of numerical features : {len(numeric_features)}')

Number of numerical features : 7


In [269]:
#categorical features
categorical_features = [feature for feature in df.columns if df[feature].dtype =='O']
print(f'Number of categorical features : {len(categorical_features)}')

Number of categorical features : 4


In [270]:
# Discrete features
discrete_features = [feature for feature in numeric_features if len(df[feature].unique())<=25]
print(f'Number of discrete features : {len(discrete_features)}')

Number of discrete features : 2


In [271]:
#continuous features
continuous_features = [feature for feature in numeric_features if feature not in discrete_features]
print(f'Number of continuous features : {len(continuous_features)}')

Number of continuous features : 5


In [272]:
# Dependent and independent features
x=df.drop(['selling_price'], axis=1)
y=df['selling_price']

In [273]:
y.head(50)

0      120000
1      550000
2      215000
3      226000
4      570000
5      350000
6      315000
7      410000
8     1050000
12     511000
14     425000
15     750000
16    3250000
17     650000
18     627000
19    1425000
20     425000
21     605000
22     600000
23     575000
25     425000
26     230000
27    1225000
29     750000
30     350000
31     600000
32     390000
33     145000
34     700000
35    1150000
36     340000
37     465000
38     125000
39     600000
40     380000
44     850000
45     598000
46     700000
50     410000
53     590000
54     850000
56     550000
58     625000
59     600000
60     445000
61     530000
63     410000
64     350000
66     520000
67     750000
Name: selling_price, dtype: int64

In [274]:
x.isnull().sum()

model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
dtype: int64

In [275]:
x.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


### Fature scaling and encoding

In [276]:
len(df['model'].value_counts())

120

In [277]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
le=LabelEncoder()
x['model']=le.fit_transform(x['model'])

In [278]:
x.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [279]:
len(df['seller_type'].value_counts()),len(df['transmission_type'].value_counts()),len(df['fuel_type'].value_counts())

(3, 2, 5)

In [281]:
# create column tranformer
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.preprocessing import OneHotEncoder


num_features = x.select_dtypes(exclude='object').columns
onehot_columns = ['seller_type','transmission_type','fuel_type']

numeric_transformer=StandardScaler()
oh_transformer=OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("oneHotEncoder",oh_transformer,onehot_columns),
        ("standardScaler",numeric_transformer,num_features)
    ],remainder='passthrough')

In [282]:
from sklearn.model_selection import train_test_split


In [283]:
x_train

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
14238,108,7,70252,Dealer,Diesel,Automatic,11.20,2400,215.00,5
1731,91,2,10000,Individual,Petrol,Manual,23.84,1199,84.00,5
13218,17,2,6000,Dealer,Diesel,Automatic,19.00,1950,241.30,5
403,25,7,63000,Dealer,Petrol,Manual,17.80,1497,117.30,5
13550,117,10,80292,Dealer,Petrol,Manual,20.36,1197,78.90,5
...,...,...,...,...,...,...,...,...,...,...
6581,42,7,127731,Dealer,Diesel,Manual,20.77,1248,88.80,7
17029,95,11,59000,Dealer,Petrol,Manual,16.09,1598,103.20,5
6839,100,7,20000,Individual,Petrol,Manual,20.51,998,67.04,5
1104,118,2,15000,Dealer,Petrol,Manual,18.60,1197,81.86,5


In [284]:
y_train

14238    1825000
1731      515000
13218    7500000
403       435000
13550     200000
          ...   
6581      665000
17029     249000
6839      250000
1104      620000
9295      960000
Name: selling_price, Length: 12328, dtype: int64

In [285]:
x=preprocessor.fit_transform(x)

In [286]:
# x_train=pd.DataFrame(x_train)

In [287]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
x_train.shape,x_test.shape

((12328, 14), (3083, 14))

## Model Training 

In [311]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRFRegressor
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error,classification_report

In [289]:
# creating a function to evaluate the model
def evaluation(true, predicted):
    mae= mean_absolute_error(true,predicted)
    mse= mean_squared_error(true,predicted)
    rmse=np.sqrt(mean_squared_error(true,predicted))
    r2_square=r2_score(true, predicted)
    return mae,rmse,r2_square

In [312]:
# Model training
models= {
    'Linear Regression' : LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'K-Neighbour Regression':KNeighborsRegressor(),
    'Random Forest Regressor':RandomForestRegressor(),
    'Decision Tree Regression':DecisionTreeRegressor(),
    'Adaboost Regressor': AdaBoostRegressor(),
    'Gradient boost regressor':GradientBoostingRegressor(),
    'XGboost regressor':XGBRFRegressor()
}

for i in range (len(list(models))):
    model = list(models.values())[i]
    model.fit(x_train, y_train)

    # For prediction
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    #evaluate train and test dataset
    model_train_mae ,model_train_rmse, model_train_r2 = evaluation(y_train, y_train_pred)
    
    model_test_mae ,model_test_rmse, model_test_r2 = evaluation(y_test, y_test_pred)

    print(list(models.keys())[i])

    print('model performance for training dataset')
    print('- Root mean squared error: {:.3f}'.format(model_train_rmse))
    print('- Mean absolute error: {:.3f}'.format(model_train_mae))
    print('- R2 score: {: .3f}'.format(model_train_r2))

    print('===============================')

    print('model performance for test dataset')
    print('- Root mean squared error: {:.3f}'.format(model_test_rmse))
    print('- Mean absolute error: {:.3f}'.format(model_test_mae))
    print('- R2 score: {:.3f}'.format(model_test_r2))

    print('_'*35)
    print('\n')

    

Linear Regression
model performance for training dataset
- Root mean squared error: 553855.667
- Mean absolute error: 268101.607
- R2 score:  0.622
model performance for test dataset
- Root mean squared error: 502543.593
- Mean absolute error: 279618.579
- R2 score: 0.665
___________________________________


Ridge
model performance for training dataset
- Root mean squared error: 553856.316
- Mean absolute error: 268059.801
- R2 score:  0.622
model performance for test dataset
- Root mean squared error: 502533.823
- Mean absolute error: 279557.217
- R2 score: 0.665
___________________________________


Lasso
model performance for training dataset
- Root mean squared error: 553855.671
- Mean absolute error: 268099.223
- R2 score:  0.622
model performance for test dataset
- Root mean squared error: 502542.670
- Mean absolute error: 279614.746
- R2 score: 0.665
___________________________________


K-Neighbour Regression
model performance for training dataset
- Root mean squared error: 32

In [307]:
## Hyperparameter tuning
rf_params = {'max_depth':[5,8,15,None,10],
             'max_features':[5,7,'auto',8],
             'min_samples_split':[2,9,16,22],
             'n_estimators':[100,200,500,800,1000]}

gradient_params={'loss':['squared_error','huber','absolute_error'],
                 'criterion':['squared_error','mse','friedman_mse'],
                 'min_samples_split':[2,10,15,20],
                 'n_estimators':[100,200,500,1000],
                 'max_depth':[None,5,8,10,15],
                 'learning_rate':[0.1,0.01,0.02,0.03]
                 }


In [308]:
# Models list for hyperparameter tuning
randomcv_models = [('RF', RandomForestRegressor(), rf_params),
                   ('gadient',GradientBoostingRegressor(), gradient_params)
                   ]

In [309]:
from sklearn.model_selection import RandomizedSearchCV

model_param = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter = 100,
                                cv=3,
                                verbose=2,
                                n_jobs=-1)
    random.fit(x_train, y_train)
    model_param[name] = random.best_params_

for model_name in model_param:
    print(f'============= Best params for {model_name} ==============')
    print(model_param[model_name])

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Fitting 3 folds for each of 100 candidates, totalling 300 fits
{'n_estimators': 1000, 'min_samples_split': 2, 'max_features': 8, 'max_depth': None}
{'n_estimators': 1000, 'min_samples_split': 2, 'max_depth': 8, 'loss': 'huber', 'learning_rate': 0.03, 'criterion': 'squared_error'}


In [310]:
# Model training
models_hyper= {
    'Random Forest Regressor':RandomForestRegressor(n_estimators=1000, min_samples_split=2, max_features=8, max_depth=None),
    'geadientboost regressor':GradientBoostingRegressor(n_estimators=1000,loss='huber',min_samples_split=2,max_depth=8,learning_rate=0.03,criterion='squared_error')
    
}

for i in range (len(list(models_hyper))):
    model = list(models_hyper.values())[i]
    model.fit(x_train, y_train)

    # For prediction
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    #evaluate train and test dataset
    model_train_mae ,model_train_rmse, model_train_r2 = evaluation(y_train, y_train_pred)
    
    model_test_mae ,model_test_rmse, model_test_r2 = evaluation(y_test, y_test_pred)

    print(list(models_hyper.keys())[i])

    print('model performance for training dataset')
    print('- Root mean squared error: {:.3f}'.format(model_train_rmse))
    print('- Mean absolute error: {:.3f}'.format(model_train_mae))
    print('- R2 score: {: .3f}'.format(model_train_r2))

    print('===============================')

    print('model performance for test dataset')
    print('- Root mean squared error: {:.3f}'.format(model_test_rmse))
    print('- Mean absolute error: {:.3f}'.format(model_test_mae))
    print('- R2 score: {:.3f}'.format(model_test_r2))

    print('_'*35)
    print('\n')

Random Forest Regressor
model performance for training dataset
- Root mean squared error: 126744.199
- Mean absolute error: 38976.273
- R2 score:  0.980
model performance for test dataset
- Root mean squared error: 215253.262
- Mean absolute error: 98623.835
- R2 score: 0.938
___________________________________


geadientboost regressor
model performance for training dataset
- Root mean squared error: 68028.971
- Mean absolute error: 43031.144
- R2 score:  0.994
model performance for test dataset
- Root mean squared error: 241929.016
- Mean absolute error: 97501.091
- R2 score: 0.922
___________________________________


