### Used Car Price Prediction

1. Problem Statement:

    * This dataset consists used cars sold on cardekho.com in India as well as important features of these cars.
    * If user can predict the price of the car based on input features.
    * Prediction results can be used to give the new seller the price suggestion based on market conditions.

2. Data Collection:

    * The Dataset is collected from scrapping from Cardekho website.
    * The data consists of 13 column and 15411 rows.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import warnings
from plotly import express as px
from matplotlib import pyplot as plt

warnings.filterwarnings("ignore")

%matplotlib inline

In [2]:
df = pd.read_csv(r'./cardekho_imputated.csv',index_col=[0])
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [32]:
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [3]:
df.drop('brand',axis=1,inplace=True)
df.drop('car_name',axis=1,inplace=True)

In [17]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [18]:
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [4]:
def print_feat_info(feat_name, count):
    print(f"{feat_name} feature are {count} in number")

## Get all features
### Get Numeric features
num_feats=[feat for feat in df.columns if df[feat].dtype != 'object']
print_feat_info("Numeric", len(num_feats))

### Catergorical features
cat_feats=[feat for feat in df.columns if df[feat].dtype == 'object']
print_feat_info("Catergorical", len(cat_feats))

### Discrete features
dsc_feats=[feat for feat in num_feats if len(df[feat].unique()) <= 25]
print_feat_info("Discrete", len(dsc_feats))

### Continuous features
con_feats=[feat for feat in num_feats if feat not in dsc_feats]
print_feat_info("Continuous", len(con_feats))

Numeric feature are 7 in number
Catergorical feature are 4 in number
Discrete feature are 2 in number
Continuous feature are 5 in number


In [5]:
## Divide data into Independent and Dependent features

from sklearn.model_selection import train_test_split

X=df.drop("selling_price",axis=1)
y=df["selling_price"]

In [27]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


#### **Feature Encoding and Scaling**

##### **One hot encoding for columns which had lesser unique values and not ordinal**

* One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a job in prediction.

In [6]:
from sklearn.preprocessing import LabelEncoder

lencoder = LabelEncoder()

X['model']=lencoder.fit_transform(X['model'])

In [7]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [8]:
len(X['seller_type'].unique()),len(X['fuel_type'].unique()),len(X['transmission_type'].unique())

(3, 5, 2)

In [41]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15411 entries, 0 to 19543
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   model              15411 non-null  int64  
 1   vehicle_age        15411 non-null  int64  
 2   km_driven          15411 non-null  int64  
 3   seller_type        15411 non-null  object 
 4   fuel_type          15411 non-null  object 
 5   transmission_type  15411 non-null  object 
 6   mileage            15411 non-null  float64
 7   engine             15411 non-null  int64  
 8   max_power          15411 non-null  float64
 9   seats              15411 non-null  int64  
dtypes: float64(2), int64(5), object(3)
memory usage: 1.3+ MB


In [10]:
## Create Column Transform with 3 types of transformers
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

num_transfomer=StandardScaler()
oh_encoder=OneHotEncoder(drop='first')

oh_columns=['seller_type','fuel_type','transmission_type']
num_columns=X.select_dtypes(exclude='object').columns

columnTransformer = ColumnTransformer(
    [
        ("OneHotEncoder",oh_encoder,oh_columns),
        ('StandardScalar',num_transfomer,num_columns)
    ],remainder='passthrough'
)

In [11]:
X=columnTransformer.fit_transform(X)

In [12]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,1.247335,-0.000276,-1.324259,-1.263352,-0.403022
1,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.225693,-0.343933,-0.690016,-0.192071,-0.554718,-0.432571,-0.403022
2,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.536377,1.647309,0.084924,-0.647583,-0.554718,-0.479113,-0.403022
3,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,-0.360667,0.292211,-0.936610,-0.779312,-0.403022
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.666211,-0.012060,-0.496281,0.735736,0.022918,-0.046502,-0.403022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15406,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.508844,0.983562,-0.869744,0.026096,-0.767733,-0.757204,-0.403022
15407,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.556082,-1.339555,-0.728763,-0.527711,-0.216964,-0.220803,2.073444
15408,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.407551,-0.012060,0.220539,0.344954,0.022918,0.068225,-0.403022
15409,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.426247,-0.343933,72.541850,-0.887326,1.329794,0.917158,2.073444


In [13]:
## Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [35]:
def print_metrics(model_name, true, predicted):
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
    print(model_name)
    print("-"*18)
    print(f"Mean Absolute Error - {mean_absolute_error(true, predicted)}")
    print(f"Root Mean Squared Error - {np.sqrt(mean_squared_error(true, predicted))}")
    print(f"R2 Score - {r2_score(true, predicted)}")
    

def train_models(models_object):
    for idx in range(len(list(models))):
        model = list(models_object.values())[idx]
        model.fit(X_train,y_train)
        y_train_pred=model.predict(X_train)
        print_metrics(list(models_object.keys())[idx],y_train,y_train_pred)
        print("-"*18)
        y_test_pred=model.predict(X_test)
        print_metrics(list(models_object.keys())[idx],y_test,y_test_pred)
        print("="*36)


from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

models = {
    "Linear Regression":LinearRegression(),
    "Lasso Regression":Lasso(),
    "Ridge Regression":Ridge(),
    "K-Neighbors Regression":KNeighborsRegressor(),
    "Decision Tree Regression":DecisionTreeRegressor(),
    "Random Forest Regression":RandomForestRegressor()
}

train_models(models)
    

Linear Regression
------------------
Mean Absolute Error - 268101.6070829937
Root Mean Squared Error - 553855.6665411663
R2 Score - 0.6217719576765959
------------------
Linear Regression
------------------
Mean Absolute Error - 279618.57941584283
Root Mean Squared Error - 502543.5930230985
R2 Score - 0.6645109298852004
Lasso Regression
------------------
Mean Absolute Error - 268099.22264981153
Root Mean Squared Error - 553855.6709544231
R2 Score - 0.6217719516489696
------------------
Lasso Regression
------------------
Mean Absolute Error - 279614.7461034126
Root Mean Squared Error - 502542.66963789385
R2 Score - 0.6645121627547996
Ridge Regression
------------------
Mean Absolute Error - 268059.8014688311
Root Mean Squared Error - 553856.3159709624
R2 Score - 0.6217710706848424
------------------
Ridge Regression
------------------
Mean Absolute Error - 279557.2168930274
Root Mean Squared Error - 502533.8229890289
R2 Score - 0.6645239743566809
K-Neighbors Regression
---------------

In [36]:
### Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV

kn_params = {
    "n_neighbors":[2,3,5,10,20,30]
}

rf_params = {
    "n_estimators":[100,200,500,1000],
    "max_depth":[5,7,8,None,10,15],
    "min_samples_split":[2,8,15,20],
    "max_features":[5,7,"auto",8]
}


models = [
    ("KN",KNeighborsRegressor(),kn_params),
    ("RF",RandomForestRegressor(),rf_params)
]

best_params = {}

for name, model, params in models:
    random=RandomizedSearchCV(model,param_distributions=params,cv=3,verbose=2,n_jobs=-1)
    random.fit(X_train,y_train)
    
    best_params[name]=random.best_params_

for key,value in best_params.items():
    print("------------")
    print(f"Best values for {key} are {value}")


Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 10 candidates, totalling 30 fits
------------
Best values for KN are {'n_neighbors': 5}
------------
Best values for RF are {'n_estimators': 200, 'min_samples_split': 2, 'max_features': 8, 'max_depth': 10}


In [37]:
models = {
    "KNeighbors Regressor":KNeighborsRegressor(n_neighbors=5,n_jobs=-1),
    "Random Forest Regression":RandomForestRegressor(n_estimators= 200,min_samples_split= 2,max_features= 8,max_depth=10)
}

train_models(models)

KNeighbors Regressor
------------------
Mean Absolute Error - 91425.47047371836
Root Mean Squared Error - 325873.008771139
R2 Score - 0.8690645337531115
------------------
KNeighbors Regressor
------------------
Mean Absolute Error - 112526.34609146934
Root Mean Squared Error - 253024.39510875393
R2 Score - 0.9149536488136147
Random Forest Regression
------------------
Mean Absolute Error - 84040.06390639755
Root Mean Squared Error - 173760.48733741313
R2 Score - 0.9627726116296974
------------------
Random Forest Regression
------------------
Mean Absolute Error - 106366.81442259534
Root Mean Squared Error - 228862.57306361088
R2 Score - 0.9304206371468214
