## Car Price Prediction

The problem at hand is to model the selling price of used cars based on the features given in the datasets. It will be used by the client to predict the price of a car of their choice. Your mission, should you choose to accept it, as a data scientist, is to make sure that you maximize the probability of them getting the car and at the same time, make sure that they don't overpay.


![79e15-sell-your-old-car%20%281%29.jpg](attachment:79e15-sell-your-old-car%20%281%29.jpg)

Let's get to it.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
from datetime import date

## A Look At The Data

In [2]:
model_df = pd.read_csv('/kaggle/input/used-car-price-data/model_data.csv')
car_df = pd.read_csv('/kaggle/input/used-car-price-data/car_data.csv')

In [3]:
car_df.head()

Unnamed: 0,Model,Selling Price,Kilometers Driven,Year,Owner,Fuel Type,Transmission,Insurance,Car Condition
0,MarutiWagonR1.0LXI,312165,82238,2014,First Owner,Petrol + CNG,MANUAL,Expired,4.2
1,ToyotaEtiosLiva,313799,30558,2013,First Owner,Petrol,MANUAL,12-09-2021,4.4
2,MarutiAlto800,295999,22164,2018,First Owner,Petrol,MANUAL,01-12-2020,4.8
3,MarutiSwift,435199,30535,2013,First Owner,Diesel,MANUAL,Comp,4.3
4,MarutiWagonR1.0,289099,15738,2013,First Owner,Petrol,MANUAL,11-08-2021,4.3


In [4]:
model_df.head()

Unnamed: 0,Model,Current Price
0,HyundaiElitei20Sportz(O)1.4CRDi,Rs.7.69 Lakh
1,MarutiErtigaZXISMARTHYBRID,Rs.9.27 Lakh
2,MarutiVitaraBrezzaLDI,"Rs.7,62,742"
3,FordEcosport1.5TITANIUMTIVCT,Rs.7.64 Lakh
4,HyundaiVernaFLUIDIC1.4CRDI,"Rs.9,99,900"


Looking at the two datasets, the model column in the two dataframes can be used to merge the two such that you can have a dataframe with information of each car with its current selling price.

In [7]:
df = pd.merge(car_df, model_df, left_on='Model', right_on='Model')

You can look at the two datasets together now.

In [8]:
df.describe(include='all')

Unnamed: 0,Model,Selling Price,Kilometers Driven,Year,Owner,Fuel Type,Transmission,Insurance,Car Condition,Current Price
count,2237,2237.0,2237.0,2237.0,2237,2237,2237,2223,2237.0,2235
unique,434,,,,3,4,58,473,,325
top,MarutiSwift,,,,First Owner,Petrol,MANUAL,Expired,,Rs.5.49 Lakh
freq,118,,,,1707,1384,1909,248,,131
mean,,418443.1,61928.605275,2013.763523,,,,,4.370854,
std,,228051.6,42260.955917,2.874686,,,,,0.28899,
min,,75299.0,913.0,2006.0,,,,,3.0,
25%,,272099.0,32137.0,2012.0,,,,,4.2,
50%,,355799.0,55430.0,2014.0,,,,,4.3,
75%,,503299.0,83427.0,2016.0,,,,,4.6,


Checking for null values if any.

In [9]:
df.isna().mean()*100

Model                0.000000
Selling Price        0.000000
Kilometers Driven    0.000000
Year                 0.000000
Owner                0.000000
Fuel Type            0.000000
Transmission         0.000000
Insurance            0.625838
Car Condition        0.000000
Current Price        0.089405
dtype: float64

We have very few null values in Insurance and Car Price columns.

In [10]:
fig = px.histogram(df, 'Selling Price')
fig.show()

There are a few records with 0 selling price. Maybe, you can get that car for free. We all know that is not possible so we can go ahead and remove these records.

In [11]:
df = df[df['Selling Price'] != 0]

Current Price column is in very raw format. Some of the records have Rs. prefix and a Lakh at the end while some are in all numbers. You need to convert it to integer format so that you can analyze it better and compare with the selling price column.

In [12]:
def format_price(price):
    price = str(price)
    price = price.replace('Rs.', '')
    price = price.replace(',', '')
    num_zeros = 5
    if '.' not in price and ' Lakh' in price:
        price = price.replace(' Lakh', '0'*num_zeros)
    elif '.' in price and ' Lakh in price':
        n = len(price)
        m = price.index('.')
        num_zeros = n - m - num_zeros
        price = price.replace(' Lakh', '0'*num_zeros)
        price = price.replace('.', '')
    return price

In [13]:
df['Current Price'] = df['Current Price'].apply(format_price)
df[df['Current Price'] == 'nan'] = 0
df['Current Price'] = df['Current Price'].astype(int)

In [14]:
fig = px.histogram(df, 'Current Price')
fig.show()

Woah! you have some cars that have 0 Current Price. Looks like the earlier null count were deceptive as you already know, no one is giving away cars for free. You might have to handle this.

In [15]:
year = date.today().year
df['Age'] = year - df['Year']

In [16]:
df = df[df['Selling Price'] != 0]
fig = px.scatter(x=df['Age'], y=df['Kilometers Driven'])
fig.show()

In [17]:
fig = px.scatter(x=df['Kilometers Driven'], y=df['Selling Price'])
fig.show()

In [18]:
fig = px.scatter(x=df['Age'], y=df['Selling Price'])
fig.show()

In [19]:
fig = px.scatter(df['Selling Price'], color=df['Owner'])
fig.show()

In [20]:
df['Transmission'].unique()

array(['MANUAL', 'MH12', 'TS07', 'KA01', 'MH05', 'DL5C', 'DL9C', 'MH04',
       'TS08', 'AUTOMATIC', 'UP14', 'UP32', 'HR03', 'MH01', 'DL2C',
       'KA05', 'KA50', 'KA53', 'DL12', 'DL11', 'GJ27', 'TN12', 'TN02',
       'HR26', 'MH03', 'MH47', 'TS09', 'RJ14', 'TN06', 'MH43', 'DL4C',
       'KA02', 'MH02', 'RJ45', 'DL3C', 'TN22', 'KA04', 'MH46', 'KA51',
       'PB91', 'DL8C', 'GJ18', 'HR51', 'DL10', 'HR29', 'KA03', 'DL14',
       'GJ05', 'GJ01', 'PB10', 'MH14', 'UP78', 'GJ06', 'Ch01', 'HR05',
       'HR12', 'DL1C', 'PB11'], dtype=object)

Transmission column has many values other than manual or automatic. You need to handle this and one way to do it can be to replace all of them with the mode (the most frequent value). That is MANUAL in this case.

In [21]:
def clean_transmission(trans):
    
    if 'MANUAL' != trans and 'AUTOMATIC' != trans:
        trans = "MANUAL"
    
    return trans

df['Transmission'] = df['Transmission'].apply(clean_transmission)

Insurance column had some null values. let's replace them with the most frequent value.

In [22]:
from sklearn.impute import SimpleImputer

mode_imputer = SimpleImputer(strategy="most_frequent")
df['Insurance'] = mode_imputer.fit_transform(df['Insurance'].values.reshape(-1, 1))

Let's derive a feature from the insurance column, to indicate whether the insurance has expired or not.

In [23]:
df['Insurance_Expired'] = 0
df.loc[df['Insurance'] == 'Expired', 'Insurance_Expired'] = 1

In [24]:
fig = px.scatter(df['Selling Price'], color=df['Insurance_Expired'])
fig.show()

In [25]:
fig = px.scatter(x=df['Current Price'], y=df['Selling Price'], color=df['Car Condition'])
fig.show()

## Cleaning data

The Current Price column had some records with 0 value. We can add the average difference between the selling price and current price to the selling price of these columns to get an approximation of the current price of these columns.

You can also add another column to tell the model that the current price was missing earlier and to take the current price for such records with a pinch of salt. As missing current price might denote that the car is too old and no longer sold in the market

In [26]:
df['No_Current_Price'] = 0
df.loc[df['Current Price'] == 0, 'No_Current_Price'] = 1

In [27]:
df['diff'] = df['Current Price'] - df['Selling Price']

Since the prices are bit skewed, median is better approximation of central tendency.

In [28]:
med_diff = df[df['diff'] > 0]['diff'].median()

In [30]:
def set_current_price(row):
    if row['Current Price'] == 0 or row['diff'] < 0:
        row['Current Price'] = row['Selling Price'] + med_diff
    
    return row['Current Price']

df.loc[:, 'Current Price'] = df.apply(set_current_price, axis=1)

In [31]:
df.head()

Unnamed: 0,Model,Selling Price,Kilometers Driven,Year,Owner,Fuel Type,Transmission,Insurance,Car Condition,Current Price,Age,Insurance_Expired,No_Current_Price,diff
0,MarutiWagonR1.0LXI,312165,82238,2014,First Owner,Petrol + CNG,MANUAL,Expired,4.2,465000.0,7,1,0,152835
1,MarutiWagonR1.0LXI,242499,88514,2015,Second Owner,Petrol + CNG,MANUAL,26-07-2021,4.4,465000.0,6,0,0,222501
2,MarutiWagonR1.0LXI,381699,29735,2017,Second Owner,Petrol + CNG,MANUAL,18-09-2021,4.3,465000.0,4,0,0,83301
3,MarutiWagonR1.0LXI,181999,153709,2013,First Owner,Petrol + CNG,MANUAL,25-05-2021,4.1,465000.0,8,0,0,283001
4,MarutiWagonR1.0LXI,239499,88691,2012,Second Owner,Petrol + CNG,MANUAL,15-10-2021,4.4,465000.0,9,0,0,225501


You can now drop model, year, insurance and diff columns

In [32]:
df.drop(['Model', 'Year', 'Insurance', 'diff'], axis=1, inplace=True)
df.head()

Unnamed: 0,Selling Price,Kilometers Driven,Owner,Fuel Type,Transmission,Car Condition,Current Price,Age,Insurance_Expired,No_Current_Price
0,312165,82238,First Owner,Petrol + CNG,MANUAL,4.2,465000.0,7,1,0
1,242499,88514,Second Owner,Petrol + CNG,MANUAL,4.4,465000.0,6,0,0
2,381699,29735,Second Owner,Petrol + CNG,MANUAL,4.3,465000.0,4,0,0
3,181999,153709,First Owner,Petrol + CNG,MANUAL,4.1,465000.0,8,0,0
4,239499,88691,Second Owner,Petrol + CNG,MANUAL,4.4,465000.0,9,0,0


Separating the target features

In [33]:
X = df.drop(['Selling Price'], axis=1)
y = df.loc[:, 'Selling Price']

Creating a Train Test split

In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

## Preprocessing

In [35]:
num_attribs = ['Kilometers Driven', 'Car Condition', 'Current Price', 'Age']
cat_attribs = ['Owner', 'Fuel Type', 'Transmission']

In [36]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

preprocessing = ColumnTransformer([
    ("num", StandardScaler(), num_attribs),
    ("cat", OneHotEncoder(drop='first', sparse=False), cat_attribs)
], remainder="passthrough")

X_train = preprocessing.fit_transform(X_train)

We can extract column names to later check which feature affect our target feature the most.

In [37]:
col_names = []
for transformer_tuple in preprocessing.transformers_[:-1]:
    cols = transformer_tuple[2]
    transformer = transformer_tuple[1]
    try:
        cols = transformer.get_feature_names(cols)
    except AttributeError:
        cols = cols
        
    col_names += list(cols)
    
col_names += list(X.columns[preprocessing.transformers_[2][2]])

In [38]:
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.values.reshape(-1, 1))

## Training Different Models

In [39]:
from sklearn.model_selection import cross_val_score

def evaluate_model(model, X, y):
    
    model.fit(X, y)
    
    accuracies = cross_val_score(estimator = model, X = X, y = y, cv = 10)
    print(model.__class__.__name__)
    print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
    print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
    
    return model

In [40]:
from sklearn.linear_model import LinearRegression

lin_reg = evaluate_model(LinearRegression(), X_train, y_train)

LinearRegression
Accuracy: 64.60 %
Standard Deviation: 7.83 %


In [41]:
from sklearn.svm import SVR

svr = evaluate_model(SVR(), X_train, y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expec

SVR
Accuracy: 80.88 %
Standard Deviation: 5.27 %


In [42]:
from sklearn.ensemble import RandomForestRegressor

rf = evaluate_model(RandomForestRegressor(n_estimators = 100, random_state = 0), X_train, y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


A column-vector y was passed when a 1d array was expected. Pl

RandomForestRegressor
Accuracy: 85.53 %
Standard Deviation: 5.77 %


In [43]:
from sklearn.ensemble import AdaBoostRegressor

ada_boost = evaluate_model(AdaBoostRegressor(random_state=0, n_estimators=100), X_train, y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expec

AdaBoostRegressor
Accuracy: 77.64 %
Standard Deviation: 4.17 %



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



In [44]:
import xgboost

xg_boost = evaluate_model(xgboost.XGBRegressor(), X_train, y_train)

XGBRegressor
Accuracy: 85.87 %
Standard Deviation: 5.24 %


In [47]:
from sklearn.model_selection import GridSearchCV

param_test = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6],
 'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}

grid_search_1 = GridSearchCV(estimator = xgboost.XGBRegressor(learning_rate =0.1,
                                                              n_estimators=10,
                                                              max_depth=5,
                                                              min_child_weight=1,
                                                              gamma=0.2,
                                                              subsample=0.85,
                                                              colsample_bytree=0.8),
                             param_grid = param_test,
                             scoring='neg_mean_squared_error',
                             n_jobs=-1,
                             cv=5)

grid_search_1.fit(X_train,y_train)
grid_search_1.best_params_, grid_search_1.best_score_

({'colsample_bytree': 0.85, 'max_depth': 6, 'min_child_weight': 6},
 -0.3238767306949564)

Note: Hyperparamter tuning can be an iterative technique where you first do a coarse search of parameter values and then go for a finer search around the values found in the previous step. In interest of your time, I have only included the last step here.

In [None]:
xg_boost = evaluate_model(xgboost.XGBRegressor(learning_rate =0.1,
                                                              n_estimators=100,
                                                              max_depth=6,
                                                              min_child_weight=6,
                                                              gamma=0.2,
                                                              subsample=0.85,
                                                              colsample_bytree=0.85),
                          X_train, y_train)

In [None]:
importance_df = pd.DataFrame(xg_boost.feature_importances_)
importance_df['Features'] = col_names
importance_df = importance_df.rename(columns={0 : 'Average Importance'})

fig = px.bar(importance_df, x='Features', y='Average Importance')
fig.show()

In [None]:
X_test = preprocessing.transform(X_test)
y_test = sc_y.transform(y_test.values.reshape(-1, 1))

y_pred = xg_boost.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)

Here, fuel type, age, current price and car condition are the features that contribute the most in determining the target features as we would have assumed.

***I hope you found this notebook interesting. I would love to read your feedbacks below.***