<div>  
<h1><center style="background-color:#0093AF; color:white;"><strong>Used Car Price Prediction in India 🚗</strong></center></h1>
</div>

![](http://cdn.dribbble.com/users/1239720/screenshots/3506944/car_mg.gif)

<div class="alert alert-warning" color=black>
<p>I've always wondered what drives the price of a car. Superficially, we know that the car's brand and it's features but what is the real crux that decides the cost of a particular car.
<br>
A car dealer may emphasis on the car's features and convince you to buy it. The rela problem arises when you have to sell the car or buy a used car. How do you judge the price of the car?
<br><br>
Now let's delve into the factors that govern the pricing!
</p>
</div>

<div>  
<h3><center style="background-color:#0093AF; color:white;"><strong>Importing Libraries</strong></center></h3>
</div>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import seaborn as sns
import plotly.express as px

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('/kaggle/input/used-cars-dataset-from-cardekhocom/cardekho_updated.csv')

print("Data frame has {}rows and {}columns".format(df.shape[0],df.shape[1]))
df.head()

<div>  
<h2><center style="background-color:#0093AF; color:white;"><strong>Data Cleaning and Preprocessing</strong></center></h2>
</div>

> ### Dropping null values

In [None]:
df.isnull().sum()

In [None]:
# dropping all except "new_price"

df.dropna(subset=['full_name', 'selling_price', 'year', 'seller_type','km_driven', 'fuel_type', 'transmission_type', 'mileage', 'engine','max_power', 'seats'],how='any',axis=0, inplace=True)
df = df.rename(columns={"new-price":"new_price"})
df.shape

> ### Creating column **"vehicle_age"** from **"year"**

In [None]:
dataset_year=2021
df['vehicle_age'] = dataset_year - df['year']
df.drop(['year'],axis=1, inplace=True)

df.head(1)

> ### Creating column **"brand"** & **"model"** from **"full_name"**

In [None]:
# Creating brand
df["full_name"] = df["full_name"].str.replace(" New ", " ")
df['brand']=df.full_name.str.split(' ').str.get(0)
df.loc[(df.brand == 'Land'),'brand']='Land Rover'

# Creating model
df['model']=df['full_name'].apply(lambda x: ' '.join(x.split(' ')[1:3]) if 'Dzire' in x else ''.join(x.split(' ')[1]))

In [None]:
# Renaming car models

df.loc[(df.model == 'Wagon'),'model'] = 'Wagon R'
df.loc[(df.model == 'E'),'model'] = 'E Verito'
df.loc[(df.model == 'Land'),'model'] = 'Land Cruiser'

In [None]:
# Dropping "full_name"

df.drop('full_name',axis = 1, inplace=True)

In [None]:
# Creating column "car_name"

df['car_name'] = df["brand"] +" "+ df["model"]
df_unique= pd.DataFrame(df['car_name'].value_counts())
df.head(1)

> ### Converting **"new_price"** into **"min_price"** & **"max_price"**

In [None]:
df['new_price1']=df['new_price'].str.lstrip('New Car (On-Road Price) : Rs.')
df.new_price1 = df.new_price1.str.replace('[*,,]', '')

df[['new_price1','unit']] = df.new_price1.str.split(" ",expand=True)

df[['min_cost_price','max_cost_price']] = df.new_price1.str.split("-",expand=True)
df.min_cost_price = df.min_cost_price.str.replace('[A-Za-z]', '')
df.max_cost_price = df.max_cost_price.str.replace('[A-Za-z]', '')

df.drop(['new_price1'],axis=1)
df.head(1)

In [None]:
# Changing datatype into float

df['max_cost_price'] = df['max_cost_price'].astype('float64', errors = 'raise')
df['min_cost_price'] = df['min_cost_price'].astype('float64', errors = 'raise')

In [None]:
# Converting cost price to appropriate units

df.loc[df.unit == "Lakh", 'min_cost_price'] = df['min_cost_price']*100000.0
df.loc[df.unit == "Lakh", 'max_cost_price'] = df['max_cost_price']*100000.0

df.loc[df.unit == "Cr", 'min_cost_price'] = df['min_cost_price']*10000000.0
df.loc[df.unit == "Cr", 'max_cost_price'] = df['max_cost_price']*10000000.0

df.drop(['unit','new_price1'],axis=1, inplace=True)

In [None]:
# Filling cars whose "max_cost_price" is missing with "min_cost_price"

df['max_cost_price'] = df['max_cost_price'].fillna(df['min_cost_price'])
df.drop(df[(df['max_cost_price'])==(df['min_cost_price'])].index, inplace=True)

In [None]:
#Filling missing cost price of cars with the mean of their respective car models

df['min_cost_price'] = df['min_cost_price'].fillna(df.groupby(['car_name'])['min_cost_price'].transform('mean'))
df['max_cost_price'] = df['max_cost_price'].fillna(df.groupby(['car_name'])['max_cost_price'].transform('mean'))

> ### Converting **"selling_price"** to appropriate units

In [None]:
df.selling_price = df.selling_price.str.replace('[*,,]', '')
df[['selling_price','unit']] = df.selling_price.str.split(expand=True)
df['selling_price'] = df['selling_price'].astype('float64', errors = 'raise')

df.head(1)

In [None]:
df.loc[df.unit == "Lakh", 'selling_price'] = df['selling_price']*100000.0
df.loc[df.unit == "Cr", 'selling_price'] = df['selling_price']*10000000.0


df=df.drop(['unit','new_price'],axis=1)

df.head()

> ### Removing unwanted non-numeric data from columns

In [None]:
rep_cols = [ "mileage","km_driven","engine","max_power","seats"]
df[rep_cols] = df[rep_cols].replace(r'[^\d.]+', '', regex=True)
df[rep_cols]= df[rep_cols].replace('', '0', regex=True)

In [None]:
# Dropping null values
df.dropna(how='any',axis=0, inplace=True)

# Changing datatype to float
df= df.astype({'km_driven': 'float64', 'mileage': 'float64', 'engine': 'float64', 'max_power': 'float64', 'seats': 'float64','min_cost_price': 'float64','max_cost_price': 'float64'})
print(df.dtypes)

In [None]:
# Reordering columns

col_order=['car_name','brand','model','min_cost_price','max_cost_price','vehicle_age','km_driven','seller_type','fuel_type','transmission_type','mileage','engine','max_power','seats','selling_price']
df=df[col_order]
df.head(1)

<div>  
<h2><center style="background-color:#0093AF; color:white;"><strong>Removing Outliers</strong></center></h2>
</div>

In [None]:
df.describe()

In [None]:
# Dropping zero valued cells

df.drop(df[df['seats'] == 0].index, inplace = True)
df.drop(df[df['mileage'] == 0].index, inplace = True)
df.drop(df[df['km_driven'] == 0].index, inplace = True)
df.drop(df[df['vehicle_age'] == 0].index, inplace = True)
df.drop(df[df['max_power'] == 0].index, inplace = True)

In [None]:
df.info()

In [None]:
# Dropping out of boundary values

df.drop(df[(df['vehicle_age'] > 20) ].index, inplace = True)
df.drop(df[df['km_driven'] >300000 ].index, inplace = True)

In [None]:
# Removing the outliers using Interquartile Range for all columns

def removeOutliers(data, col):
    Q3 = np.quantile(data[col], 0.75)
    Q1 = np.quantile(data[col], 0.25)
    IQR = Q3 - Q1
      
    print("IQR value for column %s is: %s" % (col, IQR))
    global outlier_free_list
    global filtered_data
      
    lower_range = Q1 - 1.5 * IQR
    upper_range = Q3 + 1.5 * IQR
    outlier_free_list = [x for x in data[col] if (
        (x > lower_range) & (x < upper_range))]
    filtered_data = data.loc[data[col].isin(outlier_free_list)]

out_columns = df[['km_driven','vehicle_age','mileage','engine','max_power','seats','selling_price','min_cost_price','max_cost_price']]  
for i in out_columns:
    removeOutliers(df, i)
  
# Assigning filtered data back to our original variable'

df = filtered_data
print("Shape of data after outlier removal is: ", df.shape)

<div>  
<h2><center style="background-color:#0093AF; color:white;"><strong>Final Preprocessing</strong></center></h2>
</div>

> ### Converting **"min_cost_price"** and **"max_cost_price"** to **"avg_cost_price"** using mean

In [None]:
df['avg_cost_price']=(df['min_cost_price']+df['max_cost_price'])/2

In [None]:
df=df.drop(['min_cost_price','max_cost_price'], axis=1)

In [None]:
df['avg_cost_price']=df['avg_cost_price']/100000
df['selling_price']=df['selling_price']/100000

In [None]:
df.head()

<div>  
<h2><center style="background-color:#0093AF; color:white;"><strong>Exploratory Data Analysis</strong></center></h2>
</div>

> ### CarName vs CostPrice

In [None]:
top_sell = df.sort_values(by='avg_cost_price', ascending=False)

ax = plt.subplots(figsize=(20,30))
  
# plotting columns
ax = sns.barplot(x=top_sell.avg_cost_price, y=top_sell.car_name, color='violet')
ax = sns.barplot(x=top_sell.selling_price, y=top_sell.car_name,color='orange')
  
# renaming the axes
ax.set(xlabel="Avg CostPrice & SellingPrice", ylabel="Car Name")
  
# visulaizing illustration
plt.show()

In [None]:
# Dropping Hyundai Aura Hyundai Aura
df.drop(df[df['car_name']=='Hyundai Aura'].index, axis=0,  inplace=True,)

> ### SellerType vs SellingPrice

In [None]:
figure = plt.figure(figsize=(8,10))
sns.boxplot(x='seller_type',y='selling_price', data=df, palette="Set2")

> ### Count of Seller Types

In [None]:
figure = plt.figure(figsize=(8,10))
sns.countplot(x='seller_type', data=df, palette="Set2")

> ### Count of Fuel Types

In [None]:
figure = plt.figure(figsize=(8,10))
sns.countplot(x='fuel_type', data=df, palette="Set2")

> ### SellingPrice vs VehicleAge

In [None]:
plt.figure(figsize=(20,10))
sns.lineplot(x='vehicle_age',y='selling_price',data=df)
plt.ticklabel_format(style='plain')

> ### SellingPrice vs VehicleAge vs KilometersDriven

In [None]:
plt.figure(figsize=(40,40))
fig = px.scatter_3d(df, x='vehicle_age', y='km_driven', z='selling_price', color='brand')
fig.show()

<div>  
<h2><center style="background-color:#0093AF; color:white;"><strong>Model Creation</strong></center></h2>
</div>

In [None]:
vehicles=df.copy()
vehicles=vehicles.drop(['car_name'], axis=1)
vehicles.head()

In [None]:
sns.heatmap(vehicles.corr(), annot=True, cmap="RdBu")
plt.show()

In [None]:
numeric = vehicles[vehicles.select_dtypes(include=['number']).columns]
numeric = numeric.drop(['selling_price'],axis=1)
numy=vehicles['selling_price']

In [None]:
vehicles1=vehicles.copy()

In [None]:
vehicles1.head()

In [None]:
vehicles1=pd.get_dummies(vehicles1,columns=['fuel_type','transmission_type','seller_type','brand','model'],drop_first=True)
vehicles1.head()

In [None]:
from sklearn.model_selection import train_test_split

X=vehicles1.drop(columns=['selling_price'],axis=1)
y=vehicles1['selling_price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
print("x train: ",X_train.shape)
print("x test: ",X_test.shape)
print("y train: ",y_train.shape)
print("y test: ",y_test.shape)

In [None]:
from statsmodels.api import OLS

model= OLS(y_train, X_train).fit()
print(model.summary())

In [None]:
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn import metrics

CV = []
R2_train = []
R2_test = []

def car_pred_model(model,model_name):
    # Training model
    model.fit(X_train,y_train)
            
    # R2 score of train set
    y_pred_train = model.predict(X_train)
    R2_train_model = r2_score(y_train,y_pred_train)
    R2_train.append(round(R2_train_model,2))
    
    # R2 score of test set
    y_pred_test = model.predict(X_test)
    R2_test_model = r2_score(y_test,y_pred_test)
    R2_test.append(round(R2_test_model,2))
    
    # R2 mean of train set using Cross validation
    cross_val = cross_val_score(model ,X_train ,y_train ,cv=3)
    cv_mean = cross_val.mean()
    CV.append(round(cv_mean,2))
    
    # MAE
    mae = metrics.mean_absolute_error(y_test,y_pred_test)
    
    # MSE
    mse = metrics.mean_squared_error(y_test,y_pred_test)
    
    
    # Printing results
    print("Train R2-score :",round(R2_train_model,2))
    print("Test R2-score :",round(R2_test_model,2))
    print("Train CV scores :",cross_val)
    print("Train CV mean :",round(cv_mean,2))
    print("MAE :", round(mae,5))
    print("MSE :", round(mse,5))
    
    # Plotting Graphs 
    # Residual Plot of train data
    fig, ax = plt.subplots(1,2,figsize = (10,4))
    ax[0].set_title('Residual Plot of Train samples')
    sns.distplot((y_train-y_pred_train),hist = False,ax = ax[0])
    ax[0].set_xlabel('y_train - y_pred_train')
    
    # Y_test vs Y_train scatter plot
    ax[1].set_title('y_test vs y_pred_test')
#     ax[1].scatter(x = y_test, y = y_pred_test)
    sns.regplot(x=y_test, y=y_pred_test, robust=True, ci=None)
    ax[1].set_xlabel('y_test')
    ax[1].set_ylabel('y_pred_test')
    
    plt.show()

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
car_pred_model(lr,"Linear_regressor.pkl")

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

# Creating Ridge model object
rg = Ridge()
# range of alpha 
alpha = np.logspace(-3,3,num=14)

# Creating RandomizedSearchCV to find the best estimator of hyperparameter
rg_rs = RandomizedSearchCV(estimator = rg, param_distributions = dict(alpha=alpha))

car_pred_model(rg_rs,"ridge.pkl")

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import RandomizedSearchCV

ls = Lasso()
alpha = np.logspace(-3,3,num=14) # range for alpha

ls_rs = RandomizedSearchCV(estimator = ls, param_distributions = dict(alpha=alpha))

car_pred_model(ls_rs,"lasso.pkl")

In [None]:
Technique = ["LinearRegression","Ridge","Lasso"]
results=pd.DataFrame({'Model': Technique,'R Squared(Train)': R2_train,'R Squared(Test)': R2_test,'CV score mean(Train)': CV})
display(results)