# Car Price Prediction
    Predicting used car prices is a valuable tool for buyers, sellers, and dealerships. This model uses machine learning techniques to estimate the price of a car based on various features such as brand, year, kilometer driven, fuel type, transmission type and owner. By analyzing historical data, the model can make accurate predictions, helping users make informed decisions in the automotive market.

In [1]:
import warnings
warnings.filterwarnings('ignore')
# Import Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Import Models
from sklearn.linear_model import LinearRegression,Lasso,Ridge,ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from xgboost import XGBRegressor
import xgboost as xgb

from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

In [2]:
# Load Data File
df = pd.read_csv("CAR DETAILS FROM CAR DEKHO.csv")


In [3]:
df.shape

(4340, 8)

In [4]:
df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


In [5]:
df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4340 entries, 0 to 4339
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           4340 non-null   object
 1   year           4340 non-null   int64 
 2   selling_price  4340 non-null   int64 
 3   km_driven      4340 non-null   int64 
 4   fuel           4340 non-null   object
 5   seller_type    4340 non-null   object
 6   transmission   4340 non-null   object
 7   owner          4340 non-null   object
dtypes: int64(3), object(5)
memory usage: 271.4+ KB


In [7]:
df.isnull().sum()

name             0
year             0
selling_price    0
km_driven        0
fuel             0
seller_type      0
transmission     0
owner            0
dtype: int64

There is no Null value in data set.

In [8]:
print("No of duplicate Rows = ",df.duplicated().sum())

No of duplicate Rows =  763


There are 763 duplicate rows in Data so we have to remove  these rows.

In [9]:
df.drop_duplicates(inplace=True)
print("No of duplicate Rows = ",df.duplicated().sum())
df.shape

No of duplicate Rows =  0


(3577, 8)

No duplicate row present in dataset.
Now there are 3577 No. of rows in dataset that means we have data about 3577 different cars in our data set.

In [10]:
df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


Duplicate values are replace but in first column that is index No. shows 4339 value still because as duplicated rows are deleted but index number does not changed. Now reset index number.

In [11]:
df.reset_index(inplace=True)


In [12]:
df

Unnamed: 0,index,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...,...
3572,4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
3573,4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
3574,4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
3575,4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


In [13]:
df.shape

(3577, 9)

In [14]:
df.drop(columns=['index'],inplace=True)

In [15]:

df


Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
3572,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
3573,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
3574,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
3575,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


Our first column 'name' has large values. for model use we extract only first letter of name values which is the ame of the company so that number of unique values in this column confined to small number.

In [16]:
df['brand'] = df['name'].str.extract(r'^\s*(\w+)', expand=False)
print(df['brand'])

0        Maruti
1        Maruti
2       Hyundai
3        Datsun
4         Honda
         ...   
3572    Hyundai
3573    Hyundai
3574     Maruti
3575    Hyundai
3576    Renault
Name: brand, Length: 3577, dtype: object


In [17]:
df['brand'].unique()

array(['Maruti', 'Hyundai', 'Datsun', 'Honda', 'Tata', 'Chevrolet',
       'Toyota', 'Jaguar', 'Mercedes', 'Audi', 'Skoda', 'Jeep', 'BMW',
       'Mahindra', 'Ford', 'Nissan', 'Renault', 'Fiat', 'Volkswagen',
       'Volvo', 'Mitsubishi', 'Land', 'Daewoo', 'MG', 'Force', 'Isuzu',
       'OpelCorsa', 'Ambassador', 'Kia'], dtype=object)

In [18]:
i=0
for n in df['brand'].unique():
    i+=1

print(i)

29


We have data of cars from 29 different brands.

In [19]:
df['brand'].replace(['Maruti', 'Hyundai', 'Datsun', 'Honda', 'Tata', 'Chevrolet',
       'Toyota', 'Jaguar', 'Mercedes', 'Audi', 'Skoda', 'Jeep', 'BMW',
       'Mahindra', 'Ford', 'Nissan', 'Renault', 'Fiat', 'Volkswagen',
       'Volvo', 'Mitsubishi', 'Land', 'Daewoo', 'MG', 'Force', 'Isuzu',
       'OpelCorsa', 'Ambassador', 'Kia'],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29],inplace=True)
df['brand'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

Find unique values in column fuel

In [20]:
df['fuel'].unique()


array(['Petrol', 'Diesel', 'CNG', 'LPG', 'Electric'], dtype=object)

We have 5 different types of car based on fuel type.
convert them into integers.

In [21]:
df['fuel'].replace(['Petrol', 'Diesel', 'CNG', 'LPG', 'Electric'],[1,2,3,4,5],inplace=True)
df['fuel'].unique()

array([1, 2, 3, 4, 5])

In [22]:
df['seller_type'].unique()

array(['Individual', 'Dealer', 'Trustmark Dealer'], dtype=object)

In [23]:
df['seller_type'].replace(['Individual', 'Dealer', 'Trustmark Dealer'],[1,2,3],inplace=True)
df['seller_type'].unique()

array([1, 2, 3])

In [24]:
df['transmission'].unique()

array(['Manual', 'Automatic'], dtype=object)

In [25]:
df['transmission'].replace(['Manual', 'Automatic'],[1,2],inplace=True)
df['transmission'].unique()

array([1, 2])

In [26]:
df['owner'].unique()

array(['First Owner', 'Second Owner', 'Fourth & Above Owner',
       'Third Owner', 'Test Drive Car'], dtype=object)

In [27]:
df['owner'].replace(['First Owner', 'Second Owner', 'Fourth & Above Owner',
       'Third Owner', 'Test Drive Car'],[1,2,4,3,5],inplace=True)
df['owner'].unique()

array([1, 2, 4, 3, 5])

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3577 entries, 0 to 3576
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           3577 non-null   object
 1   year           3577 non-null   int64 
 2   selling_price  3577 non-null   int64 
 3   km_driven      3577 non-null   int64 
 4   fuel           3577 non-null   int64 
 5   seller_type    3577 non-null   int64 
 6   transmission   3577 non-null   int64 
 7   owner          3577 non-null   int64 
 8   brand          3577 non-null   int64 
dtypes: int64(8), object(1)
memory usage: 251.6+ KB


We have 9 columns. first coumn that is 'name', we donot need this on as we have created new column 'brand' from this one and we use this new column in future model designing and all other columns data type as shown above is 'int' so all these values can be used in Regrssion model.
We have one column 'selling_price' that is our output(label) column and other seven are input(feature) columns.
Lets seperate inputs and output.

In [29]:
X=df.drop(columns=['name','selling_price'])
X.shape

(3577, 7)

In [30]:
Y=df['selling_price'].values
Y=Y.reshape(-1,1)
print(Y)


[[ 60000]
 [135000]
 [600000]
 ...
 [110000]
 [865000]
 [225000]]


In [31]:
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2)

In [32]:
x_train.shape


(2861, 7)

In [33]:
x_test.shape

(716, 7)

## Model Designing
### 1. Linear Regression Model

In [34]:
LR_model=LinearRegression()
LR_model.fit(x_train,y_train)

In [35]:
predict=LR_model.predict(x_test)


In [36]:
# calculate R-squared

print ("R2 Score value: {:.4f}".format(r2_score(y_test, predict)))

R2 Score value: 0.3900


In [37]:
mse = mean_squared_error(y_test, predict)
rmse = np.sqrt(mse)
print("RMSE value: {:.4f}".format(rmse))

RMSE value: 415240.8139


In [38]:
error_mean_square=[]
error_mean_absolute=[]

In [39]:
error_mean_square.append(int(mean_squared_error(predict, y_test)))
error_mean_absolute.append(int(mean_absolute_error(predict, y_test)))

In [40]:
error_mean_absolute

[211059]

In [41]:
y_predict = pd.DataFrame(predict, columns = ['Predicted Output'])
y_predict.head()

Unnamed: 0,Predicted Output
0,508904.880664
1,479222.44119
2,513136.263019
3,451821.371914
4,52407.58197


Now try to implement different Regression model and see thee results to find which model gives better results for our model

In [42]:
lnrr=LinearRegression()
ridge=Ridge(alpha=1.0)
lasso=Lasso(alpha=1.0)
elastic=ElasticNet(alpha=1.0, l1_ratio=0.5)
dtr=DecisionTreeRegressor()
rfr=RandomForestRegressor()
gbr=GradientBoostingRegressor()
xgr=XGBRegressor()


In [43]:
clfs = {
    'Linear Regression': lnrr,
    'Ridge Regression': ridge,
    'Lasso Regression': lasso,
    'Elastic Net Regression': elastic,
    'Decision Tree Regression': dtr,
    'Random Forest': rfr,
    'Gradient Boosting': gbr,
    'XGB Regression': xgr
}

In [44]:
def train_classifier(clfs, X_train, y_train, X_test, y_test):
    clfs.fit(X_train,y_train)
    y_pred = clfs.predict(X_test)
    R2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    return R2 , rmse

In [45]:
R2_scores = []
rmse_scores = []
for name , clfs in clfs.items():
    current_R2, current_rmse = train_classifier(clfs, x_train, y_train, x_test, y_test)
    print()
    print("For: ", name)
    print("R2_score : ", current_R2)
    print("rmse_score : ", current_rmse)
    
    R2_scores.append(current_R2)
    rmse_scores.append(current_rmse)


For:  Linear Regression
R2_score :  0.3899597527209556
rmse_score :  415240.813862979

For:  Ridge Regression
R2_score :  0.38990335321294034
rmse_score :  415260.00836403563

For:  Lasso Regression
R2_score :  0.38995874663925956
rmse_score :  415241.15627154295

For:  Elastic Net Regression
R2_score :  0.2490701198505335
rmse_score :  460702.40793602937

For:  Decision Tree Regression
R2_score :  0.5685772210460425
rmse_score :  349198.46448378335

For:  Random Forest
R2_score :  0.7513630030983
rmse_score :  265096.29934806837

For:  Gradient Boosting
R2_score :  0.7550587040555157
rmse_score :  263118.7486248171

For:  XGB Regression
R2_score :  0.7681554555892944
rmse_score :  255987.7597073735


Above results show that RandomForestRegression, GradientBoostingRegressor and XGBRegressor gives good result for our problem.


### Hypertune Random Forest

In [47]:
rfr=RandomForestRegressor()
# parameters for Random Forest
n_estimators=[50,100,150,200,300]
max_depth=[30,40,50]
max_features=["auto","sqrt"]
bootstrap =[True,False]

param_grid = dict(n_estimators=n_estimators,max_depth=max_depth, max_features=max_features, bootstrap=bootstrap)

grid = GridSearchCV(estimator=rfr, param_grid=param_grid, cv = 3, n_jobs=-1)

grid_result = grid.fit(x_train, y_train)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.606618 using {'bootstrap': True, 'max_depth': 50, 'max_features': 'sqrt', 'n_estimators': 200}


### Reference
1. https://www.kaggle.com/datasets/hellbuoy/car-price-prediction
2. https://github.com/Apaulgithub/oibsip_taskno3
3. https://www.kaggle.com/datasets/bhavikjikadara/car-price-prediction-dataset
4. https://github.com/suhasmaddali/Car-Prices-Prediction
5. https://www.youtube.com/watch?v=MFKSPGo_MLw


In [None]:
### My github link to this project
https://github.com/M-Zeeshan-Anjum/Car-Price-prediction