# Used Car Price Prediction

The Used Car Price Prediction project aims to develop a machine learning model that accurately predicts the selling prices of used cars based on various features such as make, model, year, mileage,Owner_Type and other relevant attributes. This project leverages data preprocessing, feature engineering, and model training techniques to create a robust predictive model.

## 1. Data Collection

In [1]:
import pandas as pd
df=pd.read_csv(r'D:\Downloads\used-car-data - train-data.csv')
df.set_index('Name',inplace=True)
df

Unnamed: 0_level_0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.50
Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.50
Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.00
Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74
...,...,...,...,...,...,...,...,...,...,...,...,...
Maruti Swift VDI,Delhi,2014,27365,Diesel,Manual,First,28.4 kmpl,1248 CC,74 bhp,5.0,7.88 Lakh,4.75
Hyundai Xcent 1.1 CRDi S,Jaipur,2015,100000,Diesel,Manual,First,24.4 kmpl,1120 CC,71 bhp,5.0,,4.00
Mahindra Xylo D4 BSIV,Jaipur,2012,55000,Diesel,Manual,Second,14.0 kmpl,2498 CC,112 bhp,8.0,,2.90
Maruti Wagon R VXI,Kolkata,2013,46000,Petrol,Manual,First,18.9 kmpl,998 CC,67.1 bhp,5.0,,2.65


## 2. Data Preprocessing and Exploratory Data Analysis

In [4]:
df.duplicated().sum()

1

In [2]:
df['Owner_Type'].value_counts()
first_owner=df[df['Owner_Type']=="First"].shape[0]
print('OwnerType \n First -',first_owner)

OwnerType 
 First - 4929


In [5]:
df['Kilometers_Driven'].agg('min'),df['Kilometers_Driven'].agg('max')

df['Year'].value_counts().sort_index()

df_null=df.isnull().sum()
df_null

Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 36
Power                  36
Seats                  42
New_Price            5195
Price                   0
dtype: int64

In [6]:
# Print the percentage of null values in each column.
# Drop those columns where the null value is greater than 60%.
df_per=df_null/df.shape[0]*100
print(df_per)

Location              0.000000
Year                  0.000000
Kilometers_Driven     0.000000
Fuel_Type             0.000000
Transmission          0.000000
Owner_Type            0.000000
Mileage               0.033228
Engine                0.598106
Power                 0.598106
Seats                 0.697790
New_Price            86.310018
Price                 0.000000
dtype: float64


In [7]:
df2=df.loc[:,df_per<60]
df2.head()

Unnamed: 0_level_0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75
Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5
Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5
Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0
Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74


In [8]:
car = df2['Location'].value_counts().reset_index()

car.columns = ['Location', 'Number_of_Cars']

car.set_index('Location', inplace=True)

car

Unnamed: 0_level_0,Number_of_Cars
Location,Unnamed: 1_level_1
Mumbai,790
Hyderabad,742
Kochi,651
Coimbatore,636
Pune,622
Delhi,554
Kolkata,535
Chennai,494
Jaipur,413
Bangalore,358


In [9]:
#Take only the numerical value of mileage, engine, and power columns.

df2['Mileage'] = pd.to_numeric(df2['Mileage'].str.extract('(\d+)')[0], errors='coerce')
df2['Engine'] = pd.to_numeric(df2['Engine'].str.extract('(\d+)')[0], errors='coerce', downcast='integer')
df2['Power'] = pd.to_numeric(df2['Power'].str.extract('(\d+)')[0], errors='coerce')

df2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Mileage'] = pd.to_numeric(df2['Mileage'].str.extract('(\d+)')[0], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Engine'] = pd.to_numeric(df2['Engine'].str.extract('(\d+)')[0], errors='coerce', downcast='integer')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Power'

Unnamed: 0_level_0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.0,998.0,58.0,5.0,1.75
Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.0,1582.0,126.0,5.0,12.50
Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.0,1199.0,88.0,5.0,4.50
Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.0,1248.0,88.0,7.0,6.00
Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.0,1968.0,140.0,5.0,17.74
...,...,...,...,...,...,...,...,...,...,...,...
Maruti Swift VDI,Delhi,2014,27365,Diesel,Manual,First,28.0,1248.0,74.0,5.0,4.75
Hyundai Xcent 1.1 CRDi S,Jaipur,2015,100000,Diesel,Manual,First,24.0,1120.0,71.0,5.0,4.00
Mahindra Xylo D4 BSIV,Jaipur,2012,55000,Diesel,Manual,Second,14.0,2498.0,112.0,8.0,2.90
Maruti Wagon R VXI,Kolkata,2013,46000,Petrol,Manual,First,18.0,998.0,67.0,5.0,2.65


In [10]:
from sklearn.impute import SimpleImputer
num_vars=df2.select_dtypes(include='number').columns
num_vars=['Mileage', 'Engine', 'Power', 'Seats',]
si=SimpleImputer(strategy='mean')
si.fit(df2[num_vars])

In [11]:
df2.fillna(df2[num_vars].mean(),inplace=True)
df2.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.fillna(df2[num_vars].mean(),inplace=True)


Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
Price                0
dtype: int64

In [12]:
df2.describe()

Unnamed: 0,Year,Kilometers_Driven,Mileage,Engine,Power,Seats,Price
count,6019.0,6019.0,6019.0,6019.0,6019.0,6019.0,6019.0
mean,2013.358199,58738.38,17.710487,1621.27645,112.938223,5.278735,9.479468
std,3.269742,91268.84,4.578434,599.553865,53.272765,0.806012,11.187917
min,1998.0,171.0,0.0,72.0,34.0,0.0,0.44
25%,2011.0,34000.0,15.0,1198.0,78.0,5.0,3.5
50%,2014.0,53000.0,18.0,1493.0,98.0,5.0,5.64
75%,2016.0,73000.0,21.0,1969.0,138.0,5.0,9.95
max,2019.0,6500000.0,33.0,5998.0,560.0,10.0,160.0


In [13]:
df2.isnull().sum()

Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
Price                0
dtype: int64

In [14]:
df2.head()

Unnamed: 0_level_0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.0,998.0,58.0,5.0,1.75
Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.0,1582.0,126.0,5.0,12.5
Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.0,1199.0,88.0,5.0,4.5
Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.0,1248.0,88.0,7.0,6.0
Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.0,1968.0,140.0,5.0,17.74


In [15]:
df2.reset_index(drop=True, inplace=True)
df2.head()

Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Mumbai,2010,72000,CNG,Manual,First,26.0,998.0,58.0,5.0,1.75
1,Pune,2015,41000,Diesel,Manual,First,19.0,1582.0,126.0,5.0,12.5
2,Chennai,2011,46000,Petrol,Manual,First,18.0,1199.0,88.0,5.0,4.5
3,Chennai,2012,87000,Diesel,Manual,First,20.0,1248.0,88.0,7.0,6.0
4,Coimbatore,2013,40670,Diesel,Automatic,Second,15.0,1968.0,140.0,5.0,17.74


In [16]:
cat_ohe=pd.get_dummies(df2[['Transmission', 'Owner_Type', 'Fuel_Type']], drop_first=True, dtype='int')
cat_ohe.head()

Unnamed: 0,Transmission_Manual,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third,Fuel_Type_Diesel,Fuel_Type_Electric,Fuel_Type_LPG,Fuel_Type_Petrol
0,1,0,0,0,0,0,0,0
1,1,0,0,0,1,0,0,0
2,1,0,0,0,0,0,0,1
3,1,0,0,0,1,0,0,0
4,0,0,1,0,1,0,0,0


In [17]:
num_vars=df2.select_dtypes(include='number')
df2 = pd.concat([num_vars, cat_ohe], axis=1)
df2.head()

Unnamed: 0,Year,Kilometers_Driven,Mileage,Engine,Power,Seats,Price,Transmission_Manual,Owner_Type_Fourth & Above,Owner_Type_Second,Owner_Type_Third,Fuel_Type_Diesel,Fuel_Type_Electric,Fuel_Type_LPG,Fuel_Type_Petrol
0,2010,72000,26.0,998.0,58.0,5.0,1.75,1,0,0,0,0,0,0,0
1,2015,41000,19.0,1582.0,126.0,5.0,12.5,1,0,0,0,1,0,0,0
2,2011,46000,18.0,1199.0,88.0,5.0,4.5,1,0,0,0,0,0,0,1
3,2012,87000,20.0,1248.0,88.0,7.0,6.0,1,0,0,0,1,0,0,0
4,2013,40670,15.0,1968.0,140.0,5.0,17.74,0,0,1,0,1,0,0,0


## 3. Train-Test-Split

In [18]:
from sklearn.model_selection import train_test_split
X=df2.iloc[:,:-1]
y=df2.iloc[:,-1]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape,X_test.shape,y_train.shape

((4815, 14), (1204, 14), (4815,))

## 4. Feature Scaling

In [19]:
from sklearn.preprocessing import StandardScaler
num_vars=df2.select_dtypes(include='number')
sc=StandardScaler()
X_train_sc=sc.fit_transform(X_train[num_vars.iloc[:,:-1].columns])
X_test_sc=sc.fit_transform(X_test[num_vars.iloc[:,:-1].columns])

## 5. Model Selection and Training

In [20]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_train_sc,y_train)

In [21]:
y_pred=lr.predict(X_test_sc)
y_pred

array([-0.00780646,  0.01793245,  0.01054277, ...,  0.96710106,
        1.01484692,  0.0214755 ])

In [22]:
y_test

2868    0
5924    0
3764    0
4144    0
2780    1
       ..
5926    1
4216    1
1351    1
4603    0
5668    0
Name: Fuel_Type_Petrol, Length: 1204, dtype: int32

## 6. Model Evaluation

In [23]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.956649540560491