##### USED CAR PRICE PREDICTION
### 1.0 Data Understanding

#### Background
This analysis aims at predicting the features that affect used car prices. The data used was from kaggle (https://www.kaggle.com/datasets/avikasliwal/used-cars-price-prediction/discussion/358691) . The predictor variables in the data are Name, Location, Year, Kilometers Driven,	Fuel Type, Transmission, Owner Type, Mileage, Engine, Power, Seats		and Price

#### Problem Statement

There is an increasingly high demand for used cars due to their relatively lower prices as compared to new cars.
It has also been noted that there is increasing concern for different features of the used car by the clients and this analysis aims at helping them know the factors to consider when buying the used cars

#### Objectives
Research questions;

1. What is the relationship between car price and other predictor variables?
2. Which combination of features provide the best accurate prediction of car prices?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import re
import warnings
%matplotlib inline

In [2]:
used_cars=pd.read_csv(r"C:\Users\user\Desktop\car_price_prediction\Used_cars_Price_Prediction\tr_data.csv")
used_cars.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


In [3]:
#Removing the double indexing by removing "Unnamed: 0" column and "New_Price" as we have another detailed price column
used_cars.drop(columns=["Unnamed: 0", "New_Price"], inplace=True)

used_cars.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74


### 2.0 Data Pre-Processing

In [4]:
## The Price is in lakh rupee. Lakh is an indian numbering system which equals to 100,000
## For example 1.75 lakh rupee == 175,000 INR
## Therefore I will convert lakh rupees to Indian rupee and finally to the Kenyan shilling for easier understanding
## 1 INR == 1.82 Ksh (As at 8th Nov 2023)
## Therefore I will multiply that column by 182,000

used_cars["Price"] = (used_cars["Price"] * 182000).astype(int)

used_cars.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,318500
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,2275000
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,819000
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,1092000
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,3228679


In [5]:
used_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               6019 non-null   object 
 1   Location           6019 non-null   object 
 2   Year               6019 non-null   int64  
 3   Kilometers_Driven  6019 non-null   int64  
 4   Fuel_Type          6019 non-null   object 
 5   Transmission       6019 non-null   object 
 6   Owner_Type         6019 non-null   object 
 7   Mileage            6017 non-null   object 
 8   Engine             5983 non-null   object 
 9   Power              5983 non-null   object 
 10  Seats              5977 non-null   float64
 11  Price              6019 non-null   int32  
dtypes: float64(1), int32(1), int64(2), object(8)
memory usage: 540.9+ KB


In [6]:
# Checking the no of rows and columns Iam working with
used_cars.shape

(6019, 12)

In [7]:
# Checking for missing values
used_cars.isna().sum()

Name                  0
Location              0
Year                  0
Kilometers_Driven     0
Fuel_Type             0
Transmission          0
Owner_Type            0
Mileage               2
Engine               36
Power                36
Seats                42
Price                 0
dtype: int64

Null values:

Mileage               2
Engine               36
Power                36
Seats                42




In [8]:


#Seats, Engine and Power are discrete variables and can only take specific values. Therefore; for this three Columns,
#I will fill the null values with the mode

#  For Mileage since it is a continuos variable, it will be appropriate to replace the missing values with the mean

used_cars["Mileage"].fillna(used_cars["Mileage"].mean, inplace=True)
used_cars["Engine"].fillna(used_cars["Engine"].mode()[0], inplace=True)
used_cars["Power"].fillna(used_cars["Power"].mode()[0], inplace= True)
used_cars["Seats"].fillna(used_cars["Seats"].mode()[0], inplace=True)

In [9]:
used_cars.isna().sum()

Name                 0
Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
Price                0
dtype: int64

In [10]:
# Its now clear we dont have null values, I will now proceed to check for duplicate values

used_cars.duplicated().value_counts()

False    6019
dtype: int64

In [11]:
# its clear there are no duplicated values

In [12]:
## finding the descriptive statistics of the factors that affect used car prices

used_cars.describe()

Unnamed: 0,Year,Kilometers_Driven,Seats,Price
count,6019.0,6019.0,6019.0,6019.0
mean,2013.358199,58738.38,5.27679,1725263.0
std,3.269742,91268.84,0.806346,2036201.0
min,1998.0,171.0,0.0,80080.0
25%,2011.0,34000.0,5.0,637000.0
50%,2014.0,53000.0,5.0,1026480.0
75%,2016.0,73000.0,5.0,1810899.0
max,2019.0,6500000.0,10.0,29120000.0


### 3.0 Exploratory Data Analysis

#### 3.1 Univariate Analysis