# Car Price Prediction using Machine Learning with Python

Problem Statement:
We have to predict the price of used car based on several features. 


#### Mounting Google drive to import the required data

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd gdrive/My Drive/Colab Notebooks/Predictive-Analytics

Mounted at /content/gdrive
/content/gdrive/My Drive/Colab Notebooks/Predictive-Analytics


#### Importing the dependencies for this project

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn import metrics

#### Data pre-processing

In [7]:
# Reading the data from the csv file to pandas dataframe
cardata = pd.read_csv("UsedCarData.csv")

In [8]:
#Inspecting the imported data
cardata.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Driven_kms,Fuel_Type,Selling_type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [10]:
# Checking the shape/ number of data points
cardata.shape

(301, 9)

In [12]:
# Getting the information about the dataset
cardata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Driven_kms     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Selling_type   301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


No values are missing from the data set. 

In [14]:
# Checking the distribution of the categorical data
print(cardata.Fuel_Type.value_counts())
print("-------"*2)
print(cardata.Selling_type.value_counts())
print("-------"*2)
print(cardata.Transmission.value_counts())

Petrol    239
Diesel     60
CNG         2
Name: Fuel_Type, dtype: int64
--------------
Dealer        195
Individual    106
Name: Selling_type, dtype: int64
--------------
Manual       261
Automatic     40
Name: Transmission, dtype: int64


The machine learning algorithms can not properly understand the text data, so we convert the categorical data into numrical data through encoding

In [15]:
# Encoding the categocial data
cardata.replace({'Fuel_Type':{'Petrol': 0,'Diesel':1,'CNG':2}}, inplace = True)
cardata.replace({'Selling_type':{'Dealer': 0,'Individual':1}}, inplace = True)
cardata.replace({'Transmission':{'Manual': 0,'Automatic':1}}, inplace = True)

In [16]:
cardata.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Driven_kms,Fuel_Type,Selling_type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,0,0,0,0
1,sx4,2013,4.75,9.54,43000,1,0,0,0
2,ciaz,2017,7.25,9.85,6900,0,0,0,0
3,wagon r,2011,2.85,4.15,5200,0,0,0,0
4,swift,2014,4.6,6.87,42450,1,0,0,0


#### Train & Test Data

We woudl define two variables X & Y where, X would be independent variables 
and Y would be the target variable (what are we trying to predict)

In [17]:
X = cardata.drop(['Car_Name', 'Selling_Price'], axis = 1)
Y = cardata["Selling_Price"]

In [18]:
X.head()

Unnamed: 0,Year,Present_Price,Driven_kms,Fuel_Type,Selling_type,Transmission,Owner
0,2014,5.59,27000,0,0,0,0
1,2013,9.54,43000,1,0,0,0
2,2017,9.85,6900,0,0,0,0
3,2011,4.15,5200,0,0,0,0
4,2014,6.87,42450,1,0,0,0


In [19]:
Y.head()

0    3.35
1    4.75
2    7.25
3    2.85
4    4.60
Name: Selling_Price, dtype: float64

We would now split the data into the train and test data

Since the data us very limited. we would keep the size of the test small


In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = 2)

#### Model Training

In [22]:
# Linear Regression Model
linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train, Y_train)




#### Model Evaluation

In [None]:
Y_predict = linear_regression_model.predict(X_test)