<a href="https://colab.research.google.com/github/Gopikuppala7/MachineLearning/blob/main/Regression_Model_Car_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import the base libraries

In [None]:
import pandas as pd
import numpy as np

Read CSV file

In [None]:
df = pd.read_csv('/content/CarResale.csv')

Check for top 5 rows

In [None]:
df.head()

Unnamed: 0,year,km_driven,fuel,seller_type,transmission,owner,seats,mileage,engine_size,brake_horsepower,selling_price
0,2014,145500,Diesel,Individual,Manual,First Owner,5,23.4,1248,74.0,4500.0
1,2014,120000,Diesel,Individual,Manual,Second Owner,5,21.14,1498,103.52,3700.0
2,2006,140000,Petrol,Individual,Manual,Third Owner,5,17.7,1497,78.0,1580.0
3,2010,127000,Diesel,Individual,Manual,First Owner,5,23.0,1396,90.0,2250.0
4,2007,120000,Petrol,Individual,Manual,First Owner,5,16.1,1298,88.2,1300.0


Check for null values

In [None]:
df.isnull().sum()

year                0
km_driven           0
fuel                0
seller_type         0
transmission        0
owner               0
seats               0
mileage             0
engine_size         0
brake_horsepower    0
selling_price       0
dtype: int64

Check for duplicate values

In [None]:
df.duplicated().sum()

1205

In [None]:
df = df.drop_duplicates()

In [None]:
df.duplicated().sum()

0

Get the basic information about the data

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6595 entries, 0 to 7797
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   year              6595 non-null   int64  
 1   km_driven         6595 non-null   int64  
 2   fuel              6595 non-null   object 
 3   seller_type       6595 non-null   object 
 4   transmission      6595 non-null   object 
 5   owner             6595 non-null   object 
 6   seats             6595 non-null   int64  
 7   mileage           6595 non-null   float64
 8   engine_size       6595 non-null   int64  
 9   brake_horsepower  6595 non-null   float64
 10  selling_price     6595 non-null   float64
dtypes: float64(3), int64(4), object(4)
memory usage: 618.3+ KB


In above information, it was found that the four of the columns viz. fuel, seller_type, transmission and owner are of object type. So, we need to perform encoding.

In [None]:
df['fuel'].value_counts()

Diesel    3644
Petrol    2951
Name: fuel, dtype: int64

In [None]:
df['seller_type'].value_counts()

Individual    5907
Dealer         688
Name: seller_type, dtype: int64

In [None]:
df['transmission'].value_counts()

Manual       6025
Automatic     570
Name: transmission, dtype: int64

In [None]:
df['owner'].value_counts()

First Owner             4099
Second Owner            1854
Third Owner              485
Fourth & Above Owner     152
Test Drive Car             5
Name: owner, dtype: int64

Encode each class from both the variables

In [None]:
df['fuel'] = df['fuel'].map({'Diesel':1, 'Petrol':2})
df['seller_type'] = df['seller_type'].map({'Individual':1, 'Dealer':2})
df['transmission'] = df['transmission'].map({'Manual':1, 'Automatic':2})
df['owner'] = df['owner'].map({'Test Drive Car':1, 'Fourth & Above Owner':2, 'Third Owner': 3, 'Second Owner':4, 'First Owner':5})

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6595 entries, 0 to 7797
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   year              6595 non-null   int64  
 1   km_driven         6595 non-null   int64  
 2   fuel              6595 non-null   int64  
 3   seller_type       6595 non-null   int64  
 4   transmission      6595 non-null   int64  
 5   owner             6595 non-null   int64  
 6   seats             6595 non-null   int64  
 7   mileage           6595 non-null   float64
 8   engine_size       6595 non-null   int64  
 9   brake_horsepower  6595 non-null   float64
 10  selling_price     6595 non-null   float64
dtypes: float64(3), int64(8)
memory usage: 618.3 KB


Get the basic descriptive statistics

In [None]:
df.describe()

Unnamed: 0,year,km_driven,fuel,seller_type,transmission,owner,seats,mileage,engine_size,brake_horsepower,selling_price
count,6595.0,6595.0,6595.0,6595.0,6595.0,6595.0,6595.0,6595.0,6595.0,6595.0,6595.0
mean,2013.615315,72846.240182,1.44746,1.104321,1.086429,4.499621,5.441547,19.47847,1435.516907,88.134959,5295.153518
std,3.902533,48622.79325,0.49727,0.3057,0.281018,0.737935,0.987656,3.908825,493.952888,31.738834,5256.336888
min,1994.0,1000.0,1.0,1.0,1.0,1.0,4.0,9.0,624.0,34.2,299.99
25%,2011.0,37579.5,1.0,1.0,1.0,4.0,5.0,16.8,1197.0,68.0,2500.0
50%,2014.0,68203.0,1.0,1.0,1.0,5.0,5.0,19.4,1248.0,81.86,4250.0
75%,2017.0,100000.0,2.0,1.0,1.0,5.0,5.0,22.345,1498.0,100.0,6500.0
max,2020.0,577414.0,2.0,2.0,2.0,5.0,14.0,42.0,3604.0,400.0,100000.0


Separate dependent and independent variables

In [None]:
X = df.drop(columns = 'selling_price') # Set of independent variables
y = df['selling_price'] # Dependent variable

Split into train and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(6595, 10)
(5276, 10)
(1319, 10)
(5276,)
(1319,)


Build the model

To build the model, first import the library (Linear Regression in this case) and then train using fit function

In [None]:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(X_train, y_train)

After training the model, get the prediction on the test features

In [None]:
y_pred = LR.predict(X_test)
y_pred

array([6985.80644607, 3460.46707979, 5725.08173162, ..., 9910.35876753,
       -306.4205076 , 4836.39887936])

To evaluate the model, we use true values and compare it with predicted values.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

print("The r2_score is:", round(r2_score(y_test, y_pred),2))
print("The mean squared error is:", round(mean_squared_error(y_test, y_pred),2))
print("The mean absolute error is:", round(mean_absolute_error(y_test, y_pred),2))

The r2_score is: 0.57
The mean squared error is: 15852700.5
The mean absolute error is: 1863.19


In [None]:
from math import sqrt
print("The root mean squared error is:", round(sqrt(mean_squared_error(y_test, y_pred)),2))

The root mean squared error is: 3981.54


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [None]:
LR = LinearRegression()
LR.fit(X_train, y_train)

In [None]:
y_pred = LR.predict(X_test)
y_pred
print("The r2_score is:", round(r2_score(y_test, y_pred),2))
print("The mean squared error is:", round(mean_squared_error(y_test, y_pred),2))
print("The mean absolute error is:", round(mean_absolute_error(y_test, y_pred),2))
print("The root mean squared error is:", round(sqrt(mean_squared_error(y_test, y_pred)),2))

The r2_score is: 0.6
The mean squared error is: 9657826.74
The mean absolute error is: 1714.51
The root mean squared error is: 3107.7
