## Introduction (Car price Prediction)

The objective of the dataset is to predict the price (Amount(Million Naira) the company should sell a car based on the available data (Location, Maker, Model, Yera, Color, Amount (Million Naira), Type, Distance). The objective is to preddict the selling price.

## Importing the required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

## Loading the dataset

In [2]:
Test = pd.read_csv('Test.csv')
Train = pd.read_csv('Train.csv')

Sub = pd.read_csv('SampleSubmission.csv')

In [3]:
Test.head()

Unnamed: 0,VehicleID,Location,Maker,Model,Year,Colour,Type,Distance
0,VHL18518,Abuja,BMW,323i,2008,White,Foreign Used,30524.0
1,VHL17149,Lagos,Toyota,Camry,2013,White,Foreign Used,
2,VHL10927,Lagos,Toyota,Highlander Limited V6,2005,Gold,Foreign Used,
3,VHL12909,Lagos,Toyota,Camry,2011,Gray,Foreign Used,166839.0
4,VHL12348,Lagos,Lexus,ES 350 FWD,2013,Red,Foreign Used,88862.0


In [4]:
Train.head()

Unnamed: 0,VehicleID,Location,Maker,Model,Year,Colour,Amount (Million Naira),Type,Distance
0,VHL12546,Abuja,Honda,Accord Coupe EX V-6,2011,Silver,2.2,Nigerian Used,
1,VHL18827,Ibadan,Hyundai,Sonata,2012,Silver,3.5,Nigerian Used,125000.0
2,VHL19499,Lagos,Lexus,RX 350,2010,Red,9.2,Foreign Used,110852.0
3,VHL17991,Abuja,Mercedes-Benz,GLE-Class,2017,Blue,22.8,Foreign Used,30000.0
4,VHL12170,Ibadan,Toyota,Highlander,2002,Red,2.6,Nigerian Used,125206.0


In [5]:
Sub.head()

Unnamed: 0,VehicleID,Amount (Million Naira)
0,VHL18518,1.0
1,VHL17149,1.0
2,VHL10927,1.0
3,VHL12909,1.0
4,VHL12348,1.0


In [6]:
Test.shape,Train.shape

((2061, 8), (7205, 9))

In [7]:
Test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   VehicleID  2061 non-null   object 
 1   Location   2061 non-null   object 
 2   Maker      2061 non-null   object 
 3   Model      2061 non-null   object 
 4   Year       2059 non-null   object 
 5   Colour     2061 non-null   object 
 6   Type       2007 non-null   object 
 7   Distance   1385 non-null   float64
dtypes: float64(1), object(7)
memory usage: 128.9+ KB


In [8]:
Train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7205 entries, 0 to 7204
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   VehicleID               7205 non-null   object 
 1   Location                7205 non-null   object 
 2   Maker                   7205 non-null   object 
 3   Model                   7205 non-null   object 
 4   Year                    7184 non-null   object 
 5   Colour                  7205 non-null   object 
 6   Amount (Million Naira)  7188 non-null   float64
 7   Type                    7008 non-null   object 
 8   Distance                4845 non-null   object 
dtypes: float64(1), object(8)
memory usage: 506.7+ KB


In [9]:
Test.describe()

Unnamed: 0,Distance
count,1385.0
mean,103800.668592
std,105986.234512
min,1.0
25%,52352.0
50%,82000.0
75%,120398.0
max,985216.0


In [10]:
Train.describe()

Unnamed: 0,Amount (Million Naira)
count,7188.0
mean,11.847999
std,25.318922
min,0.45
25%,3.5
50%,5.65
75%,11.6625
max,456.0


## Clean the dataset

In [11]:
Test.columns

Index(['VehicleID', 'Location', 'Maker', 'Model', 'Year', 'Colour', 'Type',
       'Distance'],
      dtype='object')

In [12]:
Train.columns

Index(['VehicleID', 'Location', 'Maker', 'Model', 'Year', 'Colour',
       'Amount (Million Naira)', 'Type', 'Distance'],
      dtype='object')

In [13]:
Test.isnull().sum()

VehicleID      0
Location       0
Maker          0
Model          0
Year           2
Colour         0
Type          54
Distance     676
dtype: int64

In [14]:
Train.isnull().sum()

VehicleID                    0
Location                     0
Maker                        0
Model                        0
Year                        21
Colour                       0
Amount (Million Naira)      17
Type                       197
Distance                  2360
dtype: int64

In [15]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

In [17]:
cat_col = ['Location', 'Maker', 'Model', 'Year', 'Colour', 'Type', 'Distance']

for i in cat_col:
    Train[i] = le.fit_transform(Train[i])
    Test[i] = le.fit_transform(Test[i])

In [18]:
Train.head()

Unnamed: 0,VehicleID,Location,Maker,Model,Year,Colour,Amount (Million Naira),Type,Distance
0,VHL12546,0,17,117,21,16,2.2,2,3144
1,VHL18827,1,19,1049,22,16,3.5,2,466
2,VHL19499,2,29,908,20,15,9.2,1,235
3,VHL17991,0,34,508,27,2,22.8,1,1446
4,VHL12170,1,52,569,12,15,2.6,2,470


In [19]:
Test.head()

Unnamed: 0,VehicleID,Location,Maker,Model,Year,Colour,Type,Distance
0,VHL18518,0,2,8,14,16,1,131
1,VHL17149,2,37,123,19,16,1,1020
2,VHL10927,2,37,272,11,7,1,1020
3,VHL12909,2,37,123,17,8,1,838
4,VHL12348,2,20,192,19,12,1,528


In [21]:
Train_Null = ['Year', 'Amount (Million Naira)', 'Type', 'Distance']

for i in Train_Null:
    Train[i] = Train[i].fillna(1)
    
Test_Null = ['Year', 'Type', 'Distance']

for i in Test_Null:
    Test[i] = Test[i].fillna(5555)

In [22]:
Train.isnull().sum()

VehicleID                 0
Location                  0
Maker                     0
Model                     0
Year                      0
Colour                    0
Amount (Million Naira)    0
Type                      0
Distance                  0
dtype: int64

In [23]:
Test.isnull().sum()

VehicleID    0
Location     0
Maker        0
Model        0
Year         0
Colour       0
Type         0
Distance     0
dtype: int64

In [24]:
Train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7205 entries, 0 to 7204
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   VehicleID               7205 non-null   object 
 1   Location                7205 non-null   int64  
 2   Maker                   7205 non-null   int64  
 3   Model                   7205 non-null   int64  
 4   Year                    7205 non-null   int64  
 5   Colour                  7205 non-null   int32  
 6   Amount (Million Naira)  7205 non-null   float64
 7   Type                    7205 non-null   int32  
 8   Distance                7205 non-null   int32  
dtypes: float64(1), int32(3), int64(4), object(1)
memory usage: 422.3+ KB


In [25]:
Test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2061 entries, 0 to 2060
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   VehicleID  2061 non-null   object
 1   Location   2061 non-null   int64 
 2   Maker      2061 non-null   int64 
 3   Model      2061 non-null   int64 
 4   Year       2061 non-null   int64 
 5   Colour     2061 non-null   int32 
 6   Type       2061 non-null   int32 
 7   Distance   2061 non-null   int64 
dtypes: int32(2), int64(5), object(1)
memory usage: 112.8+ KB


In [26]:
Train.drop(['VehicleID'], axis=1, inplace=True)

In [27]:
Test.drop(['VehicleID'], axis=1, inplace=True)

## Perform data Segmentation

In [29]:
y = Train["Amount (Million Naira)"]
X = Train.drop("Amount (Million Naira)", axis=1)

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [31]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

In [32]:
reg.fit(X_train,y_train)

LinearRegression()

In [33]:
y_pred = reg.predict(X_test) ##predict

In [34]:
from sklearn.metrics import mean_squared_error

# Evaluate model using the following
# - R Square/Adjusted R Square
# - Mean Square Error(MSE)/Root Mean Square Error(RMSE)
# - Mean Absolute Error(MAE)
print("Root Mean Squared Error is",np.sqrt(mean_squared_error(y_test,y_pred)))

Root Mean Squared Error is 25.928133513377002


## Main Predictions

In [35]:
predictions = reg.predict(Test)

In [36]:
Sub.head()

Unnamed: 0,VehicleID,Amount (Million Naira)
0,VHL18518,1.0
1,VHL17149,1.0
2,VHL10927,1.0
3,VHL12909,1.0
4,VHL12348,1.0


In [37]:
Sub.to_csv('Submit.csv', index=False)

## Other Machine Learning Regression

In [38]:
from sklearn.svm import SVR

In [40]:
model = SVR()
model.fit(X_train,y_train)

SVR()

In [41]:
prediction = model.predict(X_test)

In [42]:
print('Error is', np.sqrt(mean_squared_error(y_test,y_pred)))

Error is 25.928133513377002
