## Toyota Used Car Machine Learning Price Prediction

In this notebook, we will be developing a machine learning model from scratch based on the used car data for a particular brand, namely Toyota, present in this dataset. The objective of the model will be to predict the price for a used car based on available features.
    
Steps for implementing a Machine Learning Model:

1. Import required libraries and packages
        Read Dataset
        Examine Feature and Target Variables
2. Splitting Dataset
        Splitting Feature and Target variable
        Splitting Train and Test set
3. Training Model
        Perform Linear Regression
4. Prediction
        Prediction of Error
        
We will be using the following libraries:

    pandas for data handling
    numpy for numerical operations
    Specific modules from sklearn for model related operations as required
    matplotlib.pyplot and seaborn for data visualization

### Import required libraries and packages

In [42]:
# Import Libraries for Data handling
import pandas as pd

# Import Libraries for Numerical Operations
import numpy as np

# Data Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

### Read the CSV file using Pandas dataframe

In [43]:
df = pd.read_csv('toyota.csv')

print (type(df))
display(df.head())
display(df.tail())

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,model,year,price,transmission,mileage,fuelType,mpg,engineSize
0,GT86,2016,16000,Manual,24089,Petrol,36.2,2.0
1,GT86,2017,15995,Manual,18615,Petrol,36.2,2.0
2,GT86,2015,13998,Manual,27469,Petrol,36.2,2.0
3,GT86,2017,18998,Manual,14736,Petrol,36.2,2.0
4,GT86,2017,17498,Manual,36284,Petrol,36.2,2.0


Unnamed: 0,model,year,price,transmission,mileage,fuelType,mpg,engineSize
6733,IQ,2011,5500,Automatic,30000,Petrol,58.9,1.0
6734,Urban Cruiser,2011,4985,Manual,36154,Petrol,50.4,1.3
6735,Urban Cruiser,2012,4995,Manual,46000,Diesel,57.6,1.4
6736,Urban Cruiser,2011,3995,Manual,60700,Petrol,50.4,1.3
6737,Urban Cruiser,2011,4495,Manual,45128,Petrol,50.4,1.3


#### Feature variables:

model: categories of the ford car

year: the year car was made

transmission: the type of trasmission the car has

mileage: the number of miles the vehicle has driven

fuelType: energy source of the vehicle

mpg: miles per gallon the vehicle can travel

engineSize: Engine size is the volume of fuel and air that can be pushed through a car's cylinders

#### Target Variable:

price: selling price of the car

### Check for data types and any missing values

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6738 entries, 0 to 6737
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         6738 non-null   object 
 1   year          6738 non-null   int64  
 2   price         6738 non-null   int64  
 3   transmission  6738 non-null   object 
 4   mileage       6738 non-null   int64  
 5   fuelType      6738 non-null   object 
 6   mpg           6738 non-null   float64
 7   engineSize    6738 non-null   float64
dtypes: float64(2), int64(3), object(3)
memory usage: 421.2+ KB


In [45]:
print(df.shape)

print("number of rows = ", df.shape[0])
print("number of columns = ", df.shape[1])

(6738, 8)
number of rows =  6738
number of columns =  8


## Splitting Dataset

### Separating the features and target variable

In [46]:
# Create feature and target lists
features = ['mileage', 'year', 'mpg', 'engineSize']
target = ['price']

# Create feature and target dataframes

X = df[features]
y = df[target]

# display the dataframe shapes
print("Shape of X: ", X.shape)
print("Shape of y: ", y.shape)

Shape of X:  (6738, 4)
Shape of y:  (6738, 1)


### Split the Data into Train Set and Test Set

In [47]:
# Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 9)

# Display Split Data shapes
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (5390, 4)
y_train shape:  (5390, 1)
X_test shape:  (1348, 4)
y_test shape:  (1348, 1)


# Training Model

### Perform Linear Regression

In [48]:
# Create a model instance
model = LinearRegression()

# Fit data to model
model = model.fit(X_train, y_train)

### Linear Regression Coefficients and Intercept

In [49]:
coefficients = pd.DataFrame({'features':X.columns, 'coefficients':np.squeeze(model.coef_)})
coefficients = coefficients.sort_values(by='coefficients')
display(coefficients)

Unnamed: 0,features,coefficients
0,mileage,-0.07774
2,mpg,25.249202
1,year,814.592829
3,engineSize,11410.163334


positive sign on the entry indicates that as the feature variable increases, the target variable also increases.

negative sign on the entry indicates indicates that as the feature variable increases, the target variable decreases

In [50]:
c = model.intercept_
print (c)

[-1646929.7530487]


# Prediction

In [51]:
# Predict from the test set features
y_pred = model.predict(X_test)

display(y_pred) 

array([[13396.17921659],
       [ 8482.86548298],
       [10401.38376765],
       ...,
       [17890.73620753],
       [14476.04780522],
       [15804.72835936]])

In [52]:
df_predict = pd.DataFrame()

df_predict['original_price'] = y_test['price']
df_predict['predicted_price'] = y_pred
df_predict['observation'] = np.arange(0, y_test.shape[0] , 1)

display(df_predict.head())

Unnamed: 0,original_price,predicted_price,observation
1725,10992,13396.179217,0
3877,7750,8482.865483,1
2879,10500,10401.383768,2
5699,13450,9936.530483,3
6542,5999,5305.315304,4


# Prediction Error

In [53]:
#Prediction error
# root mean squared error
RMSE = mean_squared_error(y_test, y_pred, squared=False)
print("root mean squared error = ", RMSE)

# mean absolute error
MAE = mean_absolute_error(y_test, y_pred)
print("mean absolute percentage error = ", MAE)

# mean absolute percentage error
MAPE = mean_absolute_percentage_error(y_test, y_pred)
print("mean absolute percentage error = ", MAPE)

# mean squared error
MSE = mean_squared_error(y_test, y_pred)
print("mean squared error = ", MSE)

# coefficient of determination
r_squared = r2_score(y_test, y_pred) 
print("coefficient of determination = ", r_squared)

root mean squared error =  3211.5757800903634
mean absolute percentage error =  2197.6195841654558
mean absolute percentage error =  0.1900716097138662
mean squared error =  10314218.991263026
coefficient of determination =  0.7598820768101426


#### Comment:

From the aforementioned results, it can be inferred that the ML model at this point
isn't very efficient at predicting the price. This is evident from its coefficient of determination
 value of ~ 0.7598. Other Key Performance Indicator(s) (KPI) paint a similar picture. Incorporating
 the categorical variables could be a possible solution for improving the result.

### End of Homework 1 (Module 4)