# Car Sales Price Prediction Notebook

> In this notebook, we are trying to make a model that can predict the used cars sale price depending on several factors.



## Importing the libraries

Importing the Data Science and Machine Learning libraries

* Pandas as **`pd`**
* Numpy as **`np`**
* Matplotlib as **`plt`**

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Import and analyze the data

In [2]:
# Read the csv
df = pd.read_csv("car-sales.csv")

In [3]:
df.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [4]:
# Find the number of cars sold in the dataset
len(df)

1000

In [5]:
# Find info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           951 non-null    object 
 1   Colour         950 non-null    object 
 2   Odometer (KM)  950 non-null    float64
 3   Doors          950 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB


In [6]:
# Descrive the numerical attributes of the dataset
df.describe()

Unnamed: 0,Odometer (KM),Doors,Price
count,950.0,950.0,950.0
mean,131253.237895,4.011579,16042.814737
std,69094.857187,0.382539,8581.695036
min,10148.0,3.0,2796.0
25%,70391.25,4.0,9529.25
50%,131821.0,4.0,14297.0
75%,192668.5,4.0,20806.25
max,249860.0,5.0,52458.0


In [7]:
# Find the different values in teh "Make" Column
df.Make.value_counts()

Toyota    379
Honda     292
Nissan    183
BMW        97
Name: Make, dtype: int64

In [8]:
# Find the different values in teh "Colour" Column
df.Colour.value_counts()

White    390
Blue     302
Black     95
Red       88
Green     75
Name: Colour, dtype: int64

## Preprocess the data

In [9]:
# Find if there is any column that has null values
df.isnull().any()

Make             True
Colour           True
Odometer (KM)    True
Doors            True
Price            True
dtype: bool

In [10]:
# Deal with the null values
df = df.dropna()
# Alteritavely df.fillna(0, inplace=True)

In [11]:
# Check whether the null values are dealt with accordingly
df.isnull().any()

Make             False
Colour           False
Odometer (KM)    False
Doors            False
Price            False
dtype: bool

The `Make` and the `Colour` columns do not contain numerical data, but rather contain categorical data. To covert the categorical data to the numerical data we need to use the `pd.get_dummies()`

In [12]:
# Turn the Make and Colour columns to numbers
df = pd.get_dummies(df)

In [13]:
# Check the current state of the df
df.head()

Unnamed: 0,Odometer (KM),Doors,Price,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,35431.0,4.0,15323.0,0,1,0,0,0,0,0,0,1
1,192714.0,5.0,19943.0,1,0,0,0,0,1,0,0,0
2,84714.0,4.0,28343.0,0,1,0,0,0,0,0,0,1
3,154365.0,4.0,13434.0,0,0,0,1,0,0,0,0,1
4,181577.0,3.0,14043.0,0,0,1,0,0,1,0,0,0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 773 entries, 0 to 999
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Odometer (KM)  773 non-null    float64
 1   Doors          773 non-null    float64
 2   Price          773 non-null    float64
 3   Make_BMW       773 non-null    uint8  
 4   Make_Honda     773 non-null    uint8  
 5   Make_Nissan    773 non-null    uint8  
 6   Make_Toyota    773 non-null    uint8  
 7   Colour_Black   773 non-null    uint8  
 8   Colour_Blue    773 non-null    uint8  
 9   Colour_Green   773 non-null    uint8  
 10  Colour_Red     773 non-null    uint8  
 11  Colour_White   773 non-null    uint8  
dtypes: float64(3), uint8(9)
memory usage: 31.0 KB


Prepare the data to the ML model

In [15]:
X = df.drop("Price", axis=1)
y = df.Price
len(X), len(y)

(773, 773)

In [16]:
X.head()

Unnamed: 0,Odometer (KM),Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,35431.0,4.0,0,1,0,0,0,0,0,0,1
1,192714.0,5.0,1,0,0,0,0,1,0,0,0
2,84714.0,4.0,0,1,0,0,0,0,0,0,1
3,154365.0,4.0,0,0,0,1,0,0,0,0,1
4,181577.0,3.0,0,0,1,0,0,1,0,0,0


In [17]:
y.head()

0    15323.0
1    19943.0
2    28343.0
3    13434.0
4    14043.0
Name: Price, dtype: float64

In [18]:
# Import the train_test_split function from sklearn library
from sklearn.model_selection import train_test_split

# Create the training the testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [19]:
# Let's check the length of the training and testing datasets
len(X_train), len(y_train), len(X_test), len(y_test)

(618, 618, 155, 155)

## Prepare Machine Learning Model

As our problem is a regression problem, we are going to use the `RandomForestRegressor()` model.



In [20]:
# Import the model from the sklearn library
from sklearn.ensemble import RandomForestRegressor

# Assign the model to a variable
model = RandomForestRegressor()

# Train the model on the training dataset
model.fit(X_train, y_train)

# Let's check the score of the model
model.score(X_test, y_test)

0.41293987160426415

In [21]:
# Make predictions on unseed data
predictions = model.predict(X_test)

# from sklearn library, import mean_absolute_error
from sklearn.metrics import mean_absolute_error

# Check the mean_absoloute_error of the predictions of the model
mean_absolute_error(y_test, predictions)

5720.515677419355

mean absolute error of **`5474`** means that every guess is **`5474`** off of the accurate price the car is sold for.  