# Car Sales Exercise Notebook

> This notebook contains the the solution for the Car Sales Exercise.

In this notebook, we are trying to make a model that can predict the used cars sale price depending on several factors. 

## Import the libraries
Importing the Data Science and Machine Learning libraries

* Pandas as pd
* Numpy as np
* Matplotlib as plt
* Scikit-learn

In [1]:
# Import libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

# We will leave the scikit library as we will only import functions of the library when we need to

## Importing the data

Since the dataset is a csv file, we will need to use `pd.read_csv()` method to read the files. <br>
And since the file is accessible from github, we can directly pass the raw url of the github file.

In [2]:
# Read the csv file from github
df = pd.read_csv("https://raw.githubusercontent.com/Sayed-Husain/Introduction-to-Machine-Learning-Workshop/main/Data/Car%20Sales.csv")

# To confirm that the file was read lets view the first 5 rows
df.head()

Unnamed: 0,Odometer (KM),Doors,Price,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,35431,4,15323,0,1,0,0,0,0,0,0,1
1,192714,5,19943,1,0,0,0,0,1,0,0,0
2,84714,4,28343,0,1,0,0,0,0,0,0,1
3,154365,4,13434,0,0,0,1,0,0,0,0,1
4,181577,3,14043,0,0,1,0,0,1,0,0,0


In [3]:
# Check the size of the dataset
len(df)

1000

## Prepare our data

**Steps to make:**
1. Create X, y variables
2. Create training and testing datasets


### What are the X and y variables in this dataset
As we are trying to predict the Price of the car depending on its features, the y varible is going to be the `Price` column, and the X variable is going to be entire dataset excluding the `Price` column.

In [4]:
# Create the X and y variable
y = df["Price"]
X = df.drop("Price", axis=1) # Axis refers to the axis that is going to be dropped. 0 refers to rows, 1 refers to columns

In [5]:
# To confirm that the y is what we expect
y.head()

0    15323
1    19943
2    28343
3    13434
4    14043
Name: Price, dtype: int64

In [6]:
# To confirm that the X is what we expect
X.head()

Unnamed: 0,Odometer (KM),Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,35431,4,0,1,0,0,0,0,0,0,1
1,192714,5,1,0,0,0,0,1,0,0,0
2,84714,4,0,1,0,0,0,0,0,0,1
3,154365,4,0,0,0,1,0,0,0,0,1
4,181577,3,0,0,1,0,0,1,0,0,0


In [7]:
# Import the train_test_split function from sklearn library
from sklearn.model_selection import train_test_split

# Create the training the testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [8]:
# Let's check the length of the training and testing datasets
len(X_train), len(y_train), len(X_test), len(y_test)

(800, 800, 200, 200)

## Make the model

As our problem is a regression problem, we are going to use the `RandomForestRegressor()` model.

In [9]:
# Import the model from the sklearn library 
from sklearn.ensemble import RandomForestRegressor

# Assign the model to a variable
model = RandomForestRegressor()

# Train the model on the training dataset
model.fit(X_train, y_train)

# Let's check the score of the model
model.score(X_test, y_test)

0.22528981074940613

In [10]:
# Make predictions on unseed data
predictions = model.predict(X_test)

# from sklearn library, import mean_absolute_error
from sklearn.metrics import mean_absolute_error

# Check the mean_absoloute_error of the predictions of the model
mean_absolute_error(y_test, predictions)

6548.59635

## The model is performing poorly, why!?

The model performance is terrible, but why is that?
Well, it could be because of many reasons, including but not limited to:
* **The dataset is poor** - as the dataset only contains the Make, Colour, number of doors, and the Odometer of the car. And definetly a used car price has many more dependencies than those.
* **Overfitting** - The data might be learning too well on the training dataset, resulting in poor generalization of the predictions.