***
<b> Author:</b> Raghavendra Tapas
    
<b> Updated on:</b> May 2021
    
<b> Context:</b> Predicting Car Price (General Workflow of a Machine Learning Project)

Feel free to reach out to me on __[Twitter](https://twitter.com/raghutapas12)__ for any corrections or additional updates!
***

## Dependencies or Libraries used

Importing the NumPy, Pandas and matplotlib

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing the dataset(s)

In [2]:
car_sales = pd.read_csv("data/car-sales.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [3]:
car_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Make           1000 non-null   object
 1   Colour         1000 non-null   object
 2   Odometer (KM)  1000 non-null   int64 
 3   Doors          1000 non-null   int64 
 4   Price          1000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 39.2+ KB


In [4]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [5]:
len(car_sales)

1000

# Data Pre-Processing

1. Split the data into features and labels (usually `x` and `y`)
2. Filling/imputing or discarding the missing values and outliers.
3. Converting non-numerical values to numerical values (`Feature Encoding`)

In [6]:
# Create Feature Matrix (x) by excluding price column
x = car_sales.drop("Price", axis = 1)

# Create Labels (y)
y = car_sales["Price"]

# Transforming and Cleaning the Data

In [7]:
x.head(3)

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4


In [8]:
y.head(3)

0    15323
1    19943
2    28343
Name: Price, dtype: int64

In [9]:
# Import SciKit Learn 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

In [10]:
x.head(3)

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4


In [11]:
# Check if there are any missing values

car_sales.isna().sum()


Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

***
No missing values. If the missing values were to be found. We need to fill it with dummy values.

<b>Option 1:</b> Fill missing data with the <u>Pandas</u>.
<code>
    car_sales["Make"].fillna("Missing", inplace = True)
    car_sales["Colour"].fillna("Missing", inplace = True)
    car_sales["Doors"].fillna(4, inplace = True)
    # Remove rows with missing Price Values
    car_sales.dropna(inplace=True)
</code>

<b>Option 2:</b> Fill missing data with <u>SciKitLearn</u>
<code>
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    
    # Fill categorical values with "Missing" and numerical values with mean
    cat_imputer = SimpleImputer(strategy = "constant", fill_value = "missing")
    door_imputer = SimpleImputer(strategy = "constant", fill_value = 4)
    num_imputer = SimpleImputer(strategy = "mean")
    
    # Define Columns
    cat_features = ["Make", "Colour"]
    door_features = ["Doors"]
    num_features = ["Odometer (KM)"]
    
    # Create an imputer (something that fills missing data)
    imputer = ColumnTransformer([
        ("cat_imputer", cat_imputer, cat_features),
        ("door_imputer", door_imputer, door_features),
        ("num_imputer", num_imputer, num_features)
    ])
    
    # Transform the data
    filled_x = imputer.fit_transform(x)
    
    # Remove rows with missing Price Values
    car_sales.dropna(subset=["Price"], inplace = True)
</code>

***

In [12]:
# Convert the categorical data into numbers.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder = "passthrough")

transformed_x = transformer.fit_transform(x)

In [13]:
pd.DataFrame(transformed_x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


# Choose the Model

<b>Model:</b>

To choose the right model, refer to __[Scikit Machine Learning Maps](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)__

Random Forest Regressor. Since the given problem is a multi-variate regression problem (Trying to predict 5th variable using 4 variables)


<b>Split into training and test
    
* `x_train` - This includes your all independent variables,these will be used to train the model, also as we have specified the test_size = 0.4, this means 60% of observations from your complete data will be used to train/fit the model and rest 40% will be used to test the model.

* `x_test` - This is remaining 40% portion of the independent variables from the data which will not be used in the training phase and will be used to make predictions to test the accuracy of the model.

* `y_train` - This is your dependent variable which needs to be predicted by this model, this includes category labels against your independent variables, we need to specify our dependent variable while training/fitting the model.

* `y_test` - This data has category labels for your test data, these labels will be used to test the accuracy between actual and predicted categories.


In [14]:
# Fit the Model
np.random.seed(1)
x_train, x_test, y_train, y_test = train_test_split(transformed_x, y, test_size = 0.4)

model = RandomForestRegressor()
model.fit(x_train, y_train)

RandomForestRegressor()

In [15]:
model.score(x_test, y_test)

0.24879325203937042

In [16]:
# Conclusion: We cannot draw any conclusion. The machine learning model is not strong enough Need to look into evaluation matrix.