In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 1.2) Dealing with missing values in data

1) Fill them with some value (also known as imputation)

2) Remove the samples with missing data altogether

Two methods --> both are okay

In [2]:
car_df = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_df.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [3]:
# How many missing values are there in each column
car_df.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [4]:
from sklearn.model_selection import train_test_split

# Split into X & y and train/test set
X = car_df.drop("Price", axis=1)
y = car_df["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [5]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 4), (200, 4), (800,), (200,))

In [6]:
# We can't turn the categories into numericals because of the presense of missing values

### Option 1: - Filling missing data with Pandas

In [7]:
# Follow order of column

# String columns must be filled with strings
# Numerical columns can be filled with mean,median,mode or other numbers

In [8]:
# Find out how many doors are there of each number
car_df["Doors"].value_counts()

4.0    811
5.0     75
3.0     64
Name: Doors, dtype: int64

In [9]:
# Filling the missing values (cells)

# Fill the "Make" column
car_df["Make"].fillna("missing",inplace=True) # change dataframe immediately

# Fill the "Colour" column
car_df["Colour"].fillna("missing",inplace=True)

# Fill the "Odometer (KM)" column
car_df["Odometer (KM)"].fillna(car_df["Odometer (KM)"].mean(), inplace=True)

# Fill the "Doors" column
car_df["Doors"].fillna(4,inplace=True) # 4 is max number

In [10]:
# Check our dataframe again for missing values again 
car_df.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [11]:
# Remove rows with missing data
# missing price values in this case
car_df.dropna(inplace=True)

In [12]:
car_df.isna().sum() # Perfect - no missing values in dataframe

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [13]:
# how many data remaining
len(car_df)

950

In [14]:
# Split into X & y and train/test set
X = car_df.drop("Price", axis=1)
y = car_df["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [15]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((760, 4), (190, 4), (760,), (190,))

In [16]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer 

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([
    ("one_hot",one_hot,categorical_features)],
    remainder="passthrough")

transformed_X = transformer.fit_transform(car_df)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [17]:
# -------------------------------------------------------

### Option 2: - Filling missing data with Scikit-Learn and transforming categorical data with Scikit-Learn

SimpleImputer() transforms data by filling missing values with a given strategy.

The main takeaways:

* Split your data first (into train/test), always keep your training & test data separate

* Fill/transform the training set and test sets separately (this goes for filling data with pandas as well)

* Don't use data from the future (test set) to fill data from the past (training set)

The video shows filling and transforming the entire dataset (X) and although the techniques are correct, it's best to fill and transform training and test sets separately

Hence, the code below was corrected

In [2]:
car_df2 = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_df2.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [3]:
car_df2.isna().sum() # See the number of missing values in each column

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [4]:
# Drop the rows with no labels

# Drop the rows that have no price values
car_df2.dropna(subset=["Price"],inplace=True)
car_df2.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

the data is split into train and test before any filling missing values or transformations take place.

In [20]:
from sklearn.model_selection import train_test_split

# Split into X & y
X = car_df2.drop("Price", axis=1)
y = car_df2["Price"]

# Split data into train and test
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [21]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((760, 4), (190, 4), (760,), (190,))

In [22]:
# Check missing values in X (both train and test dataset)
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

Let's fill the missing values. 
We'll fill the training and test values separately to ensure training data stays with the training data and test data stays with the test data.

Note: We use fit_transform() on the training data and transform() on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).

In [23]:
# Fill missing values with Scikit-Learn

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with "missing" and numerical values with mean
# Imputation --> Find the missing values and fill them
# Define some imputers

# strategy="constant" = go to the categorical columns, if u find a missing value, 
# constantly fill them with the string missing, or mean or a default value

cat_imputer = SimpleImputer(strategy="constant",fill_value="missing")
door_imputer = SimpleImputer(strategy="constant",fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_features = ["Make","Colour"] # category column
door_features = ["Doors"] # a category column
num_features = ["Odometer (KM)"] # numerical column

# Create an imputer (something that fills missing data)
# pass in the imputations (all the different transformations to do)

# Takes a list and tuples within exists multiple different transformers
imputer = ColumnTransformer([
    # name, imputer to use, features on which to use the imputer
    ("cat_imputer",cat_imputer,cat_features),
    ("door_imputer",door_imputer,door_features),
    ("num_imputer",num_imputer,num_features)
])

# Transform data
# Fill train and test values separately
filled_X_train = imputer.fit_transform(X_train) # fit_transform imputes the missing values from the training set and fills them simultaneously
filled_X_test = imputer.transform(X_test) # tranform takes the imputing missing values from the training set and fills the test set with them

# Check filled X_train
filled_X_train

array([['Honda', 'White', 4.0, 71934.0],
       ['Toyota', 'Red', 4.0, 162665.0],
       ['Honda', 'White', 4.0, 42844.0],
       ...,
       ['Toyota', 'White', 4.0, 196225.0],
       ['Honda', 'Blue', 4.0, 133117.0],
       ['Honda', 'missing', 4.0, 150582.0]], dtype=object)

Now we've filled our missing values, let's check how many are missing from each set.

In [24]:
# Get our transformed data array's back into DataFrame's
car_sales_filled_train = pd.DataFrame(filled_X_train, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_test = pd.DataFrame(filled_X_test, 
                                     columns=["Make", "Colour", "Doors", "Odometer (KM)"])



In [25]:
# Check missing data in training set
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [26]:
# Check missing data in test set
car_sales_filled_test.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

### Now that there are no missing values

### Convert the data into numbers (numerical data)

using one hot encoding.

Again, keeping our training and test data separate.

In [27]:
# Import OneHotEncoder class from sklearn

# Turn the categories (Make and Colour) into numbers, as well as Door
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer 

# Now let's one hot encode the features

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([
    ("one_hot",one_hot,categorical_features)],
    remainder="passthrough")

# Fill train and test values separately
transformed_X_train = transformer.fit_transform(car_sales_filled_train) # fit and transform the training data
transformed_X_test = transformer.transform(car_sales_filled_test) # transform the test data

In [28]:
# Check transformed and filled X_train
transformed_X_train.toarray()

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 7.19340e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.62665e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.28440e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.96225e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.33117e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.50582e+05]])

In [29]:
# ------------------------------------------------------

### Fit a model

Now we've filled and transformed our data, ensuring the training and test sets have been kept separate. Let's fit a model to the training set and evaluate it on the test set

In [31]:
# Now, we've got our data as numbers and filled (no missing values)
# The process of filling missing values --> imputation
# The process of converting non-numerical values to numerical --> feature engineering or feature encoding

# Lets fit a model

In [30]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

# Setup model
model = RandomForestRegressor()

# Make sure to use transformed (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.21229043336119102

Fitting the model on the data involves passing it the data and asking it to figure out the patterns.

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

Use the model to make a prediction
The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once our model instance is trained, you can use the predict() method to predict a target value given a set of features. In other words, use the model, along with some unlabelled data to predict the label.

Note, data you predict on has to be in the same shape as data you trained on.

In [None]:
# -------------------------------------------------------