In [72]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### We will learn how to put it all together later

for now we will dive into separate sections  and focus on them individually

## Note: - 

We will learn many methods(ways/options) to transform data, clean data, fill in missing value, 
split data into train and test set, and so on.....

Here, we will use one method (the most useful ones) among many available to us for each process.

# 1) Getting the data ready

Main steps to take: - 

1. Splitting the data into features (usually X) and labels (usually y) 
2. Filling data (also called imputation) or disregarding missing values
3. Converting non-numerical values to numerical values (also call feature encoding)
4. Making sure all of your numerical data is on the same scale - Feature scaling by either Normalization/Standardization

## 1.1) Splitting the data into training, validation and test sets

## 1.2) and, Dealing with missing values in data

1. Fill them with some value (also known as imputation) - using fillna(), OR
2. Remove the rows with missing data - using dropna()

Two methods --> both are okay

Missing values are labeled as NaN in the dataset

In [73]:
car_df = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_df.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In dataframe - axis = 1 --> column axis, axis = 0 -> row axis

Imp principle: - In ML, never evaluate or test ur models on data that it has learned from. 
Thats why, we split data.

* Build a ML model that will train on the training data, and predict on the test data

In [74]:
# Check datatype of the columns
car_df.dtypes

# If Price column was object, convert it from
# 1) object --> string, then  2) string --> int 
# Check matplotlib notes in my notebook

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

In [75]:
# Check how many data is present in the dataframe
len(car_df)

1000

In [76]:
# Check How many missing values are there in each column
car_df.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

### IMP POINST TO REMEMBER

* Split your data first (into train/validation/test) set, always keep your training, validationtion and test data separate
* Fill/transform the training set, validation set and test sets separately
* Don't use data from the future (test set) to fill data from the past (training set)

In [77]:
# Drop the rows with no labels - i.e. Remove rows with missing data

# e.g Lets drop rows with missing price values
car_df.dropna(subset=["Price"],inplace=True)

In [78]:
# Check for missing values
car_df.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

### Split data into training, validation and test sets

the data is split into train, validation and test sets before filling any missing values or transformations that will take place.

In [79]:
# Split into X & Y (datafram variables)

# e.g X --> all columns except Price
# e.g y --> only Price column

X = car_df.drop("Price",axis=1)
y = car_df["Price"]

In [80]:
# random_state -> Controls the shuffling applied to the data before applying the split
# test_size -> proportion of the dataset to include in the test split.

# We will get 4 different values

In [81]:
# Split data into train, validation and test set

# e.g. predefined proportions such as (75, 15, 10)

from sklearn.model_selection import train_test_split

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1 - train_ratio)
# 1-0.75 = 0.25=25%, so training set is 0.75 =75% of entire dataset

# Remaining = 25% = 0.25 in test dataset

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 
# 0.10/(0.10+0.15) = 0.4 x 25% = 10%
# test set is now 10% of the initial data set
# validation is now 15% of the initial data set

In [82]:
X_train, X_val, X_test

(       Make Colour  Odometer (KM)  Doors
 941  Toyota    Red       166046.0    4.0
 482  Nissan  White        51004.0    4.0
 28    Honda  White        56687.0    4.0
 227     BMW   Blue        79301.0    5.0
 203  Toyota   Blue        99761.0    4.0
 ..      ...    ...            ...    ...
 202   Honda   Blue        84719.0    4.0
 737  Toyota   Blue       223875.0    4.0
 109   Honda   Blue       219217.0    4.0
 331  Toyota  White       112292.0    4.0
 425  Toyota   Blue        42480.0    4.0
 
 [712 rows x 4 columns],
        Make Colour  Odometer (KM)  Doors
 810  Nissan  Green       229929.0    4.0
 759  Nissan    Red       113987.0    4.0
 184  Nissan  Green        32754.0    4.0
 838  Nissan  Green       235589.0    4.0
 597  Nissan  White        20315.0    4.0
 ..      ...    ...            ...    ...
 285  Toyota  Green        44436.0    4.0
 509  Nissan  White        26634.0    4.0
 962   Honda   Blue        96308.0    4.0
 972   Honda  White            NaN    4.0
 442  N

In [83]:
len(X_train),len(X_val), len(X_test)

(712, 142, 96)

In [84]:
# Check for missing values in train, validation and test set - How many are there?

In [85]:
X_train.isna().sum()

Make             39
Colour           34
Odometer (KM)    34
Doors            42
dtype: int64

In [86]:
X_val.isna().sum()

Make              6
Colour            8
Odometer (KM)    10
Doors             2
dtype: int64

In [87]:
X_test.isna().sum()

Make             2
Colour           4
Odometer (KM)    4
Doors            3
dtype: int64

Let's fill the missing values. We'll fill the training, validation and test values separately to ensure training data stays with the training data and test data stays with the test data.

### Option 2: - Filling missing data with Scikit-Learn and transforming categorical data with Scikit-Learn

scikit-learn provides a method called SimpleImputer().

SimpleImputer() transforms data by filling missing values with a given strategy.

Follow order of column

* String columns must be filled with strings
* Numerical columns can be filled with mean,median,mode or other numbers

In [88]:
# Find out how many doors are there of each number
X_train["Doors"].value_counts()

4.0    569
5.0     53
3.0     48
Name: Doors, dtype: int64

inplace=True ==> do change to original dataframe (no need to assign to a new df variable)

Note: We use fit_transform() on the training data and transform() on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).

### Methods to know

* fit(X,y) - Compute the minimum and maximum to be used for later scaling.

* fit_transform(X,y) - Fit to data, then transform it.

* transform(X) - Scale features of X according to feature_range

In [89]:
# Fill missing values with Scikit-Learn

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with "missing" and numerical values with mean
# Imputation --> Find the missing values and fill them
# Define some imputers

# strategy="constant" = go to the categorical columns, if u find a missing value, 
# constantly fill them with the string missing, or mean or a default value

cat_imputer = SimpleImputer(strategy="constant",fill_value="missing")
door_imputer = SimpleImputer(strategy="constant",fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_features = ["Make","Colour"] # category column
door_features = ["Doors"] # a category column
num_features = ["Odometer (KM)"] # numerical column

# Create an imputer (something that fills missing data)
# pass in the imputations (all the different transformations to do)

# Takes a list and tuples within exists multiple different transformers
imputer = ColumnTransformer([
    # name, imputer to use, features on which to use the imputer
    ("cat_imputer",cat_imputer,cat_features),
    ("door_imputer",door_imputer,door_features),
    ("num_imputer",num_imputer,num_features)
])

# Transform data

# Fill train,validation and test values separately
filled_X_train = imputer.fit_transform(X_train) # fit_transform imputes the missing values from the training set and fills them simultaneously
filled_X_val = imputer.fit_transform(X_val)
filled_X_test = imputer.transform(X_test) # tranform takes the imputing missing values from the training set and fills the test set with them


In [90]:
# Check filled X_train
filled_X_train

array([['Toyota', 'Red', 4.0, 166046.0],
       ['Nissan', 'White', 4.0, 51004.0],
       ['Honda', 'White', 4.0, 56687.0],
       ...,
       ['Honda', 'Blue', 4.0, 219217.0],
       ['Toyota', 'White', 4.0, 112292.0],
       ['Toyota', 'Blue', 4.0, 42480.0]], dtype=object)

Now we've filled our missing values, let's check how many are missing from each set.

In [91]:
# Assign the training, validation and test dataframe to a new one

car_sales_filled_train = pd.DataFrame(filled_X_train, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])


car_sales_filled_val = pd.DataFrame(filled_X_val, 
                                     columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_test = pd.DataFrame(filled_X_test, 
                                     columns=["Make", "Colour", "Doors", "Odometer (KM)"])

In [92]:
# Check missing data in training set
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [93]:
# Check missing data in test set
car_sales_filled_test.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [94]:
# Check missing data in validation set
car_sales_filled_val.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [106]:
# Check how many data remaining in the sets
len(car_sales_filled_train), len(car_sales_filled_val), len(car_sales_filled_test)

(712, 142, 96)

### Now that there are no missing values

### Convert the data into numbers (numerical data)

using one hot encoding. Since a ML model cannot deal with strings, only numerical data

Again, keeping our training and test data separate.

In [95]:
# Turn the categories (Make and Colour) into numbers, as well as Door

# remainder="passthrough" --> ignore all other columns other than the ones mentioned above

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Now let's one hot encode the features
categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([
    ("one_hot",one_hot,categorical_features)],
    remainder="passthrough")

# Fill train, validation and test values separately
transformed_X_train = transformer.fit_transform(car_sales_filled_train) # fit and transform the training data
transformed_X_val = transformer.fit_transform(car_sales_filled_train) # fit and transform the validation data
transformed_X_test = transformer.transform(car_sales_filled_test) # transform the test data

In [96]:
# Check transformed and filled X_train
transformed_X_train.toarray()

array([[0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.66046e+05],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 5.10040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 5.66870e+04],
       ...,
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.19217e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.12292e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.24800e+04]])

In [97]:
# Check transformed and filled X_val
transformed_X_val.toarray()

array([[0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.66046e+05],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 5.10040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 5.66870e+04],
       ...,
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.19217e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.12292e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.24800e+04]])

In [98]:
# Check transformed and filled X_test
transformed_X_test.toarray()

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.38609e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.86840e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 1.63322e+05],
       ...,
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.77880e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.97616e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.36392e+05]])

### Feature Scaling (check it out later if needed)

### In conclusion, 

The process of filling missing values --> imputation

The process of converting non-numerical values to numerical --> feature engineering or feature encoding

In [None]:
# -----------------------------------------------------------------------------------------------------------

### Fit a model

Now we've filled and transformed our data, ensuring the training and test sets have been kept separate. Let's fit a model to the training set and evaluate it on the test set

In [105]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

# Setup model
model = RandomForestRegressor()

# Make sure to use transformed (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.11996671670109127

Fitting the model on the data involves passing it the data and asking it to figure out the patterns.

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

Use the model to make a prediction
The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once our model instance is trained, you can use the predict() method to predict a target value given a set of features. In other words, use the model, along with some unlabelled data to predict the label.

Note, data you predict on has to be in the same shape as data you trained on.