## A Beginner Guide to Machine Learning Modeling: Tutorial with Python and Scikit-Learn **(Part2)**

In the previous post we went through (pretty quickly) the basic capabilities and concepts of Machine Learning and **Scikit-Learn**. In this notebook we will break down the steps in **Part1**  more thoroughly.  

### 1. Getting the  data ready

Data doesn't always come ready to use with a Scikit-Learn machine learning model

Three of the main steps you'll often have to take are:

* Splitting the data into features (usually `X`) and labels (usually `y`).
* Splitting the data into training and testing sets (and possibly a validation set)
* Filling (also called imputing) or disregarding missing values.
* Converting non-numerical values to numerical values(also called encoding)

In [1]:
import pandas as pd
import numpy as np
import sklearn

In [2]:
heart_disease = pd.read_csv('heart-disease.csv')

In [3]:
# Splitting the data into X & y 
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
# Splitting the data into features (X) and labels (y)
X = heart_disease.drop('target', axis=1)
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


Lets check out the labels

In [5]:
y = heart_disease['target']
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

Now let's split our data into training ans test sets, we'll use an 80/20 split(80% of samples for training and 20% of smaples for testing)

In [6]:
# Splitting the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20) # you can change the test size
# Check the shapes of different data splits
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

### 1.1 Make sure it's all numerical

Computers love numbers.
One thing you'll often have to make sure of is that your datasets are in numeical form.

This is the case even for datasets that contain 'non-numerical features' that you may want to include in a model.
                     
For example, if we were working with a car sales dataset, how might we turn features such as `Make` and `Color` into numbers?

Let's figure it out 

First, we'll import the `car-sales-extended.csv`dataset

In [7]:
# Import data
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


We can check the types with `.dtypes`

In [8]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

Note the `Make` and `Color` features are of `dtype=object` (they're strings) where as the rest of th columns are of `dtype=int64`.

If we want to use the `Make` and `Color` features in our model, we'll need to figure out how to turn them into numerical form 

In [9]:
# Split into X & y and train/test
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

Now let's try and build a model on our `car_sales` data.

In [10]:
# Try to predict with random forest on price column (doesn't work)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

Seems this doest not work; Looks like we'll have to convert the non-numerical features into numbers first.

The process of turning categorical features into numbers is often referred to as **encoding**.

Scikit-Learn has a great in-depth guided on [Encoding Categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features)

But let's look at one of the mst straightforward ways to turn categorical features into numbers, [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In machine learning,[one-hot encoding](https://en.wikipedia.org/wiki/One-hot#Machine_learning_and_statistics) gives a value `1` to the target value and a value `0` to the other values.

We can us the following steps to one-hot encode our dataset:

1. Import `sklearn.preprocessing.OneHotEncoder` to one-hot encode our features and `sklearn.compose.ColumnTransformer` to target the specific columns of our DataFrame to transform.
2. Define the categorical features we'd like to transform.
3. Create an instance of the `OneHotEncoder`.
4. Create an instance of `ColumnTransformer` and feed it the transforms we's like to make.
5. Fit the instance of the `ColumnTransformer` to our data and transform it with `fit_transform(X)` method.

    **Note:** In Scikit-Learn, the term 'transformer' is often used to refer to something that ***transforms*** data  

In [11]:
# 1.  Import OneHotEncoder and ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# 2. Define the categorical features to transform
categorical_features = ["Make", "Colour", "Doors"]

# 3. Create an instance of OneHotEncoder
one_hot = OneHotEncoder()

# 4. Create an instance of ColumnTransformer
transformer = ColumnTransformer([("onehot", #name
                                one_hot,    #transformer
                                categorical_features)], # columns to transform
                                remainder='passthrough') # what to do with the rest of the columns? ("passthrough" = leave unchanged)
# 5. Turn the categorical_features into numbers (this will return an array-like sparse matrix, not a DataFrame)
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

Looks like `OneHotEncoder` and `ColumnTransformer` have turned all of our data samples into numbers.

Let's check out the first transformed sample.

In [12]:
# View the first transformed sample
transformed_X[0]

array([0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
       1.0000e+00, 0.0000e+00, 3.5431e+04])

And what were these values originally?

In [13]:
# View original first  sample 
X.iloc[0]

Make             Honda
Colour           White
Odometer (KM)    35431
Doors                4
Name: 0, dtype: object

### 1.1.1 Numerically encoding data with pandas
Another way we can numerically encode data is directly with pandas.

We can use the `pandas.get_dummies()` (or `pd.get_dummies()` for short) method and then pass it our target columns.

In return, we'll get a one-hot encoded version of our target columns.

Let's remind ourselves of what our DataFrame looks like.

In [14]:
# View head of original DataFrame
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


Perfect, now let's use `pd.get_dummies()` to turn our categorical variables into one-hot encoded variables.

In [15]:
# One-hot encode categorical variables
categorical_variables = ["Make", "Colour", "Doors"]
dummies = pd.get_dummies(data=car_sales[categorical_variables])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


Perfect!

Notice how there's a new column for each categorical option(e.g. `Make_BMW`, `Make_Honda`, etc.)

But also notice how it also missed the `Doors` column?

This is because `Doors` is already numeric, so for `pd.get_dummies()` to work on it, we can change it to type `object`.

By default , `pd.get_dummies()` also turns all of the values to bools (`True` or `False`).

We can get the returned values `0` or `1` by setting `dtype=float`.


In [16]:
# Have to convert doors to object for dummies to work on it ...
car_sales["Doors"] = car_sales["Doors"].astype(object)
dummies = pd.get_dummies(data=car_sales[["Make", "Colour", "Doors"]],
                        dtype=float)
dummies

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


Nice!

We've now turned our data into fully numeric form using Scikit-Learn and pandas.

You might be wondering..

**Should you use Scikit-Learn or pandas for turning data into numerical form?**

And the answer is either.

But as a rule of thumb:

* If you're peforming **quick data analysis and running small modeling experiments**, use `pandas` as it's generally quite fast to get up and running.
* if you're performing a **larger scale modeling experiment** or would like to out your **data processing steps into a production pipeline**, I'd recommend leaning towards Scikit-Learn, specifically a [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline)(chaining together multiple estimators/modeling steps).

Since we've turned our data into numerical form, how about we try and fit our model again?

Let's recreate a train/test split except this time we'll use `transformed_X` instead of `X`.

In [17]:
np.random.seed(42)

#Create train and test splits with transformed_X
X_train, X_test,y_train,y_test = train_test_split(transformed_X,
                                                  y,
                                                  test_size=0.2)

# Create the model instance
model = RandomForestRegressor()

# Fit the model on the numerical data (this errored before since our data wasn't fully numeric)
model.fit(X_train, y_train)

# Score the model (returns r^2 metric by default, also called coeffiecient of determination, higher is better)
model.score(X_test,y_test)

0.3235867221569877

### 1.2 What if there were missing values in the data?

Holes in the data means holes in the patterns your machine learning model can learn.

Many machine learning models don't work well or produce errors when they're used on datasets with missing values.

A missing value can appear as a blank, as a `Nan` or something similar

There are two main options when dealing with missing values:

1. **Fill them with some given or calculated value(imputation)** - For example, you might fill missing values of a numerical column with the mean of all other values. The practice of calculating or figuring out how to fill missing values in a dataset is called **imputing**. For a great resource on imputing missing values, I'd recommend referring to the [Scikit-Learn user guide](https://scikit-learn.org/stable/modules/impute.html)
2. **Remove them** - if a row or sample has missing values, you may opt to remove them from your dataset completely. However, this potentially results in using less data to build your model

   **Note:** Dealing with missing values differes from problem to problem, meaning there's no 100% best way to fill missing values across datasets and problem types. It will often take careful experimentation and practice to figure out the best way to deal with missing values in your own datasets.

To practice dealing with missing values, let's import a version of the `car_sales` dataset with several missing values.

In [18]:
# Import car sales dataframe with missing values 
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


If your're dataset is large, it's likely you aren't going to go through it sample by sample to find the missing values.

Luckily, pandas has a method called `pd.DataFrame.isna()` which is able to detect missing values.

Let's try it on our DataFrame.

In [19]:
# Get the sum of all missing values
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

seems there's about 50 or so missing values per column.

How about we try and split the data into features and labels, then convert the categorical data to numbers, then split the data into training and test and then try and fit a model on it (just like we did before)

In [20]:
# Create Features 
X_missing = car_sales_missing.drop("Price", axis=1)
print(f"Number of missing values:\n{X_missing.isna().sum()}")

Number of missing values:
Make             49
Colour           50
Odometer (KM)    50
Doors            50
dtype: int64


In [21]:
# Create Labels
y_missing = car_sales_missing['Price']
print(f"Number of missing values:\n{y_missing.isna().sum()}")

Number of missing values:
50


Now we convert the categorical columns into one-hot encodings (just as before)

In [22]:
# Lets convert the categorical columns to one hot encoded (coded copied from above)
# Turn the categories (Make and Colour) into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([('onehot',
                                  one_hot,
                                  categorical_features)],
                                remainder='passthrough',
                                sparse_threshold=0) #return a sparse matrix or not
transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.48360e+05]])

Finally, let's split the missing data samples into train and test sets and then try to fit and score a model on them

In [23]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,
                                                   y_missing, 
                                                   test_size=0.2)

# Fit and score the model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: Input y contains NaN.

Looks like the model we're trying to use doesnt work with missing values.

When we try to fit it on a dataset with missing samples, Scikit-Learn produces the error: `ValueError: Input contains **NaN**. RandomForestRegressor(estimator name) does not accept missing values encoded as NaN natively...`

Looks like if we want to use `RandomForestRegressor`, we'll have t fill or remove the missing values.

   **Note:** Scikit-Learn does have a [list of models whoch can handle NaNs or missing values directly](https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values) Such as, `sklearn.ensemble.HistGradientBoostingClassifier` or `sklearn.ensemble.HistGradientBoostingRegressor`.


Let's see what values are missing again.

In [24]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

### 1.21 Fill missing data with pandas
Let's see how we might fill missing values with pandas. 

For categorical values, one of the simplest wasy is to fill the missing fields with the string 'missing'

We could do this for the `Make` and `Colour` features.

As for the `Doors` features, we could use "missing" or we could fill it with the most common option `4`

With the the `Odometer (KM)` feature, we can use the mean value of all the other values in the column.

And finally, for those samples which are missing a `Price` value, we can remove them (since `Price` is the target value, removing it probably causes less harm than imputing, however,
                                                                                     you could design an experiment to test this)

   **Note:** The practice of filling missing data with given or calculated values is called **imputation**. And it's important to rememeber there's no perfect way to fill missing data(unless it's with data that should've actually been there in the first place ). The methods we're using are only one of many. The techniques you use will depend heavily on your dataset. A good place to look would be searching for "data imputation techniques"

Let's start with the `Make` column

We can use the pandas method `fillna(value='missing', inplace=True)` to fill all the missing values with the string "missing".

In [25]:
# Fill the missing values in the Make column
car_sales_missing["Make"].fillna(value='missing', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_sales_missing["Make"].fillna(value='missing', inplace=True)


In [26]:
# Fill the missing values in the Make column
car_sales_missing['Make'] = car_sales_missing['Make'].fillna(value='missing')

and we can do the same with the `Colour` column.

In [27]:
# Fill the Colour column 
car_sales_missing['Colour'] = car_sales_missing['Colour'].fillna('missing')

How many missing values do we have now?

In [28]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Nice! Now let's fill the `Doors` column with `4` (the most common value), this is the same as filling it with the **median** or **mode** of the `Doors` column

In [29]:
# Find the most common value of the Doors column 
car_sales_missing['Doors'].value_counts()

Doors
4.0    811
5.0     75
3.0     64
Name: count, dtype: int64

In [30]:
# Fill the Doors column with the most common value
car_sales_missing['Doors'] = car_sales_missing['Doors'].fillna(value=4)

Next, we'll fill the `Odometer (KM)` column with the mean value of itself

In [32]:
# Fill the Odometer (KM) column
car_sales_missing['Odometer (KM)'] = car_sales_missing['Odometer (KM)'].fillna(car_sales_missing['Odometer (KM)'].mean())

How many missing values do we have now?

In [34]:
# Check the number of missing values
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

Perfect!

Finally, we can remove the rows which are missing the target value `Price`.

**Note:** Another option would be to impute the `Price` value with the mean or median or some other calculated value (such as by using similar cars to estimated price), however, keep things simle and prevent introducing too many fake labels to the data, we'll remove the samples missing a `Price` value.

In [35]:
# Remove rows with the missing Price labels
car_sales_missing.dropna(inplace=True)

There should be no missing values!

In [37]:
# Check the number of missing values
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

Since we removed samples missing a `Price` value, there's now less overall samples but none of them have missing values.

In [39]:
# Check the number of total samples (previously was 1000)
len(car_sales_missing)

950

Can we fit the model now?

Let's try!

First we'll create the features and labels.

Then we'll convert categorical variables into numbers via one-hot encoding.

Then we'll split the data into training and test sets just like before. 

Finally, we'll try to fit a `RandomForestRegressor()` model to the newly fit data.

In [40]:
#Create features 
X_missing = car_sales_missing.drop("Price", axis=1)
print(f"Number of missing X values: \n {X_missing.isna().sum()}")

# Create labels
y_missing = car_sales_missing["Price"]
print(f"Number of missing y values: \n {y_missing.isna().sum()}")

Number of missing X values: 
 Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64
Number of missing y values: 
 0


In [41]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer(transformers=[("one_hot",
                                               one_hot,
                                               categorical_features )],
                                               remainder='passthrough',
                                               sparse_threshold=0) #return a sparse matrix or not

transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [42]:
#Split data into training and test sets 
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,
                                                    y_missing,
                                                    test_size=0.2)
# Fit and score the model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.22011714008302485

Perfect!

Looks like filling missing values with pandas worked!

Our model can be fit to the data without issues.

### 1.2.2 Filling missing data and transforming categorical data with Scikit-Learn

Now we've filled the missing columns using pandas functions, you might be thinking, "Why pandas? I thought this was a Scilit-Learn introduction?".

Not to worry, Scikit-Learn provides a class called `sklearn.impute.SimpleImputer()` which allows us to a similar thing.

`SimpleImputer()` transforms data by filling missing values with a given `strategy` parameter.

And we can use it to fill missing values in our DataFrame as above.

At the moment, our dataframe has no missing values.

In [43]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

Let's reimport it so it has missing values and we can fill them with Scikit-Learn.

In [45]:
# Reimport the DataFrame
car_sales_missing = pd.read_csv('car-sales-extended-missing-data.csv')
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

To begin, we'll remove the rows which are missing a `Price` value.

In [46]:
# Drop the rows with na in the Price column
car_sales_missing.dropna(subset=['Price'], inplace=True)


Now there is no rows missing in the `Price` column

In [47]:
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

Since we don't have to fill any `Price` values, let's split pur data into features `(X)` and `(y)`.

We'll also split the data into training and test sets.

In [48]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test 
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

**Note:** We've split the data into train & test sets here to perform filling missing values on them seperately. This is best practice as the test set is supposed to emulate data the model has never seen before. For categorial variables, it's generally okay to fill values across the whole dataset. However, for numerical variables, you should **only fill values on the test set that have been computed from the training set.**

Training and Test sets created!

Let's now setup a few instances of `SimleImputer()` to fill various missing values.

We'll use thr following strategies and fill values:

* For categorical columns `(Make, Colour), strategy='constant', fill_value='missing'` (fill the missing samples with a consistent value of `missing`).
* For the `Doors` column, `strategy="constant", fill_value=4 (fill the missing samples with a consistent vaue of `4`)
* For the numerical column `Odometer (KM)`, strategy="mean" (fill the missing samples with the mean of the target column).
**Note:** There are more `strategy` and fill options in the `SimpleImputer()`[documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

In [49]:
from sklearn.impute import SimpleImputer

# Create categorical variable imputer 
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')

# Create Door column imputer 
door_imputer = SimpleImputer(strategy='constant', fill_value=4)

# Create Odometer (KM) column imputer 
num_imputer = SimpleImputer(strategy='mean')

Imputers created!

Now let's define which column we'd ike to impute on.

We'll need these shortly (I'll explain in the next cell).

In [50]:
# Define different column features 

categorical_features = ["Make", "Colour"]
door_features = ["Doors"]
numerical_features = ["Odometer (KM)"]

Columns defined!

Now we can use the `sklearn.compose.ColumnTransformer` class from Scikit-Learn, in a similar way to what we did before to get our data to be all numerical values.

So we can use the `ColumnTransformer()` class.

`ColumnTransformer()` takes as input a list of of tuples in the form `(name_of_transformer, transformer_to_use, columns_to_transform)`specifying which columns to transform and how to transform them.

For example:

```
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, categorical_features),
])

```

In this case, the variable in the tuples are:

* `name_of_transformer` = `"cat_imputer"`
* `transformer_to_use` = cat_imputer (the instance of `SimpleImputer() we defined above)
* `columns_to_transform` = `categorical_features` (the list of categorical features we defined above).

Let's expand upon this by extending the example.

In [51]:
from sklearn.compose import ColumnTransformer

# Create a series of column transforms to perform 
imputer = ColumnTransformer([
    ("cat_impuuter", cat_imputer, categorical_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer", num_imputer, numerical_features)
])

Nice!

The next step is to fit our `ColumnTransformer()` instance `(imputer)` to the training data and transform the testing data.

In other words we want to:

1. Learn the imputation values from the training set.
2. Fill the missing values in the training set with the values learned in 1.
3. Fill the missing values in the testing set with the values learned in 1.


why this way?

In our case, we're not calculating many variables (except the mean of the `Odometer (KM) column), however, remember that the test set should always rwmain as unseen data.

So **when filling values in the test set, they should only be with values calculated or imputed from the training sets.**

We can achieve steps 1 & 2 simulataneously with `ColumnTransformer.fit_transform()` method (`fit`=find the values to fill, `transform`=fill them).

And then we can perform step 3 with the `ColumnTransformer.transform()` method (we only want to transform the test set, not learn different values to fill).

In [53]:
# Find values to fill and transform training data
filled_X_train = imputer.fit_transform(X_train)

# Fill values in to the test set with the values learned from the training set 
filled_X_test = imputer.transform(X_test)

# Check filled X_train
filled_X_train

array([['Honda', 'White', 4.0, 71934.0],
       ['Toyota', 'Red', 4.0, 162665.0],
       ['Honda', 'White', 4.0, 42844.0],
       ...,
       ['Toyota', 'White', 4.0, 196225.0],
       ['Honda', 'Blue', 4.0, 133117.0],
       ['Honda', 'missing', 4.0, 150582.0]], dtype=object)

Perfect!

Let's turn our `filled_X_train` and `filled_X_test` arrays into DataFrames to inspect their missing values.

In [54]:
# Get our transformed data array's back into DataFrame's
filled_X_train_df = pd.DataFrame(filled_X_train,
                                columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])

filled_X_test_df = pd.DataFrame(filled_X_test,
                                columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])

# Check missing data in training set
filled_X_train_df.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

And is there any missing data in the test set?

In [55]:
# Check missing data in the testing set 
filled_X_test_df.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

What about the original?

In [56]:
# Check to see the original.... still missing values 
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

Perfect !

No more missing values

But is our data all numerical??

In [58]:
filled_X_train_df.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,71934.0
1,Toyota,Red,4.0,162665.0
2,Honda,White,4.0,42844.0
3,Honda,White,4.0,195829.0
4,Honda,Blue,4.0,219217.0


Ahh... looks like our `Make` and `Colour` columns are still strings.

Let's one-hot encode them along with the `Doors` column to make sure they're all numerical, just as we did previously.

In [59]:
# Now let's one-hot encode the features with the same code as before
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                                 remainder='passthrough',
                                 sparse_threshold=0)

# Fill train and test values separately 
transformed_X_train = transformer.fit_transform(filled_X_train_df)
transformed_X_test = transformer.transform(filled_X_test_df)

# Check transformed and filled X_train
transformed_X_train

array([[0.0, 1.0, 0.0, ..., 1.0, 0.0, 71934.0],
       [0.0, 0.0, 0.0, ..., 1.0, 0.0, 162665.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 42844.0],
       ...,
       [0.0, 0.0, 0.0, ..., 1.0, 0.0, 196225.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 133117.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 150582.0]], dtype=object)

Nice!

Now our data is:

1. All numerical
2. No missing values

Let's try and fit a model!

In [61]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

# Make sure to use the transformed data (filed and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.21229043336119102

You might have noticed this result is slightly different to before.

This is because we've created our training and testing sets differently.

We split the data into training and test sets ***before*** filling the missing values.

Previously, we did the reverse, filled missing values ***before*** splitting the data into training and test sets.

Doing this can lead to information from training set leaking into the testing set.

Remember, one of the most important concepts in machine learning is making sure your model doesnt't see ***any*** testing data before evaluation.

We'll keep practicing but for now, some of the main take aways are:

* Keep your training and test sets separate.
* Most datasets you come across won't be in a form ready for immediate use to start using them with machine learning models. And some may take more preparation than others to get ready to use.
* For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with to numbers. This process is often referred to as feature engineering or feature encoding.
* Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as **data imputation**

We got through alot But i want to thank you for sticking with me till the end of Part 2. You know know substantially more about Scikit-Learn than you did before. In Part 3 we will dive more into picking the right models, fitting your models, making predictions, and evaluations. 

Special Thanks to Daniel and Andrei from Zero to Mastery as always for starting me on my Machine Learning and Data Science Path..