# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.

## What is Scikit-Learn?

[Scikit-Learn](https://scikit-learn.org/stable/index.html), also referred to as `sklearn`, is an open-source, commercially usable Python machine learning library. Built on NumPy, SciPy, and Matplotlib, It provides simple, efficient tools that are accessible to everybody, and reusable in various contexts.

![Scikit-Learn is used for modeling in machine learning projects.](./img/sklearn_6-step_ml_framework_tools_scikit-learn_highlight.png)

## What we're going to cover

This notebook shall be focusing on the main use cases of the Scikit-Learn library. More specifically, we shall go through the typical workflow of a Scikit-Learn project in a step-by-step process and improving upon our knowledge of Scikit-Learn as we go through the steps.

0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choosing the right estimator/algorithm for our problems
3. Fitting the model/algorithm and using it to make predictions on our data
4. Evaluating the model
5. Improving the model
6. Saving and loading a trained model
7. Putting it all together

**Note:** All of the steps in this notebook shall focus on **supervised learning** (having data and labels).

## 0. An end-to-end Scikit-Learn workflow

Scikit-Learn is a vast library containing a large variety of tools that can be used in various different contexts. As such, it might be better to start off with a typical end-to-end Scikit-Learn workflow and take a look at the most common use-cases of the library. 

[This notebook](./scikit-learn_workflow.ipynb) demonstrates one such typical Scikit-Learn workflow.

From there, we shall take a much closer look at each step in the process and improve upon the knoledge we gained using Scikit-Learn.

![Diagram of the Scikit-Learn workflow](img/sklearn_workflow.png)

## 1. Getting the data ready

### Standard imports

The very first step when working with a machine learning project is to import the necessary libraries and packages you'll be working with.

For this project, we shall keep using the usual Numpy, Pandas, and Matplotlib packages, so let's go ahead and import those right away:

In [1]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Now that we've imported the usual packages, let's get some data to work with.

Let's take a look at some heart disease data:

In [2]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


The three main things we have to do are:

- Split the data into features (usually called `X`) and labels (usually called `y`)
- Converting non-numeric values into numeric values (also called *feature encoding*)
- Filling (aka *imputing*) or disregarding missing values in the data

### Splitting the data

In [3]:
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [4]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

Now that we've split the data into features and labels, we also have to split them further into *training* and *test sets* that we can use to train and validate the machine learning models we're going to be making.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

Excellent! we now have training and test sets for our heart disease data.

When we take a look at the shapes of the training and test shapes, notice we seem to get tuples corresponding to the dimensions of the data. The second number in the tuples simply means that `X` has `13` columns of data, while `y` has only one column. But what about the `242` and `61`?

Let's take a look at the shape of our original data:

In [6]:
X.shape, y.shape

((303, 13), (303,))

In [7]:
len(heart_disease)

303

It looks like the original heart disease data has `303` rows.

When we did the `train_test_split()` call, notice we had a `test_size` parameter set to `0.2`.

This means 20% of the data rows shoud be allocated for the test set. Let's verify:

In [8]:
round(303 * 0.2)

61

In [9]:
303 - 61

242

Looks like it checks out. 242 rows were allocated for the training set while 61 rows were allocated for the test set.

### Converting data into numeric values

Luckily, the heart disease data provided to us is already in numeric form on all columns.

However, most other datasets might not have all their data in numerical form.

It is important to convert the non-numeric data into numerical form first, because most machine learning models only work on numeric inputs.

Let's take a look at another example dataset and see how we can convert these kinds of data.

In [10]:
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [11]:
len(car_sales)

1000

In [12]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

As you can see, the `Make` and `Colour` columns have non-numeric data.

Let's see what happens if we try and build a model on the dataset without first converting those columns.

In [13]:
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [14]:
# Try to predict with random forest on price column (doesn't work)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

Oh dear, looks like Scikit-Learn throws an error when we try and build a model this time.

The error message gives us a hint as to what went wrong:
> ValueError: could not convert string to float: 'Toyota'

This means the `RandomForestRegressor()` model only accepts numeric inputs. Otherwise, it throws an error when it tries to deal with non-numeric data.

Now, let's see how to get around that error by converting the non-numeric data into a numeric format that the model can work with.

In [15]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

The `categorical_features` list contains the columns that have non-numeric or *categorical* data

Now, we're trying to convert the `Make` and `Colour` columns into a numeric format, but why is `Doors` also included in the list?

Let's take a look at the `Doors` column first:

In [16]:
car_sales["Doors"].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

The `Doors` column indeed contains numerical data, but at the same time it also has very few distinct values with little variance.

Thus, it can also make sense to treat those discrete values as different *categories* that the data can fall into.

In [17]:
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

Now that's quite a lot to take in. Let's try and convert it into a Pandas `DataFrame` so we can have an easier time reading the data:

In [18]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


Now, you might say: "Hold on, this doesn't look the same as the original data!", but let's take a step back and look at what's actually happened here.

For reference, here's the dataset we're working with:

In [19]:
X

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3
...,...,...,...,...
995,Toyota,Black,35820,4
996,Nissan,White,155144,3
997,Nissan,Blue,66604,4
998,Honda,White,215883,4


Hmmm... It looks like we do have the same values on the `Odometer (KM)` column in the `teansformed_X` dataset.

Still, what's the deal with all the other columns?

The new columns are a result of the `OneHotEncoder` transforming the catergorical data into multiple categories.

It does this by generating additional columns to represent each distinct category for the data.

For example, `OneHotEncoder` generated four columns to represent each of the car makes (`BMW`, `Honda`, `Nissan`, and `Toyota`). Each of those columns will then have a value of `0.0` if the car doesn't have the corresponding car make, or `1.0` if the car does have the corresponding car make. Note that only one column out of those four will have a `1.0` value, and the rest should have a value of `0.0`.

`OneHotEncoder` also does the same thing for the other `categorical_features` in the dataset.

#### Another way to encode categorical data

We can also use the `get_dummies()` function from the Pandas library to convert categorical data into numeric form.

Let's take a look at the original `car_sales` dataset:

In [20]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [21]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


We see that the `Make` and `Colour` columns were split into different categories, but the `Doors` column was still left intact.

Again, since the `Doors` column is already numeric, we can leave it as-is, but if we want to treat it as a categorical data, we can convert it into an `object` type for the `get_dummies()` function to split it into categories:

In [22]:
car_sales["Doors"] = car_sales["Doors"].astype(object)
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,0,1,0,0,0,0,0,0,1,0,1,0
1,1,0,0,0,0,1,0,0,0,0,0,1
2,0,1,0,0,0,0,0,0,1,0,1,0
3,0,0,0,1,0,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,1,1,0,0,0,0,0,1,0
996,0,0,1,0,0,0,0,0,1,1,0,0
997,0,0,1,0,0,1,0,0,0,0,1,0
998,0,1,0,0,0,0,0,0,1,0,1,0


Great! Now that we've successfully transformed the data into numeric form, we can try and fit a model on our transformed dataset.

In [23]:
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.31695249778476753

Fantastic! Looks like we've successfully fitted a model to the data without getting an error this time, all thanks to transforming our data into numeric form.

### Dealing with missing values

There are two ways to deal with missing data.

- Fill in the missing parts with a predetermined value. This approach is also known as **_imputation_**.
- Remove the rows cotaining missing data altogether. Note that this results in having less data to work with.

**Note:** Dealing with missing values is a problem to problem issue. And there's often no best way to do it.

In [24]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


We can see some odd bits on `NaN` values in this dataset.

One way to quickly check exactly how much missing data is in your dataset is to use the `isna()` method from Pandas:

In [25]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

As you can see, we have quite a lot of missing fields in our dataset.

What happens if we try and fit a model on a dataset with missing data?

In [26]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [27]:
# Transform categorical data into numeric form
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

# Try and fit a model
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

We get another error this time:
> ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

This means Scikit-Learn also needs the data to contain no missing values.

Let's see how we can deal with these missing data before we try and fit a model on this dataset.

#### Option 1: Use Pandas to deal with missing data

We can use Pandas to fill in the missing values of the dataset.

For numerical values we can simply use the mean of the existing data in the column, and for categorical data we can use some other predetermined value instead.

In [28]:
car_sales_missing["Make"].fillna("Missing", inplace=True)
car_sales_missing["Colour"].fillna("Missing", inplace=True)
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)
car_sales_missing["Doors"].fillna(4, inplace=True)
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,Missing,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [29]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

Notice that the `Price` column still has missing values.

Since `Price` is the value we're trying to predict, it might be better to just remove the rows with no `Price` value for now:

In [30]:
car_sales_missing.dropna(inplace=True)

In [31]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

We now have no missing values in our data, but it came at a cost of having to remove a number of rows that had no `Price` value.

In [32]:
len(car_sales_missing)

950

Now let's try fitting a model into this dataset:

In [33]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [34]:
# Transform categorical data into numeric form
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

# Try and fit a model
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.25296341556528734

#### Option 2: Use Scikit-Learn to deal with missing data

We can also use Scikit-Learn to deal with missing data.

Let's use the `car_sales_missing` dataset once again, this time using Scikit-Learn to fill in or remove rows with missing data.

In [35]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Once again, we can drop the rows with the missing `Price` values for now.

In [36]:
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

Let's also split the remaining data into training and test sets:

In [37]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now, let's fill in the missing data using Scikit-Learn.

Note that it is best practice to use the following Scikit-Learn functions to fill and transform missing data separately on the training and test sets.

Why is that? It is because performing imputation on the whole un-split dataset is causing "information leakage", which is when information contained in the test set is "leaked" into the training data set. The result is a biased estimator with an optimistic test error. The test set should be set aside at the beginning of any machine learning project and only be touched when validating the model.

Here are some guidelines to keep in mind when handling missing data:
- Split your data first (into train/test), always keep your training & test data separate
- Fill/transform the training set and test sets separately (this goes for filling data with pandas as well)
- Don't use data from the future (test set) to fill data from the past (training set)

In [38]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Use SimpleImputer to fill in missing values
cat_imputer = SimpleImputer(strategy="constant", fill_value="Missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

Now that we have an imputer set up, it's time to use it to actually fill in the missing values in the dataset.

> **Note:** We use `fit_transform()` on the training data and `transform()` on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).

In [39]:
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test)

filled_X_train

array([['BMW', 'White', 5.0, 152410.0],
       ['Nissan', 'Green', 4.0, 87701.0],
       ['Nissan', 'White', 4.0, 51004.0],
       ...,
       ['Honda', 'White', 4.0, 40134.0],
       ['Nissan', 'Black', 4.0, 125251.0],
       ['Missing', 'White', 4.0, 109384.0]], dtype=object)

Let's view the result as a Pandas `DataFrame` to get a better look at what just happened:

In [40]:
car_sales_filled_train = pd.DataFrame(filled_X_train,
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_test = pd.DataFrame(filled_X_test,
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_train

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,BMW,White,5.0,152410.0
1,Nissan,Green,4.0,87701.0
2,Nissan,White,4.0,51004.0
3,Honda,Blue,4.0,30120.0
4,Toyota,Blue,4.0,132327.821823
...,...,...,...,...
755,Honda,White,4.0,193179.0
756,Honda,Blue,4.0,196507.0
757,Honda,White,4.0,40134.0
758,Nissan,Black,4.0,125251.0


In [41]:
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

Wonderful! We've just used Scikit-Learn to fill in the missing values.

Now, there's just one more step before we can fit a model to this dataset.

Let's revisit a previous topic and convert the categorical features into a numeric form.

In [42]:
# One-hot encode the features with the same code as before 
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

# Fill train and test values separately
transformed_X_train = transformer.fit_transform(car_sales_filled_train)
transformed_X_test = transformer.transform(car_sales_filled_test)

# Check transformed and filled X_train
transformed_X_train

<760x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3040 stored elements in Compressed Sparse Row format>

In [43]:
pd.DataFrame(transformed_X_train.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,152410.000000
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,87701.000000
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,51004.000000
3,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,30120.000000
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,132327.821823
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,193179.000000
756,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,196507.000000
757,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,40134.000000
758,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,125251.000000


Now that we've done all the necessary steps to get our data ready, let's see if we can fit a machine learning model this time.

In [44]:
np.random.seed(69)
model = RandomForestRegressor()
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.2603174739059788

Great work! We've successfully prepared our dataset by splitting into training and test sets, handling missing data, and converting categorical features into a numerical form. That means we are now able to use these steps to transform any other dataset in order to better fit machine learning models.

If this looks confusing, don't worry, we've covered a lot of ground very quickly. And we'll revisit these strategies in a future section in way which makes a lot more sense.

For now, the key takeaways to remember are:

- Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.
- For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as **feature engineering** or **feature encoding**.
- Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as **data imputation**.

## 2. Choosing the right estimator/algorithm for our problems

In order to get the right predictions for our data, it is important that we choose the right machine learning model that can best fit our data.

So far we've been using the `RandomForestRegressor` model provided by Scikit-Learn. But how did we know that it's the right model to use for our dataset? When does it make sense to use that particular model, and when does it make more sense to choose another model to fit?

First, we must ask ourselves what kind of machine learning problem are we trying to solve.

Here are some common types of machine learning problems:
- **Classification** - predicting whether a sample is one thing or another (i.e. whether a patient has heart disease)
   > Sometimes you'll see `clf` (short for classifier) used as a classification estimator instance's variable name.
- **Regression** - predicting a number (i.e. selling price of a car)
- **Clustering** - discovering groups in data (i.e. identifying different cutomer segments)
- **Dimensionality Reduction** - reducing the number of features in a given dataset (select only the important ones)

Scikit-Learn also provides [a handy cheat-sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) to help us identify the machine learning problem, as well as some options for machine learning models to use on our given datasets:

![Scikit-Learn algorithm cheat-sheet](./img/sklearn_ml_map.png)

> **Note:** Scikit-Learn uses *estimator* as another term for machine learning *model* or *algorithm*.

### Picking a machine learning model for a regression problem

Let's start with a regression problem. We'll use the [Boston housing dataset](https://scikit-learn.org/stable/datasets/index.html#boston-dataset) built into Scikit-Learn's `datasets` module.

In [19]:
from sklearn.datasets import load_boston
boston = load_boston() # returns a HUGE dictionary

The `boston` dataset is loaded as a Python dictionary, so let's turn it into a Pandas `DataFrame` to make it easier to work with.

In [21]:
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


That's quite a lot of data. Let's see exactly how many samples are we working with.

In [48]:
len(boston_df)

506

Great! Now let's try using that cheat-sheet and pick a machine learning model for the dataset.

![Using the Scikit-Learn cheatsheet to choose a model for the Boston housing dataset](./img/sklearn-ml-map-cheatsheet-boston-housing-ridge.png)

Following the flowchart we see that Scikit-Learn suggests using a [`Ridge`](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression) regression model. Let's try it out:

In [49]:
from sklearn.linear_model import Ridge

np.random.seed(69)

X = boston_df.drop("target", axis=1)
y = boston_df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Ridge()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.677606548622028

Not bad. But let's see if we can still get an even better result.

What happens if the `Ridge` regressor did not work, or did not produce as good of a score as we wanted?

Let's check back on the cheat-sheet to see if we can prceed with an alternative step:

![Trying out the Ensemble regressors in case the Ridge regressor did not achieve the result we wanted](./img/sklearn-ml-map-cheatsheet-boston-housing-ensemble.png)

Following the diagram, we see that the next step would be to try out [`EnsembleRegressors`](https://scikit-learn.org/stable/modules/ensemble.html).

One of the most common and useful ensemble methods is the [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest), known for its fast training and prediction times and adaptability to different problems.

We've actually used the `RandomForestRegressor` model earlier in this notebook, and here we see why we chose this particular model in those earlier datasets.

The basic premise of the Random Forest is to combine a number of different decision trees, each one random from the other and make a prediction on a sample by averaging the result of each decision tree.

An in-depth discussion of the Random Forest algorithm is beyond the scope of this notebook but if you're interested in learning more, [An Implementation and Explanation of the Random Forest](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) in Python by Will Koehrsen is a great read.

Let's try using the [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) on the Boston dataset:

In [50]:
np.random.seed(69)

X = boston_df.drop("target", axis=1)
y = boston_df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8577277397858931

Wonderful! We see that we get a much better score using the `RandomForestRegressor`.

### Picking a machine learning model for a classification problem

Now let's try another type of machine learning problem. This time let's work through the steps of picking a model for a classification problem.

The heart disease dataset from earlier contains data for exactly that kind of problem, so let's take another look at that here:

In [51]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


Again, let's check how many samples we're dealing with:

In [52]:
len(heart_disease)

303

Now let's work through the cheat-sheet yet again to try and find the right model for our heart disease dataset.

![Using the Scikit-Learn cheat-sheet to find the right model for the heart disease dataset](./img/sklearn-ml-map-cheatsheet-heart-disease-linear-svc.png)

Following the diagram, we see that we are recommended to use the [`LinearSVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) model. `LinearSVC` stands for Linear Support Vector Classifier.

Let's try it on our data:

In [55]:
from sklearn.svm import LinearSVC

np.random.seed(69)

X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = LinearSVC(max_iter=1000)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)



0.8360655737704918

Looks good, but we see that we get some warning that tells us that .

We can try and tweak some settings to try and fix those issues, but let's get to that another time.

For now, let's proceed along the cheat-sheet and choose another classifier model.

![Picking another classifier model from the cheat-sheet](./img/sklearn-ml-map-cheatsheet-heart-disease-ensemble.png)

We see that we get to the `EnsembleMethods` again. Except this time, we're going to be using ensemble *classifiers* instead of *regressors*.

The Random Forest model has another variant that is used for classification problems. Let's try using it on the heart disease data:

In [56]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(69)

X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8524590163934426

Good job! We see that we got no warnings this time using the `RandomForestClassifier`, and we also got a better score to boot.

Again, we can tweak some settings or *hyperparameters* in the models above to try and improve their results, and we'll take a look at that more closely in later sections.

To wrap things up for this section, here's some quick guideleined for choosing machine learning models:
- If you have structured data (tables or dataframes), use ensemble methods, such as, a Random Forest.
- If you have unstructured data (text, images, audio, things not in tables), use deep learning or transfer learning.

For this notebook, we're focused on structured data, which is why the Random Forest has been our model of choice.

If you'd like to learn more about the Random Forest and why it's the war horse of machine learning, check out these resources:

- [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
- [Random Forests in Python](http://blog.yhat.com/posts/random-forests-in-python.html) by yhat
- [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen

The beautiful part about using Scikit-Learn is that its API allows us to use different models with pretty much the exact same workflows. Indeed, we see that our code has stayed pretty much the same across the different machine learning models used in this section. A big part of being a machine learning engineer or data scientist is experimenting - you might want to try out some of the other models on the cheat-sheet and see how you go. The more you can reduce the time between experiments, the better.

## 3. Fitting the model/algorithm and using it to make predictions on our data

Now that we have chosen a model for our data, it's time to have the model learn about our data so it can be used to make predictions.

We've actually encountered some examples of this in the earlier sections, when we were fitting the model to the sample datasets:

In [5]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(69)

X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8524590163934426

Now, what is the `fit()` method actually doing when we call it on our data?

When we call the `fit()` method, the machine learning algorithm attempt to find patterns between `X` and/or `y`.

Let's take a look at our `X` and `y` data:

In [7]:
X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
77,59,1,1,140,221,0,1,164,1,0.0,2,0,2
117,56,1,3,120,193,0,0,162,0,1.9,1,0,3
124,39,0,2,94,199,0,1,179,0,0.0,2,0,2
237,60,1,0,140,293,0,0,170,0,1.2,1,2,3
122,41,0,2,112,268,0,0,172,1,0.0,2,0,2


In [8]:
y_train.head()

77     1
117    1
124    1
237    0
122    1
Name: target, dtype: int64

When we call `fit(X, y)`, the model takes a look at all the examples in `X` (features) and sees what the corresponding `y` (label) is.

Each different model looks at the data differently, but for now you can imagine it being similar to how people notice patterns in data over time.

You'd look at the feature variables, X, the age, sex, chol (cholesterol) and see what different values led to the labels, y, 1 for heart disease, 0 for not heart disease.

This concept, regardless of the problem, is similar throughout all of machine learning.

**During training (finding patterns in data):**

A machine learning algorithm looks at a dataset, finds patterns, tries to use those patterns to predict something and corrects itself as best it can with the available data and labels. It stores these patterns for later use.

**During testing or in production (using learned patterns):**

A machine learning algorithm uses the patterns its previously learned in a dataset to make a prediction on some unseen data.

### Making predictions using a machine learning model

After fitting the model to the data (aka training the machine learning model), we'll want to use it to make predictions.

Scikit-Learn offers several ways to make predictions, the most common of which being `predict()` and `predict_proba()`.

Let's see how they work:

In [9]:
clf.predict(X_test)

array([0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0], dtype=int64)

Given data in the form of `X`, the `predict()` function returns labels in the form of `y`.

It's standard practice to save these predictions to a variable named something like `y_preds` for later comparison to `y_test` or `y_true` (usually same as `y_test` just another name):

In [10]:
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

0.8524590163934426

Now, where have we seen that kind of result before?

That's right, the `accuracy_score()` and `clf.score()` functions make these comparisons in the same way:

In [11]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8524590163934426

In [13]:
clf.score(X_test, y_test)

0.8524590163934426

So far we've been dealing with the `predict()` function. The `predict()` function returns values in the same format as the labels in `y`.

If you would instead like to get the *probabilities* of getting a label based on the given data, you can instead use `predict_proba()`:

In [14]:
clf.predict_proba(X_test.head())

array([[0.9 , 0.1 ],
       [0.99, 0.01],
       [0.73, 0.27],
       [0.93, 0.07],
       [0.01, 0.99]])

In contrast, here's what `predict()` would've returned:

In [15]:
clf.predict(X_test.head())

array([0, 0, 0, 0, 1], dtype=int64)

As you can see, `predict_proba()` returns an array of values for each row of the input rather than just one single number per row like `predict()` does.

The array of values returned by `predict_proba()` are actualy the *probabilities* of getting a certain label (in our case either `0` or `1`) given an input sample.

In [17]:
clf.predict_proba(X_test[:1])

array([[0.9, 0.1]])

This particular output means that for the sample `X_test[:1]`, the model predicts that the sample will have a label `0` with a probability score of `0.9`.

Conversely, this also means the sample would have a label `1` with probability score of `0.1`.

Given these two probabilities, which one do you think would we get if we used `predict()`?

In [18]:
clf.predict(X_test[:1])

array([0], dtype=int64)

We get a predicted label `0`.

Because our problem is a classification task, we could simply just get the one with the highest probability if we want a singular output label.

With our dataset only having two labels to choose from, predicting a label with 0.5 probability every time would be the same as a coin toss (guessing). Therefore, once the prediction probability of a sample passes 0.5 for a certain label, it's assigned that label.

### Predicting values for regression problems

`predict()` can also be used for regression problems:

In [22]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(69)

X = boston_df.drop("target", axis=1)
y = boston_df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)

# Make predictions
y_preds = model.predict(X_test)

For regression problems, we can validate our predictions using `mean_absolute_error()`:

In [23]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

2.1642941176470587