# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.

## What is Scikit-Learn?

[Scikit-Learn](https://scikit-learn.org/stable/index.html), also referred to as `sklearn`, is an open-source, commercially usable Python machine learning library. Built on NumPy, SciPy, and Matplotlib, It provides simple, efficient tools that are accessible to everybody, and reusable in various contexts.

![Scikit-Learn is used for modeling in machine learning projects.](./img/sklearn_6-step_ml_framework_tools_scikit-learn_highlight.png)

## What we're going to cover

This notebook shall be focusing on the main use cases of the Scikit-Learn library. More specifically, we shall go through the typical workflow of a Scikit-Learn project in a step-by-step process and improving upon our knowledge of Scikit-Learn as we go through the steps.

0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choosing the right estimator/algorithm for our problems
3. Fitting the model/algorithm and using it to make predictions on our data
4. Evaluating the model
5. Improving the model
6. Saving and loading a trained model
7. Putting it all together

**Note:** All of the steps in this notebook shall focus on **supervised learning** (having data and labels).

## 0. An end-to-end Scikit-Learn workflow

Scikit-Learn is a vast library containing a large variety of tools that can be used in various different contexts. As such, it might be better to start off with a typical end-to-end Scikit-Learn workflow and take a look at the most common use-cases of the library. 

[This notebook](./scikit-learn_workflow.ipynb) demonstrates one such typical Scikit-Learn workflow.

From there, we shall take a much closer look at each step in the process and improve upon the knoledge we gained using Scikit-Learn.

![Diagram of the Scikit-Learn workflow](img/sklearn_workflow.png)

## 1. Getting the data ready

### Standard imports

The very first step when working with a machine learning project is to import the necessary libraries and packages you'll be working with.

For this project, we shall keep using the usual Numpy, Pandas, and Matplotlib packages, so let's go ahead and import those right away:

In [1]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Now that we've imported the usual packages, let's get some data to work with.

Let's take a look at some heart disease data:

In [2]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


The three main things we have to do are:

- Split the data into features (usually called `X`) and labels (usually called `y`)
- Converting non-numeric values into numeric values (also called *feature encoding*)
- Filling (aka *imputing*) or disregarding missing values in the data

### Splitting the data

In [3]:
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [4]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

Now that we've split the data into features and labels, we also have to split them further into *training* and *test sets* that we can use to train and validate the machine learning models we're going to be making.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

Excellent! we now have training and test sets for our heart disease data.

When we take a look at the shapes of the training and test shapes, notice we seem to get tuples corresponding to the dimensions of the data. The second number in the tuples simply means that `X` has `13` columns of data, while `y` has only one column. But what about the `242` and `61`?

Let's take a look at the shape of our original data:

In [6]:
X.shape, y.shape

((303, 13), (303,))

In [7]:
len(heart_disease)

303

It looks like the original heart disease data has `303` rows.

When we did the `train_test_split()` call, notice we had a `test_size` parameter set to `0.2`.

This means 20% of the data rows shoud be allocated for the test set. Let's verify:

In [8]:
round(303 * 0.2)

61

In [9]:
303 - 61

242

Looks like it checks out. 242 rows were allocated for the training set while 61 rows were allocated for the test set.

### Converting data into numeric values

Luckily, the heart disease data provided to us is already in numeric form on all columns.

However, most other datasets might not have all their data in numerical form.

It is important to convert the non-numeric data into numerical form first, because most machine learning models only work on numeric inputs.

Let's take a look at another example dataset and see how we can convert these kinds of data.

In [10]:
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [11]:
len(car_sales)

1000

In [12]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

As you can see, the `Make` and `Colour` columns have non-numeric data.

Let's see what happens if we try and build a model on the dataset without first converting those columns.

In [13]:
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [14]:
# Try to predict with random forest on price column (doesn't work)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

Oh dear, looks like Scikit-Learn throws an error when we try and build a model this time.

The error message gives us a hint as to what went wrong:
> ValueError: could not convert string to float: 'Toyota'

This means the `RandomForestRegressor()` model only accepts numeric inputs. Otherwise, it throws an error when it tries to deal with non-numeric data.

Now, let's see how to get around that error by converting the non-numeric data into a numeric format that the model can work with.

In [15]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

The `categorical_features` list contains the columns that have non-numeric or *categorical* data

Now, we're trying to convert the `Make` and `Colour` columns into a numeric format, but why is `Doors` also included in the list?

Let's take a look at the `Doors` column first:

In [16]:
car_sales["Doors"].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

The `Doors` column indeed contains numerical data, but at the same time it also has very few distinct values with little variance.

Thus, it can also make sense to treat those discrete values as different *categories* that the data can fall into.

In [17]:
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

Now that's quite a lot to take in. Let's try and convert it into a Pandas `DataFrame` so we can have an easier time reading the data:

In [18]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


Now, you might say: "Hold on, this doesn't look the same as the original data!", but let's take a step back and look at what's actually happened here.

For reference, here's the dataset we're working with:

In [20]:
X

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3
...,...,...,...,...
995,Toyota,Black,35820,4
996,Nissan,White,155144,3
997,Nissan,Blue,66604,4
998,Honda,White,215883,4


Hmmm... It looks like we do have the same values on the `Odometer (KM)` column in the `teansformed_X` dataset.

Still, what's the deal with all the other columns?

The new columns are a result of the `OneHotEncoder` transforming the catergorical data into multiple categories.

It does this by generating additional columns to represent each distinct category for the data.

For example, `OneHotEncoder` generated four columns to represent each of the car makes (`BMW`, `Honda`, `Nissan`, and `Toyota`). Each of those columns will then have a value of `0.0` if the car doesn't have the corresponding car make, or `1.0` if the car does have the corresponding car make. Note that only one column out of those four will have a `1.0` value, and the rest should have a value of `0.0`.

`OneHotEncoder` also does the same thing for the other `categorical_features` in the dataset.

#### Another way to encode categorical data

We can also use the `get_dummies()` function from the Pandas library to convert categorical data into numeric form.

Let's take a look at the original `car_sales` dataset:

In [21]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [22]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


We see that the `Make` and `Colour` columns were split into different categories, but the `Doors` column was still left intact.

Again, since the `Doors` column is already numeric, we can leave it as-is, but if we want to treat it as a categorical data, we can convert it into an `object` type for the `get_dummies()` function to split it into categories:

In [23]:
car_sales["Doors"] = car_sales["Doors"].astype(object)
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,0,1,0,0,0,0,0,0,1,0,1,0
1,1,0,0,0,0,1,0,0,0,0,0,1
2,0,1,0,0,0,0,0,0,1,0,1,0
3,0,0,0,1,0,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,1,1,0,0,0,0,0,1,0
996,0,0,1,0,0,0,0,0,1,1,0,0
997,0,0,1,0,0,1,0,0,0,0,1,0
998,0,1,0,0,0,0,0,0,1,0,1,0


Great! Now that we've successfully transformed the data into numeric form, we can try and fit a model on our transformed dataset.

In [25]:
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.31695249778476753

Fantastic! Looks like we've successfully fitted a model to the data without getting an error this time, all thanks to transforming our data into numeric form.

### Dealing with missing values

There are two ways to deal with missing data.

- Fill in the missing parts with a predetermined value. This approach is also known as **_imputation_**.
- Remove the rows cotaining missing data altogether. Note that this results in having less data to work with.

**Note:** Dealing with missing values is a problem to problem issue. And there's often no best way to do it.

In [26]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


We can see some odd bits on `NaN` values in this dataset.

One way to quickly check exactly how much missing data is in your dataset is to use the `isna()` method from Pandas:

In [27]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

As you can see, we have quite a lot of missing fields in our dataset.

What happens if we try and fit a model on a dataset with missing data?

In [28]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [30]:
# Transform categorical data into numeric form
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

# Try and fit a model
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

We get another error this time:
> ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

This means Scikit-Learn also needs the data to contain no missing values.

Let's see how we can deal with these missing data before we try and fit a model on this dataset.

#### Option 1: Use Pandas to deal with missing data

We can use Pandas to fill in the missing values of the dataset.

For numerical values we can simply use the mean of the existing data in the column, and for categorical data we can use some other predetermined value instead.

In [31]:
car_sales_missing["Make"].fillna("Missing", inplace=True)
car_sales_missing["Colour"].fillna("Missing", inplace=True)
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)
car_sales_missing["Doors"].fillna(4, inplace=True)
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,Missing,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [34]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

Notice that the `Price` column still has missing values.

Since `Price` is the value we're trying to predict, it might be better to just remove the rows with no `Price` value for now:

In [35]:
car_sales_missing.dropna(inplace=True)

In [36]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

We now have no missing values in our data, but it came at a cost of having to remove a number of rows that had no `Price` value.

In [37]:
len(car_sales_missing)

950

Now let's try fitting a model into this dataset:

In [38]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [39]:
# Transform categorical data into numeric form
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

# Try and fit a model
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.25296341556528734

#### Option 2: Use Scikit-Learn to deal with missing data

We can also use Scikit-Learn to deal with missing data.

Let's use the `car_sales_missing` dataset once again, this time using Scikit-Learn to fill in or remove rows with missing data.

In [40]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

Once again, we can drop the rows with the missing `Price` values for now.

In [41]:
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

Let's also split the remaining data into training and test sets:

In [42]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test
np.random.seed(69)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now, let's fill in the missing data using Scikit-Learn.

Note that it is best practice to use the following Scikit-Learn functions to fill and transform missing data separately on the training and test sets.

Why is that? It is because performing imputation on the whole un-split dataset is causing "information leakage", which is when information contained in the test set is "leaked" into the training data set. The result is a biased estimator with an optimistic test error. The test set should be set aside at the beginning of any machine learning project and only be touched when validating the model.

Here are some guidelines to keep in mind when handling missing data:
- Split your data first (into train/test), always keep your training & test data separate
- Fill/transform the training set and test sets separately (this goes for filling data with pandas as well)
- Don't use data from the future (test set) to fill data from the past (training set)

In [43]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Use SimpleImputer to fill in missing values
cat_imputer = SimpleImputer(strategy="constant", fill_value="Missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

Now that we have an imputer set up, it's time to use it to actually fill in the missing values in the dataset.

> **Note:** We use `fit_transform()` on the training data and `transform()` on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).

In [44]:
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test)

filled_X_train

array([['BMW', 'White', 5.0, 152410.0],
       ['Nissan', 'Green', 4.0, 87701.0],
       ['Nissan', 'White', 4.0, 51004.0],
       ...,
       ['Honda', 'White', 4.0, 40134.0],
       ['Nissan', 'Black', 4.0, 125251.0],
       ['Missing', 'White', 4.0, 109384.0]], dtype=object)

Let's view the result as a Pandas `DataFrame` to get a better look at what just happened:

In [45]:
car_sales_filled_train = pd.DataFrame(filled_X_train,
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_test = pd.DataFrame(filled_X_test,
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_train

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,BMW,White,5.0,152410.0
1,Nissan,Green,4.0,87701.0
2,Nissan,White,4.0,51004.0
3,Honda,Blue,4.0,30120.0
4,Toyota,Blue,4.0,132327.821823
...,...,...,...,...
755,Honda,White,4.0,193179.0
756,Honda,Blue,4.0,196507.0
757,Honda,White,4.0,40134.0
758,Nissan,Black,4.0,125251.0


In [46]:
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

Wonderful! We've just used Scikit-Learn to fill in the missing values.

Now, there's just one more step before we can fit a model to this dataset.

Let's revisit a previous topic and convert the categorical features into a numeric form.

In [48]:
# One-hot encode the features with the same code as before 
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

# Fill train and test values separately
transformed_X_train = transformer.fit_transform(car_sales_filled_train)
transformed_X_test = transformer.transform(car_sales_filled_test)

# Check transformed and filled X_train
transformed_X_train

<760x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3040 stored elements in Compressed Sparse Row format>

In [51]:
pd.DataFrame(transformed_X_train.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,152410.000000
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,87701.000000
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,51004.000000
3,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,30120.000000
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,132327.821823
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,193179.000000
756,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,196507.000000
757,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,40134.000000
758,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,125251.000000


Now that we've done all the necessary steps to get our data ready, let's see if we can fit a machine learning model this time.

In [52]:
np.random.seed(69)
model = RandomForestRegressor()
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.2603174739059788

Great work! We've successfully prepared our dataset by splitting into training and test sets, handling missing data, and converting categorical features into a numerical form. That means we are now able to use these steps to transform any other dataset in order to better fit machine learning models.

If this looks confusing, don't worry, we've covered a lot of ground very quickly. And we'll revisit these strategies in a future section in way which makes a lot more sense.

For now, the key takeaways to remember are:

- Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.
- For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as **feature engineering** or **feature encoding**.
- Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as **data imputation**.