# Intro To Scikit-learn

Normal workflow:

1. Get Data Ready
2. Pick a model that suits the problem
3. Fit the model to the data and make a prediction
4. Evaluate the model
5. Improve through experimentation
6. Save and reload the trained model

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


## 0. Quick Preview of Entire Workflow!

### 1. Get the Data Ready

In [6]:
# Import data and see what it looks like
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
# We want to have the model guess `target` is 1 or 0, so we need variables
# Create X (features matrix)
X = heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease["target"]

### 2. Pick a model and hyperparameters that fit the problem

In [10]:
# Import classification model from sklearn
from sklearn.ensemble import RandomForestClassifier # Can classify data (1 or 0)
clf = RandomForestClassifier() # clf = classifier 

## Keep default hyperparameters
clf.get_params() # see model's hyperparameters

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fit the model to the data


In [11]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

# Split X (features) and y (labels) into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # test_size is % of rows that go into test set

In [12]:
# Tell sklearn to fit to the training data
clf.fit(X_train, y_train)

RandomForestClassifier()

In [14]:
# Make a predictions on the test set
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1], dtype=int64)

### 4. Evaluate the Model

In [15]:
# How well has model predicted data on the training set?
clf.score(X_train, y_train) # 1.0 makes sense because it was trained on this data

1.0

In [17]:
# How well has the model predicted data on the test set?
clf.score(X_test, y_test) # output should probably be different from training predictions

0.8524590163934426

In [18]:
# Use some other measure to check the model's accuracy
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Classification Report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.86      0.75      0.80        24
           1       0.85      0.92      0.88        37

    accuracy                           0.85        61
   macro avg       0.85      0.83      0.84        61
weighted avg       0.85      0.85      0.85        61



In [19]:
# Confusion Matrix
confusion_matrix(y_test, y_preds)

array([[18,  6],
       [ 3, 34]], dtype=int64)

In [20]:
# Accuracy Score
accuracy_score(y_test, y_preds)

0.8524590163934426

### 5. Improve the Model

Models can be improved by experimenting with tuning the hyperparameters

In [23]:
# Try different amount of n_estimators
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model Accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")

Trying model with 10 estimators...
Model Accuracy on test set: 80.33%

Trying model with 20 estimators...
Model Accuracy on test set: 90.16%

Trying model with 30 estimators...
Model Accuracy on test set: 85.25%

Trying model with 40 estimators...
Model Accuracy on test set: 85.25%

Trying model with 50 estimators...
Model Accuracy on test set: 88.52%

Trying model with 60 estimators...
Model Accuracy on test set: 88.52%

Trying model with 70 estimators...
Model Accuracy on test set: 86.89%

Trying model with 80 estimators...
Model Accuracy on test set: 85.25%

Trying model with 90 estimators...
Model Accuracy on test set: 85.25%



### 6. Save and Load Models
Use Python's pickle library

In [26]:
import pickle

### Save the Model

In [27]:
# Dump saves the model
# pass in the instantiated sklearn model object
# "wb" means write binary, or write the file contents
pickle.dump(clf, open("models/heart-disease-random-forest.pkl", "wb"))

### Load a Model

In [28]:
# "rb" means read binary
loaded_model = pickle.load(open("models/heart-disease-random-forest.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.8524590163934426

---

# In-Depth Sections

---

## 1. Getting the Data Ready
---

For the below steps, it's important to **run data transformations separately for training and testing sets**

Three main things to do:
- Split the data into features and labels (usually `X` and `y`)
- Split data into training and test sets (maybe 20% of data in test set)
- Filling (also called **imputing**) or disregarding missing values
- Converting non-numerical values to numerical values (AKA **feature encoding**)
- (Optional) Reduce the amount of data you use in the model
  - Also called **Dimensionality** reduction or column reduction
  - This can help save compute time and money by getting rid of data that might not be useful
- (Optional) **Feature Scaling** to make sure all the data is on the same scale so the ML model can find patterns. There are two main ways of doing this:
  - **Normalization (min-max scaling)**: Rescales all numerical values to be between 0 and 1. Scikit-learn has a library for this in the [MinMaxScalar class](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
  - **Standardization**: Subtracts the mean value from all features and then scales them by unit variance. Scikit-learn has a library for this in the [StandardScalar class](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
  - Feature Scaling usually isn't required for your target variable
  - Feature scaling is not usually required with tree-based models (random forest) since they can handle varying features

In [30]:
# Still have this imported
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### 1.1 - Split into Features and Labels

In [31]:
# Features do not include `target` column, which is whether or not the person has heart disease
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [32]:
# Label is the 'target' column
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [33]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Check shapes of data sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

### 1.2 - Convert All Data to Be Numeric



In [35]:
# Need data that has non-numeric data so we can practice changing it
car_sales = pd.read_csv("data/car-sales-extended.csv")
len(car_sales), car_sales.head()

(1000,
      Make Colour  Odometer (KM)  Doors  Price
 0   Honda  White          35431      4  15323
 1     BMW   Blue         192714      5  19943
 2   Honda  White          84714      4  28343
 3  Toyota  White         154365      4  13434
 4  Nissan   Blue         181577      3  14043)

In [36]:
# See that data is not in all numerics
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [64]:
# Turn the non-numeric categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Separate into X (features) and y (labels)
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

# Prepare which features in your data are categorical and need to be transformed
# Note: "Doors" is categorical because you could split the data into groups by # of doors
categorical_features = ["Make", "Colour", "Doors"]

# Setup OneHotEncoder (see explanation below)
one_hot = OneHotEncoder()

# Setup transformer
# Tell it to use "one_hot" method with the one_hot object we created, and use it on the categorical_features we created
# Tell it to "passthrough" (skip, not tranform) columns that are already numeric
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")

# Fit tranformed data
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

#### One Hot Encoder

One Hot Encoder takes a single column that has categorical data in it, creates a new column for each unique category, and puts the data back into the data with a 1 in the unique column where it has a matching category. Here's an example graphic:

![](../images/one-hot-encoder.png)

In [65]:
# See transformed data in Pandas DataFrame
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


#### Another Way to Convert with Pandas

This might not work on categorical data that is already in the form of numbers (like Doors in our example), but pandas does more intelligently name the columns so you can see what OneHotEncoding is doing

In [40]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


In [66]:
# Try to fit the model with converted data
np.random.seed(24)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

# Build and train model to predict car price
from sklearn.ensemble import RandomForestRegressor # Can predict a number (regression line)
model = RandomForestRegressor()

model.fit(X_train, y_train)
model.score(X_test, y_test)


0.2364711420373471

### 1.3 - Handle Missing Values in Data

Two main ways to deal with missing values:
1. Fill them with some other value (**imputation**)
2. Remove the samples with missing data

In [47]:
# Get some data that has missing values
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [48]:
# Pandas fills missing values with `NaN`, so we can check if any of those exist
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

#### 1.3.1 - Fill Missing Data With Values (Imputation)
Pandas has ways to do this

In [49]:
# Fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True)

# Fill the "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace=True)

# Fill the "Odometer (KM)" column with mean of entire data set
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)

# Fill the "Doors" column (assumption that average car has 4 doors)
car_sales_missing["Doors"].fillna(4, inplace=True)

# Check dataframe again
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

#### 1.3.2 Remove Missing Data Rows

In [50]:
# Remove rows with NaN values
# At this point there are only Price values missing. Since that's the target label, we don't
# necessarily want to try to fill it with other data
car_sales_missing.dropna(inplace=True)
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [57]:
# Resplit the data now that NaN values are gone
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Turn the non-numeric categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Prepare which features in your data are categorical and need to be transformed
# Note: "Doors" is categorical because you could split the data into groups by # of doors
categorical_features = ["Make", "Colour", "Doors"]

# Setup OneHotEncoder (see explanation below)
one_hot = OneHotEncoder()

# Setup transformer
# Tell it to use "one_hot" method with the one_hot object we created, and use it on the categorical_features we created
# Tell it to "passthrough" (skip, not tranform) columns that are already numeric
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")

# Fit tranformed data
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

#### 1.3.2 Filling Missing Values With Only Scikit-learn

In [68]:
# Reimport data so we have the missing values
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.head(), car_sales_missing.isna().sum()

(     Make Colour  Odometer (KM)  Doors    Price
 0   Honda  White        35431.0    4.0  15323.0
 1     BMW   Blue       192714.0    5.0  19943.0
 2   Honda  White        84714.0    4.0  28343.0
 3  Toyota  White       154365.0    4.0  13434.0
 4  Nissan   Blue       181577.0    3.0  14043.0,
 Make             49
 Colour           50
 Odometer (KM)    50
 Doors            50
 Price            50
 dtype: int64)

In [70]:
# Remove rows that are missing "Price" since that's our target label and we don't want to guess at that
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [71]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [72]:
# Fill missing values with Scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' and numerical values with mean
# Create imputer object with "constant" (always same) strategy and the value to fill it with
categorical_imputer = SimpleImputer(strategy="constant", fill_value="missing")

# "Doors" label is categorical but technically stored as numerical, so we make a special imputer for it
door_imputer = SimpleImputer(strategy="constant", fill_value=4)

# Create an imputer to fill the missing numerical values with the mean
num_imputer = SimpleImputer(strategy="mean")

# Define column names
categorical_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer to fill the missing values
# ColumnTransformer takes 3-tuples of imputer name, imputer object, column names
imputer = ColumnTransformer([
    ("categorical_imputer", categorical_imputer, categorical_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer", num_imputer, num_features)
])

# Tranform the data
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [73]:
# Check that all values were filled by putting it into a Pandas DataFrame
car_sales_filled = pd.DataFrame(filled_X, columns=["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled.head(), car_sales_filled.isna().sum()

(     Make Colour Doors Odometer (KM)
 0   Honda  White   4.0       35431.0
 1     BMW   Blue   5.0      192714.0
 2   Honda  White   4.0       84714.0
 3  Toyota  White   4.0      154365.0
 4  Nissan   Blue   3.0      181577.0,
 Make             0
 Colour           0
 Doors            0
 Odometer (KM)    0
 dtype: int64)

In [74]:
# Turn the non-numeric categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Prepare which features in your data are categorical and need to be transformed
# Note: "Doors" is categorical because you could split the data into groups by # of doors
categorical_features = ["Make", "Colour", "Doors"]

# Setup OneHotEncoder (see explanation below)
one_hot = OneHotEncoder()

# Setup transformer
# Tell it to use "one_hot" method with the one_hot object we created, and use it on the categorical_features we created
# Tell it to "passthrough" (skip, not tranform) columns that are already numeric
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")

# Fit tranformed data
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [75]:
# Now we have the data as all numbers and with filled values
# Let's fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.21990196728583944