# A Simple Scikit-Learn (sklearn) Classification Workflow

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn package.

What we're going to cover:

0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choosing the right estimator/algorithm for our problems
3. Fitting the model/algorithm and using it to make predictions on our data
4. Evaluating the model
5. Improving the model
6. Saving and loading a trained model
7. Putting it all together

<img src="img/sklearn_workflow.png"/>

## 0. An end-to-end Scikit-Learn workflow

### Import the essential packages

In [1]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## 1. Getting the data ready

In [2]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


We then divide the data into a *features matrix* called `X`, and a target column `y`. As the name implies, the features matrix contains the *features* of the dataset, or the columns that we're going to use to try and predict the target column.

In [3]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

## 2. Choosing the right estimator/algorithm for our problems

Now that we have the features matrix and the target we need to predict, it's time to choose what's the best algorithm that predicts the target. Choosing the best algorithm in this case means choosing the right model for the problem, and the right hyperparameters to tune the model just right to give the best predictions.

For now, let's just use the `RandomForestClassifier` model. The `RandomForestClassifier` model is one of many classification models provided by Scikit-Learn. We'll explore the finer details about different kinds of models in the upcoming sections. Let's also use the default hyperparameters for now.

For future reference, you can also use the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) to help you choose the right model to use for your own dataset.

<img src="img/sklearn_ml_map.png" width=500/>

In [4]:
# Import the model
from sklearn.ensemble import RandomForestClassifier
# Instantiate the model
clf = RandomForestClassifier()
# Show the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 3. Fitting the model/algorithm and using it to make predictions on our data

Now that we have a model we can use, it's time to fit the model to the training data.

For that, we wil need to split our dataset into training and test sets.

Fortunately, Scikit-Learn provides a handy tool to help us split our dataset. Let's take a look:

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Great! we now have a proper split of training and test data.

Now, it's time to fit the model into the training set:

In [6]:
clf.fit(X_train, y_train);

OK, we've fit the model into the training data. Let's try and make some predictions and see just how well the model does its predictions.

Let's now use the test data and make some predictions:

In [7]:
y_preds = clf.predict(X_test)
y_preds

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0], dtype=int64)

In [8]:
y_test

106    1
273    0
214    0
178    0
60     1
      ..
82     1
249    0
264    0
219    0
242    0
Name: target, Length: 61, dtype: int64

We've now got a bunch of `0`s and `1`s, and our predictions *look like* the test set target.

But how do we check if the predictions line up with the test set targets or not?

## 4. Evaluating the model

Scikit-learn also gives us a great tool to evaluate how well the model fits into our dataset.

In [9]:
clf.score(X_train, y_train)

1.0

Oh, wow. a perfect score of `1.0`. How nice!

But what we just did is just checked the score using the training set that the model already uses to begin with. Of course it would have a perfect score when we evaluate it using the training set!

Let's now try again using the test set this time:

In [10]:
clf.score(X_test, y_test)

0.7868852459016393

Not bad! We can see that the model gives a pretty good mean accuracy.

Since we're dealing with a classification model, let's also explore some other ways to evaluate our classification model.

Again, Scikit-Learn gives us a variety of tools for these kinds of things!

In [11]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

First, let's try using the `classification_report` tool:

In [12]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.81      0.78      0.79        32
           1       0.77      0.79      0.78        29

    accuracy                           0.79        61
   macro avg       0.79      0.79      0.79        61
weighted avg       0.79      0.79      0.79        61



Hoo boy, that's quite a bit to take in!

Let's try and interpret what this classification report says:

In a nutshell, the classification report compares the actual target data from the test set with the model predictions for the test set, and generates some scores on how well the model predicts the values on the test set, i.e. how accurately it got the `0` or `1` result, and so on.

Next, let's try using the confusion matrix:

In [13]:
confusion_matrix(y_test, y_preds)

array([[25,  7],
       [ 6, 23]], dtype=int64)

Let's also try the accuracy score:

In [14]:
accuracy_score(y_test, y_preds)

0.7868852459016393

Notice that we got the same result as the `clf.score()` earlier. Pretty good stuff!

## 5. Improving the model

Now, what can we do to try and get better results/scores on our models? 

This is where we tune the hyperparameters get different predictions.

For starters, let's try tweaking the amount of `n_estimators` for the model:

In [15]:
np.random.seed(42)

for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%\n")

Trying model with 10 estimators...
Model accuracy on test set: 78.69%

Trying model with 20 estimators...
Model accuracy on test set: 81.97%

Trying model with 30 estimators...
Model accuracy on test set: 81.97%

Trying model with 40 estimators...
Model accuracy on test set: 78.69%

Trying model with 50 estimators...
Model accuracy on test set: 80.33%

Trying model with 60 estimators...
Model accuracy on test set: 85.25%

Trying model with 70 estimators...
Model accuracy on test set: 85.25%

Trying model with 80 estimators...
Model accuracy on test set: 88.52%

Trying model with 90 estimators...
Model accuracy on test set: 83.61%



Trying out different `n_estimators` from 10 to 100, we can find that we can get the best model accuracy when we use 80 estimators!

That's just one of the many hyperparameters we can tune to try and improve out machine learning models. Let's explore these further as we gain more knowledge about Scikit-Learn!

## 6. Saving and loading a trained model

Once we find a model has a good looking fit for our dataset, it's a good idea to save that model so we can just load it back later.

Here, we shall use the `pickle` library to save and load our models:

In [16]:
import pickle

To save a model, we simply `pickle.dump()` it to a filename we can specify:

In [17]:
pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

Great! We now have a file containing the model we just used (in this case, the one with 100 estimators).

To load the saved model, simply call `pickle.load()`:

In [18]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))

Let's verify that we have loaded the correct file:

In [19]:
loaded_model.score(X_test, y_test)

0.8360655737704918

## 7. Putting it all together

We've just gone through the typical workflow when dealing with Scikit-Learn projects. We went through the step-by-step process of loading a dataset, choosing a machine learning model to predict target values for the dataset, fitting the model and making actual predictions, evaluating how well the model predicted the data (compared to the actual target results), improving the model by tuning different hyperparameters, and finally saving and loading models for future use.

This has been a lot to cover, but it's still just a small taste of the full capabilities Scikit-Learn offers. In the next lectures, we shall dive deeper into the workflow we went through and try to understand the deeper concepts of machine learning using Scikit-Learn. Stay Tuned!