## Introduction to Scikit-Learn

Let's revisit the 6-steps Machine Learning framework from earlier.

<img src="../01_sample_project/6-step-ml-framework.png" style="background-color: white">

Here are the tools that we can use:

<img src="./images/ml101-6-step-ml-framework-tools.png" style="background-color: white">

We are going to start getting into the Machine Learning and writing Machine Learning code.

To do so, we are going to be using Scikit-Learn.

### What is Scikit-Learn (sklearn)?

* It is a Python Machine Learning (ML) library
* If we have data, it helps us build ML models, to:
    * Make predictions 
    * or learn patterns within that data, then make predictions
* It also implements tools to help us evaluate those predictions whether they are good or bad

### Why do we use Scikit-Learn?

* Built on NumPy and Matplotlib (and Python)
* Has many built-in ML models
* Methods to evaluate your ML models
* Very well-designed API

Here is what we are going to cover:

<img src="./images/sklearn-workflow-title.png" style="background-color: white">

Summary of Topics:

0. An end-to-end Scikit-Learn workflow
1. Getting data ready (to be used with Machine Learning models)
2. Choosing a machine learning model
3. Fitting a model to a data (learning patterns) and Making predictions with a model (using patterns)
4. Evaluating model predictions
5. Improving model predictions
6. Saving and loading models
7. Putting it all together!

In [8]:
import numpy as np

# one way to show complete documentation of any function(source: https://stackoverflow.com/questions/63200181/show-complete-documentation-in-vscode)
# np.random.randint? # put the ? at the end

## 0. An end-to-end Scikit-Learn Workflow

## 1. Getting the data ready

In [11]:
import pandas as pd

heart_disease = pd.read_csv("./data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [15]:
# X is essentially a features matrix which contains the data in the columns
X = heart_disease.drop("target", axis=1) # we want all data except for the target column

# y is the labels target column where we will train the ML model and make predictions
y = heart_disease["target"]

In [16]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [17]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

## 2. Choosing a Machine Learning model and the Hyperparameters

In [37]:
# Let's try the Random Forest ML model which is one type of classification learning model
# It is capable of learning patterns in data and then classifying whether a sample aka "a row" is one type or the other

# first we import the library
from sklearn.ensemble import RandomForestClassifier

# then, we instantiate that class using clf
clf = RandomForestClassifier()

# we'll keep the default hyperparameters for now
clf.get_params() # see what parameters the model is currently using

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 3. Fitting the model to the training data

In [24]:
# first we need to train our model on the training data set
from sklearn.model_selection import train_test_split

# what this code is doing is that it splits the X and y data into training (X_train and y_train) and testing (X_test, y_test) data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% of the data will be used for training, 20% will be used for testing

# what this code is doing is that it gets the random forest classifier model to find patterns in the training data
clf.fit(X_train, y_train);

### Now, our model is fit to the training data! Let's make a prediction!

In [25]:
# let's try to make a prediction!
import numpy as np

y_label = clf.predit(np.array([0, 2, 3, 4])) # we need to pass a numpy array in the predict function
# the above example didn't work out because our model can't make predictions on data that aren't the same shape (scikit learn is built on numpy)

# so to fix that, we get our model to predict using the testing data which has the same shape
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0], dtype=int64)

In [47]:
np.array(y_test)

array([1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0], dtype=int64)

## 4. Evaluate the model on the training and testing data

In [27]:
# Using training data
clf.score(X_train, y_train)

1.0

From the above, the model has found patterns in the training data so well that it got 100% mean accuracy score.

In [28]:
# Using testing data
clf.score(X_test, y_test)

0.8524590163934426

There are some more metrics that we can use:

In [29]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [32]:
classification_report(y_test, y_preds)

'              precision    recall  f1-score   support\n\n           0       0.86      0.75      0.80        24\n           1       0.85      0.92      0.88        37\n\n    accuracy                           0.85        61\n   macro avg       0.85      0.83      0.84        61\nweighted avg       0.85      0.85      0.85        61\n'

So, what this shows us is some classification metrics that compare the test labels (y_test) with the prediction labels (y_preds)

In [33]:
# let's try another metrics
confusion_matrix(y_test, y_preds)

array([[18,  6],
       [ 3, 34]], dtype=int64)

In [34]:
accuracy_score(y_test, y_preds)

0.8524590163934426

## 5. Improve the model predictions

In [42]:
# Try different amount of n_estimators and see the different accuracy scores
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i)
    clf.fit(X_train, y_train)
    print(f"Model accuracy on test data set: {clf.score(X_test, y_test) * 100:.2f}%\n")

Trying model with 10 estimators...
Model accuracy on test data set: 78.69%

Trying model with 20 estimators...
Model accuracy on test data set: 83.61%

Trying model with 30 estimators...
Model accuracy on test data set: 86.89%

Trying model with 40 estimators...
Model accuracy on test data set: 85.25%

Trying model with 50 estimators...
Model accuracy on test data set: 85.25%

Trying model with 60 estimators...
Model accuracy on test data set: 88.52%

Trying model with 70 estimators...
Model accuracy on test data set: 85.25%

Trying model with 80 estimators...
Model accuracy on test data set: 81.97%

Trying model with 90 estimators...
Model accuracy on test data set: 85.25%



### From the above, we can see that the highest accuracy score is by adjusting the hyperparameter n_estimators with 60 estimators

## 6. Saving and loading trained models

In [43]:
# First we want to save the model
# example here is using the pickle library
import pickle

pickle.dump(clf, open("random_forest_model_1.pkl", "wb")) # wb = write binary

In [45]:
# Then, let's try to import the model
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb")) # rb = read binary

loaded_model.score(X_test, y_test) # the score should be from the last model that we tested

0.8524590163934426

## 7. Putting it all together!