# Introduction to Scikit-Learn
> Scikit-Learn is also called sklearn. 

This notebook records the 8.1 lesson of Daniel's ML course. 
He provides a summary of the chapter 8, which is listed below:

0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together


In this lesson, we focus only on the first section—An end-to-end Scikit-Learn workflow. This section is a brief overview of the entire chapter 8. Therefore, the content of this section can be a bit overwelming. There is no need to fully understand every points in this part. Its main purpose is only give students an opportunity to preview what they are going to learn later to better understand and use sklearn.

Daniel, the instructor of this course, shows every step of a common sklearn workflow in a simple version. Those steps are "1. GET THE DATA READY", "2. CHOOSE THE RIGHT MODEL AND HYPERPARAMETERS", "3. FIT THE MODEL TO THE TRAINING DATA", "4. EVALUATE THE MODEL ON THE TRAINING DATA AND TEST DATA", "5. IMPROVE A MODEL", "6. SAVE A MODEL AND LOAD IT". The last step in the list above is missing for unknown reasons.



## 1. GET THE DATA READY

The first thing we need to do is not importing the data, but rather, importing some common libraries, such as pandas, numpy, matplotlib, and sklearn. But for now, let's just import the first two.

In [5]:
import pandas as pd
import numpy as np

You may ask what the hell is pandas or numpy?

Let me explain...

"Pandas" is a software library writtten for the Python programming language for data manipulation and analysis. More specificly, it offers data structrues and operations for manipulating numerical tables and time series.

"Numpy is a library for the Python programming language, adding support for large, multi-dimenstional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Hooo, that's a lot to take in, but we may be ok for now.

After we import pandas and numpy, let's import the **heart disease** data from Github:

In [6]:
heart_disease = pd.read_csv("https://raw.githubusercontent.com/CongLiu-CN/zero-to-mastery-ml/master/data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Why do we need to split data into X & y?

-----Intentionally blank line-----

In [7]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

## 2. Choose the right model and hyperparameters

In [8]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [9]:
# We'll keep the deafault hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 3. Fit the model to the training data

What does `train_test_split` do?

Split arrays or **matrices**(the plural of the word **Matrix**) into random train and test subsets...https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [11]:
clf.fit(X_train, y_train);

Make a prediction:

In [12]:
y_preds = clf.predict(X_test)
y_preds

array([0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0], dtype=int64)

In [14]:
y_test

256    0
24     1
266    0
294    0
131    1
      ..
14     1
146    1
275    0
25     1
168    0
Name: target, Length: 61, dtype: int64

## 4. Evaluate the model on the training data and test data

In [15]:
clf.score(X_train, y_train)

1.0

In [16]:
clf.score(X_test, y_test)

0.9016393442622951

In [17]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [19]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.92      0.86      0.89        28
           1       0.89      0.94      0.91        33

    accuracy                           0.90        61
   macro avg       0.90      0.90      0.90        61
weighted avg       0.90      0.90      0.90        61



In [20]:
confusion_matrix(y_test, y_preds)

array([[24,  4],
       [ 2, 31]], dtype=int64)

In [21]:
accuracy_score(y_test, y_preds)

0.9016393442622951

## 5. Improve a model
Try different amount of n_estimators

In [23]:
np.random.seed(2020)

In [24]:
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test)*100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 78.69%

Trying model with 20 estimators...
Model accuracy on test set: 90.16%

Trying model with 30 estimators...
Model accuracy on test set: 85.25%

Trying model with 40 estimators...
Model accuracy on test set: 88.52%

Trying model with 50 estimators...
Model accuracy on test set: 90.16%

Trying model with 60 estimators...
Model accuracy on test set: 90.16%

Trying model with 70 estimators...
Model accuracy on test set: 86.89%

Trying model with 80 estimators...
Model accuracy on test set: 85.25%

Trying model with 90 estimators...
Model accuracy on test set: 91.80%



## 6. Save a model and load it

What is **pickle**?

https://www.pitt.edu/~naraehan/python3/pickling.html

In [25]:
import pickle

In [26]:
pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [27]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test) # align with the last result

0.9180327868852459