## Introduction to Scikit-Learn

Let's revisit the 6-steps Machine Learning framework from earlier.

<img src="../01_sample_project/6-step-ml-framework.png" style="background-color: white">

Here are the tools that we can use:

<img src="./images/ml101-6-step-ml-framework-tools.png" style="background-color: white">

We are going to start getting into the Machine Learning and writing Machine Learning code.

To do so, we are going to be using Scikit-Learn.

### What is Scikit-Learn (sklearn)?

* It is a Python Machine Learning (ML) library
* If we have data, it helps us build ML models, to:
    * Make predictions 
    * or learn patterns within that data, then make predictions
* It also implements tools to help us evaluate those predictions whether they are good or bad

### Why do we use Scikit-Learn?

* Built on NumPy and Matplotlib (and Python)
* Has many built-in ML models
* Methods to evaluate your ML models
* Very well-designed API

Here is what we are going to cover:

<img src="./images/sklearn-workflow-title.png" style="background-color: white">

Summary of Topics:

0. An end-to-end Scikit-Learn workflow
1. Getting data ready (to be used with Machine Learning models)
2. Choosing a machine learning model
3. Fitting a model to a data (learning patterns) and Making predictions with a model (using patterns)
4. Evaluating model predictions
5. Improving model predictions
6. Saving and loading models
7. Putting it all together!

In [1]:
import numpy as np

# one way to show complete documentation of any function(source: https://stackoverflow.com/questions/63200181/show-complete-documentation-in-vscode)
# np.random.randint? # put the ? at the end

## 0. An end-to-end Scikit-Learn Workflow

### 1. Getting the data ready

In [2]:
import pandas as pd

heart_disease = pd.read_csv("./data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
# X is essentially a features matrix which contains the data in the columns
X = heart_disease.drop("target", axis=1) # we want all data except for the target column

# y is the labels target column where we will train the ML model and make predictions
y = heart_disease["target"]

In [4]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [5]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

### 2. Choosing a Machine Learning model and the Hyperparameters

In [6]:
# Let's try the Random Forest ML model which is one type of classification learning model
# It is capable of learning patterns in data and then classifying whether a sample aka "a row" is one type or the other

# first we import the library
from sklearn.ensemble import RandomForestClassifier

# then, we instantiate that class using clf
clf = RandomForestClassifier()

# we'll keep the default hyperparameters for now
clf.get_params() # see what parameters the model is currently using

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fitting the model to the training data

In [7]:
# first we need to train our model on the training data set
from sklearn.model_selection import train_test_split

# what this code is doing is that it splits the X and y data into training (X_train and y_train) and testing (X_test, y_test) data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% of the data will be used for training, 20% will be used for testing

# what this code is doing is that it gets the random forest classifier model to find patterns in the training data
clf.fit(X_train, y_train);

### Now, our model is fit to the training data! Let's make a prediction!

In [8]:
# let's try to make a prediction!
import numpy as np

# y_label = clf.predit(np.array([0, 2, 3, 4])) # we need to pass a numpy array in the predict function
# the above example didn't work out because our model can't make predictions on data that aren't the same shape (scikit learn is built on numpy)

# so to fix that, we get our model to predict using the testing data which has the same shape
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1], dtype=int64)

In [9]:
np.array(y_test)

array([0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1], dtype=int64)

### 4. Evaluate the model on the training and testing data

In [10]:
# Using training data
clf.score(X_train, y_train)

1.0

From the above, the model has found patterns in the training data so well that it got 100% mean accuracy score.

In [11]:
# Using testing data
clf.score(X_test, y_test)

0.7868852459016393

There are some more metrics that we can use:

In [12]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [13]:
classification_report(y_test, y_preds)

'              precision    recall  f1-score   support\n\n           0       0.69      0.78      0.73        23\n           1       0.86      0.79      0.82        38\n\n    accuracy                           0.79        61\n   macro avg       0.77      0.79      0.78        61\nweighted avg       0.79      0.79      0.79        61\n'

So, what this shows us is some classification metrics that compare the test labels (y_test) with the prediction labels (y_preds)

In [14]:
# let's try another metrics
confusion_matrix(y_test, y_preds)

array([[18,  5],
       [ 8, 30]], dtype=int64)

In [15]:
accuracy_score(y_test, y_preds)

0.7868852459016393

### 5. Improve the model predictions

In [16]:
# Try different amount of n_estimators and see the different accuracy scores
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i)
    clf.fit(X_train, y_train)
    print(f"Model accuracy on test data set: {clf.score(X_test, y_test) * 100:.2f}%\n")

Trying model with 10 estimators...
Model accuracy on test data set: 77.05%

Trying model with 20 estimators...
Model accuracy on test data set: 83.61%

Trying model with 30 estimators...
Model accuracy on test data set: 78.69%

Trying model with 40 estimators...
Model accuracy on test data set: 77.05%

Trying model with 50 estimators...


Model accuracy on test data set: 78.69%

Trying model with 60 estimators...
Model accuracy on test data set: 78.69%

Trying model with 70 estimators...
Model accuracy on test data set: 81.97%

Trying model with 80 estimators...
Model accuracy on test data set: 81.97%

Trying model with 90 estimators...
Model accuracy on test data set: 81.97%



### From the above, we can see that the highest accuracy score is by adjusting the hyperparameter n_estimators with 60 estimators

### 6. Saving and loading trained models

In [17]:
# First we want to save the model
# example here is using the pickle library
import pickle

pickle.dump(clf, open("random_forest_model_1.pkl", "wb")) # wb = write binary

In [18]:
# Then, let's try to import the model
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb")) # rb = read binary

loaded_model.score(X_test, y_test) # the score should be from the last model that we tested

0.819672131147541

### 7. Putting it all together!

In [20]:
# how to check the version of sklearn
import sklearn

sklearn.show_versions()


System:
    python: 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
executable: c:\Users\ahmad\anaconda3\python.exe
   machine: Windows-10-10.0.22631-SP0

Python dependencies:
      sklearn: 1.2.2
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.11.4
       Cython: None
       pandas: 2.1.4
   matplotlib: 3.8.0
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: C:\Users\ahmad\anaconda3\Library\bin\mkl_rt.2.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2023.1-Product
    num_threads: 14
threading_layer: intel

       filepath: C:\Users\ahmad\anaconda3\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 20


In [23]:
# Create a list for the topics that we will be covering
summary_topics = [
    "0. An end-to-end Scikit-Learn workflow", 
    "1. Getting data ready (to be used with Machine Learning models)",
    "2. Choosing a machine learning model",
    "3. Fitting a model to a data (learning patterns) and Making predictions with a model (using patterns)",
    "4. Evaluating model predictions",
    "5. Improving model predictions",
    "6. Saving and loading models",
    "7. Putting it all together!"]

In [26]:
summary_topics

['0. An end-to-end Scikit-Learn workflow',
 '1. Getting data ready (to be used with Machine Learning models)',
 '2. Choosing a machine learning model',
 '3. Fitting a model to a data (learning patterns) and Making predictions with a model (using patterns)',
 '4. Evaluating model predictions',
 '5. Improving model predictions',
 '6. Saving and loading models',
 '7. Putting it all together!']

## 1. Getting the data ready to be used with Machine Learning models

Three main things that we have to do:
1. Split the data into features ("X") and labels ("y")
2. Filling (also called imputing) or disregarding missing values
3. Converting non-numerical values to numerical values (also called feature encoding)

In [27]:
# Let's start with the dataset
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### 1. Split the data into features ("X") and labels ("y")

In [28]:
X = heart_disease.drop("target", axis=1) # 0=rows, 1=columns
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [29]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [30]:
# Now, split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [31]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

We have done step 1! Time for the next step!

### 1.1 Make sure the data is all numerical (extra step)

Let's look at anothe example and make a prediction

In [34]:
# Let's use the car sales dataset
car_sales = pd.read_csv("./data/car-sales-extended.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [35]:
len(car_sales)

1000

In [38]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [42]:
# Let's split the data into features (X) and labels (y)
X = car_sales.drop("Price", axis=1)

y = car_sales["Price"]

In [43]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [44]:
y.head()

0    15323
1    19943
2    28343
3    13434
4    14043
Name: Price, dtype: int64

In [46]:
# Now, let's split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2)


In [47]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((800, 4), (200, 4), (800,), (200,))

In [49]:
# Choose another ML model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [52]:
# model.fit(X_train, y_train)
# model.score(X_test, y_test)

If we run the code above, we will get the error, `ValueError: could not convert string to float: "Nissan".`

Our machine learning model cannot deal with strings.

This is the reason why we need to make sure that our data is all numerical.

Let's see how we deal with this situation with sklearn.

In [53]:
# turn categories into numbers

# First, import these libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [54]:
# then define the features to turn into numbers
categorical_features = ["Make", "Colour", "Doors"]

In [59]:
# Doors are also considered categorical in our case e.g. there are 856 cars under the 4 doors category
car_sales["Doors"].value_counts()

Doors
4    856
5     79
3     65
Name: count, dtype: int64

In [60]:
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                  one_hot, 
                                  categorical_features)],
                                  remainder="passthrough") # accepts a list of tuple

transformed_X = transformer.fit_transform(X) # the version of X but turned into numbers
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [63]:
# let's put the transformed_X data into a dataframe so we could visualize the data better
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [64]:
# we create dummies to get the indicators
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


In [66]:
# Now that the data is in numbers, let's refit our model

np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model.fit(X_train, y_train)

In [67]:
model.score(X_test, y_test)

0.3235867221569877

The ML model is now able to make predictions now (eventhough the score is low) because our data is all numerical!

### 2. Filling (also called imputing) or disregarding missing values

## Data Science Quick Tip: Clean, Transform, Reduce

* Clean Data: Remove a row/ column that's empty or has missing fields/ values e.g. Null, or calculate average prices

* Transform Data: Transform data into some sort of form that a computer can understand e.g. change the data type to numbers, convert colors into numbers

* Reduce Data: The more data we have, the more CPU, energy we need to run our computation (called dimensionality/ columns reduction), remove columns that we don't need to use