# 1. Scikit-Learn

### sklearn is a Python machine learning library.<br>It can be used to create models that finds patterns in data and predicts the future based on that data.<br>It also provides tools for evaluating the predictions.

### Why sklearn?
- It is built on NumPy and Pandas (and of course Python)
- Has many in-built machine learning models
- Has methods to evaluate the machine learning model
- It is a very well designed API

# 1.1. Scikit-Learn Workflow

### In general when trying to use a prediction model, the workflow is a as follows:

1. Getting the data ready using NumPy and Pandas and creating a training/testing sets.
2. Picking a machine learning model based on the problem to be solved.
3. Fit the training data to the model and make a prediction based on the testing data.
4. Evaluate the model based on its prediction using predefined metrics.
5. Improve the result of the prediction through experimentation.
6. Save and reload the trained model.

# 1.2. Usefull Tips

### To display the sklearn version:

In [16]:
import sklearn


sklearn.show_versions()


System:
    python: 3.8.10 (default, Mar 15 2022, 12:22:08)  [GCC 9.4.0]
executable: /home/mmpsudani/Documents/programming/notebook/venv/bin/python
   machine: Linux-5.13.0-44-generic-x86_64-with-glibc2.29

Python dependencies:
      sklearn: 1.1.1
          pip: 20.0.2
   setuptools: 44.0.0
        numpy: 1.22.4
        scipy: 1.8.1
       Cython: None
       pandas: 1.4.2
   matplotlib: 3.5.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/mmpsudani/Documents/programming/notebook/venv/lib/python3.8/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/mmpsudani/Documents/programming/notebook/venv/lib/python3.8/site-packages/scikit_learn.libs/libgomp-a

### To stop the jupyter notebook from displaying warnings:

In [19]:
# Ite recommanded to resolve any warning instead of ignoring them
import warnings


warnings.filterwarnings("default") #"error", "ignore", "always", "default", "module", or "once"

### Standard Imports

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# 1.3. An end-to-end Scikit-Learn workflow

In [21]:
# 1.1. Setting the data and preprocessing it
df = pd.read_csv("data/heart_disease.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [2]:
# 1.2. Splitting the data into X (features matrix) and y (labels)
y = df["target"]

# X = df.drop("target", axis=1)
features = df.columns[:-1]
X = df[features]

In [3]:
# 1.3. Spliting the data into training and testing sets
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

In [4]:
# 2.1 Choose a Machine Learning model based on the problem
from sklearn.ensemble import RandomForestClassifier

In [5]:
# 2.2. Create an ML model
model = RandomForestClassifier(n_estimators=100, random_state=0)

In [6]:
# 2.3. Display the model's hyperparameters
model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 0,
 'verbose': 0,
 'warm_start': False}

In [7]:
# 3.1. Fit the model with the training data
model.fit(X_train, y_train)

In [8]:
# 4.1. Make a prediction based on the testing date with the trained model
prediction = model.predict(X_test)
prediction

array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0])

In [9]:
# 5.1. Evaluate the model's prediction using score
print(f"Score: {model.score(X_test, y_test)}")

Score: 0.8351648351648352


In [10]:
# 5.2. Evalueate the model's prediction using MAE and MSE metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error


mae = mean_absolute_error(y_test, prediction)
mse = mean_squared_error(y_test, prediction)
print(f"MAE: {mae}")
print(f"MSE: {mse}")

MAE: 0.16483516483516483
MSE: 0.16483516483516483


In [11]:
# 5.3. Evaluate the model's prediction using other metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


print(classification_report(y_test, prediction))
print("*" * 18)
print(confusion_matrix(y_test, prediction))
print("*" * 18)
print(accuracy_score(y_test, prediction))

              precision    recall  f1-score   support

           0       0.89      0.75      0.81        44
           1       0.80      0.91      0.85        47

    accuracy                           0.84        91
   macro avg       0.84      0.83      0.83        91
weighted avg       0.84      0.84      0.83        91

******************
[[33 11]
 [ 4 43]]
******************
0.8351648351648352


In [12]:
# 6.1. Improve the model's performance by adjasting the hyper parameters
# in this exmaple, n_estimators

max_score = [0, 0]

for i in range(10, 100, 1):
    model = RandomForestClassifier(n_estimators=i, random_state=0)
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    score = model.score(X_test, y_test)
    
    if score > max_score[1]:
        max_score = [i, score]
    
print(f"model's max score with n_estimators={max_score[0]}: {max_score[1]}")

model's max score with n_estimators=50: 0.8461538461538461


In [13]:
# 7.1. Save the model
import pickle as pkl


pkl.dump(model, open("data/RFC_1.pkl", "wb"))

In [14]:
# 7.2. Load the model
loaded_model = pkl.load(open("data/RFC_1.pkl", "rb"))

In [15]:
loaded_model

# 1.4. Getting the data ready

### The workflow of getting the data ready is as follows:
1. Clean the Data
2. Transform the Data
3. Reduce the Data

### To get the data ready we must:
1. Import the data using pandas as a DataFrame OR create the data using numpy
2. Fill (imputing) or drop the cells that have NaN values
3. Converting non-numerical values into numerical values (AKA feature encoding)
4. split the data into X (features) and y (labels)

## 1.4.1. Import the data

In [87]:
# Importing the data from a csv file
car_sales = pd.read_csv("data/car-sales-extended-missing-data.csv")

In [88]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [89]:
car_sales.describe()

Unnamed: 0,Odometer (KM),Doors,Price
count,950.0,950.0,950.0
mean,131253.237895,4.011579,16042.814737
std,69094.857187,0.382539,8581.695036
min,10148.0,3.0,2796.0
25%,70391.25,4.0,9529.25
50%,131821.0,4.0,14297.0
75%,192668.5,4.0,20806.25
max,249860.0,5.0,52458.0


In [90]:
# Displayes the number of NaN cells in each column
car_sales.isnull().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [91]:
car_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           951 non-null    object 
 1   Colour         950 non-null    object 
 2   Odometer (KM)  950 non-null    float64
 3   Doors          950 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB


## 1.4.2. Fill / Drop the missing values

In [92]:
# Fill the NaN data in the Odometer column with the average value of this column
car_sales["Odometer (KM)"].fillna(car_sales["Odometer (KM)"].dropna().mean(), inplace=True)

In [93]:
# Fill the data using SimpleImputer from sklearn, which fills the NaN cells with the mean of that column
from sklearn.impute import SimpleImputer


my_imputer = SimpleImputer()
car_sales[["Doors", "Price"]] = pd.DataFrame(my_imputer.fit_transform(car_sales[["Doors", "Price"]]))

In [94]:
car_sales.isnull().sum()

Make             49
Colour           50
Odometer (KM)     0
Doors             0
Price             0
dtype: int64

### There are two ways to deal with missing cells with non-numeric values:
1. Drop the rows with the missing values
2. Fill the missing cells with the most common value of that column

In [95]:
# Filling the missing values with the most common value
car_sales["Make"].fillna(car_sales["Make"].dropna().max(), inplace=True)
car_sales["Colour"].fillna(car_sales["Colour"].dropna().max(), inplace=True)

In [96]:
car_sales.isnull().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

## 1.4.3. Converting non-numeric data into numeric values

### There are two ways to convert the non-numeric data in a DataFrame:
1. use an encoder from sklearn
2. use get_dummies from pandas

In [99]:
# Using the sklearn hot encoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


categorical_features = ["Make", "Colour", "Doors"]
my_encoder = OneHotEncoder()
transformer = ColumnTransformer([("my_encoder", my_encoder, categorical_features)], remainder="passthrough")

car_sales_transformed = pd.DataFrame(transformer.fit_transform(car_sales))
car_sales_transformed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,35431.0,15323.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0,19943.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,84714.0,28343.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,154365.0,13434.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,181577.0,14043.0


In [100]:
# Using the get_dummies
car_sales_dummies = pd.get_dummies(car_sales)
car_sales_dummies.head()

Unnamed: 0,Odometer (KM),Doors,Price,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,35431.0,4.0,15323.0,0,1,0,0,0,0,0,0,1
1,192714.0,5.0,19943.0,1,0,0,0,0,1,0,0,0
2,84714.0,4.0,28343.0,0,1,0,0,0,0,0,0,1
3,154365.0,4.0,13434.0,0,0,0,1,0,0,0,0,1
4,181577.0,3.0,14043.0,0,0,1,0,0,1,0,0,0


## 1.4.4. Spliting the data into X and y

In [102]:
y = car_sales_transformed[14]
X = car_sales_transformed.drop(14, axis=1)

# 1.5. Choosing the ML model