# Introduction to Scikit Learn (sklearn)

This notebook demonstrates some of the most useful functions of the beautiful scikit learn library.

Flow of the work:

0. An-end-to-end scikit leatrn flow
1. Getting data ready
2. Choose the right estimator/algorithm for our problems 
3. Fit the model/algorithm/estimator and use it make predictions on our data 
4. Evaluating Model
5. Improve the model
6. Save and load a trained model
7. Putting it all together

In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 0. An End-to-end Scikit Learn Workflow

In [3]:
# 1. Get the data ready 
import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [9]:
# Create X (Features Matrix)
X = heart_disease.drop("target", axis = 1)

# Create Y (labels)
y = heart_disease["target"]

In [None]:
# 2. Choose the right model and hyperparameters

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
 
#  We will keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [10]:
# 3. Fit the model to the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [12]:
clf.fit(X_train,y_train);

In [15]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
219,48,1,0,130,256,1,0,150,1,0.0,2,2,3
31,65,1,0,120,177,0,1,140,0,0.4,2,0,3
33,54,1,2,125,273,0,0,152,0,0.5,0,1,2
253,67,1,0,100,299,0,0,125,1,0.9,1,2,2
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,44,1,2,120,226,0,1,169,0,0.0,2,0,2
13,64,1,3,110,211,0,0,144,1,1.8,1,0,2
291,58,1,0,114,318,0,2,140,0,4.4,0,3,1
262,53,1,0,123,282,0,1,95,1,2.0,1,2,3


In [16]:
X_test

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
88,54,0,2,110,214,0,1,158,0,1.6,1,0,2
186,60,1,0,130,253,0,1,144,1,1.4,2,1,3
52,62,1,2,130,231,0,1,146,0,1.8,1,3,3
185,44,1,0,112,290,0,0,153,0,0.0,2,1,2
135,49,0,0,130,269,0,1,163,0,0.0,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
211,61,1,0,120,260,0,1,140,1,3.6,1,1,3
243,57,1,0,152,274,0,1,88,1,1.2,1,1,3
279,61,1,0,138,166,0,0,125,1,3.6,1,1,2


In [18]:
y_preds = clf.predict(X_test)
y_preds

array([1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0])

In [19]:
y_test

88     1
186    0
52     1
185    0
135    1
      ..
179    0
211    0
243    0
279    0
240    0
Name: target, Length: 61, dtype: int64

In [21]:
# 4. Evaluate the model on the training data and the test data

clf.score(X_train,y_train)  # the score() returns the Mean Accuracy of the model.A float between 0.0 and 1.0. 
# accuracy = correct predictions / total predictions

1.0

In [22]:
clf.score(X_test, y_test)

0.8852459016393442

In [23]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.85      0.94      0.89        31
           1       0.93      0.83      0.88        30

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.88        61



In [24]:
confusion_matrix(y_test, y_preds)

array([[29,  2],
       [ 5, 25]])

In [25]:
accuracy_score(y_test,y_preds)

0.8852459016393442

### What are n_estimators ??
Think of a Random Forest like a jury in a courtroom.

n_estimators is simply the number of people in that jury.

1. The Single Tree vs. The Forest
* If you have 1 tree, it's like asking one person for their opinion. They might be biased or miss a tiny detail.

* If you have n_estimators=100, you are asking 100 people.  

2. Why more trees help Each "person" (tree) in the forest looks at the data a little bit differently. When it’s time to make a decision:

* Every single tree "votes" for a result.

* The forest looks at all the votes.

* The majority wins.

Because you are averaging the votes of many trees, the mistakes of one tree are cancelled out by the others. This makes the model much more reliable and "stable."

3. Practical Rules of Thumb
* 10 trees: Fast, but maybe a bit "shaky" (less accurate).

* 100 trees: This is the default. It’s usually very strong and accurate for most projects.

* 1,000 trees: Very accurate, but it takes a lot of "brain power" (battery/memory) to calculate.

In [28]:
# 5. Improve a model
# try different amount of n_estimators

np.random.seed(42)

for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators = i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 86.89%

Trying model with 20 estimators...
Model accuracy on test set: 85.25%

Trying model with 30 estimators...
Model accuracy on test set: 86.89%

Trying model with 40 estimators...
Model accuracy on test set: 86.89%

Trying model with 50 estimators...
Model accuracy on test set: 86.89%

Trying model with 60 estimators...
Model accuracy on test set: 91.80%

Trying model with 70 estimators...
Model accuracy on test set: 85.25%

Trying model with 80 estimators...
Model accuracy on test set: 86.89%

Trying model with 90 estimators...
Model accuracy on test set: 90.16%



In [29]:
# 6. Save a model and load it

import pickle

pickle.dump(clf, open("random_forest_model_1.pkl","wb"))

In [31]:
# reloding the model which we have saved

loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.9016393442622951

## 1. Getting the data ready to be used with machine learning

Three main things we have to do:
   1. Split the data into features and labels (X and y)
   2. Filling (also called imputing) or disregarding missing values
   3. Converting non-numerical values to numerical values (also called feature coding)

In [34]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [36]:
X = heart_disease.drop("target", axis  = 1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [37]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [39]:
# Split the data into training and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [42]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# This line is used to check the dimensions (the size and structure) of your datasets after you have split them into training and testing sets.
# In Python, .shape tells you how many rows (samples) and columns (features) are in each piece of your data.

((242, 13), (61, 13), (242,), (61,))

In [43]:
X.shape # this is the data before splitting, original data.

(303, 13)

In [46]:
# now out of this we have given 20% of the data (test_size = 0.2) for testing set

len(heart_disease) # tot samples  = 303

303

In [47]:
X.shape[0] * 0.8

242.4

In [48]:
# 303 - 242 = 61
# 242 + 61 = 303

### 1.1 Make sure all this data is numerical

In [49]:
car_sales = pd.read_csv("data/scikit-learn-data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [50]:
len(car_sales)

1000

In [51]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [52]:
# Split into X/y

X = car_sales.drop("Price", axis = 1)
y = car_sales["Price"]

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# Build machine learning model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test) 


# this code will give a valueError

In [58]:
# To solve the errors we need to preprocess the data

# Turn the categories into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X


array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]], shape=(1000, 13))