<a href="https://colab.research.google.com/github/LochanaBandara03/ML_tutorial/blob/main/Introduction_to_scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [22]:
#Standard imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sklearn
print(f"Scikit learn version: {sklearn.__version__}")

Scikit learn version: 1.5.2


##An end-to-end Scikit-Learn workflow

* Getting data ready (split into features and labels, prepare train and test steps)
* Choosing a model for our problem
* Fit the model to the data and use it to make a prediction
* Evaluate the model
* Experiment to improve
* Save a model for someone else to use

##Random Forest Classifier Workflow for Classifying Heart Disease

###1.Get the data ready

In [23]:
import pandas as pd

In [24]:
heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [25]:
#Create X (all the feature columns)
X = heart_disease.drop("target",axis=1)

#Create y (the target column)
y = heart_disease["target"]

#Check the head of DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [26]:
#Check the head and value counts
y.head(), y.value_counts()

(0    1
 1    1
 2    1
 3    1
 4    1
 Name: target, dtype: int64,
 target
 1    165
 0    138
 Name: count, dtype: int64)

In [27]:
#Split the data into training
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

###2. Choose the model and hyperparameters

In [28]:
#This is a classification problem - RandomForestClassifier (ML algorithm for classification)
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [29]:
#Current hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

###3. Fit the model to the data and use it to make a prediction

In [30]:
clf.fit(X=X_train, y=y_train)

In [31]:
#Predict a label, data should be the same shape as X_train
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
163,38,1,2,138,175,0,1,173,0,0.0,2,4,2
33,54,1,2,125,273,0,0,152,0,0.5,0,1,2
15,50,0,2,120,219,0,1,158,0,1.6,1,0,2
49,53,0,0,138,234,0,0,160,0,0.0,2,0,2
57,45,1,0,115,260,0,0,185,0,0.0,2,0,2


In [32]:
#Use thhe model to make a prediction on test data
y_preds = clf.predict(X=X_test)

###4. Evaluate the model

In [33]:
#Evaluate the model on training set
train_acc = clf.score(X=X_train, y=y_train)
print(f"The model's accuracy on the training dataset: {train_acc*100}%")

The model's accuracy on the training dataset: 100.0%


In [34]:
#Evaluate the mmodel on test dataset
test_acc = clf.score(X=X_test, y=y_test)
print(f"The model's accuracy on the testing dataset is: {test_acc*100:.2f}%")

The model's accuracy on the testing dataset is: 82.89%


All of the following classification metrics come from the sklearn.metrics module:

* classification_report(y_true, y_true) - Builds a text report showing various classification metrics such as precision, recall and F1-score.
* confusion_matrix(y_true, y_pred) - Create a confusion matrix to compare predictions to truth labels.
* accuracy_score(y_true, y_pred) - Find the accuracy score (the default metric) for a classifier.

In [35]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#Create a classification report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.85      0.71      0.77        31
           1       0.82      0.91      0.86        45

    accuracy                           0.83        76
   macro avg       0.83      0.81      0.82        76
weighted avg       0.83      0.83      0.83        76



In [36]:
#Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[22,  9],
       [ 4, 41]])

In [37]:
#Compute the accuray score
accuracy_score(y_test, y_preds)

0.8289473684210527

###5. Experiment to improve

But let's break it into two.

1. From a model perspective.
2. From a data perspective.

In [40]:
#Try different numbers of estimatros
np.random.seed(42)
for i in range(100, 200, 10):
  print(f"Trying model with {i} estimators...")
  model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
  print(f"Model accuracy on test set: {model.score(X_test, y_test)* 100:.2f}%")
  print("")

Trying model with 100 estimators...
Model accuracy on test set: 82.89%

Trying model with 110 estimators...
Model accuracy on test set: 81.58%

Trying model with 120 estimators...
Model accuracy on test set: 81.58%

Trying model with 130 estimators...
Model accuracy on test set: 82.89%

Trying model with 140 estimators...
Model accuracy on test set: 81.58%

Trying model with 150 estimators...
Model accuracy on test set: 85.53%

Trying model with 160 estimators...
Model accuracy on test set: 82.89%

Trying model with 170 estimators...
Model accuracy on test set: 82.89%

Trying model with 180 estimators...
Model accuracy on test set: 82.89%

Trying model with 190 estimators...
Model accuracy on test set: 82.89%



In [42]:
from sklearn.model_selection import cross_val_score

#with cross validation
np.random.seed(42)
for i in range(100, 200, 10):
  print(f"Trying model with {i} estimators.....")
  model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)

  #Measure the model score on a single train/test split
  model_score = model.score(X_test, y_test)
  print(f"Model accuracy on single test set split: {model_score*100:.2f}%")

  #Measure the mean cross-validation score across 5 different train and test splits
  cross_val_mean = np.mean(cross_val_score(model, X, y, cv=5))
  print(f"5-fold cross-validation score: {cross_val_mean*100 :.2f}%")

  print("")



Trying model with 100 estimators.....
Model accuracy on single test set split: 82.89%
5-fold cross-validation score: 82.15%

Trying model with 110 estimators.....
Model accuracy on single test set split: 84.21%
5-fold cross-validation score: 81.17%

Trying model with 120 estimators.....
Model accuracy on single test set split: 81.58%
5-fold cross-validation score: 83.16%

Trying model with 130 estimators.....
Model accuracy on single test set split: 80.26%
5-fold cross-validation score: 83.14%

Trying model with 140 estimators.....
Model accuracy on single test set split: 84.21%
5-fold cross-validation score: 82.48%

Trying model with 150 estimators.....
Model accuracy on single test set split: 81.58%
5-fold cross-validation score: 80.17%

Trying model with 160 estimators.....
Model accuracy on single test set split: 82.89%
5-fold cross-validation score: 80.83%

Trying model with 170 estimators.....
Model accuracy on single test set split: 82.89%
5-fold cross-validation score: 81.83%



In [43]:
#Another way to do it GridSearchCV
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

#Define the parameters to search over in dictiionary
#any of target model's hyperparameters
param_grid = {'n_estimators' : [i for i in range(100,200,10)]}

#Setup the grid search
grid = GridSearchCV(estimator=RandomForestClassifier(),
                    param_grid=param_grid,
                    cv=5,
                    verbose=1)

#Fit the grid  search to the data
grid.fit(X, y)

#Find the best parameters
print(f"The best parameter values are: {grid.best_params_}")
print(f"With a score of: {grid.best_score_*100:.2f}%")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
The best parameter values are: {'n_estimators': 120}
With a score of: 82.82%
