<a href="https://colab.research.google.com/github/Aleem2/Mastering-ML-using-sklearn/blob/main/Sklearn_template_file.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TL;DR Template for Complete ML Process using SKLearn

TL;DR Template for Complete ML Process using SKLearn
The aim of this template is to capture the complete ML process in one notebook. This code is part of getting started with Sklearn. This guide aims to add cross-validation neatly in one template file. A user might copy this notebook and bring on their dataset and change the estimator.

Hint: How to select an estimator.
https://scikit-learn.org/stable/getting_started.html




In [49]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV # GridSearchCV is to tune the hyper parameters.
import numpy as np # numpy for data wrangling
import pandas as pd # scipy for data wrangling
# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)
# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# fit the whole pipeline
pipe.fit(X_train, y_train)
# we can now use it like any other estimator
acc_score=accuracy_score(pipe.predict(X_test), y_test)

result = cross_validate(pipe, X_train, y_train)  # defaults to 5-fold CV

In [50]:
print("\n\n Cross validation scores based on 5-fold CV validation = " + str(result['test_score']))
print("\n\n y_test values = " + str(y_test))
print("\n\n Predicted values based on X_test data = " + str(pipe.predict(X_test)))



 Cross validation scores based on 5-fold CV validation = [0.95652174 0.91304348 0.95454545 0.95454545 0.95454545]


 y_test values = [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]


 Predicted values based on X_test data = [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [67]:
print("Accuracy of the model based on test data = " + str(acc_score)) # accuracy of test data

Accuracy of the model based on test data = 0.9736842105263158


# Model selection: choosing estimators and their parameters
Scikit exposes various methods which help us select appropriate hyper parameters such as regularization C, solvers and so on. These parameters are specific for different models. Hence a proper approach needs to be adopted. The code below broadly does it in four steps. Firstly, listing the hyperparameters using the get_params() method. Secondly, referring to Scikit documentation to identify the appropriate parameters and their ranges. Thirdly, use the GridSearchCV() method to find the appropriate values based on the training data. Finally, using cross_validate() method to confirm and finalize the ML process.

In [52]:
# So what parameters should I be considering?
pipe.get_params()

{'memory': None,
 'steps': [('standardscaler', StandardScaler()),
  ('logisticregression', LogisticRegression())],
 'verbose': False,
 'standardscaler': StandardScaler(),
 'logisticregression': LogisticRegression(),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'logisticregression__C': 1.0,
 'logisticregression__class_weight': None,
 'logisticregression__dual': False,
 'logisticregression__fit_intercept': True,
 'logisticregression__intercept_scaling': 1,
 'logisticregression__l1_ratio': None,
 'logisticregression__max_iter': 100,
 'logisticregression__multi_class': 'auto',
 'logisticregression__n_jobs': None,
 'logisticregression__penalty': 'l2',
 'logisticregression__random_state': None,
 'logisticregression__solver': 'lbfgs',
 'logisticregression__tol': 0.0001,
 'logisticregression__verbose': 0,
 'logisticregression__warm_start': False}

To optimize the hyper parameters, this code uses GridSearchCV method. The code captures the process.

In [59]:
#grid search allows for hyper parameter optimisation
# Target parameter to optimise can be selected as
#'logisticregression__solver': 'lbfgs'
#'logisticregression__penalty': 'l2'
#'logisticregression__max_iter': 100
#'logisticregression__C': 1.0

pipe=LogisticRegression()
Cs = np.linspace(0.1, 0.9, 10)
print("\n \n Cs values being fed in the search = " + str(Cs))
clf = GridSearchCV(estimator=pipe, param_grid=dict(C=Cs,solver=['lbfgs','liblinear', 'newton-cg', 'newton-cholesky','sag','saga'], max_iter=[2000]))
clf.fit(X_train, y_train)
print("\n\n Verifying the accuracy of the final model" + str(clf.score(X_test, y_test)))
print("\n\n Best estimator final values = " + str(clf.best_estimator_))


 
 Cs values being fed in the search = [0.1        0.18888889 0.27777778 0.36666667 0.45555556 0.54444444
 0.63333333 0.72222222 0.81111111 0.9       ]


 Verifying the accuracy of the final model0.9736842105263158


 Best estimator final values = LogisticRegression(C=0.8111111111111111, max_iter=2000, solver='saga')


Findings going through the whole process are as follows. Selecting appropriate regularization value C and the max_iter value were interrelated. A lower value of max_iter would mean the model would not converge. While some solvers performed better than others. Next based on the model hyper parameters, the model can be validated against the training data to complete the training process.

In [66]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(solver='saga',C=0.8111,max_iter=2000)
)
# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# fit the whole pipeline
pipe.fit(X_train, y_train)
# we can now use it like any other estimator
acc_score=accuracy_score(pipe.predict(X_test), y_test)

result = cross_validate(pipe, X_train, y_train)  # defaults to 5-fold CV

print("\n\n Model Accuracy = " + str(acc_score)) # accuracy of model on the test data
print("\n\n Cross validation scores based on 5-fold CV validation = " + str(acc_score))
print("\n\n y_test values = " + str(y_test))
print("\n\n Predicted values based on X_test data = " + str(pipe.predict(X_test)))



 Model Accuracy = 0.9736842105263158


 Cross validation scores based on 5-fold CV validation = 0.9736842105263158


 y_test values = [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]


 Predicted values based on X_test data = [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


# Observations


1.  The finding was that lbfgs, sag, newton-cg and saga gave exactly the same values, while others were underperforming.
2.  The final accuracy on the test set in identifying if a passenger survived or not was 0.9736 by using logistic regression with Saga solver. The regularization value C was 0.811 and max_iter was 2000.

