### Experiment 4: Build a classification model by using different machine learning algorithms.

## Software Used: Google colaboratory---
# **PyCaret for Classification**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self Learning Resource**
1. Tutorial on Pycaret <a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a> 

2. Documentation on Pycaret-Classification: <a href="https://pycaret.org/Classification/"> Click Here </a>

---

### **In this experiment we will learn:**

- Getting Data: How to import data from PyCaret repository
- Setting up Environment: How to setup an experiment in PyCaret and get started with building regression models
- Create Model: How to create a model, perform cross validation and evaluate regression metrics
- Tune Model: How to automatically tune the hyperparameters of a regression model
- Plot Model: How to analyze model performance using various plots
- Finalize Model: How to finalize the best model at the end of the experiment
- Predict Model: How to make prediction on new / unseen data
- Save / Load Model: How to save / load a model for future use

---



#### **(a) Install Pycaret**

In [None]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")

Pycaret installed sucessfully!!


#### **(b) Get the version of the pycaret**

In [None]:
from pycaret.utils import version
version()

'2.3.1'

---
# **1. Classification: Basics**
---

### **1.1 Loading Dataset - Loading dataset from pycaret**

In [None]:
from pycaret.datasets import get_data

# No output

---
### **1.2 Get the list of datasets available in pycaret (55)**
---

In [None]:
# Internet connection is required
dataSets = get_data('index')

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


---
### **1.3 Get diabetes dataset**
---

In [None]:
diabetesDataSet = get_data("diabetes")    # SN is 7
# This is binary classification dataset. The values in "Class variable" have two (binary) values.

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Read data from file

In [None]:
# import pandas as pd
# diabetesDataSet = pd.read_csv("myFile.csv")

---
### **1.4 Parameter setting for all classification models**
- Train/Test division
- Sampling
- Normalization
- Transformation
- PCA (Dimention Reduction)
- Handaling of Outliers
- Feature Selection
---

In [None]:
from pycaret.classification import *
s = setup(data=diabetesDataSet, target='Class variable', silent=True)

---
### **1.5 Run and compare the Model Performance**
---

In [None]:
cm = compare_models()
# Explore more parameters

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.758,0.7945,0.5889,0.7104,0.6415,0.4614,0.4678,0.506
lr,Logistic Regression,0.7505,0.797,0.5324,0.7206,0.6083,0.4317,0.4447,0.47
ridge,Ridge Classifier,0.7486,0.0,0.5174,0.7255,0.5993,0.4243,0.4398,0.013
lda,Linear Discriminant Analysis,0.7486,0.794,0.5174,0.7241,0.5986,0.4239,0.4393,0.016
gbc,Gradient Boosting Classifier,0.7376,0.8136,0.5584,0.6812,0.6079,0.4145,0.4229,0.121
et,Extra Trees Classifier,0.7376,0.7846,0.5234,0.6975,0.5904,0.4049,0.4181,0.459
ada,Ada Boost Classifier,0.7282,0.7929,0.5982,0.6434,0.6168,0.407,0.4101,0.104
lightgbm,Light Gradient Boosting Machine,0.7209,0.7852,0.5889,0.6445,0.6073,0.3929,0.3995,0.093
knn,K Neighbors Classifier,0.6907,0.7115,0.5166,0.592,0.5486,0.3158,0.3192,0.114
dt,Decision Tree Classifier,0.6742,0.6473,0.5476,0.5647,0.5533,0.2974,0.2994,0.016


---
### **1.6 Three line of code for model comparison for "Cancer" dataset**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.classification import *

cancerDataSet = get_data("cancer")
s = setup(data = cancerDataSet, target='Class', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
svm,SVM - Linear Kernel,0.9707,0.0,0.9574,0.9601,0.9577,0.9353,0.9365,0.015
rf,Random Forest Classifier,0.9707,0.996,0.9632,0.9552,0.9572,0.935,0.9372,0.477
lr,Logistic Regression,0.9665,0.9967,0.9574,0.949,0.9515,0.926,0.9278,0.275
ada,Ada Boost Classifier,0.9664,0.9969,0.9445,0.961,0.9496,0.9245,0.9278,0.107
nb,Naive Bayes,0.9644,0.9815,0.9629,0.9387,0.9492,0.9218,0.9236,0.016
et,Extra Trees Classifier,0.9644,0.9954,0.9449,0.9546,0.9477,0.9208,0.9229,0.464
lightgbm,Light Gradient Boosting Machine,0.9624,0.9949,0.9515,0.9445,0.9464,0.9175,0.9193,0.089
ridge,Ridge Classifier,0.9602,0.0,0.9265,0.9575,0.9402,0.9104,0.9123,0.016
gbc,Gradient Boosting Classifier,0.9561,0.994,0.9331,0.9441,0.9359,0.9026,0.9055,0.119
lda,Linear Discriminant Analysis,0.9539,0.9915,0.9143,0.9505,0.9305,0.8961,0.8982,0.024


---
### **1.7 Three line of code for model comparison for "Heart Disease" dataset**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.classification import *

heartDiseaseDataSet = get_data("heart_disease")
s = setup(data = heartDiseaseDataSet, target='Disease', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.8418,0.8963,0.7982,0.8394,0.8124,0.676,0.6833,0.464
lr,Logistic Regression,0.8412,0.905,0.8107,0.8245,0.812,0.6751,0.6818,0.215
et,Extra Trees Classifier,0.8409,0.8911,0.7821,0.8478,0.8067,0.6719,0.6812,0.46
lightgbm,Light Gradient Boosting Machine,0.8351,0.8951,0.8089,0.812,0.8059,0.6637,0.6691,0.028
ridge,Ridge Classifier,0.8307,0.0,0.7964,0.8152,0.7995,0.6531,0.66,0.014
lda,Linear Discriminant Analysis,0.8307,0.8948,0.7964,0.8152,0.7995,0.6531,0.66,0.015
gbc,Gradient Boosting Classifier,0.8249,0.8739,0.7571,0.8153,0.7826,0.6363,0.6403,0.082
nb,Naive Bayes,0.798,0.883,0.8964,0.7261,0.7924,0.604,0.6331,0.015
ada,Ada Boost Classifier,0.776,0.8347,0.7054,0.7554,0.7238,0.5365,0.5435,0.092
dt,Decision Tree Classifier,0.7664,0.7632,0.7446,0.7061,0.7227,0.521,0.5244,0.016


---
# **2. Classification: Advance - 1**
---

#### **2.1 Model Performance using data "Normalization"**

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.

In [None]:
## Commonly used techniques: clipping, log scaling, z-score, minmax, maxabs, robust
s = setup(data=diabetesDataSet, target='Class variable', normalize = True, normalize_method = 'zscore', silent=True)
cm = compare_models()


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7689,0.8156,0.531,0.7122,0.5968,0.4428,0.4586,0.271
ridge,Ridge Classifier,0.7652,0.0,0.5255,0.7138,0.5908,0.4342,0.4524,0.014
gbc,Gradient Boosting Classifier,0.7579,0.8225,0.5647,0.6559,0.6029,0.4318,0.4366,0.12
lda,Linear Discriminant Analysis,0.7559,0.8123,0.5255,0.6909,0.5814,0.4164,0.4332,0.017
lightgbm,Light Gradient Boosting Machine,0.7523,0.8127,0.5869,0.6379,0.6074,0.4282,0.4315,0.094
rf,Random Forest Classifier,0.7468,0.8042,0.4977,0.6563,0.5524,0.3863,0.399,0.511
knn,K Neighbors Classifier,0.745,0.7424,0.5147,0.6508,0.5615,0.3899,0.4003,0.119
ada,Ada Boost Classifier,0.7393,0.7861,0.5709,0.6191,0.5868,0.3992,0.4043,0.106
et,Extra Trees Classifier,0.7281,0.7889,0.4248,0.6438,0.5067,0.3313,0.347,0.462
svm,SVM - Linear Kernel,0.7057,0.0,0.5209,0.5871,0.5333,0.3252,0.3379,0.015


---
### **2.2 Model Performance using "Feature Selection"**
---

Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve. The goal of feature selection in machine learning is to find the best set of features that allows one to build useful models of studied phenomena. Threshold used for feature selection (including newly created polynomial features). A higher value will result in a higher feature space. It is recommended to do multiple trials with different values of feature_selection_threshold.

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', feature_selection = True, feature_selection_threshold = 0.6, silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.782,0.8187,0.5807,0.7553,0.646,0.4932,0.5098,0.209
lda,Linear Discriminant Analysis,0.7819,0.8142,0.5863,0.7503,0.647,0.4939,0.51,0.016
ridge,Ridge Classifier,0.78,0.0,0.5807,0.7486,0.6437,0.4893,0.505,0.014
rf,Random Forest Classifier,0.769,0.8294,0.5754,0.7304,0.6306,0.4672,0.4837,0.513
lightgbm,Light Gradient Boosting Machine,0.7596,0.8111,0.5915,0.6914,0.6289,0.4536,0.4635,0.046
gbc,Gradient Boosting Classifier,0.756,0.8196,0.5696,0.6849,0.6147,0.4393,0.4481,0.119
ada,Ada Boost Classifier,0.7503,0.8136,0.5921,0.6722,0.6198,0.4365,0.4461,0.103
knn,K Neighbors Classifier,0.7375,0.7533,0.5418,0.6467,0.5828,0.3947,0.4021,0.12
et,Extra Trees Classifier,0.7374,0.8036,0.4886,0.6833,0.5597,0.3804,0.3977,0.462
nb,Naive Bayes,0.6982,0.7491,0.3029,0.6462,0.4035,0.2353,0.2706,0.015


---
### **2.3 Model Performance using "Outlier Removal"**
---

Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier values. outliers_threshold = 0.05 is the default value.

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.7745,0.8287,0.5647,0.7058,0.624,0.4663,0.4746,0.498
lr,Logistic Regression,0.7686,0.8272,0.5588,0.6927,0.6133,0.4522,0.461,0.205
gbc,Gradient Boosting Classifier,0.7686,0.8209,0.6,0.6842,0.6315,0.4651,0.4732,0.119
lda,Linear Discriminant Analysis,0.7627,0.8275,0.5412,0.6822,0.6001,0.4357,0.4438,0.016
ridge,Ridge Classifier,0.7608,0.0,0.5235,0.687,0.5897,0.4265,0.4372,0.014
ada,Ada Boost Classifier,0.7529,0.7839,0.5882,0.6512,0.6106,0.4318,0.4386,0.104
et,Extra Trees Classifier,0.7529,0.8123,0.5176,0.6758,0.5785,0.4097,0.4214,0.46
knn,K Neighbors Classifier,0.7314,0.7798,0.5647,0.6045,0.5776,0.383,0.3875,0.115
lightgbm,Light Gradient Boosting Machine,0.7275,0.8071,0.5412,0.6075,0.5652,0.3696,0.3759,0.046
nb,Naive Bayes,0.7157,0.7654,0.6941,0.5638,0.6185,0.3969,0.4063,0.015


---
### **2.4 Model Performance using "Transformation"**
---

Data transformation is the process in which you take data from its raw, siloed and normalized source state and transform it into data that's joined together, dimensionally modeled, de-normalized, and ready for analysis

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', transformation = True, transformation_method = 'yeo-johnson', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7619,0.8259,0.5995,0.7401,0.6498,0.4741,0.4881,0.025
lda,Linear Discriminant Analysis,0.7526,0.8226,0.5895,0.7253,0.6359,0.4537,0.4684,0.016
gbc,Gradient Boosting Classifier,0.7507,0.8057,0.5987,0.7053,0.6427,0.4533,0.4606,0.117
ridge,Ridge Classifier,0.7471,0.0,0.5795,0.7171,0.6266,0.4407,0.4554,0.015
lightgbm,Light Gradient Boosting Machine,0.7451,0.8001,0.6089,0.6791,0.6375,0.443,0.4478,0.049
ada,Ada Boost Classifier,0.7358,0.8092,0.6245,0.6678,0.6372,0.4311,0.4379,0.104
svm,SVM - Linear Kernel,0.7283,0.0,0.5845,0.6499,0.6063,0.4026,0.4104,0.018
rf,Random Forest Classifier,0.7283,0.8112,0.5384,0.6701,0.5932,0.3939,0.4015,0.504
knn,K Neighbors Classifier,0.7281,0.7626,0.5134,0.678,0.5816,0.3868,0.3965,0.116
dt,Decision Tree Classifier,0.7209,0.7002,0.6192,0.6142,0.6112,0.3961,0.3988,0.016


---
### **2.5 Model Performance using "PCA"**
---

An important machine learning method for dimensionality reduction is called Principal Component Analysis. It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', pca = True, pca_method = 'linear', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.7673,0.0,0.5137,0.7351,0.5997,0.4443,0.4614,0.013
lda,Linear Discriminant Analysis,0.7654,0.8171,0.5243,0.7262,0.6034,0.4441,0.4595,0.016
lr,Logistic Regression,0.7636,0.8162,0.5243,0.7239,0.6021,0.4409,0.4566,0.021
rf,Random Forest Classifier,0.7597,0.8129,0.5725,0.6909,0.6206,0.4479,0.456,0.513
dt,Decision Tree Classifier,0.7561,0.7287,0.6415,0.6548,0.6425,0.4587,0.463,0.016
et,Extra Trees Classifier,0.7559,0.8187,0.5617,0.6811,0.6131,0.4376,0.4438,0.464
ada,Ada Boost Classifier,0.7523,0.792,0.5833,0.6672,0.6185,0.437,0.4422,0.102
qda,Quadratic Discriminant Analysis,0.7522,0.8112,0.5287,0.6902,0.5888,0.4182,0.4319,0.015
nb,Naive Bayes,0.7466,0.8048,0.5453,0.6697,0.5915,0.4127,0.4235,0.015
gbc,Gradient Boosting Classifier,0.7373,0.8035,0.5728,0.6396,0.5993,0.4058,0.4112,0.117


---
### **2.6 Model Performance using "Outlier Removal" + "Normalization"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, normalize = True, normalize_method = 'zscore', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7686,0.8175,0.5755,0.7145,0.6307,0.4672,0.4771,0.022
ridge,Ridge Classifier,0.7667,0.0,0.5699,0.7098,0.6272,0.4621,0.4711,0.014
lda,Linear Discriminant Analysis,0.7647,0.8136,0.5699,0.7038,0.6249,0.4582,0.4664,0.016
rf,Random Forest Classifier,0.7569,0.8171,0.5588,0.6944,0.615,0.4414,0.4496,0.503
gbc,Gradient Boosting Classifier,0.751,0.8164,0.5928,0.6672,0.6224,0.4391,0.4443,0.116
ada,Ada Boost Classifier,0.7451,0.8023,0.5876,0.6514,0.6128,0.4254,0.4301,0.103
lightgbm,Light Gradient Boosting Machine,0.7451,0.804,0.6258,0.6429,0.6304,0.437,0.44,0.048
knn,K Neighbors Classifier,0.7333,0.7699,0.5474,0.6505,0.5876,0.394,0.402,0.116
et,Extra Trees Classifier,0.7333,0.8044,0.5186,0.6466,0.5734,0.3839,0.3898,0.463
svm,SVM - Linear Kernel,0.698,0.0,0.4686,0.589,0.5061,0.2997,0.3094,0.014


---
### **2.7 Model Performance using "Outlier Removal" + "Normalization" + "Transformation"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, normalize = True, normalize_method = 'zscore', transformation = True, transformation_method = 'yeo-johnson', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.7471,0.7826,0.5471,0.662,0.5884,0.4093,0.42,0.104
lr,Logistic Regression,0.7431,0.8059,0.5,0.6535,0.5585,0.3843,0.3954,0.022
gbc,Gradient Boosting Classifier,0.7431,0.7908,0.5294,0.6412,0.577,0.3957,0.4012,0.117
lda,Linear Discriminant Analysis,0.7353,0.7998,0.4765,0.6406,0.5397,0.3617,0.3729,0.016
rf,Random Forest Classifier,0.7333,0.7757,0.4588,0.6517,0.5314,0.3541,0.3685,0.511
ridge,Ridge Classifier,0.7314,0.0,0.4588,0.6377,0.5266,0.3483,0.3609,0.013
et,Extra Trees Classifier,0.7235,0.746,0.3882,0.6389,0.4782,0.3077,0.3268,0.461
knn,K Neighbors Classifier,0.7176,0.7126,0.4235,0.6116,0.4942,0.3098,0.3223,0.119
nb,Naive Bayes,0.7078,0.7247,0.4588,0.568,0.5039,0.3031,0.3073,0.015
lightgbm,Light Gradient Boosting Machine,0.7,0.7566,0.5059,0.5527,0.5216,0.3064,0.3107,0.05


**Task:** Use different machine learning models to classify a dataset and use classification based Model Evaluation Parameters for performance analysis.