<a href="https://colab.research.google.com/github/psrana/Machine-Learning-using-PyCaret/blob/main/02_PyCaret_for_Classification_with_Results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **PyCaret for Classification**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self learning resource**
1. Tutorial on Pycaret **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html" target="_blank"> Click Here</a>** 

2. Documentation on Pycaret-Classification: **<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html" target="_blank"> Click Here </a>**

---

### **In this tutorial we will learn:**

- Getting Data
- Setting up Environment
- Create Model
- Tune Model
- Plot Model
- Finalize Model
- Predict Model
- Save / Load Model
---



### **(a) Install Pycaret**

In [1]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")

Pycaret installed sucessfully!!


The syntax of the command is incorrect.


### **(b) Get the version of the pycaret**

In [2]:
from pycaret.utils import version
version()

'2.3.6'

---
# **1. Classification: Basics**
---
### **1.1 Get the list of datasets available in pycaret (Total Datasets = 55)**




In [3]:
from pycaret.datasets import get_data
dataSets = get_data('index')

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


---
### **1.2 Get the "diabetes" dataset (Step-I)**
---

In [4]:
diabetesDataSet = get_data("diabetes")    # SN is 7

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


---
### **1.3 Parameter setting for all models (Step-II)**
---

In [5]:
from pycaret.classification import *
s = setup(data=diabetesDataSet, target='Class variable', silent=True)

# Other Parameters:
# train_size = 0.7
# data_split_shuffle = False

Unnamed: 0,Description,Value
0,session_id,4997
1,Target,Class variable
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(768, 9)"
5,Missing Values,False
6,Numeric Features,7
7,Categorical Features,1
8,Ordinal Features,False
9,High Cardinality Features,False


---
### **1.4 Run all models (Step-III)**
---

In [6]:
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7655,0.8143,0.5743,0.6948,0.6228,0.4568,0.4647,1.717
rf,Random Forest Classifier,0.7618,0.8077,0.5781,0.6857,0.6189,0.4495,0.4582,0.219
et,Extra Trees Classifier,0.7581,0.8062,0.4968,0.7147,0.5802,0.42,0.4362,0.143
gbc,Gradient Boosting Classifier,0.7579,0.8062,0.5953,0.6733,0.6296,0.4511,0.4548,1.527
lda,Linear Discriminant Analysis,0.7561,0.8102,0.5474,0.6837,0.6029,0.4319,0.4402,0.02
ridge,Ridge Classifier,0.7524,0.0,0.5365,0.6796,0.5943,0.4217,0.4306,0.012
lightgbm,Light Gradient Boosting Machine,0.7394,0.7823,0.5895,0.6393,0.6087,0.4147,0.419,0.102
ada,Ada Boost Classifier,0.7225,0.772,0.502,0.6289,0.5541,0.3569,0.3642,0.072
xgboost,Extreme Gradient Boosting,0.7208,0.7812,0.5465,0.6132,0.5749,0.3687,0.3723,0.254
knn,K Neighbors Classifier,0.7115,0.728,0.488,0.6193,0.5355,0.3326,0.3438,0.025


---
### **1.5 "Three line of code" for model comparison for "Diabetes" dataset**
---



In [7]:
from pycaret.datasets import get_data
from pycaret.classification import *

diabetesDataSet = get_data("diabetes")
setup(data=diabetesDataSet, target='Class variable', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7727,0.8372,0.5737,0.742,0.6402,0.4789,0.4918,0.238
lda,Linear Discriminant Analysis,0.7709,0.8242,0.5737,0.7364,0.6382,0.4751,0.4874,0.013
ridge,Ridge Classifier,0.769,0.0,0.5579,0.7391,0.6295,0.4672,0.481,0.008
lightgbm,Light Gradient Boosting Machine,0.7575,0.8042,0.6211,0.666,0.6395,0.4582,0.4612,0.084
rf,Random Forest Classifier,0.7558,0.8181,0.5737,0.6902,0.6202,0.4441,0.4521,0.184
gbc,Gradient Boosting Classifier,0.7558,0.8325,0.5947,0.6827,0.6254,0.4478,0.4582,0.08
xgboost,Extreme Gradient Boosting,0.7521,0.8047,0.6421,0.6589,0.6448,0.4555,0.4604,0.181
knn,K Neighbors Classifier,0.7392,0.7458,0.5895,0.6431,0.6106,0.4167,0.4205,0.036
ada,Ada Boost Classifier,0.7352,0.7802,0.5579,0.6443,0.5947,0.4009,0.4049,0.091
et,Extra Trees Classifier,0.7149,0.7917,0.5158,0.6276,0.5571,0.3519,0.3607,0.167


---
### **1.6 "Three line of code" for model comparison for "Cancer" dataset**
---



In [8]:
from pycaret.datasets import get_data
from pycaret.classification import *

cancerDataSet = get_data("cancer")
setup(data = cancerDataSet, target='Class', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9624,0.996,0.9474,0.9519,0.9474,0.9182,0.9208,0.018
et,Extra Trees Classifier,0.9623,0.9948,0.9474,0.9533,0.9469,0.9178,0.9217,0.142
rf,Random Forest Classifier,0.9602,0.9932,0.9415,0.9533,0.9437,0.913,0.9172,0.144
nb,Naive Bayes,0.9583,0.971,0.9654,0.9264,0.944,0.9109,0.9131,0.009
ridge,Ridge Classifier,0.9561,0.0,0.9186,0.9623,0.9364,0.9031,0.9074,0.009
gbc,Gradient Boosting Classifier,0.954,0.9913,0.9301,0.9475,0.9353,0.8997,0.9036,0.066
lightgbm,Light Gradient Boosting Machine,0.954,0.9935,0.9242,0.953,0.9338,0.8987,0.904,0.117
lda,Linear Discriminant Analysis,0.952,0.9867,0.9069,0.9623,0.9295,0.8933,0.8986,0.017
ada,Ada Boost Classifier,0.9499,0.9893,0.9245,0.9398,0.9295,0.8907,0.8937,0.069
knn,K Neighbors Classifier,0.9476,0.9783,0.8889,0.9673,0.9227,0.8834,0.889,0.036


---
# **2. Classification: working with user dataset**
---
### **2.1 Download the "diabetes" dataset to local system** 
---


In [9]:
diabetesDataSet.to_csv("diabetesDataSet.csv", index=False)

from google.colab import files
files.download('diabetesDataSet.csv')

ModuleNotFoundError: No module named 'google.colab'

---
### **2.2 Uploading "user file" from user system**
---

In [None]:
from google.colab import files
files.upload()

---
### **2.3 "Read" the uploaded file**
---

In [None]:
import pandas as pd
myDataSet = pd.read_csv('diabetesDataSet (1).csv')
myDataSet.head()

---
### **2.4 "Compare" the model performance**
---

In [None]:
from pycaret.classification import *

setup(data = myDataSet, target='Class variable', silent=True)
cm = compare_models()

---
### **2.5 "Three line of code" for model comparison for "user dataset**

##### Use it, while working on **"Anaconda/Jupyter notebook"** on local machine
---

In [None]:
from pycaret.classification import *
import pandas as pd

#myDataSet = pd.read_csv("myData.csv")
#s = setup(data = myDataSet, target='cancer', silent=True)
#cm = compare_models()

---
# **3. Classification: Apply "Data Preprocessing"**
---

### **3.1 Model performance using "Normalization"**

In [None]:
setup(data=diabetesDataSet, target='Class variable', normalize = True, normalize_method = 'zscore', silent=True)
cm = compare_models()

#normalize_method = {zscore, minmax, maxabs, robust}

---
### **3.2 Model performance using "Feature Selection"**
---

In [None]:
setup(data=diabetesDataSet, target='Class variable', feature_selection = True, feature_selection_method = 'classic', feature_selection_threshold = 0.2, silent=True)
cm = compare_models()

#feature_selection_method = {classic, boruta}

---
### **3.3 Model performance using "Outlier Removal"**
---

In [None]:
setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, silent=True)
cm = compare_models()

---
### **3.4 Model performance using "Transformation"**
---

In [None]:
setup(data=diabetesDataSet, target='Class variable', transformation = True, transformation_method = 'yeo-johnson', silent=True)
cm = compare_models()

---
### **3.5 Model performance using "PCA"**
---

In [None]:
setup(data=diabetesDataSet, target='Class variable', pca = True, pca_method = 'linear', silent=True)
cm = compare_models()

---
### **3.6 Model performance using "Outlier Removal" + "Normalization"**
---

In [None]:
setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, 
      normalize = True, normalize_method = 'zscore', silent=True)
cm = compare_models()

---
### **3.7 Model performance using "Outlier Removal" +  "Normalization" + "Transformation"**
---

In [None]:
setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, 
      normalize = True, normalize_method = 'zscore', 
      transformation = True, transformation_method = 'yeo-johnson', silent=True)
cm = compare_models()

---
### **3.8 Explore more parameters of "setup()" on pycaret**
---
- Explore setup() paramaeters in **Step 1.3**
- **<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html" target="_blank"> Click Here</a>** for more

---
# **4. Classification: More Operations**
---
### **4.1 Build a single model - "RandomForest"**

In [None]:
from pycaret.datasets import get_data
from pycaret.classification import *

diabetesDataSet = get_data("diabetes")
setup(data=diabetesDataSet, target='Class variable', silent=True)

rfModel = create_model('rf')
# Explore more parameters

---
### **4.2 Other available classification models**
---
-	'ada' -	Ada Boost Classifier
-	'dt' -	Decision Tree Classifier
-	'et' -	Extra Trees Classifier
-	'gbc' -	Gradient Boosting Classifier
-	'knn' -	K Neighbors Classifier
-	'lightgbm' -	Light Gradient Boosting Machine
-	'lda' -	Linear Discriminant Analysis
-	'lr' -	Logistic Regression
-	'nb' -	Naive Bayes
-	'qda' -	Quadratic Discriminant Analysis
-	'rf' -	Random Forest Classifier
-	'ridge' -	Ridge Classifier
-	'svm' -	SVM - Linear Kernel

---
### **4.3 Explore more parameters of "create_model()" on pycaret**
---

**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.create_model" target="_blank"> Click Here</a>** 

---
### **4.4 Make prediction on the "new unseen dataset"**
---
#### **Get the "new unseen dataset"**



In [None]:
# Select top 10 rows from diabetes dataset
newDataSet = get_data("diabetes").iloc[:10]

#### **Make prediction on "new unseen dataset"**

In [None]:
newPredictions = predict_model(rfModel, data = newDataSet)
newPredictions

---
### **4.5 "Save" and "Download" the prediction result**
---

In [None]:
newPredictions.to_csv("NewPredictions.csv", index=False)

from google.colab import files
files.download('NewPredictions.csv')

---
### **4.6 "Save" the trained model** 
---

In [None]:
sm = save_model(rfModel, 'rfModelFile')

---
### **4.7 Download the "trained model file" to user local system** 
---

In [None]:
from google.colab import files
files.download('rfModelFile.pkl')

---
### **4.8  "Upload the trained model" --> "Load the model"  --> "Make the prediction" on "new unseen dataset"** 
---
### **4.8.1 Upload the  "Trained Model"**


In [None]:
from google.colab import files
files.upload()

---
### **4.8.2 Load the "Model"**
---

In [None]:
rfModel = load_model('rfModelFile (1)')

---
### **4.8.3 Make the prediction on "new unseen dataset"**
---

In [None]:
newPredictions = predict_model(rfModel, data = newDataSet)
newPredictions

---
# **5. Plot the trained model**
---
**Following parameters can be plot for a trained model**
*   Area Under the Curve         - 'auc'
*   Discrimination Threshold     - 'threshold'
*   Precision Recall Curve       - 'pr'
*   Confusion Matrix             - 'confusion_matrix'
*   Class Prediction Error       - 'error'
*   Classification Report        - 'class_report'
*   Decision Boundary            - 'boundary'
*   Recursive Feat. Selection    - 'rfe'
*   Learning Curve               - 'learning'
*   Manifold Learning            - 'manifold'
*   Calibration Curve            - 'calibration'
*   Validation Curve             - 'vc'
*   Dimension Learning           - 'dimension'
*   Feature Importance           - 'feature'
*   Model Hyperparameter         - 'parameter'

---
### **5.1 Create RandomForest model or any other model**
---

In [None]:
rfModel = create_model('rf')

---
### **5.2 Create "Confusion Matrix"**
---

In [None]:
plot_model(rfModel, plot='confusion_matrix')

---
### **5.3 Plot the "learning curve"**
---

In [None]:
plot_model(rfModel, plot='learning')

---
### **5.4 Plot the "AUC Curve" (Area Under the Curve)**
---

In [None]:
plot_model(rfModel, plot='auc')

---
### **5.5 Plot the "Decision Boundary"**
---

In [None]:
plot_model(rfModel, plot='boundary')

---
### **5.6 Get the model "parameters"**
---

In [None]:
plot_model(rfModel, plot='parameter')

---
### **5.7 Explore the more parameters of "plot_model()" on pycaret**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.plot_model" target="_blank"> Click Here </a>**

---
# **6. Feature Importance**
---
### **6.1 Feature Importance using "Random Forest"**


In [None]:
rfModel = create_model('rf', verbose=False)
plot_model(rfModel, plot='feature')

---
### **6.2 Feature Importance using "Extra Trees Regressor"**
---

In [None]:
etModel = create_model('et', verbose=False)
plot_model(etModel, plot='feature')

---
### **6.3 Feature Importance using "Decision Tree"**
---

In [None]:
dtModel = create_model('dt', verbose=False)
plot_model(dtModel, plot='feature')

---
# **7. Tune/Optimize the model performance**
---
### **7.1 Train "Decision Tree" with default parameters**


In [None]:
dtModel = create_model('dt')

#### **Get the "parameters" of Decision Tree**

In [None]:
plot_model(dtModel, plot='parameter')

---
### **7.2 Tune "Decision Tree" model**
---

In [None]:
dtModelTuned = tune_model(dtModel, n_iter=50)

#### **Get the "tuned parameters" of Decision Tree**

In [None]:
plot_model(dtModelTuned, plot='parameter')

---
### **7.3 Explore more parameters of "tune_model()" on pycaret**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.tune_model" target="_blank"> Click Here </a>**

---
# **8. AutoML - Advanced Machine Learning**
---

- Select n Best Models:
  - Ensemble, Stacking, Begging, Blending
  - Auto tune the best n models

**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.automl" target="_blank">Click Here</a>**


---
# **9. Deploy the model on AWS / Azure**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.deploy_model" target="_blank">Click Here</a>**