---
# **PyCaret for Classification**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self learning resource**
1. Tutorial on Pycaret **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a>** 

2. Documentation on Pycaret-Classification: **<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html"> Click Here </a>**

---

### **In this tutorial we will learn:**

- Getting Data
- Setting up Environment
- Create Model
- Tune Model
- Plot Model
- Finalize Model
- Predict Model
- Save / Load Model
---



### **(a) Install Pycaret**

In [None]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")

### **(b) Get the version of the pycaret**

In [3]:
from pycaret.utils import version
version()

ModuleNotFoundError: ignored

---
# **1. Classification: Basics**
---
### **1.1 Get the list of datasets available in pycaret (55)**




In [None]:
from pycaret.datasets import get_data
dataSets = get_data('index')

---
### **1.2 Get the "diabetes" dataset**
---

In [None]:
diabetesDataSet = get_data("diabetes")    # SN is 7
# This is binary classification dataset. 
# The values in "Class variable" have two (binary) values.

---
### **1.3 Download the "diabetes" dataset to local system** 
---

In [None]:
diabetesDataSet.to_csv("diabetesDataSet.csv")
from google.colab import files
#files.download('diabetesDataSet.csv')            # Uncomment this line       

---
### **1.4 "Parameter setting"  for all classification models**
##### **Train/Test division, applying data pre-processing** {Sampling, Normalization, Transformation, PCA, Handaling of Outliers, Feature Selection}
---

In [None]:
from pycaret.classification import *
s = setup(data=diabetesDataSet, target='Class variable', train_size=0.7, silent=True)

---
### **1.5 "Run and Compare" the model performance**
---

In [None]:
cm = compare_models()

---
### **1.6 "Three line of code" for model comparison for "Cancer" dataset**
---



In [8]:
from pycaret.datasets import get_data
from pycaret.classification import *

cancerDataSet = get_data("cancer")
s = setup(data = cancerDataSet, target='Class', silent=True)
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9728,0.9962,0.9702,0.9566,0.9621,0.9409,0.9424,0.024
rf,Random Forest Classifier,0.9685,0.994,0.9643,0.9507,0.9562,0.9317,0.9332,0.464
et,Extra Trees Classifier,0.9685,0.9935,0.9643,0.9507,0.9562,0.9317,0.9332,0.466
ridge,Ridge Classifier,0.9664,0.0,0.9526,0.9558,0.952,0.9263,0.9287,0.015
lightgbm,Light Gradient Boosting Machine,0.9664,0.9933,0.9522,0.9547,0.9521,0.9262,0.9278,0.045
svm,SVM - Linear Kernel,0.9623,0.0,0.964,0.9349,0.9479,0.9184,0.9202,0.019
lda,Linear Discriminant Analysis,0.9623,0.9899,0.9408,0.955,0.9454,0.9167,0.9193,0.024
nb,Naive Bayes,0.9602,0.9792,0.9529,0.9392,0.9448,0.9137,0.9152,0.016
ada,Ada Boost Classifier,0.958,0.9901,0.9283,0.9535,0.9388,0.9069,0.9091,0.112
knn,K Neighbors Classifier,0.9559,0.9868,0.9048,0.9683,0.9343,0.9013,0.9036,0.117


---
### **1.7 "Three line of code" for model comparison for "Heart Disease" dataset**
---



In [9]:
from pycaret.datasets import get_data
from pycaret.classification import *

heartDiseaseDataSet = get_data("heart_disease")
s = setup(data = heartDiseaseDataSet, target='Disease', silent=True)
cm = compare_models()

---
# **2. Classification: working with user dataset**
---
### **2.1 Uploading "user file" from user system**

In [None]:
from google.colab import files
files.upload()                     # Uncomment this line

---
### **2.2 "Read" the uploaded file**
---

In [None]:
import pandas as pd
myDataSet = pd.read_csv('BC7-LitCovid-Train.csv')        # Uncomment this line and replace the file name that read in above step
myDataSet.head()                                             # Uncomment this line

---
### **2.3 "Compare" the model performance**
---

In [None]:
from pycaret.classification import *

#s = setup(data = myDataSet, target='Cancer', silent=True)               # Uncomment this line
#cm = compare_models()                                                 # Uncomment this line

---
### **2.4 "Three line of code" for model comparison for "user dataset**

##### Use it, while working on **"Anaconda/Jupyter notebook"** on local machine
---

In [None]:
from pycaret.classification import *
import pandas as pd

#myDataSet = pd.read_csv("myData.csv")                          # Uncomment this line
#s = setup(data = myDataSet, target='cancer', silent=True)      # Uncomment this line
#cm = compare_models()                                          # Uncomment this line

---
# **3. Classification: Apply "Data Preprocessing"**
---

### **3.1 Model performance using "Normalization"**

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', normalize = True, normalize_method = 'zscore', silent=True)
cm = compare_models()

#normalize_method = {zscore, minmax, maxabs, robust}

---
### **3.2 Model performance using "Feature Selection"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', feature_selection = True, feature_selection_threshold = 0.9, silent=True)
cm = compare_models()

---
### **3.3 Model performance using "Outlier Removal"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, silent=True)
cm = compare_models()

---
### **3.4 Model performance using "Transformation"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', transformation = True, transformation_method = 'yeo-johnson', silent=True)
cm = compare_models()

---
### **3.5 Model performance using "PCA"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', pca = True, pca_method = 'linear', silent=True)
cm = compare_models()

---
### **3.6 Model performance using "Outlier Removal" + "Normalization"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, normalize = True, normalize_method = 'zscore', silent=True)
cm = compare_models()

---
### **3.7 Model performance using "Outlier Removal" +  "Normalization" + "Transformation"**
---

In [None]:
s = setup(data=diabetesDataSet, target='Class variable', remove_outliers = True, outliers_threshold = 0.05, normalize = True, normalize_method = 'zscore', transformation = True, transformation_method = 'yeo-johnson', silent=True)
cm = compare_models()

---
### **3.8 Explore more parameters of "setup()" on pycaret**
---
- Explore setup() paramaeters in **Step 1.4**
- **<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html"> Click Here</a>** for more

---
# **4. Classification: More Operations**
---
### **4.1 Build a single model - "RandomForest"**

In [None]:
from pycaret.datasets import get_data
from pycaret.classification import *

diabetesDataSet = get_data("diabetes")
s = setup(data=diabetesDataSet, target='Class variable', silent=True)

rfModel = create_model('rf')
# Explore more parameters

---
### **4.2 Other available classification models**
---
-	'ada' -	Ada Boost Classifier
-	'dt' -	Decision Tree Classifier
-	'et' -	Extra Trees Classifier
-	'gbc' -	Gradient Boosting Classifier
-	'knn' -	K Neighbors Classifier
-	'lightgbm' -	Light Gradient Boosting Machine
-	'lda' -	Linear Discriminant Analysis
-	'lr' -	Logistic Regression
-	'nb' -	Naive Bayes
-	'qda' -	Quadratic Discriminant Analysis
-	'rf' -	Random Forest Classifier
-	'ridge' -	Ridge Classifier
-	'svm' -	SVM - Linear Kernel

---
### **4.3 Explore more parameters of "create_model()" on pycaret**
---

**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html"> Click Here</a>** 

---
### **4.4 Make prediction on the "new unseen dataset"**
---
#### **Get the "new unseen dataset"**



In [None]:
# Select top 10 rows from diabetes dataset
newDataSet = get_data("diabetes").iloc[:10]

#### **Make prediction on "new unseen dataset"**

In [None]:
newPredictions = predict_model(rfModel, data = newDataSet)
newPredictions

---
### **4.5 "Save" the prediction results to csv** 
---

In [None]:
newPredictions.to_csv("NewPredictions.csv")
print("Result saved in NewPredictions.csv")

---
### **4.6 Download the "result file" to user local system** 
---

In [None]:
from google.colab import files
#files.download('NewPredictions.csv')      # Uncomment this line

---
### **4.7 "Save" the trained model** 
---

In [None]:
sm = save_model(rfModel, 'rfModelFile')

---
### **4.8 Download the "trained model file" to user local system** 
---

In [None]:
from google.colab import files
#files.download('rfModelFile.pkl')           # Uncomment this line

---
### **4.9  "Upload the trained model" --> "Load the model"  --> "Make the prediction" on "new unseen dataset"** 
---
### **4.9.1 Upload the  "Trained Model"**


In [None]:
from google.colab import files
#files.upload()                    # Uncomment this line

---
### **4.9.2 Load the "Model"**
---

In [None]:
#rfModel = load_model('rfModelFile (1)')        # Uncomment this line

---
### **4.9.3 Make the prediction on "new unseen dataset"**
---

In [None]:
newPredictions = predict_model(rfModel, data = newDataSet)
newPredictions

---
# **5. Plot the trained model**
---
**Following parameters can be plot for a trained model**
*   Area Under the Curve         - 'auc'
*   Discrimination Threshold     - 'threshold'
*   Precision Recall Curve       - 'pr'
*   Confusion Matrix             - 'confusion_matrix'
*   Class Prediction Error       - 'error'
*   Classification Report        - 'class_report'
*   Decision Boundary            - 'boundary'
*   Recursive Feat. Selection    - 'rfe'
*   Learning Curve               - 'learning'
*   Manifold Learning            - 'manifold'
*   Calibration Curve            - 'calibration'
*   Validation Curve             - 'vc'
*   Dimension Learning           - 'dimension'
*   Feature Importance           - 'feature'
*   Model Hyperparameter         - 'parameter'

---
### **5.1 Create RandomForest model or any other model**
---

In [None]:
rfModel = create_model('rf')

---
### **5.2 Create "Confusion Matrix"**
---

In [None]:
plot_model(rfModel, plot='confusion_matrix')

---
### **5.3 Plot the "learning curve"**
---

In [None]:
plot_model(rfModel, plot='learning')

---
### **5.4 Plot the "AUC Curve" (Area Under the Curve)**
---

In [None]:
plot_model(rfModel, plot='auc')

---
### **5.5 Plot the "Decision Boundary"**
---

In [None]:
plot_model(rfModel, plot='boundary')

---
### **5.6 Get the model "parameters"**
---

In [None]:
plot_model(rfModel, plot='parameter')

---
### **5.7 Explore the more parameters of "plot_model()" on pycaret**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html"> Click Here </a>**

---
# **6. Feature Importance**
---
### **6.1 Feature Importance using "Random Forest"**


In [None]:
rfModel = create_model('rf', verbose=False)
plot_model(rfModel, plot='feature')

---
### **6.2 Feature Importance using "Extra Trees Regressor"**
---

In [None]:
etModel = create_model('et', verbose=False)
plot_model(etModel, plot='feature')

---
### **6.3 Feature Importance using "Decision Tree"**
---

In [None]:
dtModel = create_model('dt', verbose=False)
plot_model(dtModel, plot='feature')

---
# **7. Tune/Optimize the model performance**
---
### **7.1 Train "Decision Tree" with default parameters**


In [None]:
dtModel = create_model('dt')

#### **Get the "parameters" of Decision Tree**

In [None]:
plot_model(dtModel, plot='parameter')

---
### **7.2 Tune "Decision Tree" model**
---

In [None]:
dtModelTuned = tune_model(dtModel, n_iter=10)

#### **Get the "tuned parameters" of Decision Tree**

In [None]:
plot_model(dtModelTuned, plot='parameter')

---
### **7.3 Explore more parameters of "tune_model()" on pycaret**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html"> Click Here </a>**

---
# **8. Deploy the model on AWS**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html">Click Here</a>**
