<a href="https://colab.research.google.com/github/Rahul-Chahar/02-01-2022/blob/main/Low_Code_%26_Auto_ML_with_PyCaret_Advanced_Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1 - Objective


The objective of this notebook will be to tackle the same classification problem from the Kickstarter notebook using the power of PyCaret. We will be exploring more advanced features and capabilities of PyCaret along the way:

-  __Getting Data:__ Learn how to import default datasets from PyCaret repository

-  __Custom Environment Setup:__ Learn how to setup a custom experiment in PyCaret with advanced data transformations!

-  __Compare Models:__ Learn how to compare multiple machine learning models for the given classification task based on model evaluation metrics

-  __Create Model:__ Learn how to create specific classifical models, perform stratified cross validation and evaluate classification metrics

-  __Tune Model:__ Learn how to automatically tune the hyper-parameters of classification models in different ways

-  __Ensemble Model:__ Learn how to automatically ensemble classification models in different ways

-  __Plot Model:__ Learn how to analyze model performance using various diagnostic plots

-  __Interpret Model:__ Learn how to interpret and explain classification models in different ways using XAI

-  __Finalize Model:__ Learn how to finalize the best model at the end of the experiment

-  __Predict Model:__ Learn how to make predictions on new / unseen data 



# 2 - Install PyCaret

The first step to get started is to install `pycaret`. 

Run all the cells below to install necessary dependencies for some of the advanced capabilities to use with PyCaret along with PyCaret itself



In [None]:
!pip install explainerdashboard
!pip install optuna
!pip install pyngrok

In [None]:
!pip install pycaret

In [None]:
!pip install -U jinja2==3.0.3 # https://github.com/pycaret/pycaret/issues/2591 to bypass import errors later on

## Restart the Kernel now and then proceed by running the following cells as usual

# Low-Code & Auto-ML with PyCaret - Advanced

Welcome to this hands-on workshop session where we will learn about leveraging the very popular low-code and auto-ml library PyCaret!

![](https://i.imgur.com/cWzC62x.png)


The focus of this notebook is to continue from where we left off in the kickstarter notebook and dive into more complex features in PyCaret! 

## 2.1 - Enable Interactive Visuals

If you are using Google Colab, please run the following to enable interactive visuals

In [None]:
from pycaret.utils import enable_colab
enable_colab()

# 3 - Binary Classification

The objective in this notebook will be to solve a predictive machine learning classification problem. To be more specific, it is going to be binary classification.

Binary classification is a supervised machine learning technique where the key objective is to predict a response variable given a set of independent variables (features). The response variable is categorical, having two discrete class labels, such as 1/0, Yes/No, Positive/Negative, Default/Not-Default and so on. 

A few real world use cases for classification are listed below:

- Fraud detection models to detect if a transaction is fraudulent or not fraudulent
- A "pass or fail" test method or quality control in factories, i.e. deciding if a specification has or has not been met – a go/no-go classification.
- Sentiment Analysis -> Positive or Negative

# 4 - PyCaret Classification Module

PyCaret's `classification` module (`pycaret.classification`) is a supervised machine learning module which is used for training, tuning, evaluating and deploying classification models. 

The PyCaret `classification` module can be used for Binary or Multi-class classification problems. It has over 18 algorithms and 14 plots to analyze the performance of models. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's classification module has it all.

Do check out [`pycaret.classification`'s documentation](https://pycaret.gitbook.io/docs/get-started/quickstart#classification)and [full-fledged APIs](https://pycaret.readthedocs.io/en/latest/api/classification.html) as needed!

# 5 - Getting the Data

We will be using a popular open-source dataset, called the "Adult" dataset also known as "Census Income" dataset.

Key Objective: Predict whether income exceeds $50K/yr based on census data

You can download the data from the original source [found here](https://archive.ics.uci.edu/ml/datasets/adult) 


and load it using `pandas` or you can use PyCaret's data respository to load the data using the `get_data()` function (This will require an internet connection).

The PyCaret version of the dataset is slightly more processed and is a subset.

## 5.1 - Data Retrieval

In [None]:
from pycaret.datasets import get_data
dataset = get_data('income')

In [None]:
dataset.info()

In [None]:
dataset.shape

## 5.2 - Split Data into Train-Test Datasets

In order to demonstrate the `predict_model()` function on unseen data, a holdout sample of 15% records has been withheld from the original dataset to be used for predictions. 

This will be your true unseen test dataset to be used at the end once all training is complete as a simulation of live real data.

In [None]:
# create train - test datasets
data_train = dataset.sample(frac=0.85, random_state=42)
data_test = dataset.drop(data_train.index)

# reset row numbers \ indices
data_train.reset_index(inplace=True, drop=True)
data_test.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data_train.shape))
print('Unseen Data For Predictions: ' + str(data_test.shape))

## 5.3 - Understanding the Data

Let's try to understand our dataset now in terms of the given attributes.

We use a dataset modified dataset from UCI called [Adult Data Set
](https://archive.ics.uci.edu/ml/datasets/adult). 

This dataset contains census data and details about various aspects of people and their income.

There are 32561 samples and 14 features. 

Brief descriptions of each column are as follows:

- __age__: continuous; age of the person

- __workclass__: categorical; working class of the person;
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

- __education__: categorical; educational qualification of the person;
Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

- __education-num__: discrete numeric; educational qualification of the person as a encoded value; 

- __marital-status__: categorical; marital status of the person; 
Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

- __occupation__: categorical; occupation of the person;
Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

- __relationship__: categorical; relationship information;
Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

- __race__: categorical; race information;
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

- __sex__: categorical; gender of the person; 
Female, Male.

- __capital-gain__: continuous; overall capital gain 

- __capital-loss__: continuous; overall capital loss

- __hours-per-week__: continuous; working hours per week

- __native-country__: categorical; native country of residence;
United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

- **`income >50K`**: Whether the income of the person is more than $50K (1=yes, 0=no) Target Column

The original dataset and data dictionary can be [found here](https://archive.ics.uci.edu/ml/datasets/adult).

In [None]:
data_train.head()

# 6 - PyCaret Environment Setup

The `setup()` function initializes the environment in `pycaret` and creates the transformation pipeline to prepare the data for modeling and deployment. 



In [None]:
from pycaret.classification import *

In [None]:
experiment = setup(data=data_train, target='income >50K', session_id=42) 

Once the setup has been succesfully executed it prints the information grid which contains several important pieces of information. 

Most of the information is related to the pre-processing pipeline which is constructed when `setup()` is executed. 

We are not doing any extensive pre-processing to start with, however a few important things to note at this stage include:

- **session_id :**  A pseudo-random number distributed as a seed in all functions for later reproducibility. If no `session_id` is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the `session_id` is set as `123` for later reproducibility.

- **Target Type :**  Binary or Multiclass. The Target type is automatically detected and shown. There is no difference in how the experiment is performed for Binary or Multiclass problems. All functionalities are identical.

- **Label Encoded :**  When the Target variable is of type string (i.e. 'Yes' or 'No') instead of 1 or 0, it automatically encodes the label into 1 and 0 and displays the mapping (0 : No, 1 : Yes) for reference. In this experiment no label encoding is required since the target variable is of type numeric.

- **Original Data :**  Displays the original shape of the dataset. In this experiment (27677, 14) means 27,677 samples and 14 features including the target column. 

- **Missing Values :**  When there are missing values in the original data this will show as True. For this experiment there are several missing values in the dataset. 

- **Numeric Features :**  The number of features inferred as numeric. In this dataset, 4 features are inferred as numeric. 

- **Categorical Features :**  The number of features inferred as categorical. In this dataset, 9 features are inferred as categorical.

- **Transformed Train Set :**  Displays the shape of the transformed training set. Notice that the original shape of (27677, 24) is transformed into (19373, 104) for the transformed train set and the number of features have increased to 104 due to categorical encoding 

- **Transformed Test Set :**  Displays the shape of the transformed test/hold-out set. There are 8304 samples in test/hold-out set. This split is based on the default value of 70/30 that can be changed using the `train_size` parameter in setup. Can also be used as a validation set if you make decisions of choosing the best model based on this subset.

Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation, categorical encoding etc. 

Most of the parameters in `setup()` are optional and used for customizing the pre-processing pipeline.



# 7 - Comparing all Models

Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). 

This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC accross the folds (10 by default) along with training times.

We use a 3-fold cross validation and focus on the F1-score metric

In [None]:
best_model = compare_models(fold=3, sort='F1')

Looks like boosting models have taken up the leaderboard above!

What we will do next is focus on the top model and apply different data transformation techniques to see if it improves 
in terms of performance

# 8 - Profiling your Dataset

We can use `pandas_profiling` to generate a nice data profile report of our dataset to get an idea of the major issues in our data which we can perhaps fix in the next section

# 9 - Advanced Data Transformations

In this section we will setup another classification experiment but we will make some additional transformations to the dataset:

- **imputation_type :**  We do not use the default imputation methodology anymore in PyCaret which does mean imputation for numeric and constant imputation for categorical. Here we use iterative ML based imputation using a lightGBM model

- **remove_multicollinearity :**  We set this to true to remove features which might be highly correlated

- **multicollinearity_threshold :**  We set this to 0.9 to remove features which might be having more than 0.9 correlation with other features

- **fix_imbalance :**  We set this to True because we have imbalanced classes and hence this will internally use the SMOTE oversampling technique to generate synthetic data to create more samples for the minority class

Here we build the top 5 models from the last time with our new data transformations

# 10 - Create ML Models

`create_model` is one of the most important functions in PyCaret and is often the starting point or foundation behind most of the PyCaret functionalities. 

As the name suggests this function trains and evaluates a model using cross validation that can be set with `fold` parameter. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold. 

For the remaining part of this tutorial, we will work with the following model which is our top performing model

The selection is based on the best model with the top F1-score:

```
- Light Gradient Boosting Machine ('lightgbm')
```

There are 18 classifiers available in the model library of PyCaret. To see list of all classifiers either check the `docstring` or use `models` function to see the library.

## 10.1 - Create Light Gradient Boosting Model

# 11 - Tune ML Models

When a model is created using the `create_model()` function it uses the default hyperparameters to train the model. 

In order to tune hyperparameters, the [`tune_model()`](https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.tune_model) function is used. 

This function automatically tunes the hyperparameters of a model using `Randomized Search` on a pre-defined search space. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC by fold for the best model. 

To use the custom search grid, you can pass `custom_grid` parameter in the `tune_model` function (see 9.2 Logistic Regression tuning below).

## 11.1 Tune Light Gradient Boosting Model with Randomized Search

Uses Randomized Search method to tune the model on F1-score

## 11.2 Tune Light Gradient Boosting Model with Bayesian Search

Uses Bayesian Search method to tune the model on F1-score using Optuna

The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach. SMBO methods sequentially construct models to approximate the performance of hyperparameters based on historical measurements, and then subsequently choose new hyperparameters to test based on this model.

Looks like tuning didn't yield any significant improvements. That could also be because the number of trials we tried our were very less.

# 12 - Ensembling ML Models

This functionality helps in ensembling a given model in different ways

## 12.1 - Create a simple decision tree model

## 12.2 - Ensemble model by Bagging

Trains multiple models independently in parallel and combines their predictions to build one big model

## 12.3 - Ensemble model by Boosting

Trains multiple models sequentially where one model tries to learn from the mistakes of the previous model, and combines their predictions to build one big model

## 12.4 - Ensemble model by Blending

This function trains a Soft Voting / Hard Voting Majority Rule classifier for select models passed in the `estimator_list` parameter.

In [None]:
# train individual models to blend


In [None]:
# blend individual models based on soft labels i.e predicted probabilities


In [None]:
# blend individual models based on hard labels i.e predicted labels


## 12.5 - Ensemble model by Stacking

This function trains a meta-model over select estimators passed in the `estimator_list` parameter. Which means predictions of the initial models go as inputs into the meta model which makes the final predictions

# 13 - Plot ML Model Evaluation Diagnostics

Before model finalization, the `plot_model()` function can be used to analyze and evaluate the model performance across different aspects such as AUC, confusion_matrix, decision boundary etc. 

This function takes a trained model object and returns a plot based on the test / hold-out set. 

There are many different plots available, please see the `plot_model()` docstring for the list of available plots.

## 13.1 - Confusion Matrix

## 13.2 - Feature Importance

## 13.3 - ROC AUC Curve

## 13.4 - Classification Report

*Another* way to analyze the performance of models is to use the `evaluate_model()` function which displays a user interface for all of the available plots for a given model. It internally uses the `plot_model()` function. 

# 14 - Interpret your ML Models with XAI

## 14.1 - SHAP Summary Plot

Shows effects of each feature on model predictions based on SHAP values for the entire test dataset. 

Positive values have a positive influence on model (pushes it to predict the positive class) and negative values have a negative influence on the model (pushes it to predict the negative class)

## 14.2 - SHAP Partial Dependence Plots

Shows effects of specific features on model predictions based on SHAP values for the entire test dataset. 

Positive values have a positive influence on model (pushes it to predict the positive class) and negative values have a negative influence on the model (pushes it to predict the negative class)

## 14.3 - SHAP Reasoning Plot

Shows effects of specific features on model predictions based on SHAP values for a specific row of the test dataset.

Positive values have a positive influence on model (pushes it to predict the positive class) and negative values have a negative influence on the model (pushes it to predict the negative class)

## 14.4 - XAI Explainer Dashboard

This generates an interactive dashboard for a trained model consisting of evaluation metrics and SHAP based explanation artifacts

## We need to open a tunnel since we can't open a webpage inside colab

In [None]:


# Terminate open tunnels if exist

# Setting the authtoken (optional)
# Get your authtoken from https://dashboard.ngrok.com/auth


# Open an HTTPs tunnel on port 8050 for http://localhost:8050


# 15 - Finalize Model for Deployment

Model finalization is the last step in the experiment. 

A normal machine learning workflow in PyCaret starts with `setup()`, followed by comparing all models using `compare_models()` and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. (more on advanced techniques in the next tutorial!).

This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. 


The `finalize_model()` function fits the model onto the complete dataset **including** the test/hold-out sample (30% in this case). The purpose of this function is to **train the model on the complete dataset** before it is deployed in production.

In [None]:
final_lgbm = finalize_model(lgbm)

In [None]:
#Final Light Gradient Boosting Model to be used for deployment
final_lgbm

# 16 - Predict on unseen / new datasets

The `predict_model()` function is also used to predict on any new / unseen datasets. 

The only difference from section 11 above is that this time we will pass the `data_test` parameter. `data_test` is the variable created at the beginning of the tutorial and contains 15% of the original dataset which was never exposed to PyCaret. (see section 5 for explanation)

In [None]:
new_predictions = predict_model(final_lgbm, data=data_test)

In [None]:
new_predictions.head()

The `Label` and `Score` columns are added onto the `data_test` set. 

Label is the prediction and score is the probability of the prediction. 

Notice that predicted results are concatenated to the original dataset while all the data transformations are automatically performed in the background. 


In [None]:
from sklearn.metrics import classification_report

print(classification_report(new_predictions['income >50K'], new_predictions['Label']))