A.S. Lundervold, 15.11.2023

[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HVL-ML/DAT158/blob/master/notebooks/DAT158-3.1-PyCaret.ipynb)  &nbsp; [![kaggle](https://camo.githubusercontent.com/a08ca511178e691ace596a95d334f73cf4ce06e83a5c4a5169b8bb68cac27bef/68747470733a2f2f6b6167676c652e636f6d2f7374617469632f696d616765732f6f70656e2d696e2d6b6167676c652e737667)](https://www.kaggle.com/alexanderlundervold/2023-dat158-3-1-pycaret-ipynb)

> **NB**: If you want to run this notebook on your own computer, you must install PyCaret. I recommend creating a new conda environment and running `pip install "pycaret[full]"`.

# A quick PyCaret tutorial

> PyCaret is an open-source, low-code machine learning library in Python designed to automate and streamline machine learning workflows. It serves as an end-to-end machine learning and model management solution, significantly enhancing productivity and reducing the time needed for experimentation in machine learning projects.

> This will be a concise tutorial on PyCaret. Consult the documentation at https://pycaret.gitbook.io/docs/ for more. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [None]:
if (colab or kaggle):
    %pip install "pycaret[full]"

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=RuntimeWarning)

# Getting data

PyCaret comes with functionality for loading many standard benchmark datasets:

In [None]:
import pycaret

In [None]:
from pycaret.datasets import get_data

In [None]:
_ = get_data('index')

In [None]:
dataset = 'cancer'

In [None]:
data = get_data(dataset)

In [None]:
target = 'Class'

> **Your turn!** Later, you should try out some of the other datasets listed above. You can also try to load your own data.

# Inspect the data

We'll use one of the smaller data sets to reduce computational time.  But, of course, you're welcome to try some of the other data sets listed above!

In [None]:
data[target].value_counts()

In [None]:
data.info()

In [None]:
data.describe()

# Explore the data

After downloading data and looking at its structure, one should start a more thorough exploration. We've seen how this can be done earlier in the course.

Here's a convenient package that can perform some of the common exploration steps automatically:

In [None]:
from pandas_profiling import ProfileReport

In [None]:
ProfileReport(data)

# Prepare the data and set up an experiment

In [None]:
from pycaret.classification import *

In [None]:
experiment = setup(data=data, target=target, normalize=True, 
                   normalize_method='robust', 
                   log_experiment=True, experiment_name='exp1', 
                   session_id=42)

> **Your turn!** Explore the various options PyCaret provides for setting up experiments using `setup`. 

# Train some baseline models

In [None]:
models()

In [None]:
top_models = compare_models(n_select=5, sort='Accuracy', exclude=['ridge'])

The results from this experiment are saved as a dataframe in our log (later in the notebook, we'll explore the results of our experiments using MLflow): 

In [None]:
log_df = get_logs()

In [None]:
#log_df

In [None]:
def get_sorted_logs():
    log_df = get_logs()
    return log_df.sort_values(by='metrics.Accuracy', ascending=False)

In [None]:
get_sorted_logs()

# Hyperparameter tuning

We've found some candidate models: 

In [None]:
top_models

We'll want to tune their hyperparameters to try to improve their performance:

In [None]:
%%time
tuned_models = [tune_model(m, optimize='Accuracy', n_iter=600, fold=5, choose_better=True) 
                for m in top_models]

**Note:** We've used the default parameter grids set by PyCaret. However, it's often a good idea to investigate more carefully what parameters to consider (as it depends not only on the model but also on the data). You can modify the grid using custom_grid. See the PyCaret source code for the default parameter grids. 

The logs have now been updated:

These scores can be compared to those obtained when we used default parameters. Note that `tune_model` uses `RandomizedSearchCV` from scikit-learn as its default search strategy. It's, therefore, not guaranteed that you will find the best hyperparameter combination from the parameter grid during the search. You can, of course, change the search algorithm and also the search library (for example, you can use `scikit-optimize`).

# Ensembling

As we've seen, it's often possible to combine models in a way that outperforms each of the single models. Again, there are multiple ways of doing this. A simple way, as you know, is to use "voting ensembles." In PyCaret, we can use `blend_models` to construct voting ensembles. 

Let's try ensembling some of the best models found so far:

In [None]:
tuned_models

In [None]:
n_models = 4
best_models = tuned_models[:n_models]

In [None]:
best_models

In [None]:
voting_hard = blend_models(best_models, method='hard', optimize='Accuracy')

As we've seen earlier, one can also train a so-called "blender" on top of the predictions from a set of models and, in that way, make use of more complicated patterns than in a voting ensemble:

In [None]:
blender = stack_models(estimator_list=best_models, optimize='Accuracy')

# Inspect the results of the experiment

In [None]:
get_sorted_logs()

We can have a look at our results using MLflow (https://mlflow.org/):

In [None]:
!mlflow ui

# Evaluate the results

We'll pick the best model trained so far and evaluate it. 

In [None]:
get_sorted_logs().iloc[0]['artifact_uri']

In [None]:
# Find the location of the best model
best_model_fn = f"{get_sorted_logs().iloc[0]['artifact_uri'][7:]}/model/model"

In [None]:
best_model_fn

In [None]:
best_model = load_model(best_model_fn)

In [None]:
best_model

## Confusion matrix

In [None]:
plot_model(best_model, 'confusion_matrix')

## Classification report

In [None]:
plot_model(best_model, 'class_report')

## Errors

Here's a plot of the errors made by the model:

In [None]:
plot_model(best_model, 'error')

## Precision versus recall

We remember the so-called "bias-variance-tradeoff" and that there is typically a tradeoff between precision and recall. We can visualize where our models have set the thresholds:

In [None]:
plot_model(best_model, 'threshold')

If you want to change this threshold (e.g., if false positives are worse than false negatives in your specific case), you can use the method `optimize_threshold`.

## Feature importance

Which features does the model lean on the most?

In [None]:
plot_model(best_model, 'feature')

Note that this can vary between different models.

In [None]:
print(best_models[1])

In [None]:
plot_model(best_models[1], 'feature')

## ExplainerDashboard

In [None]:
dashboard(best_models[0])

# Use the model on new data

In [None]:
# Predict on the test data we put aside earlier
y_pred = predict_model(best_model)

In [None]:
y_pred.head()

# Export the pipeline

When you are done constructing, training, evaluating, and interpreting the models, it's time to deploy them. First, you'll want to export the model together with the entire pipeline for pre-processing to, for example, the hard drive, to memory, or to or cloud provider.

When you start making predictions on entirely new data (in other words, after you've completed the first stage of the model building), then you can use `predict_model` on this data. The data will then be preprocessed according to the pipeline and passed through the model.

Remark: until now, we've put aside some data for testing. If you're done constructing the model, there's no point in not using this (often valuable) labeled data for training. One would like to train the model on _all_ the available labeled data. 

This can be achieved by using `finalize_model`:

In [None]:
final_model = finalize_model(best_model)

Then we can save the model:

In [None]:
save_model(final_model,'saved_model')

# Deploy

PyCaret has built-in functionality for deployment to AWS, GCP, and Azure: https://pycaret.gitbook.io/docs/get-started/functions/deploy. But you're, of course, free to deploy anywhere else. 

In [None]:
#?deploy_model

We can also create a simple POST API:

In [None]:
create_api(best_model[0], 'test')

In [None]:
# %load test.py

In [None]:
!python test.py