**Table of contents**<a id='toc0_'></a>    
- [AutoML - PyCaret Quickstart](#toc1_)    
- [Regression](#toc2_)    
  - [Setting up an experiment](#toc2_1_)    
      - [Preprocessing](#toc2_1_1_1_)    
  - [Modelling](#toc2_2_)    
  - [Model diagnostics](#toc2_3_)    
  - [Model tracking](#toc2_4_)    
  - [What comes next?](#toc2_5_)    
- [Extra: Homework](#toc3_)    
- [Resources](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[AutoML - PyCaret Quickstart](#toc0_)

In [1]:
#!pip install pycaret
#!pip install markupsafe==2.0.1
#!pip install numpy==1.20
#!pip install mlflow

In [2]:
import numpy as np
import pandas as pd
import time
from sklearn.impute import KNNImputer

# <a id='toc2_'></a>[Regression](#toc0_)

In [3]:
from pycaret.regression import *

RuntimeError: ('Pycaret only supports python 3.9, 3.10, 3.11. Your actual Python version: ', sys.version_info(major=3, minor=12, micro=6, releaselevel='final', serial=0), 'Please DOWNGRADE your Python version.')

In [None]:
housing_data = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/california_housing_census.csv')
housing_data.shape

In [None]:
# Shrink the dataset to make experiments simpler, i.e. faster to run, hopefully
housing = housing_data.copy().sample(500)
print(housing.shape)

## <a id='toc2_1_'></a>[Setting up an experiment](#toc0_)

In [None]:
# Exp with normalization, seed
exp = setup(
    housing, 
    target='median_house_value',
    normalize=True, 
    # fold=5,
    session_id=10,
    log_experiment='mlflow', 
    log_plots=True, 
    log_data=True
    )

Most of this can be found in the full-on documentation: https://pycaret.readthedocs.io/en/stable/api/regression.html

#### <a id='toc2_1_1_1_'></a>[Preprocessing](#toc0_)

The cool thing about pycaret is that you can personalize a gazillion things off the bat. A couple of things that I found useful to work with:

- `ordinal_features` - which features to be encoded ordinally. By default, pycaret uses One-Hot Encoding for all categorical features. 

- `max_encoding_ohe` - when to stop doing OHE. By default, if a categorical feature has more than 25 categories, pycaret defaults to target encoding. When dealing with a regression problem, target encoding takes the mean of the target per group (only in the training set!). When dealing with a classification problem, target encoding uses the probability of an outcome happening in each of the groups.

- `ignore_features` - which features to ignore during modelling.

- `keep_features` - which features to never remove during preprocessing. **Attention:** The moel might still have other features, this option is simply there so these features can't be removed.

- `numeric_imputation`, `categorical_imputation` - to select how to fill in NaNs.

- `remove_multicollinearity`, `multicollinearity_threshold` - uses Pearson (linear) correlation to remove highly correlated variables.

- `remove_outliers`, `outliers_method` - this one doesn't remove outliers by [typical methods](https://archive.is/ruhhQ) but instead uses outlier detection algorithms such as `IsolationForest`, `EllipticEnvelope`, and `LocalOutlierFactor`

- `normalize`, `normalize_method` - how to normalize/scale the data

Other cool stuff:  

- `polynomial_features`, `polynomial_degree` - whether to create new features using the polynomials of existing features.

- `transformation`, `transformation_method` - whether to apply any transformation on the feature, in pycaret we have yeo-johnson (aka a log-transform adapted for negative values) and quantile transform (which I never used but it maps the current distribution into a to desired distribution). More easy to interpret transformations are sqrt and log transforms. There are separate options to transform the target, i.e. `transform_target` and `transform_target_method`.

- `pca`, `pca_method`, `pca_components` - whether to apply PCA for feature extraction & compression.

**Note:** While pycaret is great for doing blanket transformations over numerical/categorical variables, it starts to break down when you need a higher level of granularity, e.g. when you want to remove outliers/apply transformations only in certain columns. However, it's a great library to help you start off a project, i.e. get something up-and-running quickly.

In [None]:
# You can access directly X_train, X_test, etc.
exp.X_test  

In [None]:
exp.X_test_transformed

In [None]:
exp.get_config('seed')

## <a id='toc2_2_'></a>[Modelling](#toc0_)

PyCaret is great not only because it allows you to set a crazy amount of preprocessing options but also because it allows you to compare model performance crazy easy:

In [None]:
# This might take a while to run so feel free to include only 1 model!
best_model = exp.compare_models(include=['lr', 'rf', 'xgboost', 'svm', 'knn'])

# Models included:
# lr = linear regression
# rf = random forest
# xgboost = extreme gradient boosting
# svm = support vector machine
# knn = k-nearest neighbours

In [None]:
best_model

Modelling options when setting up experiments:  

- `train_size` - to decide on the size of the training set 

- `fold_strategy`, `fold` - cross-validation strategy and number of folds  

- `fold_shuffle` - random state for cross-validation

In [None]:
# By default, the cross-validation is 10-fold but we set it to 5 because of the small dataset
lr_model = exp.create_model('lr')

## <a id='toc2_3_'></a>[Model diagnostics](#toc0_)

On top of being able to create, optimize, and save models, pycaret also enables you to run more complex model diagnostics through the help of the `yellowbrick` library ([super useful for ML visualizations](https://www.scikit-yb.org/en/latest/)):

In [None]:
exp.plot_model(best_model, plot='residuals')

In [None]:
exp.plot_model(best_model, 'feature')

## <a id='toc2_4_'></a>[Model tracking](#toc0_)

Lastly, now that we have so much capability because of the wonderful pycaret, how are we going to keep track of all the things we've tested? Easy, pycaret is readily integrated with the most popular open-source ML-tracking software, [MLFlow](https://mlflow.org/docs/latest/index.html) - so all you need to do is to set these parameters to `True` to begin saving your results:

- `log_experiment` - if set to `True`, defaults to MLFlow, but if you want to be explicit you can also say `'mlflow'`  

- `log_data` - whether to save the train/test sets as `.csv` files  

- `log_plots` - whether to save the typical diagnostic plots

To view the results from the experiments, run this command in your terminal, in the same directory as this notebook: `mlflow ui`.

You will then see a link to the localhost, typically `http://localhost:5000`, which you can click on!

## <a id='toc2_5_'></a>[What comes next?](#toc0_)

After we evaluated all these models and picked the one that suits us the most, it's time to get it ready for production!

As we want models to see as much data as possible, we will now finally be able to re-train our model on the full sample of data!

In [None]:
X = pd.concat([exp.X_train_transformed, exp.X_test_transformed], axis=0)
y = pd.concat([exp.y_train_transformed, exp.y_test_transformed], axis=0)

In [None]:
best_model.fit(X, y)

In [None]:
import pickle

pickle.dump(best_model, open('best_model.pkl', 'wb'))

In [None]:
best_model = pickle.load(open('best_model.pkl', 'rb'))

In [None]:
best_model

# <a id='toc3_'></a>[Extra: Homework](#toc0_)

Now that we've explored the pycaret module for regression, how about you also explore the [classification module](https://pycaret.readthedocs.io/en/stable/api/classification.html)?  

Also... for those of you interested in more advanced topics, pycaret also has modules for:  
- time series
- anomaly detection, e.g. fraud prediction
- clustering 

You can find some tutorials on all modules here: https://pycaret.gitbook.io/docs/get-started/tutorials

# <a id='toc4_'></a>[Resources](#toc0_)

In [None]:
#https://towardsdatascience.com/how-to-use-pycaret-the-library-for-lazy-data-scientists-91343f960bd2