# <font color='#eb3483'> PyCaret </font>

Last week we learned a great toolkit of machine learning models and how to build, train, and test them in isolation but what if we don't want to try out each model independently? PyCaret (the python ~~rip-off~~ version of caret - R's premiere machine learning interface) provides an awesome interface to quickly test a gaggle of machine learning algorithms with only a handful of code. In this notebook we'll check it out and show you some helpful functionality.

Start by installing the package using pip if you haven't already:
`pip install pycaret`

### <font color='#eb3483'> Credit Dataset </font>
For this tutorial we'll be using the credit dataset from [UCI](https://archive.ics.uci.edu/ml/index.php) (a really great resource for machine learning datasets). PyCaret, like sklearn and seaborn, has some great in-built functionality for getting data. 

In [None]:
from pycaret.datasets import get_data
credit_raw = get_data('credit')

Our dataset has 23 features about a loan applicant, and whether or not they defaulted on the loan (default variable). Our goal is going to be to predict whether or not someone will default. We're going to hold a little bit of the data back for testing our models later.

In [None]:
credit = credit_raw.sample(frac=0.95, random_state=123)
test_data = credit_raw.drop(credit.index).reset_index(drop=True)
credit.reset_index(drop=True, inplace=True)

### <font color='#eb3483'> PyCaret Environment </font>
To start working with our data in pycaret we need to import a module based on the type of problem we're solving (in this case classification, but check online for other options), and set-up a pycaret environment using the setup function. This will check the datatype of our columns and do some important pre-processing steps. PyCaret is built to be used in jupyter or google colab so it'll have some interactive steps for you to check what's going on and confirm they've interpreted the data correctly.

In [None]:
# import the classification module 
from pycaret import classification
# setup the environment 
#classification_setup = classification.setup(data= data_classification, target='Personal Loan')

In [None]:
from pycaret.classification import *
exp_clf101 = setup(data = credit, target = 'default', session_id=123)

Whew there's a lot going on here! It's not super important that we understand every step for this tutorial, but read through the list and you'll see that pycaret can automatically take care of a lot of things we've talked about the past couple weeks including imputing missing data, normalizing and breaks our data into a train and test set. We haven't checked most of the options, but these are awesome steps to keep in mind when you want to use PyCaret to customize your data cleaning pipeline.

### <font color='#eb3483'> Comparing Models </font>
Where PyCaret shines is quickly comparing a multitude of machine learning models. The compare models function runs through all the models in the module you're using (i.e. classification like us) and outputs some high level metrics to help you see what types of models seem to work best on your data. This'll take awhile (as it should - we're training a bunch of models)!

In [None]:
compare_models()

You'll notice that PyCaret's even saved us the trouble of trying to look at a table and see what model's best - it's sorted it by accuracy (and if you change the sort parameter you can get it to sort by other metrics).

### <font color='#eb3483'> Training A Model </font>
Compare models is an amazing tool that gives you an overview of which algorithms perform best, but it doesn't actually return us a trained model. It also isn't able to do hyper-parameter tuning for each model so once you have an idea of what models are worth trying it's important to actually train one on it's own. Luckily PyCaret has a convenient interface for training models.

Based on the AUC metric it looks like extreme gradient boosting is the way to go for this dataset. Let's train a model using it. We can see what the abbreviated string for each model type is [here](https://pycaret.org/create-model/).

In [None]:
#All we have to do to train a model is use create_model and specify we want XGBoost (aka Extreme Gradient Boosting)
xgb = create_model('xgboost')

We can even tune hyperparameters for a model using the `tune_model` function. It works by testing a simple grid of possible hyperparameter values and chosing the one with thebest accuracy, but it provides a lot of flexibility for specifying exactly how you want to tune it (check out the help docs for all the options!). Let's train a tuned decision tree model.

In [None]:
tuned_dt = tune_model('dt')

Wow notice how much higher our decision tree performance is after tuning versus when we just used compare model! A nice reminder that hyper parameter is super important for getting rockstar machine learning results.

### <font color='#eb3483'> Ensemble Methods </font>
So far we've talked about training and testing models in isolation, but often times the best solution is to use a few different algorithms and combine their predictions, a process called *ensembling*. This is a massive topic so we won't dive deep into it in this notebook, but feel free to check-out some more information [here](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/?utm_source=blog&utm_medium=pycaret-machine-learning-model-seconds). 

PyCaret let's us combine models into an ensemble using a few different methods, but let's use the blending method and give it the two classifier's we've already trained.

In [None]:
blender = blend_models(estimator_list=[xgb, tuned_dt])

Unfortunately it doesn't look like ensembling has led to a massive increase in performance on this problem with the models we included, but in general it's a great tool to try out when you're trying to eek out a few extra percentage points.

### <font color='#eb3483'> Plotting Performance </font>
We all know data visualization is important - and man oh man does PyCaret have some great tools for visualizing our model's performance (looks like a greatest hits of our class from this week). I won't explain every plot (there are over 15 options) - but let's take a peak at a couple familiar plots.

In [None]:
plot_model(tuned_dt, plot = 'auc')

In [None]:
plot_model(tuned_dt, plot = 'confusion_matrix')

In [None]:
plot_model(tuned_dt, plot='feature')

Instead of plotting each curve independently, PyCaret even provides an interface for selecting what we want to see (we're barely even coding anymore!).

In [None]:
evaluate_model(tuned_dt)

### <font color='#eb3483'> Prediction </font>
Obviously if we're training a machine learning model, we'll eventually want to use it to make some predictions! To make predictions on unseen data, you can use the `predict_model` function. 

In [None]:
predict_model(tuned_dt, data=test_data)

Notice that the output is the data with two new columns - label and score representing our predictions.

### <font color='#eb3483'> Next Steps </font>
PyCaret has a lot of functionality, and this notebook just scratches the surface. For more information make sure to check out their website and the fantastic tutorials they have [here](https://pycaret.org/guide/).