## AutoGluon

In this notebook, we use __AutoGluon__ to predict the __Outcome Type__ field of our review dataset.


[AutoGluon](https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html) implements many of the best practices that we have discussed in this class, and more!  In particular, it sets itself apart from other AutoML solutions by having excellent automated feature engineering that can handle text data and missing values without any hand-coded solutions (See their [paper](https://arxiv.org/abs/2003.06505) for details).  It is too new to be in an existing Sagemaker kernel, so let's install it.

1. <a href="#1">Set up AutoGluon</a>
2. <a href="#2">Read the datasets</a>
3. <a href="#3">Train a classifier with AutoGluon</a>
4. <a href="#4">Model evaluation</a>
5. <a href="#5">Clean up model artifacts</a>

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a single __review.csv__ file. We also didn't consider animals with multiple entries to the facility to keep our dataset simple. If you want to see the original datasets and the merged data with multiple entries, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv and Austin_Animal_Center_Intakes_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome (0 = not placed, 1 = placed). This is the field to predict.
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Age upon Outcome Days__ - Age of pet at outcome (days))

## 1. <a name="1">Set up AutoGluon</a>
(<a href="#0">Go to top</a>)

In [1]:
!pip install --upgrade pip
!pip install --upgrade mxnet autogluon

import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (20.1.1)
Requirement already up-to-date: mxnet in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (1.6.0)
Requirement already up-to-date: autogluon in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (0.0.11)
Collecting scikit-learn<0.23,>=0.22.0
  Using cached scikit_learn-0.22.2.post1-cp36-cp36m-manylinux1_x86_64.whl (7.1 MB)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.23.1
    Uninstalling scikit-learn-0.23.1:
      Successfully uninstalled scikit-learn-0.23.1
Successfully installed scikit-learn-0.22.2.post1


## 2. <a name="2">Read the dataset</a>
(<a href="#0">Go to top</a>)

Let's read the dataset into a dataframe, using Pandas, and split the dataset into train and test sets (AutoGluon will handle the validation itself).

In [2]:
import pandas as pd

df = pd.read_csv('../data/review/review_dataset.csv')

In [3]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.1, shuffle=True, random_state=23)

## 3. <a name="3">Train a classifier with AutoGluon</a>
(<a href="#0">Go to top</a>)

We can run AutoGluon with a short snippet.

In [4]:
from autogluon import TabularPrediction as task

k = 10000 # grab less data for a quick demo
#k = train_data.shape[0] # grad the whole dataset; 

predictor = task.fit(train_data=train_data.head(k), label='Outcome Type')

  Optimizer.opt_registry[name].__name__))
No output_directory specified. Models will be saved in: AutogluonModels/ag-20200711_030909/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200711_030909/
Train Data Rows:    10000
Train Data Columns: 13
Preprocessing data ...
Here are the 2 unique label values in your data:  [1.0, 0.0]
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Train Data Class Count: 2
Feature Generator processed 10000 data points with 230 features
Original Features:
	object features: 10
	int features: 2
Generated Features:
	int features: 219
All Features:
	object features: 10
	int features: 221
	Data preprocessing and feature engineering runtime = 9.91s ...
AutoGluon w

## 4. <a name="4">Model evaluation</a>
(<a href="#0">Go to top</a>)

In [5]:
y_pred = predictor.predict(test_data.head(k))
predictor.evaluate_predictions(y_true=test_data['Outcome Type'].head(k), y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.8496177610220965
Evaluations on test data:
{
    "accuracy": 0.8496177610220965,
    "accuracy_score": 0.8496177610220965,
    "balanced_accuracy_score": 0.8431987847586772,
    "matthews_corrcoef": 0.6924709919442359,
    "f1_score": 0.8496177610220965
}
Detailed (per-class) classification report:
{
    "0.0": {
        "precision": 0.8476018566271274,
        "recall": 0.7954985479186835,
        "f1-score": 0.8207240948813982,
        "support": 4132
    },
    "1.0": {
        "precision": 0.8509962969493916,
        "recall": 0.8908990215986708,
        "f1-score": 0.8704906204906204,
        "support": 5417
    },
    "accuracy": 0.8496177610220965,
    "macro avg": {
        "precision": 0.8492990767882596,
        "recall": 0.8431987847586772,
        "f1-score": 0.8456073576860093,
        "support": 9549
    },
    "weighted avg": {
        "precision": 0.8495274701181428,
        "recall": 0.8496177610220965,
        "f1-score": 0.8489558

OrderedDict([('accuracy', 0.8496177610220965),
             ('accuracy_score', 0.8496177610220965),
             ('balanced_accuracy_score', 0.8431987847586772),
             ('matthews_corrcoef', 0.6924709919442359),
             ('f1_score', 0.8496177610220965),
             ('classification_report',
              {'0.0': {'precision': 0.8476018566271274,
                'recall': 0.7954985479186835,
                'f1-score': 0.8207240948813982,
                'support': 4132},
               '1.0': {'precision': 0.8509962969493916,
                'recall': 0.8908990215986708,
                'f1-score': 0.8704906204906204,
                'support': 5417},
               'accuracy': 0.8496177610220965,
               'macro avg': {'precision': 0.8492990767882596,
                'recall': 0.8431987847586772,
                'f1-score': 0.8456073576860093,
                'support': 9549},
               'weighted avg': {'precision': 0.8495274701181428,
                'recall': 

## 5. <a name="5">Clean up model artifacts</a>
(<a href="#0">Go to top</a>)

In [6]:
!rm -r AutogluonModels
!rm -r catboost_info
!rm -r dask-worker-space

rm: cannot remove ‘catboost_info’: No such file or directory
rm: cannot remove ‘dask-worker-space’: No such file or directory
