## AutoGluon

In this notebook, we use __AutoGluon__ to predict the __Outcome Type__ field of our review dataset.

1. <a href="#1">Set up AutoGluon</a>
2. <a href="#2">Read the datasets</a>
3. <a href="#3">Train a classifier with AutoGluon</a>
4. <a href="#4">Model evaluation</a>
5. <a href="#5">Clean up model artifacts</a>

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a single __review.csv__ file. We also didn't consider animals with multiple entries to the facility to keep our dataset simple. If you want to see the original datasets and the merged data with multiple entries, they are available under `DATA/review` folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv and Austin_Animal_Center_Intakes_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome (0 = not placed, 1 = placed). This is the field to predict.
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Age upon Outcome Days__ - Age of pet at outcome (days)

## 1. <a name="1">Set up AutoGluon</a>
(<a href="#0">Go to top</a>)

[AutoGluon](https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html) implements many of the best practices that we have discussed in this class, and more!  In particular, it sets itself apart from other AutoML solutions by having excellent automated feature engineering that can handle text data and missing values without any hand-coded solutions (see their [paper](https://arxiv.org/abs/2003.06505) for details).  It is too new to be in an existing Sagemaker kernel, so let's install it.

In [16]:
# !pip install --upgrade pip
# !pip install --upgrade mxnet autogluon

In [2]:
import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
    warnings.filterwarnings("ignore", category=DeprecationWarning)

## 2. <a name="2">Read the dataset</a>
(<a href="#0">Go to top</a>)

Let's read the dataset into a dataframe, using Pandas, and split the dataset into train and test sets (AutoGluon will handle the validation itself).

In [3]:
import pandas as pd

df = pd.read_csv('../../DATA/review/review_dataset.csv')

In [4]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.1, shuffle=True, random_state=23)

## 3. <a name="3">Train a classifier with AutoGluon</a>
(<a href="#0">Go to top</a>)

We can run AutoGluon with a short snippet.

In [12]:
from autogluon.tabular import TabularPredictor

k = 1000 # grab less data for a quick demo
#k = train_data.shape[0] # grad the whole dataset; 

predictor = TabularPredictor(label='Outcome Type', eval_metric = 'f1').fit(train_data=train_data.sample(k))

No path specified. Models will be saved in: "AutogluonModels/ag-20210630_211347/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20210630_211347/"
AutoGluon Version:  0.2.0
Train Data Rows:    1000
Train Data Columns: 12
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1.0, 0.0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    58531.39 MB
	Train Data (Original)  Memory Usage: 0.69 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtyp



	0.839	 = Validation f1 score
	1.32s	 = Training runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetMXNet ...
	0.8106	 = Validation f1 score
	15.23s	 = Training runtime
	0.24s	 = Validation runtime
Fitting model: LightGBMLarge ...
	0.8353	 = Validation f1 score
	1.3s	 = Training runtime
	0.02s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	0.8755	 = Validation f1 score
	1.71s	 = Training runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 33.99s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20210630_211347/")


## 4. <a name="4">Model evaluation</a>
(<a href="#0">Go to top</a>)

In [13]:
y_pred = predictor.predict(test_data)
predictor.evaluate_predictions(y_true=test_data['Outcome Type'], y_pred=y_pred, auxiliary_metrics=True)

  and should_run_async(code)
Evaluation: f1 on test data: 0.8520463600144874
Evaluations on test data:
{
    "f1": 0.8520463600144874,
    "accuracy": 0.8288826055084302,
    "balanced_accuracy": 0.8227127195032097,
    "mcc": 0.650000066092912,
    "precision": 0.8361471476808245,
    "recall": 0.8685619346501754
}


{'f1': 0.8520463600144874,
 'accuracy': 0.8288826055084302,
 'balanced_accuracy': 0.8227127195032097,
 'mcc': 0.650000066092912,
 'precision': 0.8361471476808245,
 'recall': 0.8685619346501754}

In [14]:
predictor.leaderboard(silent=True)

  and should_run_async(code)


Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.875536,0.268644,7.285246,0.001821,1.713055,2,True,14
1,CatBoost,0.865801,0.021074,0.825659,0.021074,0.825659,1,True,7
2,ExtraTreesEntr,0.859574,0.113189,0.959247,0.113189,0.959247,1,True,9
3,LightGBMXT,0.855895,0.019708,1.735614,0.019708,1.735614,1,True,3
4,RandomForestGini,0.852321,0.112877,0.858639,0.112877,0.858639,1,True,5
5,RandomForestEntr,0.852321,0.112895,0.856853,0.112895,0.856853,1,True,6
6,ExtraTreesGini,0.848739,0.112864,0.959635,0.112864,0.959635,1,True,8
7,LightGBM,0.840708,0.019666,2.930432,0.019666,2.930432,1,True,4
8,XGBoost,0.838983,0.012448,1.319662,0.012448,1.319662,1,True,11
9,LightGBMLarge,0.835341,0.019737,1.304277,0.019737,1.304277,1,True,13


## 5. <a name="5">Clean up model artifacts</a>
(<a href="#0">Go to top</a>)

In [15]:
!rm -r AutogluonModels

  and should_run_async(code)
