## Final Project Day 4 Solution: AutoGluon TabularPrediction for a Classification Problem

Let's finally see how __AutoGluon TabularPrediction__ works to predict the __isPositive__ field of our final project dataset.

1. Setup the AutoGluon environment 
2. Use AutoGluon TabularPrediction
    * Find more details on the AutoGluon TabularPrediction here: https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html
3. AutoGluon TabularPrediction performance analysis

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review

### 1. Setup the AutoGluon environment 

In [5]:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: CC-BY-SA-4.0

# Setup the AutoGluon environment
# WARNING: this might take a couple of minutes the first time around!
!pip install --upgrade pip
!pip install --upgrade mxnet autogluon

import warnings
warnings.filterwarnings('ignore')

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (20.0.2)
Requirement already up-to-date: mxnet in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (1.6.0)
Requirement already up-to-date: autogluon in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (0.0.5)




### 2. Use AutoGluon TabularPrediction

Load the raw unpreprocessed training and test datasets to train a regressor with __AutoGluon TabularPrediction__.

* Find more details on the AutoGluon TabularPrediction here: https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html

In [6]:
from autogluon import TabularPrediction as task

train_data = task.Dataset(file_path='../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')
test_data = task.Dataset(file_path='../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

# Train a classifier with AutoGluon TabularPrediction
predictor = task.fit(train_data = train_data.head(1000), # For speed, grab a small subset of the dataset
                     label = 'isPositive', 
                     hyperparameters = {'NN':{}, 'GBM':{}, 'CAT':{}, 'RF':{}, 'XT':{}}, # Also for speed, change the default hyperparameters
                     auto_stack = True # Decrease training time by up to 20x, switching from AutoGluon's default attempt to select optimal num_bagging_folds and stack_ensemble_levels based on data properties. 
                    )

# Evaluate the performance of the AutoGluon TabularPrediction classifier
performance = predictor.evaluate(test_data)


Loaded data from: ../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv | Columns = 6 / 6 | Rows = 70000 -> 70000
Loaded data from: ../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv | Columns = 6 / 6 | Rows = 8000 -> 8000
No output_directory specified. Models will be saved in: AutogluonModels/ag-20200222_015043/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200222_015043/
Train Data Rows:    1000
Train Data Columns: 6
Preprocessing data ...
Here are the first 10 unique label values in your data:  [1. 0.]
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Feature Generator processed 1000 data points with 441 features
Original Features:
	object features: 2
	bool features: 1
	int features: 1
	float feature

Predictive performance on given dataset: accuracy = 0.8185


In [7]:
import pandas as pd 
y_test = test_data['isPositive']

y_pred = predictor.predict(test_data)
performance = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


Evaluation: accuracy on test data: 0.818500
Evaluations on test data:
{
    "accuracy": 0.8185,
    "accuracy_score": 0.8185,
    "balanced_accuracy_score": 0.7993510465703875,
    "matthews_corrcoef": 0.609088403672891,
    "f1_score": 0.8184999999999999
}
Detailed (per-class) classification report:
{
    "0.0": {
        "precision": 0.7812051649928264,
        "recall": 0.7211920529801324,
        "f1-score": 0.75,
        "support": 3020
    },
    "1.0": {
        "precision": 0.8384497313891021,
        "recall": 0.8775100401606426,
        "f1-score": 0.857535321821036,
        "support": 4980
    },
    "accuracy": 0.8185,
    "macro avg": {
        "precision": 0.8098274481909642,
        "recall": 0.7993510465703875,
        "f1-score": 0.803767660910518,
        "support": 8000
    },
    "weighted avg": {
        "precision": 0.8168399075745081,
        "recall": 0.8185,
        "f1-score": 0.8169407378335949,
        "support": 8000
    }
}


In [8]:
# Deleting notebook artifacts
! rm -rf AutogluonModels
! rm -rf catboost_info
! rm -rf dask-worker-space