## Final Project Day 4 Solution: AutoGluon TabularPrediction for a Classification Problem

Let's finally see how __AutoGluon TabularPrediction__ works to predict the __isPositive__ field of our final project dataset.

* We are giving you two pieces of code to read your training and test datasets.
* Use the notebooks from the class and implement the model, train and test with the corresponding datasets.
* You can use the __AutoGluon TabularPrediction__ from here: https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html

*Note: No need to incorporate the data preprocessing from previous days - __AutoGluon TabularPrediction__ handles all that!*

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review

### 1. Setup the AutoGluon environment 

In [1]:
!pip install --upgrade pip
!pip install --upgrade mxnet autogluon

import warnings
warnings.filterwarnings('ignore')

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (20.0.2)
Requirement already up-to-date: mxnet in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (1.5.1.post0)
Requirement already up-to-date: autogluon in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (0.0.5)


### 2. Use AutoGluon TabularPrediction to train a classifier

Load the raw unpreprocessed training and test datasets to train a classifier with __AutoGluon TabularPrediction__:

In [2]:
from autogluon import TabularPrediction as task

train_data = task.Dataset(file_path='../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')
test_data = task.Dataset(file_path='../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

# For speed, grab a small subset of the dataset
train_data = train_data.head(1000)

# Also for speed, change the default hyperparameters
# hyp = {'NN': {'num_epochs': 500}, 'GBM': {'num_boost_round': 10000}, 'CAT': {'iterations': 10000}, 'RF': {'n_estimators': 300}, 'XT': {'n_estimators': 300}, 'KNN': {}, 'custom': ['GBM']}
hyp = {'NN':{}, 'GBM':{}, 'CAT':{}, 'RF':{}, 'XT':{}}

# Decrease training time by up to 20x, switching from AutoGluon's default attempt to select optimal num_bagging_folds and stack_ensemble_levels based on data properties. 
auto_stack = True 

predictor = task.fit(train_data = train_data, label = 'isPositive', 
                     auto_stack = auto_stack, hyperparameters = hyp)
performance = predictor.evaluate(test_data)

Loaded data from: ../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv | Columns = 6 / 6 | Rows = 70000 -> 70000
Loaded data from: ../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv | Columns = 6 / 6 | Rows = 8000 -> 8000
No output_directory specified. Models will be saved in: AutogluonModels/ag-20200212_022149/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200212_022149/
Train Data Rows:    1000
Train Data Columns: 6
Preprocessing data ...
Here are the first 10 unique label values in your data:  [1. 0.]
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Feature Generator processed 1000 data points with 441 features
Original Features:
	object features: 2
	bool features: 1
	int features: 1
	float feature

Predictive performance on given dataset: accuracy = 0.8205


In [9]:
# Deleting notebook artifacts
! rm -rf AutogluonModels
! rm -rf catboost_info
! rm -rf dask-worker-space