## AutoGluon TabularPrediction for a Regression Problem 

In this notebook, we use __AutoGluon TabularPrediction__ to predict the __log_votes__ field of our review dataset.

1. Setup the AutoGluon environment 
2. Use AutoGluon TabularPrediction
    * Find more details on the __AutoGluon TabularPrediction__ here: https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html
3. AutoGluon TabularPrediction performance analysis

Via a simple __fit()__ call, __AutoGluon TabularPrediction__ can produce a highly-accurate model to predict the values in the __log_votes__ column of our data table based on the rest of the columns’ values. 

__AutoGluon__ with tabular data works for both classification and regression problems. Moreover, we do not need to specify the kind of problem, as this it automatically inferred from the data and the appropriate performance metric is reported (by default, RMSE for regression, and accuracy for classification).

__AutoGluon__ also automatically decides which variables should be represented as integers, which variables should be represented as categorical objects, and handles common issues like missing data and rescaling feature values.

Rather than just a single model, __AutoGluon__ trains multiple models and ensembles them together to ensure superior predictive performance. Each type of model has various hyperparameters, which traditionally, the user would have to specify. __AutoGluon__ automates this process, including cross-validation, so there is no need to specify separate validation data.

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __rating:__ Rating of the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)


### 1. Setup the AutoGluon environment 

In [1]:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: CC-BY-SA-4.0

# Setup the AutoGluon environment
# WARNING: this might take a couple of minutes the first time around!
!pip install --upgrade pip
!pip install --upgrade mxnet autogluon

import warnings
warnings.filterwarnings('ignore')


Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (20.0.2)
Requirement already up-to-date: mxnet in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (1.6.0)
Requirement already up-to-date: autogluon in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (0.0.5)


### 2. Use AutoGluon TabularPrediction

#### 2.1 Reading and getting the dataset in AutoGluon TabularPrediction friendly format

We first use the __pandas__ library to read our raw unpreprocessed __review_dataset__ and split into training and testing datasets for modeling with __AutoGluon TabularPrediction__:

In [2]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

df = pd.read_csv('../../DATA/NLP/EMBK-NLP-REVIEW-DATA-CSV.csv')

X_train, X_test, y_train, y_test = train_test_split(df.drop("log_votes", axis =1), df["log_votes"],
                                                  test_size=0.10,  # 10% test, 90% tranining
                                                  shuffle=True # Shuffle the whole dataset
                                                 )

pd.concat([X_train, y_train], axis = 1).to_csv('review_dataset_AG_training.csv', index=False)
pd.concat([X_test, y_test], axis = 1).to_csv('review_dataset_AG_test.csv', index=False)


#### 2.2 Use AutoGluon TabularPrediction to train and evaluate a regressor 

Load the raw unpreprocessed training and test datasets to train a regressor with __AutoGluon TabularPrediction__.

* Find more details on the AutoGluon TabularPrediction here: https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html

In [None]:
from autogluon import TabularPrediction as task

train_data = task.Dataset(file_path='review_dataset_AG_training.csv')
test_data = task.Dataset(file_path='review_dataset_AG_test.csv')

# Train a regressor with AutoGluon TabularPrediction
predictor = task.fit(train_data = train_data.head(1000), # For speed, grab a small subset of the dataset
                     label = 'log_votes', 
                     eval_metric = 'mean_squared_error', # default metric is r2
                     hyperparameters = {'NN':{}, 'GBM':{}, 'CAT':{}, 'RF':{}, 'XT':{}}, # Also for speed, change the default hyperparameters
                     auto_stack = True # Decrease training time by up to 20x, switching from AutoGluon's default attempt to select optimal num_bagging_folds and stack_ensemble_levels based on data properties. 
                    )

# Evaluate the performance of the AutoGluon TabularPrediction regressor
performance = predictor.evaluate(test_data)


  Optimizer.opt_registry[name].__name__))
Loaded data from: review_dataset_AG_training.csv | Columns = 6 / 6 | Rows = 49500 -> 49500
Loaded data from: review_dataset_AG_test.csv | Columns = 6 / 6 | Rows = 5500 -> 5500
No output_directory specified. Models will be saved in: AutogluonModels/ag-20200222_010208/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200222_010208/
Train Data Rows:    1000
Train Data Columns: 6
Preprocessing data ...
Here are the first 10 unique label values in your data:  [0.         1.38629436 1.09861229 2.39789527 1.79175947 2.30258509
 3.36729583 1.60943791 2.94443898 3.09104245]
AutoGluon infers your prediction problem is: regression  (because dtype of label-column == float and label-values can't be converted to int)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Feature Generator processed 1000 data points with 491 

Predictive performance on given dataset: mean_squared_error = 0.6127495554620974


### 3 AutoGluon TabularPrediction performance analysis

Let's now examine in more details the performance of our trained model, with __predictor.evaluate_predictions()__:

In [None]:
import pandas as pd 
y_test = test_data['log_votes']

y_pred = predictor.predict(test_data)
performance = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


Evaluation: mean_squared_error on test data: 0.612750
Evaluations on test data:
{
    "mean_squared_error": 0.6127495554620974,
    "mean_absolute_error": 0.5433151720814781,
    "explained_variance_score": 0.33883902399920174,
    "r2_score": 0.3379975279407289,
    "pearson_correlation": 0.585042190297682,
    "median_absolute_error": 0.3566230097804107
}


Besides inference, the predictor object returned by __fit()__ can also be used to view a summary of what happened during fit:

In [None]:
results = predictor.fit_summary()


*** Summary of fit() ***
Number of models trained: 12
Types of models trained: 
{'WeightedEnsembleModel', 'StackerEnsembleModel'}
Validation performance of individual models: {'RandomForestRegressorMSE_STACKER_l0': -0.7656234494471702, 'ExtraTreesRegressorMSE_STACKER_l0': -0.8063705521687946, 'LightGBMRegressor_STACKER_l0': -0.7533709461365505, 'CatboostRegressor_STACKER_l0': -0.7083271472251684, 'NeuralNetRegressor_STACKER_l0': -0.9645390056495208, 'weighted_ensemble_k0_l1': -0.7064099322447895, 'RandomForestRegressorMSE_STACKER_l1': -0.7835650760576615, 'ExtraTreesRegressorMSE_STACKER_l1': -0.798081293003763, 'LightGBMRegressor_STACKER_l1': -0.7992987246969606, 'CatboostRegressor_STACKER_l1': -0.7294328660768603, 'NeuralNetRegressor_STACKER_l1': -0.9461263666210408, 'weighted_ensemble_k0_l2': -0.7280060365858667}
Best model (based on validation performance): weighted_ensemble_k0_l1
Hyperparameter-tuning used: False
Bagging used: True  (with 10 folds)
Stack-ensembling used: True  (wit

From this summary, we can see that __AutoGluon__ trained many different types of models as well as an ensemble of the best-performing models. The summary also describes the actual models that were trained during fit and how well each model performed on the held-out validation data. We can also view what properties __AutoGluon__ automatically inferred about our prediction task, along with more details on features preprocessing:

In [None]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon categorized the features as: ", predictor.feature_types)


AutoGluon infers problem type is:  regression
AutoGluon categorized the features as:  {'nlp': ['reviewText', 'summary'], 'vectorizers': ['__nlp__.10', '__nlp__.able', '__nlp__.able to', '__nlp__.about', '__nlp__.actually', '__nlp__.add', '__nlp__.after', '__nlp__.again', '__nlp__.all', '__nlp__.all of', '__nlp__.all the', '__nlp__.almost', '__nlp__.already', '__nlp__.also', '__nlp__.always', '__nlp__.am', '__nlp__.amazon', '__nlp__.an', '__nlp__.and', '__nlp__.and have', '__nlp__.and it', '__nlp__.and the', '__nlp__.another', '__nlp__.any', '__nlp__.anything', '__nlp__.are', '__nlp__.around', '__nlp__.as', '__nlp__.as well', '__nlp__.at', '__nlp__.at all', '__nlp__.at least', '__nlp__.at the', '__nlp__.available', '__nlp__.away', '__nlp__.back', '__nlp__.back to', '__nlp__.bad', '__nlp__.be', '__nlp__.because', '__nlp__.been', '__nlp__.been using', '__nlp__.before', '__nlp__.being', '__nlp__.best', '__nlp__.better', '__nlp__.bit', '__nlp__.both', '__nlp__.bought', '__nlp__.business', '

In [None]:
# Deleting notebook artifacts
! rm review_dataset_AG_training.csv
! rm review_dataset_AG_test.csv
! rm -rf AutogluonModels
! rm -rf catboost_info
! rm -rf dask-worker-space