## AutoGluon TabularPrediction for a Regression Problem 

In this notebook, we use __AutoGluon TabularPrediction__ on our regression problem to predict the __log_votes__ field of our review dataset.

* Find more details on the AutoGluon TabularPrediction here: https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html

Via a simple __fit()__ call, __AutoGluon TabularPrediction__ can produce a highly-accurate model to predict the values in the __log_votes__ column of our data table based on the rest of the columns’ values. 

__AutoGluon__ with tabular data works for both classification and regression problems. Moreover, we do not need to specify the kind of problem, as this it automatically inferred from the data and the appropriate performance metric is reported (by default, RMSE for regression, and accuracy for classification).

__AutoGluon__ also automatically decides which variables should be represented as integers, which variables should be represented as categorical objects, and handles common issues like missing data and rescaling feature values.

Rather than just a single model, __AutoGluon__ trains multiple models and ensembles them together to ensure superior predictive performance. Each type of model has various hyperparameters, which traditionally, the user would have to specify. __AutoGluon__ automates this process, including cross-validation, so there is no need to specify separate validation data.

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __rating:__ Rating of the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)


### 1. Setup the AutoGluon environment 

In [1]:
!pip install --upgrade pip
!pip install --upgrade mxnet autogluon

import warnings
warnings.filterwarnings('ignore')


Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (20.0.2)
Requirement already up-to-date: mxnet in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (1.5.1.post0)
Requirement already up-to-date: autogluon in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (0.0.5)


### 2. AutoGluon TabularPrediction on raw unprocessed datasets

#### 2.1 Reading and getting the dataset in AutoGluon TabularPrediction friendly format

We first use the __pandas__ library to read our raw unpreprocessed __review_dataset__ and split into training and testing datasets for modeling with __AutoGluon TabularPrediction__:

In [2]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

df = pd.read_csv('../../DATA/NLP/EMBK-NLP-REVIEW-DATA-CSV.csv')

X_train, X_test, y_train, y_test = train_test_split(df.drop("log_votes", axis =1), df["log_votes"],
                                                  test_size=0.10,  # 10% test, 90% tranining
                                                  shuffle=True # Shuffle the whole dataset
                                                 )

pd.concat([X_train, y_train], axis = 1).to_csv('review_dataset_AG_training.csv', index=False)
pd.concat([X_test, y_test], axis = 1).to_csv('review_dataset_AG_test.csv', index=False)


#### 2.2 Use AutoGluon TabularPrediction to train a regressor to predict the log_votes field

Load the raw unpreprocessed training and test datasets to train a regressor with __AutoGluon TabularPrediction__:

In [3]:
from autogluon import TabularPrediction as task

train_data = task.Dataset(file_path='review_dataset_AG_training.csv')
test_data = task.Dataset(file_path='review_dataset_AG_test.csv')

# For speed, grab a small subset of the dataset
train_data = train_data.head(1000)

# Also for speed, change the default hyperparameters
# hyp = {'NN': {'num_epochs': 500}, 'GBM': {'num_boost_round': 10000}, 'CAT': {'iterations': 10000}, 'RF': {'n_estimators': 300}, 'XT': {'n_estimators': 300}, 'KNN': {}, 'custom': ['GBM']}
hyp = {'NN':{}, 'GBM':{}, 'CAT':{}, 'RF':{}, 'XT':{}}

# Decrease training time by up to 20x, switching from AutoGluon's default attempt to select optimal num_bagging_folds and stack_ensemble_levels based on data properties. 
auto_stack = True 

predictor = task.fit(train_data = train_data, label = 'log_votes', 
                     eval_metric = 'r2', auto_stack = auto_stack, hyperparameters = hyp)
performance = predictor.predict(test_data)

Loaded data from: review_dataset_AG_training.csv | Columns = 6 / 6 | Rows = 49500 -> 49500
Loaded data from: review_dataset_AG_test.csv | Columns = 6 / 6 | Rows = 5500 -> 5500
No output_directory specified. Models will be saved in: AutogluonModels/ag-20200212_023241/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200212_023241/
Train Data Rows:    1000
Train Data Columns: 6
Preprocessing data ...
Here are the first 10 unique label values in your data:  [0.         1.79175947 2.39789527 3.21887582 2.30258509 2.56494936
 2.89037176 1.09861229 4.18965474 2.63905733]
AutoGluon infers your prediction problem is: regression  (because dtype of label-column == float and label-values can't be converted to int)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Feature Generator processed 1000 data points with 534 features
Original Features:
	object featur

#### 2.3 Evaluate performance with TabularPrediction

Let's now use our trained models to make predictions on the test dataset, and then evaluate performance:

In [4]:
y_pred = predictor.predict(test_data)
# print("Predictions:  ", y_pred)

import pandas as pd 
y_test = test_data['log_votes']
performance = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


Evaluation: r2 on test data: 0.360751
Evaluations on test data:
{
    "r2": 0.3607512139913043,
    "mean_absolute_error": 0.5380760051322063,
    "explained_variance_score": 0.36089642865659965,
    "r2_score": 0.3607512139913043,
    "pearson_correlation": 0.6023670360692323,
    "mean_squared_error": 0.6144549193733427,
    "median_absolute_error": 0.3202014741712549
}


Besides inference, the predictor object returned by __fit()__ can also be used to view a summary of what happened during fit.

In [5]:
results = predictor.fit_summary()


*** Summary of fit() ***
Number of models trained: 12
Types of models trained: 
{'WeightedEnsembleModel', 'StackerEnsembleModel'}
Validation performance of individual models: {'RandomForestRegressorMSE_STACKER_l0': 0.3803287967865676, 'ExtraTreesRegressorMSE_STACKER_l0': 0.3441682556111162, 'LightGBMRegressor_STACKER_l0': 0.3749271266131676, 'CatboostRegressor_STACKER_l0': 0.3970470855490632, 'NeuralNetRegressor_STACKER_l0': 0.12756396725219166, 'weighted_ensemble_k0_l1': 0.405317400521376, 'RandomForestRegressorMSE_STACKER_l1': 0.3615558995393131, 'ExtraTreesRegressorMSE_STACKER_l1': 0.3504722707483374, 'LightGBMRegressor_STACKER_l1': 0.33409715227542347, 'CatboostRegressor_STACKER_l1': 0.3849396903314025, 'NeuralNetRegressor_STACKER_l1': 0.14246165414008327, 'weighted_ensemble_k0_l2': 0.3866349523288962}
Best model (based on validation performance): weighted_ensemble_k0_l1
Hyperparameter-tuning used: False
Bagging used: True  (with 10 folds)
Stack-ensembling used: True  (with 1 level

From this summary, we can see that __AutoGluon__ trained many different types of models as well as an ensemble of the best-performing models. The summary also describes the actual models that were trained during fit and how well each model performed on the held-out validation data. We can also view what properties __AutoGluon__ automatically inferred about our prediction task, along with more details on features preprocessing:

In [6]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon categorized the features as: ", predictor.feature_types)


AutoGluon infers problem type is:  regression
AutoGluon categorized the features as:  {'nlp': ['reviewText', 'summary'], 'vectorizers': ['__nlp__.10', '__nlp__.30', '__nlp__.able', '__nlp__.able to', '__nlp__.about', '__nlp__.access', '__nlp__.account', '__nlp__.actually', '__nlp__.add', '__nlp__.after', '__nlp__.again', '__nlp__.all', '__nlp__.all of', '__nlp__.all the', '__nlp__.almost', '__nlp__.already', '__nlp__.also', '__nlp__.although', '__nlp__.always', '__nlp__.am', '__nlp__.amazon', '__nlp__.an', '__nlp__.and', '__nlp__.and easy', '__nlp__.and have', '__nlp__.and it', '__nlp__.and the', '__nlp__.and then', '__nlp__.and was', '__nlp__.another', '__nlp__.any', '__nlp__.anyone', '__nlp__.are', '__nlp__.around', '__nlp__.as', '__nlp__.at', '__nlp__.at all', '__nlp__.at the', '__nlp__.available', '__nlp__.back', '__nlp__.back to', '__nlp__.bad', '__nlp__.be', '__nlp__.because', '__nlp__.been', '__nlp__.been using', '__nlp__.before', '__nlp__.being', '__nlp__.best', '__nlp__.better

In [7]:
# Deleting notebook artifacts
! rm review_dataset_AG_training.csv
! rm review_dataset_AG_test.csv
! rm -rf AutogluonModels
! rm -rf catboost_info
! rm -rf dask-worker-space