<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem;
"><img src="https://cdn-prod.mlu.aws.dev/static/amazon_apollo_django_setup_staging/da021f332105bfea6edc2b02f78330ab1e750dfb01896a80b9676a49743759a4/img/mlu_logo.png" class="logo" alt="MLU Logo"></div>
    
# <a name="0">Automated Machine Learning (AutoML) with AutoGluon - Demo</a>


[__AutoGluon__](https://auto.gluon.ai/stable/index.html#) automates machine learning tasks and enables you to easily achieve strong predictive performance in your applications with just a few lines of code. 

This notebook shows how to use AutoGluon Tabular to solve a __multiclass classification task__. The metric we use to evaluate the performance of the model is accuracy.

1. <a href="#1">Business Problem and ML Problem Description</a>
2. <a href="#2">Installing AutoGluon</a>
3. <a href="#3">Loading the Data</a>
4. <a href="#4">Sampling Data</a>
5. <a href="#5">Model Training with AutoGluon (smaller train dataset)</a>
6. <a href="#6">AutoGluon Training Results</a>
7. <a href="#7">Model Prediction with AutoGluon</a>
8. <a href="#8">Re-Train (with full train data) and predict again</a>
9. <a href="#9">Before You Go (clean up model artifacts)</a>


## 1. <a name="1">Business Problem and ML Problem Description</a>
(<a href="#0">Go to top</a>)

__Business Problem:__ Products from the Amazon Product Catalog cannot be listed for sale because they are missing some relevant information, the Unit Of Measure (count, volume, weight). 

__ML Problem Description:__ Predict the Unit Of Measure (count, volume, weight) Identification (UOMI) for a product from the Amazon Product Catalog. This is a __multiclass classification__ task (3 distinct classes: count, volume, weight). The dataset for this ML problem has 33 features columns and 1 label column. Below some examples of the features that are included in the dataset:


| Feature | Description |
| :---        |    :----  |
| marketplace_id | Marketplace ID.|
| product_type   | Type of product.  |
| item_name | Short item description. |
| product_description   | Long item description.  |
| bullet_point | Bullet point item description. |
| brand   | Brand name.  |
| manufacturer | Manufacturer name. |
| ...   | ...  |
| list_price_value_with_tax   | Price of item including tax.  |
| imgID | ID for image of product. |
| ID   | Product identifier.  |

___
## 2. <a name="2">Installing AutoGluon</a>
(<a href="#0">Go to top</a>)

In [1]:
%%capture
!pip install -q autogluon

Now we load the libraries needed to work with our tabular dataset.

In [2]:
# Load in AutoGluon
from autogluon.tabular import TabularPredictor, TabularDataset

# Load in libraries
import pandas as pd

___
## 3. <a name="3">Loading the Data</a>
(<a href="#0">Go to top</a>)

Let's load the datasets and look at a few data samples.

In [3]:
# Load train and test data splits from csv files into TabularDatasets
train = TabularDataset("../../data/uomi-train.csv")
test = TabularDataset("../../data/uomi-test.csv")

In [4]:
# Print size of train set
print(f"Size of training set: {len(train)}")

# Show the first rows of train data
train.head(2)

Size of training set: 28305


Unnamed: 0,ID,marketplace_id,label,product_type,item_name,product_description,bullet_point,brand,manufacturer,part_number,...,item_dimensions_height,item_dimensions_width,item_dimensions_length,normalized_item_weight,normalized_item_package_weight,list_price_currency,list_price_value,list_price_value_with_tax,imgID,ID_0
0,1633,1,1,GROCERY,"JELL-O Play Ocean Build + Eat Kit, 6 oz Box",,One 6 oz. JELL-O Play Ocean Build + Eat Kit,Jell-O Play,Jell-o,4300008150.0,...,2.625,6.625,8.5,0.023438,0.500449,USD,3.99,,51sislDjTYL,9cd726a519754b6bad27be39bc95cac6
1,18103,1,2,GROCERY,Crystal Light Pure Variety Pack includes- Rasp...,"With no artificial sweeteners, flavors or pres...",Customer Will Receive 6 Boxes Total - 1 Raspbe...,Crystal Light,Crystal Light,,...,,,,,0.599657,,,,41MsGCednqL,44a997b7ff9f4d2ebd1615ac5f3861ff


___
## 4. <a name="4">Sampling Data</a>
(<a href="#0">Go to top</a>)

It is good practice to grab a small sample dataset to quickly run AutoGluon before using the full dataset.

In [5]:
# Take a sample of 1000 datapoints for a quick test
train_sample_small = train.sample(n=1000, random_state=1)

___
## 5. <a name="5">Model Training with AutoGluon</a>
(<a href="#0">Go to top</a>)


We can train a model using AutoGluon with only a single line of code.  All we need to do is tell AutoGluon what column from the dataset we want to predict, and what the train dataset is.

For fast experimentation, we use only the small sample from our train dataset, containing 1000 data points.

__NOTE__: Training on this smaller dataset might still take approx. 3-4 minutes!

__AUTOGLUON FIT OUTPUT__: Running `.fit` to train a model with AutoGluon triggers the output of multiple details about the training process. See the output of the cell below. AutoGluon prints information regarding: 
- the path where trained models are stored. If no path is specified, AutoGluon uses a default value. This is not a problem.
- specified presets for the training. If no presets are specified, AutoGluon issues a warning and a recommendation. It is not a problem to train without presets, although improved performance is expected using the recommended presets.
- system info including software and OS versions, hardware specs
- size of training data
- type of problem as inferred by AutoGluon
- data preprocessing operations
- details of all ML algorithms that are being trained, including training and validation metrics
- model training completion, total runtime, and best model

Pay attention to the output below to identify all pieces of information.

In [6]:
# We specify train and validation data for the model training
first_predictor = TabularPredictor(label="label").fit(
    train_data=train_sample_small
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240106_002857"
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20240106_002857"
AutoGluon Version:  1.0.0
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Sep 6 21:15:41 UTC 2023
CPU Count:          4
Memory Avail:       12.58 GB / 15.32 GB (82.1%)


___
## 6. <a name="6">AutoGluon Training Results</a>
(<a href="#0">Go to top</a>)

Now let's take a look at all the training information AutoGluon provides via its __leaderboard function__.

In [7]:
# Call AutoGluon's leaderboard on the trained predictor to output details of all created models 
first_predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.825,accuracy,0.209271,49.0118,0.001021,1.151814,2,True,14
1,CatBoost,0.82,accuracy,0.070311,46.333023,0.070311,46.333023,1,True,8
2,LightGBM,0.78,accuracy,0.015724,6.151672,0.015724,6.151672,1,True,5
3,XGBoost,0.765,accuracy,0.018853,5.2957,0.018853,5.2957,1,True,11
4,LightGBMLarge,0.76,accuracy,0.033105,10.858981,0.033105,10.858981,1,True,13
5,LightGBMXT,0.755,accuracy,0.04841,6.27676,0.04841,6.27676,1,True,4
6,NeuralNetTorch,0.75,accuracy,0.077085,4.193133,0.077085,4.193133,1,True,12
7,NeuralNetFastAI,0.745,accuracy,0.050142,7.587986,0.050142,7.587986,1,True,3
8,ExtraTreesGini,0.745,accuracy,0.117585,1.386432,0.117585,1.386432,1,True,9
9,RandomForestGini,0.735,accuracy,0.100514,2.002491,0.100514,2.002491,1,True,6


___
## 7. <a name="7">Model Prediction with AutoGluon</a>
(<a href="#0">Go to top</a>)

Now that we trained a model on the train data (that had labels to learn from), let's use the fitted model to predict the labels for the test dataset.

In [8]:
# Call the predict method on the trained predictor to run inference and get predictions on the test dataset
prediction = first_predictor.predict(test)

# Print a few test predictions
print(f"Predictions for the first 20 data points in the test dataset: {prediction.values[0:20]}")

Predictions for the first 20 data points in the test dataset: [2 2 2 2 2 1 2 1 2 2 1 0 2 2 2 1 0 2 1 2]


___
## 8. <a name="8">Re-Train (with full train data) and predict again</a>
(<a href="#0">Go to top</a>)

To improve performance, repeat the process using the full dataset and check whether the score gets better (using AutoGluon leaderboard).

__AUTOGLUON FIT OUTPUT__: check that the output refers to the full dataset this time.

In [9]:
# Retrain the model using all training data 
# We let AutoGluon handle the train/validation split directly
# NOTE: We cap the training time to 10 minutes!
second_predictor = TabularPredictor(label="label").fit(
    train_data=train, time_limit = 60*10) 

# Use the trained model to make predictions on the test dataset
prediction = second_predictor.predict(test)

# Print a few test predictions
print(f"Predictions for the first 20 data points in the test dataset: {prediction.values[0:20]}")

No path specified. Models will be saved in: "AutogluonModels/ag-20240106_003056"
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20240106_003056"
AutoGluon Version:  1.0.0
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Sep 6 21:15:41 UTC 2023
CPU Count:          4
Memory Avail:       11.25 GB /

Predictions for the first 20 data points in the test dataset: [2 2 2 2 2 1 2 1 2 2 1 0 2 2 2 1 2 2 1 2]


In [10]:
# Call AutoGluon's leaderboard on the trained predictor to output details of all created models 
second_predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.8724,accuracy,2.899198,396.879091,0.001395,0.731445,2,True,7
1,LightGBMXT,0.8692,accuracy,0.465028,47.875665,0.465028,47.875665,1,True,2
2,LightGBM,0.8684,accuracy,0.641491,55.776563,0.641491,55.776563,1,True,3
3,CatBoost,0.8536,accuracy,0.867544,155.687522,0.867544,155.687522,1,True,6
4,RandomForestGini,0.828,accuracy,0.471428,72.739019,0.471428,72.739019,1,True,4
5,RandomForestEntr,0.8256,accuracy,0.452311,64.068877,0.452311,64.068877,1,True,5
6,NeuralNetFastAI,0.822,accuracy,0.091722,68.727036,0.091722,68.727036,1,True,1


___
## 9. <a name="10">Before You Go</a>
(<a href="#0">Go to top</a>)

After you are done with this Demo, clean model artifacts by uncommenting and executing the cell below.

__It is always good practice to clean everything when you are done, preventing the disk from getting full.__

In [11]:
# !rm -r AutogluonModels