<p style="padding: 10px; border: 1px solid black;">
<img src="./utils/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>
    
# <a name="0">MLU Workshop: Autogluon Training</a>
   
This notebook will demonstrate the simplest way to use AutoGluon for Tabular data. AutoGluon automates several tasks related to ML model development and builds highly accurate models. In this notebook, you will test AutoGluon on a dataset comprising of products from Amazon's retail catalogue. The goal is to identify whether two products are similar or not.
Due to the large volume of items sold on Amazon, it is challenging to search and identify multiple listings of similar products. The goal is to use Autogluon to predict whether two products are similar using their respective features. 
    
> This is a __binary classification__ task. The label column indicates whether a given pair of products are similar or not <br>

__Jupiter notebooks environment__:

* Jupiter notebooks allow creating and sharing documents that contain both code and rich text cells. If you are not familiar with Jupiter notebooks, read more [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
* This is a quick-start demo to bring you up to speed on coding and experimenting with machine learning. Move through the notebook __from top to bottom__. 
* Run each code cell to see its output. To run a cell, click within the cell and press __Shift+Enter__, or click __Run__ from the top of the page menu. 
* A `[*]` symbol next to the cell indicates the code is still running. A `[#]` symbol, where # is an integer, indicates it is finished.
* Beware, __some code cells might take longer to run__, sometimes 5-10 minutes (depending on the task, installing packages and libraries, training models, etc.)

Let's start by loading some libraries and packages!    

 <a href="#Part-I---Training-models-using-AutoGluon">Part I - Training models using AutoGluon</a>

 <a href="#Part-II---Leaderboard-Submission">Part II - Leaderboard Submission</a>

## <a id="Part-I---Training-models-using-AutoGluon">Part I - Training models using AutoGluon</a>

Let's solve the problem of identifying similar products using __AutoGluon__.

- Part I - 1. <a href="#Importing-AutoGluon"> Importing AutoGluon </a>
- Part I - 2. <a href="#Getting-the-Data">Getting the Data</a>
- Part I - 3. <a href="#Model-Training-with-AutoGluon">Model Training with AutoGluon</a>
- Part I - 4. <a href="#AutoGluon-Training-Results">AutoGluon Training Results</a>
- Part I - 5. <a href="#Model-Prediction">Model Prediction with AutoGluon</a>

(<a href="#0">Go to top</a>)

### <font color='orange'>Please make sure to run the cell below to import all the required libraries! </font> 

In [1]:
# # Install AutoGluon
# !pip install -q autogluon
import pandas as pd

### <a id="Importing-AutoGluon">Importing AutoGluon</a>

Now we load the libraries needed to work with our Tabular dataset.

(<a href="#0">Go to top</a>)

In [2]:
# Importing the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset

### <a id="Getting-the-Data">Getting the Data</a>

Let's get the data for our business problem.

(<a href="#0">Go to top</a>)

In [3]:
# Load the training dataset
df_train = TabularDataset(data="../data/training.csv")

# Load the test dataset
df_test = TabularDataset(data="../data/mlu-leaderboard-test.csv")

In [4]:
df_train.head()

Unnamed: 0,list_price_value_1,product_type_1,item_name_1,product_description_1,bullet_point_1,brand_1,manufacturer_1,part_number_1,model_number_1,size_1,...,item_dimensions_width_2,item_dimensions_length_2,item_dimensions_height_2,list_price_currency_1,list_price_value_with_tax_1,list_price_currency_2,list_price_value_with_tax_2,imgID_1,imgID_2,ID
0,,OFFICE_PRODUCTS,Charlyn Woodruff - CW Designs Monogram - Squar...,,,,3dRose LLC,,,,...,,,,,,,,41y91fNgZqL,510nboKsU5L,7dd5ed12f418440c9aa46813732ff7a3
1,,HOBBIES,Propeller Guard Protector for DJI Mavic 2 Pro/...,Easy installation;Installing and removing the ...,Propeller Guard Protector for DJI Mavic 2 Pro/...,XSD MODEL,PGYTECH,681385399035.0,,,...,,,,,,,,41of5Aiv0jL,41of5Aiv0jL,7cbdc2cff2d44b15852302f15ac718dc
2,,WIRELESS_ACCESSORY,Multi USB Charging Cable,0,0,MuchCORD,MuchCORD,,,4 Feet,...,,,,,,,,41UwfSHCuEL,41UwfSHCuEL,fc50670a76ee4e25bf8fc56ce604877d
3,,KITCHEN,Creative Converting Touch of Color 20 Count Pl...,,10. 25-Inch premium plastic banquet plates in ...,Creative Converting,Creative Converting,28313131.0,28313131.0,One Size,...,,,,,,,,41gzBI7mLXL,41gzBI7mLXL,f11b496a223149aeb8c48d364940e64a
4,,VIDEO_DVD,20th Century Fox Studio Classics (The Blue Max...,,,,,,,,...,,,,,,,,51b6p9J1AOL,51Ol7lW9VbL,afcc880457644c0d99fb3695ad062e47


### <a id="Model-Training-with-AutoGluon">Model Training with AutoGluon</a>

We can train a model using AutoGluon with only a single line of code.  All we need to do is to tell it which column from the dataset we are trying to predict, and what the dataset is.

__Optional:__ You may set a __time limit__ for AutoGluon to perform all the tasks related to ML model development. More time allows AutoGluon to try out more techniques to improve performance.

(<a href="#0">Go to top</a>)

In [5]:
# Train a model with AutoGluon on the train dataset

# Set the path to save models
save_path = "AutogluonModels/Intro/"

# Set the training time to 20 minutes here, to achieve good results
predictor = TabularPredictor(label="label", path=save_path).fit(train_data=df_train, time_limit=20*60)

Beginning AutoGluon training ... Time limit = 1200s
AutoGluon will save models to "AutogluonModels/Intro/"
AutoGluon Version:  0.3.1
Train Data Rows:    18893
Train Data Columns: 63
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
NumExpr defaulting to 4 threads.
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    61855.75 MB
	Train Data (Original)  Memory Usage: 80.55 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fi

[2022-08-03 21:12:25.655 ip-172-16-153-214:4057 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-08-03 21:12:25.709 ip-172-16-153-214:4057 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.


	0.7032	 = Validation score   (accuracy)
	309.37s	 = Training   runtime
	1.14s	 = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 42.96s of the 42.3s of remaining time.
	Ran out of time, early stopping on iteration 198. Best iteration is:
	[196]	train_set's binary_error: 0.0298183	valid_set's binary_error: 0.262434
	0.7376	 = Validation score   (accuracy)
	49.95s	 = Training   runtime
	0.38s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the -17.27s of remaining time.
	0.7693	 = Validation score   (accuracy)
	1.24s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 1220.13s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/Intro/")


### <a id="AutoGluon-Training-Results">AutoGluon Results</a>
Now let's take a look at all the information AutoGluon provides via its __leaderboard function__. <br/> 

__NOTE__: Don't confuse this with the MLU Leaderboard. The MLU Leaderboard is where you will make submissions with the predictions from your trained models; the AutoGluon leaderboard function is a summary of all models that AutoGluon trained.

(<a href="#0">Go to top</a>)

In [6]:
predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.769312,5.17753,794.675259,0.00405,1.241518,2,True,13
1,RandomForestEntr,0.745503,0.466994,92.132452,0.466994,92.132452,1,True,6
2,CatBoost,0.745503,1.036961,136.983842,1.036961,136.983842,1,True,7
3,LightGBMXT,0.740741,0.382354,30.080697,0.382354,30.080697,1,True,3
4,RandomForestGini,0.740741,0.544838,90.212166,0.544838,90.212166,1,True,5
5,LightGBM,0.739153,0.459839,82.301038,0.459839,82.301038,1,True,4
6,LightGBMLarge,0.737566,0.379415,49.950664,0.379415,49.950664,1,True,12
7,XGBoost,0.731746,0.591545,57.927339,0.591545,57.927339,1,True,10
8,ExtraTreesEntr,0.726984,0.577737,127.770066,0.577737,127.770066,1,True,9
9,ExtraTreesGini,0.725397,0.522792,122.392076,0.522792,122.392076,1,True,8


### <a id="Model-Prediction">Model Prediction with AutoGluon</a>
#### Now that your model is trained, let's use it to predict prices!

We should always run a final model performance assessment using data that was unseen by the model (the test data). Test data is not used during training and can therefore give a performance assessment. In our case, we will use the test data to make predictions and submit those to MLU Leaderboard in the next step.

(<a href="#0">Go to top</a>)

In [7]:
# Run this cell
df_test.head()

Unnamed: 0,list_price_value_1,product_type_1,item_name_1,product_description_1,bullet_point_1,brand_1,manufacturer_1,part_number_1,model_number_1,size_1,...,item_dimensions_width_2,item_dimensions_length_2,item_dimensions_height_2,list_price_currency_1,list_price_value_with_tax_1,list_price_currency_2,list_price_value_with_tax_2,imgID_1,imgID_2,ID
0,,HEADPHONES,"Bluetooth Headphones, ESTAVEL Wireless Sports ...",The model number is HT-BT03<br><br><b>[Waterpr...,[SUPER QUALITY&AMAZING BASS]This sport wireles...,ESTAVEL,ESTAVEL,HT,,Black,...,,,,,,,,41rS6EmkzTL,41wxxq7168L,d612d10afd8242c892ad2c697cc64cd8
1,,SOUND_AND_RECORDING_EQUIPMENT,Rane TTM57 MKII Club/DJ Mixer W/Serato DJ+Dual...,The TTM57mkII stays true to its original desig...,Rane TTM57 MKII Club/DJ Mixer W/Serato DJ+Dire...,Rane,Rane,TTM57 MKII+BRLRMXBP1+RED WAVE+ATM510,TTM57 MKII+BRLRMXBP1+RED WAVE+ATM510,,...,,,,,,,,510Tc3DmEBL,51qWaOS8ytL,faf5a521ca994cc889d6db29ecdb82e0
2,,STICKER_DECAL,Cartoon Sticker '' Sesame Street Elmo Cartoon '',-Gently peel off the sticker by nail tip.\n-Cl...,"Decorate photo albums, notebooks,moblie, car,m...",Hometown,Sticker,,,,...,,,,,,,,51rWGz3k8OL,51rWGz3k8OL,0a873fd0823b4e6da423aedfb3119de9
3,,AUTO_PART,XENON HALOGEN FOG LIGHTS For 05-10 NISSAN NAVA...,This auction includes a complete fog lamp kit ...,(2) 55 Watt 4100K Xenon Halogen Lamps,BlingLights,,bl-300w-d4-navara,,,...,,,,,,,,41SuFxTWs2L,41SuFxTWs2L,cadf42deef854462b1f9adcb2ab8ce2b
4,,PERSONAL_CARE_APPLIANCE,"Little Sunny 7.6Inch Premium Dildo, Classical ...",<BR>Product Description: <BR>All dildo cock fr...,"Made of 100% pure liquid silicone, safe ,odorl...",Little Sunny,Little Sunny,AAC570717022,,,...,,,,,,,,41DUNzQGZqL,41DUNzQGZqL,278dfd58a7c8412aa7e2b3c5bc21438c


In [8]:
# Get predictions
predictions = predictor.predict(df_test)
predictions.head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

## <a id="Part-II---Leaderboard-Submission">Part II - Leaderboard Submission</a>


#### Now you are ready for your first submission to our MLU Leaderboard!

(<a href="#0">Go to top</a>)

In [9]:
# Run this cell

# Define empty dataset with column headers ID & Prediction
df_submission = pd.DataFrame(columns=["ID", "label"])
# Creating ID column from ID list
df_submission["ID"] = df_test["ID"].tolist()
# Creating label column from prediction list
df_submission["label"] = predictions
# saving your csv file for Leaderboard submission
df_submission.to_csv(
    "./../data/predictions/Prediction_to_Leaderboard.csv", index=False
)

In [10]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./../data/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_submission["ID"]).sum(),
)

Double-check submission file against the original test file
Differences between project result IDs and sample submission IDs: 0
