<center><img src="images/logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# ML through Application
## Module 1, Lab 4: Refining Models by Using AutoGluon

By the end of this lab, you should be able to create a model by using [AutoGluon](https://auto.gluon.ai/stable/index.html#).

You will learn how to do the following: 

- Identify the best model that AutoGluon outputs.
- Use your model to create predictions.

---

You will explore a dataset that contains information about books. The goal is to predict book prices by using features about the books.

__Business problem:__ Books from a large database with several features cannot be listed for sale because one critical piece of information is missing: the price. 

__ML problem description:__ Predict book prices by using book features, such as genre, release data, ratings, and number of reviews.

This is a regression task (the training dataset has a book price column to use for labels).

----

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you can practice your coding skills.</p>

## Index

- [Importing AutoGluon](#Importing-AutoGluon)
- [Getting the data](#Getting-the-data)
- [Model training with AutoGluon](#Model-training-with-AutoGluon)
- [AutoGluon training results](#AutoGluon-training-results)
- [Model prediction with AutoGluon](#Model-prediction-with-AutoGluon)

---
## Importing AutoGluon

Install and load the libraries that are needed to work with the tabular dataset.

In [1]:
%%capture
# Install libraries
!pip install -U -q -r requirements.txt

In [2]:
# Import libraries and utility functions
%load_ext autoreload
import pandas as pd
# Import the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset

## Getting the data

Next, load the dataset into a Pandas DataFrame and preview the first rows of data.

__Note:__ You will use the [Amazon Product Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews) dataset. For more information about this dataset, see the following resources:

- Ruining He and Julian McAuley. "Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering." Proceedings of the 25th International Conference on World Wide Web, Geneva, Switzerland, April 2016. https://doi.org/10.1145/2872427.2883037.

- Julian McAuley, Christopher Targett, Qinfeng Shi, Anton van den Hengel. "Image-Based Recommendations on Styles and Substitutes." Proceedings of the 38th International Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, Santiago, Chile, August 2015. https://doi.org/10.1145/2766462.2767755.

In [3]:
df_train = TabularDataset(data="data/train.csv")
df_test = TabularDataset(data="data/test.csv")

In [4]:
df_train.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,Price,asin,details,descriptionstring
0,[],"Books"" />",[],Joan M. Lexau,"1,683,587 in Books (",['0590457292'],Books,5.48,B001D4OHQA,"{'Publisher:': 'Scholastic (1974)', 'Language:...","Staining on cover, minimal wear and creasing. ..."
1,"['Books', 'Education & Teaching', 'Schools & T...",The Core Knowledge Sequence Content and Skill ...,"['0325008957', '1138188492', '1890517208', '14...",Core Knowledge Foundation,"974,014 in Books (","['0385316402', '1890517208', '1933486058', '19...",Books,21.4,B0071QRBFS,"{'Paperback:': '400 pages', 'Publisher:': 'Cor...",A double volume with two &quot;front covers.&q...
2,[],Stranger In The Woods,[],Leah Fried,"17,588,750 in Books (",[],Books,17.0,965906523X,"{'Hardcover:': '202 pages', 'Publisher:': 'Fel...",Stranger in the woods is a dramatic tale of co...
3,[],"Hansel and Gretel : A Fairy Opera, Vocal Score",[],"Adelheid ; Bache, Constance ; Humperdinck, E. ...","3,680,123 in Books (",['0793506603'],Books,10.95,B0011ZV86I,"{'Publisher:': 'G. Schirmer, Inc. (1957)', 'AS...","Complete vocal score, words and music."
4,"['Books', 'History', 'Asia']",Genghis Khan - Conqueror Of The World,[],Leo De Hartog,"5,083,249 in Books (",[],Books,3.5,B001LIQC7A,"{'Hardcover:': '230 pages', 'Publisher:': 'Bar...",a great biography of Ghengis Khan


## Model training with AutoGluon

Finally, create a subset of the training data and use it to train a model by using AutoGluon.  

Remember that you only need to provide the dataset and tell AutoGluon which column from the dataset you are trying to predict.

In [5]:
# Sampling 1,000
subsample_size = 1000  # Sample a subset of data for faster demo
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Print the first rows
df_train_smaller.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,Price,asin,details,descriptionstring
398,[],Every Last One (Audiobook CD),"['1491546336', '1600244041', '1524754668', '14...",Visit Amazon's -Anna Quindlen- Page,"6,392,575 in Books (","['0812985907', '0525509879', '0812976185', '08...",Books,23.84,B003SFS8F8,{'Publisher:': 'Unabridged edition; Unabridged...,The latest novel from Pulitzer Prize-winner An...
3833,[],"Books"" />","['0441810764', '0312863551', '0441094996', '04...",Robert A Heinlein,"4,893,400 in Books (","['0441810764', '0312863551', '0671577808', '04...",Books,6.74,B001R2GZA4,"{'Publisher:': 'SIGNET BOOKS (1900)', 'ASIN:':...",Classic science fiction novel.
4836,"['Books', 'Reference']",Review Notes and Study Guide to Conrad's Vict...,[],Ken Sobol,"2,286,014 in Books (",[],Books,8.07,B000QCDE5A,"{'Paperback:': '142 pages', 'Publisher:': 'Mon...",A CRITICAL GUIDE BY MONARCH NOTES.
4572,[],Simon's Cat va al veterinario,[],Simon Tofield,"7,769,270 in Books (",[],Books,15.18,8416261865,"{'Publisher:': 'Duomo Ediciones (October 1, 20...",Brand New. Ship worldwide
636,"['Books', 'Arts &amp; Photography', 'Decorativ...",Taisho Kimono: Speaking of Past and Present,['4756246354'],Visit Amazon's Jan Dees Page,"2,053,979 in Books (",[],Books,51.75,8857200116,"{'Hardcover:': '292 pages', 'Publisher:': 'Ski...","A unique collection of 130 kimonos for women, ..."


### Training a model with the small sample

AutoGluon uses certain defaults. For example, AutoGluon uses `root_mean_squared_error` as an evaluation metric for regression problems. For more information, see [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) in the sklearn documentation.

__Note:__ Training on this smaller dataset might take approximately 3–4 minutes.

In [6]:
# Run this cell

smaller_predictor = TabularPredictor(label="Price").fit(train_data=df_train_smaller)

No path specified. Models will be saved in: "AutogluonModels/ag-20230928_022939/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230928_022939/"
AutoGluon Version:  0.8.0
Python Version:     3.10.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Apr 24 23:34:06 UTC 2023
Disk Space Avail:   19.81 GB / 20.96 GB (94.5%)
Train Data Rows:    1000
Train Data Columns: 10
Label Column: Price
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (2326.87, 0.0, 39.77738, 123.6481)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Avail

Now the data is loaded, and a model has been trained.

## AutoGluon training results

Now you will look at the information that AutoGluon provides through its `leaderboard` function. The `leaderboard` function is a summary of all models that AutoGluon trained.

**Note:** Because AutoGluon only maximizes metrics, you will see a negative root mean squared error (RMSE) value, for prioritization purposes only.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center; margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style="text-align: center; margin: auto;">To look more closely at the output of the AutoGluon <code>leaderboard</code> function, run the following cell.</p>
    <br>
</div>

In [7]:
# Run this cell to see the model leaderboard
smaller_predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-61.381004,0.15583,25.781846,0.000629,0.395424,2,True,12
1,NeuralNetTorch,-62.322474,0.084373,4.050249,0.084373,4.050249,1,True,10
2,CatBoost,-62.479631,0.030436,2.792857,0.030436,2.792857,1,True,6
3,LightGBMXT,-62.542155,0.013362,4.153607,0.013362,4.153607,1,True,3
4,LightGBM,-62.629336,0.009289,0.743461,0.009289,0.743461,1,True,4
5,LightGBMLarge,-62.802749,0.0144,2.943632,0.0144,2.943632,1,True,11
6,NeuralNetFastAI,-65.590649,0.012399,12.524662,0.012399,12.524662,1,True,8
7,XGBoost,-67.072119,0.01463,1.865047,0.01463,1.865047,1,True,9
8,ExtraTreesMSE,-76.667466,0.088215,4.0779,0.088215,4.0779,1,True,7
9,RandomForestMSE,-100.91315,0.085791,8.770626,0.085791,8.770626,1,True,5


### Interpreting the RMSE value

The root mean squared error (RMSE) that is used here has nice interpretability. Because you are predicting prices, the values that are expressed in the __score\_val__ column of the leaderboard output can give you an idea of the amount of error that is related to the predictions. For example, if score\_val = 0.24, the average error for book price predictions will be about 24 cents.

<div style="border: 4px solid coral; text-align: center; margin: auto;"> 
    <h3><i>Try it yourself!</i></h3>
    <p style="text-align:center; margin:auto;"><img src="images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Which model is the best?<br>
    Choose the model that you think is the best, and justify your choice with data in the following cell.</p>
    <br>
</div>


**Challenge answer**

The WeightedEnsemble_L2 model is the best for this case because it has the less negative "score_val", which indicates the lowest error or the highest performance among the models trained.


## Model prediction with AutoGluon

Now that your model is trained, you can use it to predict prices.

You should always run a final model performance assessment by using data that the model didn't see (the test data). Test data is not used during training and can therefore give a performance assessment. You will use the test data to make predictions in the next step.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it Yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">To show the first rows of the test dataset, which you will use to predict prices, run the following cell.
        </p>
    <br>
</div>

In [8]:
# Run this cell

df_test.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,Price,asin,details,descriptionstring
0,"['Books', 'Cookbooks, Food & Wine']",Sanjeev Kapoor`s Traditional Indian Cuisines Punjabi,[],Visit Amazon's Sanjeev Kapoor Page,"4,203,444 in Books (","['1909487465', '8179916286']",Books,10.48,8179913112,"{'Paperback:': '104 pages', 'Publisher:': 'Popular Prakashan (January 1, 2007)', 'Language:': 'English', 'ISBN-10:': '8179913112', 'ISBN-13:': '978-8179913116', '\n Package Dimensions: \n ': '6.3 x 4.2 x 0.4 inches', 'Shipping Weight:': '5 ounces ('}","): Sanjeev kapoor is a celebrity par excellence in the field. Author of several Best Sellers,Anchor since 1993 of one of the best Cookery Shows,he hardly needs any introduction.He works with many reputed Hotel Chains as consultant.He is recipient of a host of awards such as the Best Executive Chef of India,the Mercury Gold Award at Geneva by IFCA.Mercury is latest addition."
1,[],"Christopher Radko: The first decade, 1986-1995 1st edition by Radko, Christopher published by C. Radko for Starad, Inc Hardcover",[],aa,"2,006,465 in Books (","['0609604767', '0740725114', '0977909905', '0609604759', 'B07HQVBFFF', 'B004TJZDKA', 'B079WWFM6G', '1493022148', 'B01NB1KWK8', 'B00ADYQ9L2']",Books,315.2,B0091PA87K,"{'Publisher:': 'C. Radko for Starad, Inc; 25125th edition (1994)', 'ASIN:': 'B0091PA87K', '\n Package Dimensions: \n ': '12 x 10 x 1 inches'}",Detailed pictorial look at the first 10 years of the Chrisopher Radko ornament line.
2,"['Books', 'Reference', 'Words, Language & Grammar']",Navaho Stories in Basic Vocabulary (A Dolch Basic Book),"['B0006AV7D8', 'B0007E0QTY']",Edward W. Dolch,"6,550,510 in Books (",[],Books,13.98,B000VF4TDS,"{'Hardcover:': '165 pages', 'Publisher:': 'The Garrard Press Publishers; 1st edition (1957)', 'Language:': 'English', 'ASIN:': 'B000VF4TDS', '\n Package Dimensions: \n ': '8.1 x 5.9 x 0.7 inches', 'Shipping Weight:': '13.8 ounces'}","Dust jacket notes about the Dolch Basic Vocabulary Books: ""The Basic Vocabulary Books have been written to fill the need for easy to read, interesting stories that encourage independent reading onbeginner level. Children are delighted with the wealth of new and fascinating true stories and tales of folklore. The books are easy to read for the stories are written with the very first words children learn by sight, the Dolch 220 Basic Sight Words and 95 Commonest Nouns. The repetition of these easy sight words builds a sound reading vocabulary and children gain confidence in their reading abi..."
3,[],The Cultural Monuments of Tibet's Outer Provinces: Amdo. Volume 1: The Qinghai Part of Amdo,"['9747534908', '9744800496', '9744800615']",Andreas Gruschke,"4,368,852 in Books (",[],Books,99.95,9747534592,"{'Paperback:': '284 pages', 'Publisher:': 'White Lotus Co., Ltd; 1st edition (January 1, 2001)', 'Language:': 'English', 'ISBN-10:': '9747534592', 'ISBN-13:': '978-9747534597', '\n Product Dimensions: \n ': '8.3 x 1.2 x 11.4 inches', 'Shipping Weight:': '2.2 pounds ('}","This book presents the fascinating world of northeast Tibet?s historical and cultural monuments. The author's original studies reveal that Tibetan culture is thriving. Tibetans have rebuilt their economy and revitalized their traditional way of life. East Tibet has not until now been thoroughly researched although it comprises about two-thirds of the Tibetan Plateau. This book provides comprehensive information on unknown sites in Amdo. The first volume starts with the famous Kumbum Monastery. Next, the major lamaseries of Tsongkha and the Yellow River bend are described with a historical ..."
4,[],The Danger by Dick Francis,"['0425204391', '042520846X', '0425237753', '0425199932', '0425196739', '042520393X', '0425194973', '0425198006', '0425206955', '0425235408', '0425276244', '0425199835', '0425206300', '0425194981', '0425201910', '0425234630', '0425205258', '042519745X', '0425197050', '0425203549', '0425217566', '0425233316', '0425202887', '0425233189', '042519681X', '0425196747', '0425268543', '0425198774', '0425207188', '0399574719', '0425199169', '0425201481', '042527893X', '0449221113', '0425208850', '0425250385', '0425228975', '042519938X', '0425261352', '0449221164', '0425235904', '0399574735', '042519...",,"8,401,025 in Books (","['042520846X', '0425204391', '0425194973', '0425196739', '042520393X', '0425234630', '0425198006', '0425199835', '0425201910', '0425199932', '0399127070', '0425206300', '0425233316', '0425206955', '0425233189', '0425203549', '042519681X', '0425194981', '0330253077', '0425208850', '0425205258', '0718113934', '0425237753', '0399143025', '0425210766', '0425235408', '0425197050', '0425211037', '0425206777', '0425276244', '0525536760', '006149223X', '0425222713', '0425199169', '0515116173', '0425268543', '0425207188', '042519499X', '0425217566', '0425235904', '0425202887', '0425222705', '042520...",Books,12.75,B004HMQY3Y,"{'Publisher:': 'by Dick Francis (July 12, 2009)', 'ASIN:': 'B004HMQY3Y'}","Will be shipped from US. Used books may not include companion materials, may have some shelf wear, may contain highlighting/notes, may not include CDs or access codes. 100% money back guarantee."


<div style="border: 4px solid coral; text-align: center; margin: auto;"> 
    <h3><i>Try it Yourself!</i></h3>
    <p style="text-align:center; margin:auto;"><img src="images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Use this test dataset as input to the model that you just trained. Use the model to predict book prices. Use the following cell to run the appropriate code.<br><br>
    <b>Tip:</b> For information about the <code>predict</code> function, see <a href="https://auto.gluon.ai/0.6.2/api/autogluon.predictor.html">AutoGluon Predictors</a> in the AutoGluon documentation.</p>
    <br>
</div>


In [18]:
############### CODE HERE ###############

# Load the trained model
predictor = TabularPredictor.load("AutogluonModels/ag-20230928_022939/")

# Make predictions on the test dataset
testPricePredictions = predictor.predict(df_test)

# See the first few rows of the predicted book prices
testPricePredictions.head()

############## END OF CODE ##############

0    25.674156
1    46.761768
2    25.561752
3    46.388287
4    34.460209
Name: Price, dtype: float32

----
## Conclusion

You have now created a model by using AutoGluon, seen how to identify the best model version, and made predictions by using the model.

## Next lab
In the next lab, you will explore some of the advanced features of AutoGluon to refine your model.