<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem;
"><img src="https://cdn-prod.mlu.aws.dev/static/amazon_apollo_django_setup_staging/da021f332105bfea6edc2b02f78330ab1e750dfb01896a80b9676a49743759a4/img/mlu_logo.png" class="logo" alt="MLU Logo"></div>
    
    
# Automate ML Tasks with AutoGluon

This hands-on notebook will let you practice the concepts you have learned in this course so far. You will explore a database of books (books of different genres, from thousands of authors). The goal is to predict book prices using book features.

__Business Problem:__ Books from a large database of books - different genres, thousands of authors, etc., cannot be listed for sale because they are missing one critical piece of information, the price. 

__ML Problem Description:__ Predict book prices using book features, such as genre, release data, ratings, number of reviews. This is a __regression__ task (there is a book price column in the training dataset that you can use to train your model). <br>

----

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
        To generate book price predictions, you will be presented with exercises throughout the notebook whenever you see the MLU robot. As we are not trying to measure your coding skills, you will also find solutions throughout the notebook. <br><br> No matter how experienced and skilled you are with coding, you will be able to submit a solution!
    </span>
</div>

----

The notebook consists of 2 parts; please work top to bottom and don't skip sections as this could lead to error messages due to missing code.

### <a href="#1">Part I - Getting Started with AutoGluon (REQUIRED)</a>
In the first part of the notebook you are going to learn how [__AutoGluon__](https://auto.gluon.ai/stable/index.html#) can solve the book price prediction problem.<br/>

You will build a simple and quick base model and then implement iterations of this model to improve it. To measure how well you are doing (and to see how the model improves) you have to submit your model's predictions to the [__Book Prices Prediction MLU Leaderboard__](https://mlu.corp.amazon.com/contests/redirect/7).

MLU Leaderboard will assess your prediction performance against other participants. Your submission to the leaderboard also __counts towards your course completion__. You can make as many submissions as you like.

We ask you to make 2 submissions in Part I: A simple prediction trained with a smaller dataset (for a quick first submission) which will occur in Part I - 6 and another prediction trained with a full dataset, in order to submit an improved result in Part I - 7.

- Part I - 1. <a href="#p1-1">Importing AutoGluon</a>
- Part I - 2. <a href="#p1-2">Getting the Data</a>
- Part I - 3. <a href="#p1-3">Model Training with AutoGluon (small training dataset)</a>
- Part I - 4. <a href="#p1-4">AutoGluon Training Results</a>
- Part I - 5. <a href="#p1-5">Label Prediction with AutoGluon</a>
- Part I - 6. <a href="#p1-6">First MLU Leaderboard Submission (with small training data)</a>
- Part I - 7. <a href="#p1-7">Second MLU Leaderboard Submission (with full training data)</a>

### <a href="#2">Part II - Advanced AutoGluon (OPTIONAL)</a>
In the second part of the notebook you will find some advanced features of AutoGluon. You're welcome to use the insights you can gain from Part II to make an optional 3rd submission. However, a quick word of warning - AutoGluon is very powerful in its base form so you might not see much additional model improvement on MLU Leaderboard.

- Part II - 1. <a href="#p2-1">Explainability: Feature Importance</a>
- Part II - 2. <a href="#p2-2">Data Preprocessing: Cleaning & Missing Values</a>
- Part II - 3. <a href="#p2-3">Final (optional) MLU Leaderboard Submission (with full engineered data)</a>
- Part II - 4. <a href="#p2-4">Before You Go (clean up model artifacts)</a>

----
</br>

## <a name="1">Part I - Getting Started with AutoGluon</a>
Let's solve the book price prediction problem using __AutoGluon__ in this Jupyter notebook.

__Jupyter notebooks environment__:

* Jupyter notebooks allow creating and sharing documents that contain both code and rich text cells. If you are not familiar with Jupyter notebooks, read more [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
* Run each grey code cell to see its output. To run a cell, click within the cell and press __Shift+Enter__, or click __Run__ from the top of the page menu. 
* A `[*]` symbol next to the cell indicates the code is still running. A `[#]` symbol, where # is an integer, indicates it is finished.
* Beware, __some code cells might take longer to run__, sometimes 5-20 minutes (depending on the task, installing packages and libraries, training models, etc.)

Let's start by loading some libraries and packages!

### <font color='#DF2A5D'>Please make sure to run the below cells!</font> 

In [None]:
%%capture
!pip install -q autogluon

In [None]:
# Import utility functions that provide answers to challenges
%load_ext autoreload
%aimport course_utils

# Import pandas library to work with dataframes
import pandas as pd

### <a name="p1-1">Part I - 1. Importing AutoGluon</a>


Now you load the libraries needed to work with your tabular dataset.

In [None]:
# Importing the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset

### <a name="p1-2">Part I - 2. Getting the Data</a>

Let's get the data for your business problem.
<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
        Run the cell below to load the train and test data. Then continue and take a look at the first samples of your train dataset to explore the data.
    </span>
</div>
    

In [None]:
# Load training and test data
df_train = TabularDataset(data="../data/training.csv")
df_test = TabularDataset(data="../data/mlu-leaderboard-test.csv")

In [None]:
# Explore first few rows of training data
df_train.head()

### <a name="p1-3">Part I - 3. Model Training with AutoGluon (small training dataset)</a>

You can train multiple models (a predictor) using AutoGluon with only a single line of code.  All you need to do is to tell it which column from the dataset you are trying to predict, and what the dataset is.


### Sampling data
For this first training, you are going to randomly sample 1000 books from your training dataset in order to have a faster training.

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to prepare the datasets (AutoGluon is doing all the magic for us). Here you are randomly selecting 1000 rows of your dataset and splitting it into train and validation datasets.
    </span>
</div>
    

<br>

__NOTE__: The `random_state` parameter below allows to have repeatability when running the code multiple times.

In [None]:
# Sampling 1000
subsample_size = 1000  # subsample subset of data for faster demo
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Printing the first rows
df_train_smaller.head()

### Training a model with 1000 samples
It's time to train the AutoGluon predictor with the small data sample you created above.

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Use the smaller dataset with 1000 samples to reduce the time required for training. Training on this smaller dataset might still take approx. 3-4 minutes!
    </span>
</div>



__NOTE__: AutoGluon uses certain defaults; generally these are good but there is one exception: `eval_metric`.  By default, AutoGluon uses `‘root_mean_squared_error’` as evaluation metric for regression problems. However, MLU Leaderboard is using the `‘mean_squared_error’` metric to measure submissions quality, so you need to explicitly pass this metric to AutoGluon. For more information on these options, see sklearn [metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

In [None]:
# Run this cell to fit (train) models with AutoGluon
smaller_predictor = TabularPredictor(
    label="Price", eval_metric="mean_squared_error", path="smaller_predictor"
).fit(train_data=df_train_smaller)

You now have a trained predictor (called `smaller_predictor`) that you can use to make predictions. AutoGluon saves the predictor locally on your instance too, so if you want to use the predictor again later you don't have to train it again. You can simply load it with: `predictor = TabularPredictor.load("AutogluonModels/smaller_predictor")`.

### <a name="p1-4">Part I - 4. AutoGluon Training Results</a>
Now let's take a look at all the information AutoGluon provides via its __leaderboard function__. <br/> 

__NOTE__: Don't confuse this with the MLU Leaderboard. The MLU Leaderboard is where you will make submissions with the predictions from your trained models; the AutoGluon leaderboard function is a summary of all models that AutoGluon trained.

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to take a closer look at AutoGluon's leaderboard output. You will notice there are many different columns, not just a model score. Think about why there might be other columns, such as <code>fit_time</code> for example. Can you decide which model is the best model based on the information in the AutoGluon leaderboard? 
    </span>
</div>

In [None]:
# Display AutoGluon leaderboard

smaller_predictor.leaderboard(silent=True)

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("AutoGluon_leaderboard")

### <a name="p1-5">Part I - 5. Label Prediction with AutoGluon</a>
#### Now that your predictor is trained, let's use it to predict book prices for the test data!

You should always run a final performance assessment using data that was unseen by the predictor (the test data). Test data is not used during training and can therefore give a performance assessment. In this case, you will use the test data to make predictions and submit those to MLU Leaderboard in the next step.

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to show the first few rows of the test dataset that you need to predict book prices for (each row corresponds to a book). You should notice that it looks just like the training dataset - with one exception; there is no column for price! That makes sense as this is what you are trying to predict with AutoGluon.
    </span>
</div>

In [None]:
# Take a look at the test dataset

df_test.head()

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Use the predictor you just trained to predict book prices for the each row (book) in the test dataset. To create predictions, use the <code>predict</code> method.
    </span>
</div>



__TIP:__ Look at the excerpt from the AutoGluon documentation below and to see how to create predictions using `TabularPredictor.predict(data, model, *, decision_threshold)`. Make sure to replace `TabularPredictor` with the name of the trained predictor (for example `smaller_predictor`).

```
TabularPredictor.predict(data, model, *, decision_threshold)

"""
    Parameters:
    ----------
    data (`TabularDataset` or `pd.DataFrame`): The data to make predictions for (e.g. df_test).

    model (str): The name of the model to get predictions from (default uses the highest scoring model on the validation set).

    as_pandas (True/False): Whether to return the output as a `pd.Series` (True) or `np.ndarray` (False).

    decision_threshold (float): The decision threshold used to convert prediction probabilities to predictions (relevant for binary classification).

    
    Returns:
    -------
    Array of predictions, one corresponding to each row in given dataset. 
"""
```

In [None]:
############## CODE HERE ####################



############## END OF CODE ####################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("predict_prices_small_training")

### <a name="p1-6">Part I - 6. First MLU Leaderboard Submission (with small training data)</a>
#### Now you are ready for your first submission to MLU Leaderboard!

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to save your prediction file in the format expected by MLU Leaderboard.
    </span>
</div>

__NOTE__: For this cell to work you __need to complete Part I - 5__. Otherwise, you will not have the `price_prediction` variable that is needed for the submission file, and running the cell below __will raise an error__. If you got stuck in the previous section, have a look at the solution with `course_utils.answer_html("predict_prices_small_training")`.

In [None]:
# Run this cell

# Define empty dataset with column headers ID & Price
df_submission = pd.DataFrame(columns=["ID", "Price"])
# Creating ID column from ID list
df_submission["ID"] = df_test["ID"].tolist()
# Creating label column from price prediction list
df_submission["Price"] = price_prediction
# saving your csv file for Leaderboard submission
df_submission.to_csv("./../data/predictions/Prediction_to_Leaderboard.csv", index=False)

#### Let's do a quick check to see if the file is ok!

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard. If the difference is zero you are good to go!
    </span>
</div>



In [None]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./../data/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_submission["ID"]).sum(),
)

#### Downloading the prediction file and submitting

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
1. Download the file you just saved to your local machine. <br/>
2. Go <a href="https://mlu.corp.amazon.com/contests/redirect/7">here</a> to submit your file. <br/>
3. Follow the instructions on the MLU Leaderboard submission page.
    </span>
</div>


__NOTE__: Navigate to the parent folder containing the lab notebooks, <code>AutoML-Course</code>. You can now find your submission file in subfolder <code>data > predictions</code> as a csv file and download it to your local machine for submission to the MLU Leaderboard.

<img src="./../images/download_solution.png" alt="screenshot solution folder" width="400" style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 50%;"/>

### <a name="p1-7">Part I - 7. Second MLU Leaderboard Submission (with full training data)</a>

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Now that you made your first submission using the small training sample from your dataset, repeat the process using the full dataset. Create predictions for the test set, and submit again to see if your score gets better.
    </span>
</div>


__NOTE__: If you don't know how to write the code for this, uncomment the challenge answer; copy and paste it in the section below. It should take around 12-15 minutes to run this training with your CPU. Just in case, use the `time_limit` parameter (in seconds) to limit the run time to 20 minutes.



In [None]:
############## CODE HERE ####################



############## END OF CODE ####################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("predict_prices_full_training")

### Get the second submission for MLU Leaderboard ready</a>

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to save your prediction file in the format expected by MLU Leaderboard.
    </span>
</div>


In [None]:
# Use full predictor to create predictions for the test data
full_prediction = full_predictor.predict(df_test)

# Define empty dataset with column headers ID & Price
df_full_submission = pd.DataFrame(columns=["ID", "Price"])

# Creating ID column from ID list
df_full_submission["ID"] = df_test["ID"].tolist()

# Creating label column from price prediction list
df_full_submission["Price"] = full_prediction

# saving your csv file for Leaderboard submission
df_full_submission.to_csv(
    "./../data/predictions/Full_Prediction_to_Leaderboard.csv", index=False
)

#### Let's do a quick check to see if the file is ok!

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard. If the difference is zero you are good to go!
    </span>
</div>


In [None]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./../data/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_full_submission["ID"]).sum(),
)

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Submit again to MLU leaderboard to improve your score. For the submission use the same link as before <a href="https://mlu.corp.amazon.com/contests/redirect/7">here</a>.
    </span>
</div>


___
#  <a name="2"> Part II - Advanced AutoGluon (OPTIONAL)</a>

Now that you have made your first Leaderboard submission, let's practice using some advanced features of AutoGluon. <br/>
- Part II - 1. <a href="#p2-1">Explainability: Feature Importance</a>
- Part II - 2. <a href="#p2-2">Data Preprocessing: Cleaning & Missing Values</a>
- Part II - 3. <a href="#p2-3">Final (optional) MLU Leaderboard Submission (with full engineered data)</a>
- Part II - 4. <a href="#p2-4">Before You Go (clean up model artifacts)</a>

### <a name="p2-1">Part II (optional) - 1. Explainability</a>

There are growing business needs and legislative regulations that require explanations of why a model made a certain decision.<br/>
To better understand your trained predictor, you can estimate the overall importance of each feature.

#### Feature Importance
A features importance score represents the performance drop that results when the model makes predictions on a perturbed copy of the dataset where this features values have been randomly shuffled across rows. A feature score of 0.01 would indicate that the predictive performance dropped by 0.01 when the feature was randomly shuffled. The higher the score a feature has, the more important it is to the models performance. If a feature has a negative score, this means that the feature is likely harmful to the final model, and a model trained without that feature  would be expected to achieve a better predictive performance.



<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the code below to see the output of the AutoGluon feature importance function for the first model that was trained with only 1000 samples.
    </span>
</div>

In [None]:
# Run the code below to see feature importance
smaller_predictor.feature_importance(df_train_smaller)

### <a name="p2-2">Part II (optional) - 2. Data Preprocessing</a>

With AutoGluon you don't have to worry about which model to chose as AutoGluon automatically tries all applicable models; this means you can focus on the data itself. 
In the book price dataset, there are a few columns which are clearly very poorly encoded, most importantly the ```Edition``` column. <br/>

### Data Cleaning

For this section, use the small dataset __df_train_smaller__ to make everything run a bit faster. In the code cell below you will find some helper functions that you can use for data cleaning.


In [None]:
# Run this cell

# Import needed libraries
import re
import pandas as pd


# Function to extract the first digit from a string
def first_num(in_val):
    num_string = in_val.split(" ")[0]
    digits = re.sub(r"[^0-9\.]", "", num_string)
    return float(digits)


# Function to extract the year from a string
def year_get(in_val):
    m = re.compile(r"\d{4}").findall(in_val)
    # print(in_val, m)
    if len(m) > 0:
        return int(m[0])
    else:
        return None


# Function to extract an abbreviated month name from a string
def month_get(in_val):
    m = re.compile(r"Jan |Feb |Mar |Apr |May |Jun |Jul |Aug |Sep |Oct |Nov |Dec ").findall(in_val)
    # print(in_val, m)
    if len(m) > 0:
        return m[0]
    else:
        return "None"


# To drop features and save the new dataframe, you can use <name_of_df>.drop([<features_to_drop>], axis=1, inplace=True)

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Using the helper functions above, try to write code that cleans up the columns and potentially create new columns too. You can find a suggested sequence of steps below.
    </span>
</div>

To perform data cleaning, follow the steps below:
1. Create a copy of the dataframe (to compare results later); call it ```train_data_feateng```.
2. Splitting the Column ```Edition``` into three new ones: ```hard_paper```, ```year``` and ```month```.
3. Creating two numerical features based on the features ```Reviews``` and ```Ratings```, named ```Reviews-n``` and ```Ratings-n``` respectively.
4. Drop the old columns from the dataset: ```Edition```,  ```Reviews``` and ```Ratings```. 

Please, try to solve the challenge before uncommenting for the answer.

In [None]:
############## CODE HERE ####################



############## END OF CODE ####################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("feature_engineering")

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Now take a look at the resulting dataset with the new features to inspect if everything worked as expected.    </span>
</div>


In [None]:
# Run this cell

train_data_feateng.head(2)

### Identifying Missing values
By doing the feature engineering above you introduced a new potential problem; you might now have some missing data.

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Try to identify the features that may have missing values and how many are missing.   
    </span>
</div>


Please, try to solve the challenge before uncommenting for the answer below. 

In [None]:
############## CODE HERE ####################



############## END OF CODE ####################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("missing_values")

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Let's train the model again with these new manually created features.
    </span>
</div>



In [None]:
############## CODE HERE ####################



############## END OF CODE ####################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("feature_eng_predictor")

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Compare the AutoGluon leaderboard for the new <code>feateng_predictor</code> to the AutoGluon leaderboard for the <code>smaller_predictor</code> from Part I in the cells below. <br><br>    
Are there any significant differences?
    </span>
</div>

In [None]:
############## FIRST CODE HERE ####################



############## END OF CODE ########################################

In [None]:
############## SECOND CODE HERE ####################



############## END OF CODE #########################################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("compare_leaderboards")

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; float: left; margin: 15px;">
<ol>
    <li>Run the AutoGluon <code>feature_importance</code> function for the original smaller dataset into the first cell below.</li>
<li>Run the <code>feature_importance</code> function again for the modified  dataset into the second cell below.</li>
<li>Compare the results: Are there any significant differences?</li>
<ol>
    </span>
</div>

In [None]:
############## CODE HERE FOR THE ORIGINAL DATASET FEATURE IMPORTANCE ##################



############## END OF CODE ############################################################

In [None]:
############## CODE HERE FOR THE FEATURE ENGINEERED DATASET FEATURE IMPORTANCE #####################



############## END OF CODE #########################################################################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("feature_importances")

### <a name="p2-3">Part II (optional) - 3. MLU Leaderboard Submission (with full engineered data)</a>
Let's create the full engineered dataset to train a final AutoGluon model & let's also allocate more time to really get the best results.

__NOTE__: As there are few columns in this dataset, you can't necessarily expect additional performance improvement.

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Now it is time to train your model with the feature engineered dataset. For this iteration you can use a time limit of 20 min.
    </span>
</div>

In [None]:
full_feateng = df_train.copy()

# Clean up the features by applying the helper functions
full_feateng["Reviews-n"] = full_feateng["Reviews"].apply(first_num)
full_feateng["Ratings-n"] = full_feateng["Ratings"].apply(first_num)
full_feateng["hard-paper"] = full_feateng["Edition"].apply(lambda x: x.split(",")[0])
full_feateng["year"] = full_feateng["Edition"].apply(year_get)
full_feateng["month"] = full_feateng["Edition"].apply(month_get)

# Drop the original feature columns - they are no longer needed
full_feateng.drop(["Edition", "Ratings", "Reviews"], axis=1, inplace=True)

In [None]:
# Train the new predictor
enhanced_predictor = TabularPredictor(
    label="Price", eval_metric="mean_squared_error"
).fit(train_data=full_feateng, time_limit=20 * 60)

### Time to make your final (and optional) submission to MLU Leaderboard</a>

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Now make a final prediction and submit this to MLU leaderboard.<br> Keep in mind that you used an engineered version of the dataset for training the AutoGluon predictor and that you need to apply the same transformation to the test data before you can call <code>.predict()</code>:
    </span>
</div>


In [None]:
# Create a copy of the test data
test_data_feateng = df_test.copy()

# Apply feature extraction and transformations to test data too
test_data_feateng["Reviews-n"] = test_data_feateng["Reviews"].apply(first_num)
test_data_feateng["Ratings-n"] = test_data_feateng["Ratings"].apply(first_num)
test_data_feateng["hard-paper"] = test_data_feateng["Edition"].apply(
    lambda x: x.split(",")[0]
)
test_data_feateng["year"] = test_data_feateng["Edition"].apply(year_get)
test_data_feateng["month"] = test_data_feateng["Edition"].apply(month_get)

# Drop the original feature columns - they are no longer needed
test_data_feateng.drop(["Edition", "Ratings", "Reviews"], axis=1, inplace=True)

Add the code below to create predictions and the output file.

In [None]:
############## CODE HERE ####################



############## END OF CODE ####################

In [None]:
## CHALLENGE ANSWER
# course_utils.answer_html("final_submission")

#### Let's do a quick check to see if the file is ok!

<div style="border: 5px solid white; display: flex; align-items: center; justify-content: center; background-color:#330066;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="./../images/MLU_robot.png" alt="MLU robot" width="100"/>
    <span style="color:white; font-weight: bold; padding-left: 20px; float: left; margin: 15px;">
Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard. If the difference is zero you are good to go!
    </span>
</div>


In [None]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./../data/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_enhanced_submission["ID"]).sum(),
)

## Congrats!
In the next module, __Code Walkthrough and Advanced AutoGluon__ we are going do a walk-through over your solutions and also show a notebook that implements an __end-to-end__ solution.

### <a name="p2-4">Part II (optional) - 4. Before You Go</a>

After you are done with this Hands On, you can clean all model artifacts uncommenting and executing the cell below.

__It's always a good practice to clean up everything when you are done.__

In [None]:
# !rm -r AutogluonModels