![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 2</a>

## Final Project:  Tree-based Models

Build, train and tune by using [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) or [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with a [Random_Forest_Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to predict the __label__ field (substitute or not substitute) of the Amazon product substitute dataset.

### Final Project Problem: Product Substitute Prediction

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your model's predictions as a CSV file to the leaderboard: __https://mlu.corp.amazon.com/contests/redirect/35__

1. <a href="#1">Read the datasets</a> (Given) 
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Data Processing with Pipeline</a>
3. <a href="#3">Train (and Tune) a Classifier on the Training Dataset</a> (Implement)
4. <a href="#3">Make Predictions on the Test Dataset</a> (Implement)
5. <a href="#4">Write the Test Predictions to a CSV file</a> (Given)


__Datasets and Files:__


* __training.csv__: Training data with product pair features and corresponding labels:
> - `ID:` ID of the record
> - `label:` Tells whether the key and candidate products are substitutes (1) or not (0).
> - `key_asin ...:` Key product ASIN features 
> - `cand_asin ...:` Candidate product ASIN features 


* __public_test_features.csv__: Test data with product pairs features __without__ labels:
> - `ID:` ID of the record
> - `key_asin ...:` Key product ASIN features 
> - `cand_asin ...:` Candidate product ASIN features 


* __metadata-dataset.xlsx__: Provides detailed information about all key_ and cand_ columns in the training and test sets. Try to select some useful features to include in the model, as not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


* __Sample submission file:__ You can see a sample file: sample-submission.csv under data/final_project folder.


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)

Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data.

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
training_data = pd.read_csv('../../data/final_project/training.csv')
test_data = pd.read_csv('../../data/final_project/public_test_features.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)

The shape of the training dataset is: (36803, 228)
The shape of the test dataset is: (15774, 227)


## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>)

### 2.1 <a name="21">Exploratory Data Analysis</a>
(<a href="#2">Go to Data Processing</a>)

We look at number of rows, columns, and some simple statistics of the datasets.

In [2]:
# Implement here

training_data.head()

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,,base_product,...,0.529104,pounds,5.11811,,18-Apr-13,14-Oct-17,N,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,,base_product,...,0.1,pounds,4.5,,19-May-16,21-Mar-18,N,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,N,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,N,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.0,base_product,...,0.396832,pounds,5.19685,,26-Jul-12,9-Mar-18,N,9-Mar-18,1253,


In [3]:
# Implement here

test_data.head()

Unnamed: 0,ID,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,key_classification_description,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,35057,1,1,B0096M8VR2,pc,Personal Computers,,1.0,base_product,Base Product,...,0.925932,pounds,5.826772,,10-Apr-13,5-Jul-16,N,4-Jul-16,699,
1,41573,1,1,B00EAQJCWW,kitchen,Kitchen,BUNN9,2.0,base_product,Base Product,...,,,,,17-Mar-16,17-Mar-16,N,17-Mar-16,2,
2,44029,1,1,B013P93YOQ,toy,Toy,,,base_product,Base Product,...,,,,,23-Dec-15,2-Dec-17,N,2-Dec-17,17,
3,6462,1,1,B00SKJPKGW,wireless_phone_accessory,Wireless Phone Accessory,PIQ22,1.0,base_product,Base Product,...,6.25,pounds,9.7,,22-Jan-15,18-Jan-17,N,18-Jan-17,25351,
4,17533,1,1,B001DCEKXM,sports,Sports,SUUNR,1.0,base_product,Base Product,...,0.176368,pounds,3.228346,,4-Jan-11,16-Nov-17,N,16-Nov-17,7424,


In [4]:
# Implement more EDA here


### 2.2 <a name="22">Select features to build the model</a>
(<a href="#2">Go to Data Processing</a>)

Previously, we recommended using only a few of the numerical features for both key_ and cand_ ASINs: __item_package_quantity__, __item_height__, __item_width__, __item_length__, __item_weight__, __pkg_height__, __pkg_width__, __pkg_length__, __pkg_weight__.

We learned how to use __categorical data__ today. Let's select some categorical variables to add to the model, such as __classification_code__, __has_ean__, __has_online_play__.

We also discussed text vectorization, so you can include into the model the __item_name__ text fields for example. Feel free to explore other fields from the metadata-dataset.xlsx file.

__Note: Be careful about the missing text values when you are cleaning and stemming your text. Refer to the class exercise: MLA-TAB-DAY2-TREE-MODELS-NB.ipynb.__

In [5]:
# Grab model features/inputs and target/output
numerical_features = ["key_item_package_quantity", 
                      "key_item_height", "key_item_width", "key_item_length", "key_item_weight", 
                      "key_pkg_height", "key_pkg_width", "key_pkg_length", "key_pkg_weight",
                      "cand_item_package_quantity", 
                      "cand_item_height", "cand_item_width", "cand_item_length", "cand_item_weight", 
                      "cand_pkg_height", "cand_pkg_width", "cand_pkg_length", "cand_pkg_weight"]

categorical_features = ["key_classification_code", "key_has_ean", "key_has_online_play", 
                      "cand_classification_code", "cand_has_ean", "cand_has_online_play"]

text_features = ["key_item_name", "cand_item_name"]

model_features = numerical_features + categorical_features + text_features

model_target = 'label'

### 2.3 <a name="23">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

We plan to use [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) or [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) from sklearn that does k-fold cross-validation by default, so we don't need an extra validation set unless we would like to evaluate the model before submitting to the leaderboard. 

### 2.4 <a name="24">Data processing with Pipeline and ColumnTransformer</a>
(<a href="#2">Go to Data Processing</a>)

Use the collective ColumnTransformer to process the data, ensuring that the transformations learned on the train data are performed accordingly on the training, validation, and test datasets.

In [6]:
# Implement here



## 3. <a name="3">Train (and Tune) a Classifier</a> (Implement)
(<a href="#0">Go to top</a>)

Train and tune a [Random_Forest_Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), using [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) or [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for hyperparameter tuning. 

In [7]:
# Implement here



When the hyperparameter search is over, print the optimal combination of hyperparameters and the corresponding validation metric. Also, grab the best estimator equiped with the optimal combination of hyperparameters for further use. 

In [8]:
# Implement here



## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained classifier to predict the labels on the test set. Test accuracy would be displayed upon a valid submission to the leaderboard.

In [9]:
# Implement here

# Get test data to test the classifier
# ! test data should come from public_test_features.csv !
# ...

# Use the trained model to make predictions on the test dataset
# test_predictions = ...


## 5. <a name="5">Write the test predictions to a CSV file</a> (Given)
(<a href="#0">Go to top</a>)

UUse the following code to write the test predictions to a CSV file. Download locally the CSV file from the SageMaker instance, and upload it to __https://mlu.corp.amazon.com/contests/redirect/35__

In [None]:
import pandas as pd

result_df = pd.DataFrame(columns=["ID", "label"])
result_df["ID"] = test_data["ID"].tolist()
result_df["label"] = test_predictions

result_df.to_csv("../../data/final_project/project_day2_result.csv", index=False)

In [None]:
print('Double-check submission file against the sample_submission.csv')
sample_submission_df = pd.read_csv('../../data/final_project/sample-submission.csv')
print('Differences between project_day2_result IDs and sample submission IDs:',(sample_submission_df['ID'] != result_df['ID']).sum())