![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 3</a>

## Final Project:  Neural Networks and AutoML

You have more choices for today. Build a Neural Network with PyTorch or use [AutoGluon](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) to predict the __label__ field (substitute or not substitute) of the Amazon product substitute dataset. 
__Note: If you use Autogluon, you don't need to follow the steps in this notebook. Just refer to the Autogluon notebook in the class or [this tutorial](https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-quickstart.html)__.

### Final Project Problem: Product Substitute Prediction

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your network's predictions as a CSV file to the leaderboard: __https://mlu.corp.amazon.com/contests/redirect/35__

1. <a href="#1">Read the datasets</a> (Given) 
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Data Processing with Pipeline and ColumnTransformer</a>
3. <a href="#3">Network Training and Validation on the Training Dataset</a> (Implement)
4. <a href="#4">Make Predictions on the Test Dataset</a> (Implement)
5. <a href="#4">Write the Test Predictions to a CSV file</a> (Given)


__Datasets and Files:__


* __training.csv__: Training data with product pair features and corresponding labels::
> - `ID:` ID of the record
> - `label:` Tells whether the key and candidate products are substitutes (1) or not (0).
> - `key_asin ...:` Key product ASIN features 
> - `cand_asin ...:` Candidate product ASIN features 


* __public_test_features.csv__: Test data with product pairs features __without__ labels::
> - `ID:` ID of the record
> - `key_asin ...:` Key product ASIN features 
> - `cand_asin ...:` Candidate product ASIN features 


* __metadata-dataset.xlsx__: Provides detailed information about all key_ and cand_ columns in the training and test sets. Try to select some useful features to include in the model, as not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


* __Sample submission file:__ You can see a sample file: sample-submission.csv under data/final_project folder.


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)

Files for the final project are training and test data files, a sample submission file and a metadata file.

Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data.

In [12]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
training_data = pd.read_csv('../../data/final_project/training.csv')
test_data = pd.read_csv('../../data/final_project/public_test_features.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)

The shape of the training dataset is: (36803, 228)
The shape of the test dataset is: (15774, 227)


## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>)

### 2.1 <a name="21">Exploratory Data Analysis</a>
(<a href="#2">Go to Data Processing</a>)

We look at number of rows, columns, and some simple statistics of the datasets.

In [13]:
# Implement here

training_data.head()

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,,base_product,...,0.529104,pounds,5.11811,,18-Apr-13,14-Oct-17,N,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,,base_product,...,0.1,pounds,4.5,,19-May-16,21-Mar-18,N,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,N,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,N,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.0,base_product,...,0.396832,pounds,5.19685,,26-Jul-12,9-Mar-18,N,9-Mar-18,1253,


In [14]:
# Implement here

test_data.head()

Unnamed: 0,ID,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,key_classification_description,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,35057,1,1,B0096M8VR2,pc,Personal Computers,,1.0,base_product,Base Product,...,0.925932,pounds,5.826772,,10-Apr-13,5-Jul-16,N,4-Jul-16,699,
1,41573,1,1,B00EAQJCWW,kitchen,Kitchen,BUNN9,2.0,base_product,Base Product,...,,,,,17-Mar-16,17-Mar-16,N,17-Mar-16,2,
2,44029,1,1,B013P93YOQ,toy,Toy,,,base_product,Base Product,...,,,,,23-Dec-15,2-Dec-17,N,2-Dec-17,17,
3,6462,1,1,B00SKJPKGW,wireless_phone_accessory,Wireless Phone Accessory,PIQ22,1.0,base_product,Base Product,...,6.25,pounds,9.7,,22-Jan-15,18-Jan-17,N,18-Jan-17,25351,
4,17533,1,1,B001DCEKXM,sports,Sports,SUUNR,1.0,base_product,Base Product,...,0.176368,pounds,3.228346,,4-Jan-11,16-Nov-17,N,16-Nov-17,7424,


In [15]:
# Implement more EDA here
# Print info about the training dataset
print("Training Dataset Info:")
training_data.info()

print("\nTraining Dataset Summary Statistics:")
print(training_data.describe())

print("\nMissing Values in Training Dataset:")
print(training_data.isnull().sum())

# Print info about the test dataset
print("\nTest Dataset Info:")
test_data.info()

print("\nTest Dataset Summary Statistics:")
print(test_data.describe())

print("\nMissing Values in Test Dataset:")
print(test_data.isnull().sum())

# Check unique values in categorical columns
for col in categorical_features:
    print(f"\nUnique values in {col}:")
    print(training_data[col].value_counts())


Training Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36803 entries, 0 to 36802
Columns: 228 entries, ID to cand_external_testing_certification
dtypes: float64(90), int64(12), object(126)
memory usage: 64.0+ MB

Training Dataset Summary Statistics:
                 ID         label  key_Region Id  key_MarketPlace Id  \
count  36803.000000  36803.000000        36803.0             36803.0   
mean   26341.493438      0.505095            1.0                 1.0   
std    15159.339391      0.499981            0.0                 0.0   
min        1.000000      0.000000            1.0                 1.0   
25%    13250.500000      0.000000            1.0                 1.0   
50%    26318.000000      1.000000            1.0                 1.0   
75%    39455.500000      1.000000            1.0                 1.0   
max    52576.000000      1.000000            1.0                 1.0   

       key_case_pack_quantity  key_country_of_origin  key_discontinued_date  \
coun

#### Dataset features

Let's now print the features of our dataset.

In [16]:
# Implement here

# Check sample text in text features
for col in text_features:
    print(f"\nSample text in {col}:")
    print(training_data[col].sample(5))


Sample text in key_item_name:
26433    Dual Output Portable Charger, Oripow Spark A6 ...
19451    VALTERRA PRODUCTS, INC. B8202X 2x200' BACKWASH...
17251    InstaMagic Hair Straightener Brush with LED Di...
25450    Grandma Lucys Freeze-Dried Grain-Free Pet Foo...
18666    Underground Toys Star Wars Home Kitchen Storag...
Name: key_item_name, dtype: object

Sample text in cand_item_name:
14862                                 Mighty Wheel Tractor
28176    Twin Sized SoundAsleep Dream Series Air Mattre...
36648                      Lilax Girls' Racerback Tank Top
17588    Pokemon Platinum Rising Rivals #111 Snorlax LV...
3011     GOTD 10PC Plastic Nail Art Soak Off Cap Clip U...
Name: cand_item_name, dtype: object


### 2.2 <a name="22">Select features to build the model</a> (Suggested)
(<a href="#2">Go to Data Processing</a>)

Previously, we recommended using only a few of the numerical features for both key_ and cand_ ASINs: __item_package_quantity__, __item_height__, __item_width__, __item_length__, __item_weight__, __pkg_height__, __pkg_width__, __pkg_length__, __pkg_weight__.

We learned how to use __categorical data__ today. Let's select some categorical variables to add to the model, such as __classification_code__, __has_ean__, __has_online_play__.

We also discussed text vectorization, so you can include into the model the __item_name__ text fields for example. Feel free to explore other fields from the metadata-dataset.xlsx file. Keep in mind that sklearn expects a seperate transformer for each text field.

__Note: Be careful about the missing text values when you are cleaning and stemming your text. Refer to the class exercise: MLA-TAB-DAY2-TREE-MODELS-NB.ipynb.__

__Creating Better Features (Optional Extra Hint):__ This is some extra hint if you are interested in reducing your number of features. As we are comparing two products in this problem, it is natural to think about our features as differences of features between the candidate and key products. You can create a new set of features this time looking at the difference between cad. and key features such as: diff_item_height, diff_pkg_weight, diff_classification_code. You can take absolute difference for numerical features or create some binary feature as same or not for categorical features. 

In [17]:
# Grab model features/inputs and target/output
numerical_features = ["key_item_package_quantity", 
                      "key_item_height", "key_item_width", "key_item_length", "key_item_weight", 
                      "key_pkg_height", "key_pkg_width", "key_pkg_length", "key_pkg_weight",
                      "cand_item_package_quantity", 
                      "cand_item_height", "cand_item_width", "cand_item_length", "cand_item_weight", 
                      "cand_pkg_height", "cand_pkg_width", "cand_pkg_length", "cand_pkg_weight"]

categorical_features = ["key_classification_code", "key_has_ean", "key_has_online_play", 
                      "cand_classification_code", "cand_has_ean", "cand_has_online_play"]

text_features = ["key_item_name", "cand_item_name"]

model_features = numerical_features + categorical_features + text_features

model_target = 'label'

### 2.3 <a name="23">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

To monitor network training, use sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function on the training set, to get a validation subset.

In [18]:
# Implement here
from sklearn.model_selection import train_test_split

# Split the training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    training_data[model_features], 
    training_data[model_target], 
    test_size=0.2, 
    random_state=42
)

print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)


Training set shape: (29442, 26)
Validation set shape: (7361, 26)


### 2.4 <a name="24">Data processing with Pipeline and ColumnTransformer</a>
(<a href="#2">Go to Data Processing</a>)

Use the collective ColumnTransformer to process the data for model training, validation, and test, ensuring that the transformations learned on the train data are performed accordingly on the training, validation, and test datasets.

In [19]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Numerical features processing
num_processor = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical features processing
cat_processor = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine processors (excluding text features)
preprocessor = ColumnTransformer([
    ('num', num_processor, numerical_features),
    ('cat', cat_processor, categorical_features)
], remainder='drop')

# Fit the preprocessor and transform the data
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(test_data[model_features])

print("Processed training set shape:", X_train_processed.shape)
print("Processed validation set shape:", X_val_processed.shape)
print("Processed test set shape:", X_test_processed.shape)

Processed training set shape: (29442, 32)
Processed validation set shape: (7361, 32)
Processed test set shape: (15774, 32)


## 3. <a name="3">Network Training and Validation</a> (Implement)
(<a href="#0">Go to top</a>)

Train and validate a neural network with PyTorch.

In [23]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Convert processed data to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train_processed)
y_train_tensor = torch.LongTensor(y_train.values)
X_val_tensor = torch.FloatTensor(X_val_processed)
y_val_tensor = torch.LongTensor(y_val.values)

# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Define the neural network
class SubstituteNet(nn.Module):
    def __init__(self, input_size):
        super(SubstituteNet, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 2)
        self.dropout = nn.Dropout(0.3)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout(x)
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

# Initialize the model, loss function, and optimizer
model = SubstituteNet(X_train_tensor.shape[1])
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 30
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    # Validation
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_val_tensor)
        val_loss = criterion(val_outputs, y_val_tensor)
        _, predicted = torch.max(val_outputs.data, 1)
        val_accuracy = (predicted == y_val_tensor).sum().item() / len(y_val_tensor)
    
    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {total_loss/len(train_loader):.4f}, Val Loss: {val_loss.item():.4f}, Val Accuracy: {val_accuracy:.4f}")

print("Training completed!")

Epoch [1/30], Train Loss: 0.6812, Val Loss: 0.6756, Val Accuracy: 0.5753
Epoch [2/30], Train Loss: 0.6723, Val Loss: 0.6698, Val Accuracy: 0.5846
Epoch [3/30], Train Loss: 0.6678, Val Loss: 0.6699, Val Accuracy: 0.5851
Epoch [4/30], Train Loss: 0.6641, Val Loss: 0.6645, Val Accuracy: 0.5937
Epoch [5/30], Train Loss: 0.6616, Val Loss: 0.6651, Val Accuracy: 0.5923
Epoch [6/30], Train Loss: 0.6600, Val Loss: 0.6633, Val Accuracy: 0.5986
Epoch [7/30], Train Loss: 0.6575, Val Loss: 0.6603, Val Accuracy: 0.5992
Epoch [8/30], Train Loss: 0.6566, Val Loss: 0.6573, Val Accuracy: 0.6035
Epoch [9/30], Train Loss: 0.6542, Val Loss: 0.6593, Val Accuracy: 0.6018
Epoch [10/30], Train Loss: 0.6546, Val Loss: 0.6576, Val Accuracy: 0.6040
Epoch [11/30], Train Loss: 0.6529, Val Loss: 0.6605, Val Accuracy: 0.5973
Epoch [12/30], Train Loss: 0.6520, Val Loss: 0.6601, Val Accuracy: 0.6025
Epoch [13/30], Train Loss: 0.6509, Val Loss: 0.6611, Val Accuracy: 0.6037
Epoch [14/30], Train Loss: 0.6504, Val Loss: 0.

## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained network to predict the labels on the test set. Test accuracy would be displayed upon a valid submission to the leaderboard.

In [24]:
# Get test data to test the classifier
# Note: test data should come from public_test_features.csv, which we've already loaded as test_data

# Convert processed test data to PyTorch tensor
X_test_tensor = torch.FloatTensor(X_test_processed)

# Make predictions
model.eval()
with torch.no_grad():
    test_outputs = model(X_test_tensor)
    _, test_predictions = torch.max(test_outputs.data, 1)

# Convert predictions to numpy array
test_predictions = test_predictions.cpu().numpy()

print("Predictions made on test dataset.")
print("Shape of predictions:", test_predictions.shape)

# If you need probabilities instead of class predictions
test_probabilities = torch.softmax(test_outputs, dim=1).cpu().numpy()
print("Shape of probability predictions:", test_probabilities.shape)

Predictions made on test dataset.
Shape of predictions: (15774,)
Shape of probability predictions: (15774, 2)


## 5. <a name="5">Write the test predictions to a CSV file</a> (Given)
(<a href="#0">Go to top</a>)

Use the following code to write the test predictions to a CSV file. Download locally the CSV file from the SageMaker instance, and upload it to __https://mlu.corp.amazon.com/contests/redirect/35__

In [25]:
import pandas as pd

result_df = pd.DataFrame(columns=["ID", "label"])
result_df["ID"] = test_data["ID"].tolist()
result_df["label"] = test_predictions

result_df.to_csv("../../data/final_project/project_day3_result.csv", index=False)

In [26]:
print('Double-check submission file against the sample_submission.csv')
sample_submission_df = pd.read_csv('../../data/final_project/sample-submission.csv')
print('Differences between project_day3_result IDs and sample submission IDs:',(sample_submission_df['ID'] != result_df['ID']).sum())

Double-check submission file against the sample_submission.csv
Differences between project_day3_result IDs and sample submission IDs: 0
