# <a name="0">Machine Learning Lab

Build a classfier to predict the __label__ field (substitute or not substitute) of the product substitute dataset.

### Final Project Problem: Product Substitute Prediction

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your notebook via Colab  

1. <a href="#1">Read the datasets</a> (Given)
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Data Processing with Pipeline</a>
3. <a href="#3">Train (and Tune) a Classifier on the Training Dataset</a> (Implement)
4. <a href="#3">Make Predictions on the Test Dataset</a> (Implement)


__Datasets and Files:__


* __training.csv__: Training data with product pair features and corresponding labels:
> - `ID:` ID of the record
> - `label:` Tells whether the key and candidate products are substitutes (1) or not (0).
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __public_test_features.csv__: Test data with product pairs features __without__ labels:
> - `ID:` ID of the record
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __metadata-dataset.xlsx__: Provides detailed information about all key_ and cand_ columns in the training and test sets. Try to select some useful features to include in the model, as not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)
</br>
<a href="https://propensity-labs-screening.s3.amazonaws.com/machine_learning/ml_data.zip">Download Dataset</a>

Then, we read the __training__ and __test__ datasets into dataframes

In [1]:
import pandas as pd

# Read datasets
train_data = pd.read_csv("training.csv")
test_data = pd.read_csv("public_test_features.csv")
metadata = pd.read_excel("metadata-dataset.xlsx")

  train_data = pd.read_csv("training.csv")
  test_data = pd.read_csv("public_test_features.csv")


## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>)

### 2.1 <a name="21">Exploratory Data Analysis</a>

We look at number of rows, columns, and some simple statistics of the datasets.

In [2]:
# Implement EDA here
train_data.head()


Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,,base_product,...,0.529104,pounds,5.11811,,18-Apr-13,14-Oct-17,N,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,,base_product,...,0.1,pounds,4.5,,19-May-16,21-Mar-18,N,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,N,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,N,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.0,base_product,...,0.396832,pounds,5.19685,,26-Jul-12,9-Mar-18,N,9-Mar-18,1253,


In [3]:
test_data.head()

Unnamed: 0,ID,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,key_classification_description,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,35057,1,1,B0096M8VR2,pc,Personal Computers,,1.0,base_product,Base Product,...,0.925932,pounds,5.826772,,10-Apr-13,5-Jul-16,N,4-Jul-16,699,
1,41573,1,1,B00EAQJCWW,kitchen,Kitchen,BUNN9,2.0,base_product,Base Product,...,,,,,17-Mar-16,17-Mar-16,N,17-Mar-16,2,
2,44029,1,1,B013P93YOQ,toy,Toy,,,base_product,Base Product,...,,,,,23-Dec-15,2-Dec-17,N,2-Dec-17,17,
3,6462,1,1,B00SKJPKGW,wireless_phone_accessory,Wireless Phone Accessory,PIQ22,1.0,base_product,Base Product,...,6.25,pounds,9.7,,22-Jan-15,18-Jan-17,N,18-Jan-17,25351,
4,17533,1,1,B001DCEKXM,sports,Sports,SUUNR,1.0,base_product,Base Product,...,0.176368,pounds,3.228346,,4-Jan-11,16-Nov-17,N,16-Nov-17,7424,


In [4]:
metadata.head()

Unnamed: 0,Column Name,Data Type,Description
0,REGION_ID,"NUMBER(2,0)",DW specific locale identifier. Referenced in d...
1,MARKETPLACE_ID,NUMBER,Unique identifier for a marketplace. A replace...
2,ASIN,CHAR(10),Amazon Standard Item Number sometimes also kno...
3,BINDING,VARCHAR2(96),This former books term is used across all prod...
4,BINDING_DESCRIPTION,VARCHAR2(100),Text description of the above binding column. ...


In [5]:
train_data.shape

(36803, 228)

In [6]:
test_data.shape

(15774, 227)

In [7]:
metadata.shape

(113, 3)

In [8]:
train_data.isnull().sum()

ID                                         0
label                                      0
key_Region Id                              0
key_MarketPlace Id                         0
key_ASIN                                   0
                                       ...  
cand_dw_last_updated                       0
cand_is_deleted                            0
cand_last_updated                          0
cand_version                               0
cand_external_testing_certification    36226
Length: 228, dtype: int64

In [9]:
test_data.isnull().sum()

ID                                         0
key_Region Id                              0
key_MarketPlace Id                         0
key_ASIN                                   0
key_Binding Code                        2017
                                       ...  
cand_dw_last_updated                       0
cand_is_deleted                            0
cand_last_updated                          0
cand_version                               0
cand_external_testing_certification    15515
Length: 227, dtype: int64

### 2.2 <a name="22">Select features to build the model</a>

For a quick start, we recommend using only a few of the numerical features for both key_ and cand_ ASINs: __item_package_quantity__, __item_height__, __item_width__, __item_length__, __item_weight__, __pkg_height__, __pkg_width__, __pkg_length__, __pkg_weight__. Feel free to explore other fields from the metadata-dataset.xlsx file.


In [10]:
# Implement here
selected_features = ['key_item_package_quantity', 'cand_item_package_quantity','key_item_height','cand_item_height' ,'key_item_width','cand_item_width', 'key_item_length', 'cand_item_length',
                     'key_item_weight', 'cand_item_weight','key_pkg_height', 'cand_pkg_height','key_pkg_width', 'cand_pkg_width','key_pkg_length','cand_pkg_length', 'key_pkg_weight','cand_pkg_weight']


### 2.3 <a name="23">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

We already have training and test datasets, however the test dataset is missing the labels - the goal of the project is to predict these labels.

To produce a validation set to evaluate model performance before submitting  split the training dataset into train and validation. Validation data you get here will be used later in section 3 to tune your classifier.

In [11]:
# Implement here
from sklearn.model_selection import train_test_split

# Split train_data into train and validation sets
train, validation = train_test_split(train_data, test_size=0.2, random_state=42)

# Separate features and labels
X_train = train[selected_features]
y_train = train['label']
X_val = validation[selected_features]
y_val = validation['label']
X_test = test_data[selected_features]


In [12]:
X_train.isnull().sum()

key_item_package_quantity      2405
cand_item_package_quantity     4290
key_item_height                8327
cand_item_height              11922
key_item_width                 8327
cand_item_width               11922
key_item_length                8327
cand_item_length              11922
key_item_weight               11442
cand_item_weight              14075
key_pkg_height                 2795
cand_pkg_height                5914
key_pkg_width                  2795
cand_pkg_width                 5914
key_pkg_length                 2795
cand_pkg_length                5914
key_pkg_weight                 2927
cand_pkg_weight                6137
dtype: int64

In [13]:
X_test.isnull().sum()

key_item_package_quantity     1237
cand_item_package_quantity    2313
key_item_height               4345
cand_item_height              6369
key_item_width                4345
cand_item_width               6369
key_item_length               4345
cand_item_length              6369
key_item_weight               6036
cand_item_weight              7455
key_pkg_height                1450
cand_pkg_height               3240
key_pkg_width                 1450
cand_pkg_width                3240
key_pkg_length                1450
cand_pkg_length               3240
key_pkg_weight                1532
cand_pkg_weight               3334
dtype: int64

### 2.4 <a name="24">Data processing with Pipeline</a>

Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  to impute the missing values and scale the numerical features, and finally train the classifier on the imputed and scaled dataset.


In [14]:
# Implement here

from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Define a pipeline with an imputer and the random forest classifier
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Impute missing values with median
    ('classifier', RandomForestClassifier())
])


## 3. <a name="3">Train (and Tune) a Classifier</a> (Implement)
(<a href="#0">Go to top</a>)

Train and tune the classifier

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Define parameter grid for hyperparameter tuning
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Perform Grid Search CV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Train classifier with best parameters
best_pipeline = grid_search.best_estimator_
best_pipeline.fit(X_train, y_train)

# Predict the labels for the validation set
predictions = best_pipeline.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, predictions)
print("Accuracy:", accuracy)

Best Parameters: {'classifier__max_depth': 20, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 200}
Accuracy: 0.6516777611737535


In [18]:
from sklearn.metrics import classification_report

# Predict the labels for the validation set
predictions = best_pipeline.predict(X_val)

# Calculate classification report
report = classification_report(y_val, predictions)

# Print classification report
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.66      0.60      0.63      3637
           1       0.64      0.70      0.67      3724

    accuracy                           0.65      7361
   macro avg       0.65      0.65      0.65      7361
weighted avg       0.65      0.65      0.65      7361



## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained classifier to predict the labels on the test set. Test accuracy would be displayed upon a valid submission to the leaderboard.

In [17]:
# Implement here



X_test = test_data[selected_features]

# Make predictions
predictions = best_pipeline.predict(X_test)

# Create a DataFrame for predictions
predictions_df = pd.DataFrame({'pair_id': test_data['ID'], 'label': predictions})

# Save predictions to a CSV file
predictions_df.to_csv('predictions.csv', index=False)