# <a name="0">Machine Learning Lab

Build a classfier to predict the __label__ field (substitute or not substitute) of the product substitute dataset.

### Final Project Problem: Product Substitute Prediction

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your notebook via Colab  

1. <a href="#1">Read the datasets</a> (Given)
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Data Processing with Pipeline</a>
3. <a href="#3">Train (and Tune) a Classifier on the Training Dataset</a> (Implement)
4. <a href="#3">Make Predictions on the Test Dataset</a> (Implement)


__Datasets and Files:__


* __training.csv__: Training data with product pair features and corresponding labels:
> - `ID:` ID of the record
> - `label:` Tells whether the key and candidate products are substitutes (1) or not (0).
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __public_test_features.csv__: Test data with product pairs features __without__ labels:
> - `ID:` ID of the record
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __metadata-dataset.xlsx__: Provides detailed information about all key_ and cand_ columns in the training and test sets. Try to select some useful features to include in the model, as not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)
</br>
<a href="https://propensity-labs-screening.s3.amazonaws.com/machine_learning/ml_data.zip">Download Dataset</a>

Then, we read the __training__ and __test__ datasets into dataframes

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

train_data = pd.read_csv("/content/training.csv")
test_data = pd.read_csv("/content/public_test_features.csv")

  train_data = pd.read_csv("/content/training.csv")
  test_data = pd.read_csv("/content/public_test_features.csv")


## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>)

### 2.1 <a name="21">Exploratory Data Analysis</a>

We look at number of rows, columns, and some simple statistics of the datasets.

In [2]:
# Implement EDA here
train_data.shape

(36803, 228)

In [3]:
train_data.isna().sum()

ID                                         0
label                                      0
key_Region Id                              0
key_MarketPlace Id                         0
key_ASIN                                   0
                                       ...  
cand_dw_last_updated                       0
cand_is_deleted                            0
cand_last_updated                          0
cand_version                               0
cand_external_testing_certification    36226
Length: 228, dtype: int64

In [7]:
nan_columns = train_data.columns[train_data.isnull().all()].tolist()
nan_columns

['key_country_of_origin',
 'key_discontinued_date',
 'key_manufacturer_sku',
 'key_monthly_recurring_charge',
 'key_number_of_licenses',
 'key_number_of_pages',
 'key_number_of_points',
 'key_preferred_vendor',
 'key_publisher',
 'key_recall_external_identifier',
 'key_recall_notice_expiration_date',
 'key_recall_notice_publication_date',
 'key_recall_notice_receive_date',
 'cand_country_of_origin',
 'cand_discontinued_date',
 'cand_esrb_descriptors',
 'cand_manufacturer_sku',
 'cand_number_of_points',
 'cand_preferred_vendor',
 'cand_publisher',
 'cand_recall_external_identifier',
 'cand_recall_notice_expiration_date',
 'cand_recall_notice_publication_date',
 'cand_recall_notice_receive_date']

In [8]:
train_data=train_data.drop(columns=nan_columns)

In [12]:


variances = train_data.var()


threshold = 0.1


low_variance_columns = variances[variances < threshold].index.tolist()


df = train_data.drop(columns=low_variance_columns)

  variances = train_data.var()


In [14]:
train_data

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,,base_product,...,0.529104,pounds,5.118110,,18-Apr-13,14-Oct-17,N,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,,base_product,...,0.100000,pounds,4.500000,,19-May-16,21-Mar-18,N,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,N,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,N,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.0,base_product,...,0.396832,pounds,5.196850,,26-Jul-12,9-Mar-18,N,9-Mar-18,1253,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36798,9631,0,1,1,B0002ABA8E,consumer_electronics,Electronics,HEWL4,10.0,base_product,...,0.260000,pounds,5.100000,,9-Sep-16,21-Mar-18,N,20-Mar-18,60,
36799,16965,1,1,1,B000H46XQE,kitchen,Kitchen,CUIJ9,2.0,base_product,...,7.900000,pounds,12.500000,,6-Apr-13,30-May-17,N,29-May-17,298,
36800,50014,1,1,1,B01HFRC7UQ,miscellaneous,Misc.,,,base_product,...,7.000000,pounds,,,2-Nov-16,17-Jun-17,N,17-Jun-17,13,
36801,42674,1,1,1,B001T0HHDS,health_and_beauty,Health and Beauty,O3S14,12.0,base_product,...,3.000000,pounds,11.700000,,4-Jan-11,15-Nov-17,N,14-Nov-17,618058,


In [15]:
mapping = {'y': 1, 'Y': 1, 'n': 0, 'N': 0}


train_data = train_data.replace(mapping)

In [16]:
train_data

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,,base_product,...,0.529104,pounds,5.118110,,18-Apr-13,14-Oct-17,0,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,,base_product,...,0.100000,pounds,4.500000,,19-May-16,21-Mar-18,0,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,0,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,0,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.0,base_product,...,0.396832,pounds,5.196850,,26-Jul-12,9-Mar-18,0,9-Mar-18,1253,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36798,9631,0,1,1,B0002ABA8E,consumer_electronics,Electronics,HEWL4,10.0,base_product,...,0.260000,pounds,5.100000,,9-Sep-16,21-Mar-18,0,20-Mar-18,60,
36799,16965,1,1,1,B000H46XQE,kitchen,Kitchen,CUIJ9,2.0,base_product,...,7.900000,pounds,12.500000,,6-Apr-13,30-May-17,0,29-May-17,298,
36800,50014,1,1,1,B01HFRC7UQ,miscellaneous,Misc.,,,base_product,...,7.000000,pounds,,,2-Nov-16,17-Jun-17,0,17-Jun-17,13,
36801,42674,1,1,1,B001T0HHDS,health_and_beauty,Health and Beauty,O3S14,12.0,base_product,...,3.000000,pounds,11.700000,,4-Jan-11,15-Nov-17,0,14-Nov-17,618058,


In [17]:
df_mean=train_data.fillna(train_data.mean())

  df_mean=train_data.fillna(train_data.mean())


In [18]:
df_mean

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,9.828326,base_product,...,0.529104,pounds,5.118110,,18-Apr-13,14-Oct-17,0,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,9.828326,base_product,...,0.100000,pounds,4.500000,,19-May-16,21-Mar-18,0,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,9.828326,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,0,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,9.828326,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,0,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.000000,base_product,...,0.396832,pounds,5.196850,,26-Jul-12,9-Mar-18,0,9-Mar-18,1253,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36798,9631,0,1,1,B0002ABA8E,consumer_electronics,Electronics,HEWL4,10.000000,base_product,...,0.260000,pounds,5.100000,,9-Sep-16,21-Mar-18,0,20-Mar-18,60,
36799,16965,1,1,1,B000H46XQE,kitchen,Kitchen,CUIJ9,2.000000,base_product,...,7.900000,pounds,12.500000,,6-Apr-13,30-May-17,0,29-May-17,298,
36800,50014,1,1,1,B01HFRC7UQ,miscellaneous,Misc.,,9.828326,base_product,...,7.000000,pounds,7.670509,,2-Nov-16,17-Jun-17,0,17-Jun-17,13,
36801,42674,1,1,1,B001T0HHDS,health_and_beauty,Health and Beauty,O3S14,12.000000,base_product,...,3.000000,pounds,11.700000,,4-Jan-11,15-Nov-17,0,14-Nov-17,618058,


In [19]:
non_numeric_columns = df_mean.select_dtypes(exclude=['number']).columns.tolist()


df_numeric = df_mean.drop(columns=non_numeric_columns)
df_numeric

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_case_pack_quantity,key_ean,key_excluded_direct_browse_node_id,key_fedas_id,key_fma_qualified_price_max,key_Product Group Code,...,cand_unit_count,cand_upc,cand_variation_theme_id,cand_video_game_region,cand_pkg_height,cand_pkg_length,cand_pkg_weight,cand_pkg_width,cand_is_deleted,cand_version
0,34016,0,1,1,9.828326,3.228122e+10,1.931567e+09,100954.0,111.96,201,...,46.670352,8.854010e+11,8.000000,1.428571,1.574803,18.110236,0.529104,5.118110,0,2867
1,3581,0,1,1,9.828326,7.846730e+11,7.053480e+08,100954.0,15.71,229,...,46.670352,6.207103e+11,36.932808,1.428571,0.300000,6.750000,0.100000,4.500000,0,65
2,36025,1,1,1,9.828326,7.123230e+11,1.931567e+09,100954.0,43.37,107,...,46.670352,8.480610e+11,2.000000,1.428571,2.007874,5.236220,0.654773,3.937008,0,1532
3,42061,1,1,1,9.828326,5.570000e+12,1.931567e+09,100954.0,648.63,147,...,46.670352,6.676490e+11,36.932808,1.428571,2.401575,20.590551,3.549442,10.314961,0,13964
4,14628,1,1,1,1.000000,6.134230e+11,2.975436e+09,100954.0,23.85,199,...,1.000000,5.274219e+10,36.932808,1.428571,1.102362,7.874016,0.396832,5.196850,0,1253
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36798,9631,0,1,1,10.000000,8.039830e+11,2.659770e+08,100954.0,47.61,229,...,46.670352,7.013050e+11,36.932808,1.428571,2.000000,6.400000,0.260000,5.100000,0,60
36799,16965,1,1,1,2.000000,8.852250e+11,1.931567e+09,100954.0,123.71,79,...,46.670352,8.627905e+10,36.932808,1.428571,9.500000,12.800000,7.900000,12.500000,0,298
36800,50014,1,1,1,9.828326,8.409790e+11,1.657930e+08,100954.0,78.96,75,...,46.670352,8.409790e+11,36.932808,1.428571,3.497030,12.478694,7.000000,7.670509,0,13
36801,42674,1,1,1,12.000000,4.339631e+10,1.657960e+08,100954.0,132.98,510,...,1.000000,8.833500e+11,36.932808,1.428571,4.000000,12.500000,3.000000,11.700000,0,618058


In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score




X = df_numeric.drop(columns=['label'])
y = df_numeric['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression()

model.fit(X_train, y_train)


predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)


Accuracy: 0.55060453742698


### 2.2 <a name="22">Select features to build the model</a>

For a quick start, we recommend using only a few of the numerical features for both key_ and cand_ ASINs: __item_package_quantity__, __item_height__, __item_width__, __item_length__, __item_weight__, __pkg_height__, __pkg_width__, __pkg_length__, __pkg_weight__. Feel free to explore other fields from the metadata-dataset.xlsx file.


In [None]:
# Implement here


### 2.3 <a name="23">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

We already have training and test datasets, however the test dataset is missing the labels - the goal of the project is to predict these labels.

To produce a validation set to evaluate model performance before submitting  split the training dataset into train and validation. Validation data you get here will be used later in section 3 to tune your classifier.

In [None]:
# Implement here


### 2.4 <a name="24">Data processing with Pipeline</a>

Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  to impute the missing values and scale the numerical features, and finally train the classifier on the imputed and scaled dataset.


In [None]:
# Implement here


## 3. <a name="3">Train (and Tune) a Classifier</a> (Implement)
(<a href="#0">Go to top</a>)

Train and tune the classifier

In [None]:
# Implement here


## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained classifier to predict the labels on the test set. Test accuracy would be displayed upon a valid submission to the leaderboard.

In [None]:
# Implement here

# Get test data to test the classifier
# ! test data should come from public_test_features.csv !
# ...

# Use the trained model to make predictions on the test dataset
# test_predictions = ...
