<a href="https://colab.research.google.com/github/Krishnaugale353/KrishnaUgale_Assignment_KariniAI/blob/main/Machine_Learning_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <a name="0">Machine Learning Lab

Build a classfier to predict the __label__ field (substitute or not substitute) of the product substitute dataset.

### Final Project Problem: Product Substitute Prediction

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your notebook via Colab  

1. <a href="#1">Read the datasets</a> (Given)
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Data Processing with Pipeline</a>
3. <a href="#3">Train (and Tune) a Classifier on the Training Dataset</a> (Implement)
4. <a href="#3">Make Predictions on the Test Dataset</a> (Implement)


__Datasets and Files:__


* __training.csv__: Training data with product pair features and corresponding labels:
> - `ID:` ID of the record
> - `label:` Tells whether the key and candidate products are substitutes (1) or not (0).
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __public_test_features.csv__: Test data with product pairs features __without__ labels:
> - `ID:` ID of the record
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __metadata-dataset.xlsx__: Provides detailed information about all key_ and cand_ columns in the training and test sets. Try to select some useful features to include in the model, as not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)
</br>
<a href="https://propensity-labs-screening.s3.amazonaws.com/machine_learning/ml_data.zip">Download Dataset</a>

Then, we read the __training__ and __test__ datasets into dataframes

In [2]:
import pandas as pd
import numpy as np

In [3]:
data_train=pd.read_csv("/content/drive/MyDrive/ml_data/ml_data/training.csv")
data_test=pd.read_csv("/content/drive/MyDrive/ml_data/ml_data/public_test_features.csv")
metadata=pd.read_excel("/content/drive/MyDrive/ml_data/ml_data/metadata-dataset.xlsx")

  data_train=pd.read_csv("/content/drive/MyDrive/ml_data/ml_data/training.csv")
  data_test=pd.read_csv("/content/drive/MyDrive/ml_data/ml_data/public_test_features.csv")


## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>)

### 2.1 <a name="21">Exploratory Data Analysis</a>

We look at number of rows, columns, and some simple statistics of the datasets.

In [4]:
data_train.head()

Unnamed: 0,ID,label,key_Region Id,key_MarketPlace Id,key_ASIN,key_Binding Code,key_binding_description,key_brand_code,key_case_pack_quantity,key_classification_code,...,cand_pkg_weight,cand_pkg_weight_uom,cand_pkg_width,cand_release_date_embargo_level,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated,cand_version,cand_external_testing_certification
0,34016,0,1,1,B00YCZ6IKA,kitchen,Kitchen,NICLW,,base_product,...,0.529104,pounds,5.11811,,18-Apr-13,14-Oct-17,N,13-Oct-17,2867,
1,3581,0,1,1,B00U25WT7A,office_product,Office Product,,,base_product,...,0.1,pounds,4.5,,19-May-16,21-Mar-18,N,20-Mar-18,65,
2,36025,1,1,1,B011BZ3GXU,consumer_electronics,Electronics,,,base_product,...,0.654773,pounds,3.937008,,10-Dec-15,16-Feb-18,N,15-Feb-18,1532,
3,42061,1,1,1,B0089XDG3I,pc,Personal Computers,,,base_product,...,3.549442,pounds,10.314961,,19-Oct-12,15-Feb-18,N,14-Feb-18,13964,
4,14628,1,1,1,B014UTSBZW,miscellaneous,Misc.,ZUKC7,1.0,base_product,...,0.396832,pounds,5.19685,,26-Jul-12,9-Mar-18,N,9-Mar-18,1253,


In [5]:
# Implement EDA here
data_train.columns

Index(['ID', 'label', 'key_Region Id', 'key_MarketPlace Id', 'key_ASIN',
       'key_Binding Code', 'key_binding_description', 'key_brand_code',
       'key_case_pack_quantity', 'key_classification_code',
       ...
       'cand_pkg_weight', 'cand_pkg_weight_uom', 'cand_pkg_width',
       'cand_release_date_embargo_level', 'cand_dw_creation_date',
       'cand_dw_last_updated', 'cand_is_deleted', 'cand_last_updated',
       'cand_version', 'cand_external_testing_certification'],
      dtype='object', length=228)

In [6]:
data_train.shape

(36803, 228)

In [7]:
data_test.shape

(15774, 227)

In [8]:
data_train["label"].value_counts()

1    18589
0    18214
Name: label, dtype: int64

In [9]:
data_train.isna().sum()

ID                                         0
label                                      0
key_Region Id                              0
key_MarketPlace Id                         0
key_ASIN                                   0
                                       ...  
cand_dw_last_updated                       0
cand_is_deleted                            0
cand_last_updated                          0
cand_version                               0
cand_external_testing_certification    36226
Length: 228, dtype: int64

### 2.2 <a name="22">Select features to build the model</a>

For a quick start, we recommend using only a few of the numerical features for both key_ and cand_ ASINs: __item_package_quantity__, __item_height__, __item_width__, __item_length__, __item_weight__, __pkg_height__, __pkg_width__, __pkg_length__, __pkg_weight__. Feel free to explore other fields from the metadata-dataset.xlsx file.


In [10]:
#k calculates perecentage of null values in each column
k=(data_train.isnull().sum()/data_train.shape[0])*100
type(k)

In [11]:
#columns_with_null gives columns having more than 50% null values
columns_with_null = k[k >= 50]
print(columns_with_null)


key_brand_code                          56.095970
key_case_pack_quantity                  55.144961
key_color_map                           67.787952
key_country_of_origin                  100.000000
key_cpsia_cautionary_statement          75.298209
                                          ...    
cand_video_game_region_description      99.961960
cand_wireless_provider                  99.584273
cand_wireless_provider_code             99.584273
cand_release_date_embargo_level         99.703828
cand_external_testing_certification     98.432193
Length: 129, dtype: float64


In [12]:
columns_with_null = k[k >= 50].index.to_list() #
print(columns_with_null)
print(len(columns_with_null))

['key_brand_code', 'key_case_pack_quantity', 'key_color_map', 'key_country_of_origin', 'key_cpsia_cautionary_statement', 'key_customer_return_method', 'key_customer_return_policy', 'key_delivery_option', 'key_discontinued_date', 'key_esrb_age_rating', 'key_esrb_descriptors', 'key_excluded_direct_browse_node_id', 'key_fedas_id', 'key_fma_override', 'key_inner_package_type', 'key_is_adult_product', 'key_is_certified_organic', 'key_is_phone_upgradeable', 'key_is_super_saver_shipping_excl', 'key_isbn', 'key_item_display_diameter', 'key_item_display_height', 'key_item_display_length', 'key_item_display_length_uom', 'key_item_display_volume', 'key_item_display_volume_uom', 'key_item_display_weight', 'key_item_display_weight_uom', 'key_item_display_width', 'key_manufacturer_sku', 'key_manufacturer_vendor_code', 'key_max_weight_recommendation', 'key_mfg_series_number', 'key_min_weight_recommendation', 'key_monthly_recurring_charge', 'key_number_of_items', 'key_number_of_licenses', 'key_number_

In [13]:
key1=[] #columns of key_product
cand1=[] #columns of cand product
other1=[]
for i in columns_with_null:
  if i.startswith("key"):
    cand1.append(i.replace("key","cand",1))
  elif i.startswith("cand"):
    key1.append(i.replace("cand","key",1))
  else:
    other1.append(i)

In [14]:
print(key1)
print()
print(cand1)

['key_brand_code', 'key_case_pack_quantity', 'key_color_map', 'key_country_of_origin', 'key_cpsia_cautionary_statement', 'key_customer_return_method', 'key_customer_return_policy', 'key_delivery_option', 'key_discontinued_date', 'key_esrb_age_rating', 'key_esrb_descriptors', 'key_excluded_direct_browse_node_id', 'key_fedas_id', 'key_fma_override', 'key_inner_package_type', 'key_is_adult_product', 'key_is_certified_organic', 'key_is_phone_upgradeable', 'key_is_super_saver_shipping_excl', 'key_isbn', 'key_item_display_diameter', 'key_item_display_height', 'key_item_display_length', 'key_item_display_length_uom', 'key_item_display_volume', 'key_item_display_volume_uom', 'key_item_display_weight', 'key_item_display_weight_uom', 'key_item_display_width', 'key_manufacturer_sku', 'key_manufacturer_vendor_code', 'key_max_weight_recommendation', 'key_mfg_series_number', 'key_min_weight_recommendation', 'key_monthly_recurring_charge', 'key_number_of_items', 'key_number_of_licenses', 'key_number_

In [15]:
print(len(cand1))
print(len(key1))

64
65


In [16]:
print(len(columns_with_null))
type(columns_with_null)

129


list

In [17]:
columns_with_null.extend(key1)
columns_with_null.extend(cand1)
print(columns_with_null)
print(len(columns_with_null))


['key_brand_code', 'key_case_pack_quantity', 'key_color_map', 'key_country_of_origin', 'key_cpsia_cautionary_statement', 'key_customer_return_method', 'key_customer_return_policy', 'key_delivery_option', 'key_discontinued_date', 'key_esrb_age_rating', 'key_esrb_descriptors', 'key_excluded_direct_browse_node_id', 'key_fedas_id', 'key_fma_override', 'key_inner_package_type', 'key_is_adult_product', 'key_is_certified_organic', 'key_is_phone_upgradeable', 'key_is_super_saver_shipping_excl', 'key_isbn', 'key_item_display_diameter', 'key_item_display_height', 'key_item_display_length', 'key_item_display_length_uom', 'key_item_display_volume', 'key_item_display_volume_uom', 'key_item_display_weight', 'key_item_display_weight_uom', 'key_item_display_width', 'key_manufacturer_sku', 'key_manufacturer_vendor_code', 'key_max_weight_recommendation', 'key_mfg_series_number', 'key_min_weight_recommendation', 'key_monthly_recurring_charge', 'key_number_of_items', 'key_number_of_licenses', 'key_number_

In [18]:
columns_drop=list(set(columns_with_null))
print(len(columns_drop))

130


In [19]:
train2=data_train.drop(columns_drop,axis=1)

In [20]:
train2.shape

(36803, 98)

In [21]:
train2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36803 entries, 0 to 36802
Data columns (total 98 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   ID                                 36803 non-null  int64  
 1   label                              36803 non-null  int64  
 2   key_Region Id                      36803 non-null  int64  
 3   key_MarketPlace Id                 36803 non-null  int64  
 4   key_ASIN                           36803 non-null  object 
 5   key_Binding Code                   32208 non-null  object 
 6   key_binding_description            32208 non-null  object 
 7   key_classification_code            36803 non-null  object 
 8   key_classification_description     36803 non-null  object 
 9   key_creation_date                  36803 non-null  object 
 10  key_currency_code                  36803 non-null  object 
 11  key_ean                            33389 non-null  flo

In [22]:
#t stores columns having null values
t=train2.isna().sum()[train2.isna().sum()!=0]
print(len(t))
t

42


key_Binding Code                    4595
key_binding_description             4595
key_ean                             3414
key_fma_qualified_price_max         2429
key_item_height                    10350
key_item_length                    10350
key_item_package_quantity           3044
key_item_weight                    14168
key_item_width                     10350
key_manufacturer_name               2157
key_model_number                    9896
key_publisher_studio_label          2157
key_upc                             6232
key_pkg_dimensional_uom             3445
key_pkg_height                      3445
key_pkg_length                      3445
key_pkg_weight                      3613
key_pkg_weight_uom                  3613
key_pkg_width                       3445
cand_Binding Code                   5689
cand_binding_description            5689
cand_classification_code               1
cand_classification_description        1
cand_ean                            6407
cand_fma_qualifi

In [23]:
train3=train2.dropna(thresh=92)

In [24]:
m=train3.isna().sum()[train3.isna().sum()!=0]
m

key_Binding Code                 614
key_binding_description          614
key_ean                          410
key_fma_qualified_price_max      464
key_item_height                 1744
key_item_length                 1744
key_item_package_quantity        588
key_item_weight                 5138
key_item_width                  1744
key_manufacturer_name            138
key_model_number                2445
key_publisher_studio_label       138
key_upc                         1677
key_pkg_dimensional_uom           29
key_pkg_height                    29
key_pkg_length                    29
key_pkg_weight                    52
key_pkg_weight_uom                52
key_pkg_width                     29
cand_Binding Code                797
cand_binding_description         797
cand_ean                         819
cand_fma_qualified_price_max    1467
cand_item_height                2656
cand_item_length                2656
cand_item_package_quantity       771
cand_item_weight                5982
c

In [25]:
train3.shape

(20621, 98)

In [26]:
#number of rows dropped
t-m

cand_Binding Code                   4892.0
cand_binding_description            4892.0
cand_classification_code               NaN
cand_classification_description        NaN
cand_ean                            5588.0
cand_fma_qualified_price_max        6621.0
cand_item_classification_id            NaN
cand_item_height                   12282.0
cand_item_length                   12282.0
cand_item_name                         NaN
cand_item_package_quantity          4606.0
cand_item_weight                   11630.0
cand_item_width                    12282.0
cand_manufacturer_name              2187.0
cand_model_number                  10369.0
cand_pkg_dimensional_uom            7362.0
cand_pkg_height                     7362.0
cand_pkg_length                     7362.0
cand_pkg_weight                     7571.0
cand_pkg_weight_uom                 7571.0
cand_pkg_width                      7362.0
cand_publisher_studio_label         2187.0
cand_upc                            7363.0
key_Binding

In [27]:
train3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20621 entries, 0 to 36801
Data columns (total 98 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   ID                                 20621 non-null  int64  
 1   label                              20621 non-null  int64  
 2   key_Region Id                      20621 non-null  int64  
 3   key_MarketPlace Id                 20621 non-null  int64  
 4   key_ASIN                           20621 non-null  object 
 5   key_Binding Code                   20007 non-null  object 
 6   key_binding_description            20007 non-null  object 
 7   key_classification_code            20621 non-null  object 
 8   key_classification_description     20621 non-null  object 
 9   key_creation_date                  20621 non-null  object 
 10  key_currency_code                  20621 non-null  object 
 11  key_ean                            20211 non-null  flo

In [28]:
object_columns = train3.select_dtypes(include=['object']).columns.tolist()
object_columns

['key_ASIN',
 'key_Binding Code',
 'key_binding_description',
 'key_classification_code',
 'key_classification_description',
 'key_creation_date',
 'key_currency_code',
 'key_Product Group Description',
 'key_has_ean',
 'key_has_online_play',
 'key_has_platform',
 'key_has_recommended_browse_nodes',
 'key_has_upc',
 'key_is_advantage',
 'key_is_conveyable',
 'key_is_discontinued',
 'key_is_manufacture_on_demand',
 'key_Is Sortable',
 'key_item_name',
 'key_language_code',
 'key_manufacturer_name',
 'key_model_number',
 'key_product_type',
 'key_publisher_studio_label',
 'key_pkg_dimensional_uom',
 'key_pkg_weight_uom',
 'key_dw_creation_date',
 'key_dw_last_updated',
 'key_is_deleted',
 'key_last_updated',
 'cand_ASIN',
 'cand_Binding Code',
 'cand_binding_description',
 'cand_classification_code',
 'cand_classification_description',
 'cand_creation_date',
 'cand_currency_code',
 'cand_Product Group Description',
 'cand_has_ean',
 'cand_has_online_play',
 'cand_has_platform',
 'cand_ha

In [29]:
num_columns = train3.select_dtypes(include=['int','float']).columns.tolist()
num_columns

['ID',
 'label',
 'key_Region Id',
 'key_MarketPlace Id',
 'key_ean',
 'key_fma_qualified_price_max',
 'key_Product Group Code',
 'key_item_classification_id',
 'key_item_height',
 'key_item_length',
 'key_item_package_quantity',
 'key_item_weight',
 'key_item_width',
 'key_product_type_id',
 'key_upc',
 'key_pkg_height',
 'key_pkg_length',
 'key_pkg_weight',
 'key_pkg_width',
 'key_version',
 'cand_Region Id',
 'cand_MarketPlace Id',
 'cand_ean',
 'cand_fma_qualified_price_max',
 'cand_Product Group Code',
 'cand_item_classification_id',
 'cand_item_height',
 'cand_item_length',
 'cand_item_package_quantity',
 'cand_item_weight',
 'cand_item_width',
 'cand_product_type_id',
 'cand_upc',
 'cand_pkg_height',
 'cand_pkg_length',
 'cand_pkg_weight',
 'cand_pkg_width',
 'cand_version']

In [30]:
num_int_columns = train3.select_dtypes(include=['int']).columns.tolist()
num_int_columns

['ID',
 'label',
 'key_Region Id',
 'key_MarketPlace Id',
 'key_Product Group Code',
 'key_product_type_id',
 'key_version',
 'cand_Region Id',
 'cand_MarketPlace Id',
 'cand_Product Group Code',
 'cand_product_type_id',
 'cand_version']

In [31]:
num_float_columns = train3.select_dtypes(include=['float']).columns.tolist()
num_float_columns

['key_ean',
 'key_fma_qualified_price_max',
 'key_item_classification_id',
 'key_item_height',
 'key_item_length',
 'key_item_package_quantity',
 'key_item_weight',
 'key_item_width',
 'key_upc',
 'key_pkg_height',
 'key_pkg_length',
 'key_pkg_weight',
 'key_pkg_width',
 'cand_ean',
 'cand_fma_qualified_price_max',
 'cand_item_classification_id',
 'cand_item_height',
 'cand_item_length',
 'cand_item_package_quantity',
 'cand_item_weight',
 'cand_item_width',
 'cand_upc',
 'cand_pkg_height',
 'cand_pkg_length',
 'cand_pkg_weight',
 'cand_pkg_width']

In [32]:
pd.set_option('display.max_columns', None)

In [33]:
train3[object_columns].head()

Unnamed: 0,key_ASIN,key_Binding Code,key_binding_description,key_classification_code,key_classification_description,key_creation_date,key_currency_code,key_Product Group Description,key_has_ean,key_has_online_play,key_has_platform,key_has_recommended_browse_nodes,key_has_upc,key_is_advantage,key_is_conveyable,key_is_discontinued,key_is_manufacture_on_demand,key_Is Sortable,key_item_name,key_language_code,key_manufacturer_name,key_model_number,key_product_type,key_publisher_studio_label,key_pkg_dimensional_uom,key_pkg_weight_uom,key_dw_creation_date,key_dw_last_updated,key_is_deleted,key_last_updated,cand_ASIN,cand_Binding Code,cand_binding_description,cand_classification_code,cand_classification_description,cand_creation_date,cand_currency_code,cand_Product Group Description,cand_has_ean,cand_has_online_play,cand_has_platform,cand_has_recommended_browse_nodes,cand_has_upc,cand_is_advantage,cand_is_conveyable,cand_is_discontinued,cand_is_manufacture_on_demand,cand_Is Sortable,cand_item_name,cand_language_code,cand_manufacturer_name,cand_model_number,cand_product_type,cand_publisher_studio_label,cand_pkg_dimensional_uom,cand_pkg_weight_uom,cand_dw_creation_date,cand_dw_last_updated,cand_is_deleted,cand_last_updated
0,B00YCZ6IKA,kitchen,Kitchen,base_product,Base Product,27-May-15,USD,gl_home,Y,N,N,N,Y,N,Y,N,N,N,Nickelodeon Teenage Mutant Ninja Turtles You B...,en_US,"Jay Franco and Sons, Inc.",JF22451BBCD,HOME,"Jay Franco and Sons, Inc.",inches,pounds,28-May-15,30-Sep-17,N,29-Sep-17,B00CEEU86G,home_improvement,Tools & Home Improvement,base_product,Base Product,17-Apr-13,USD,gl_home_improvement,Y,N,N,Y,Y,N,Y,N,N,N,Roommates Rmk2249Gm Teenage Mutant Ninja Turtl...,en_US,RoomMates,RMK2249GM,BUILDING_MATERIAL,RoomMates,inches,pounds,18-Apr-13,14-Oct-17,N,13-Oct-17
1,B00U25WT7A,office_product,Office Product,base_product,Base Product,27-Feb-15,USD,gl_office_product,Y,N,N,Y,Y,N,Y,N,N,Y,BLOCKIT RFID Protector Sleeves - Made in the U...,en_US,Blockit Security Products LLC,CARS6-TP,OFFICE_PRODUCTS,Blockit Security Products LLC,inches,pounds,27-Feb-15,30-Jan-18,N,29-Jan-18,B01FUA9HP8,office_product,Office Product,base_product,Base Product,18-May-16,USD,gl_wireless,Y,N,N,N,N,N,Y,N,N,Y,RFID Blocking Sleeves (10 Credit Card & 2 Pass...,en_US,01 Digitals,01DRFIDBS,WIRELESS_ACCESSORY,01 Digitals,inches,pounds,19-May-16,21-Mar-18,N,20-Mar-18
2,B011BZ3GXU,consumer_electronics,Electronics,base_product,Base Product,10-Jul-15,USD,gl_wireless,Y,N,N,N,Y,N,Y,N,N,Y,"Dual Output Portable Charger, Oripow Spark A6 ...",en_US,Oripow,22SK6A6,PHONE_ACCESSORY,Oripow,inches,pounds,11-Jul-15,24-Mar-17,N,23-Mar-17,B0194WDVHI,consumer_electronics,Electronics,base_product,Base Product,9-Dec-15,USD,gl_wireless,Y,N,N,Y,Y,N,Y,N,N,Y,"Anker PowerCore 10000, One of the Smallest and...",en_US,Anker,A1263G12,CONSUMER_ELECTRONICS,Anker,inches,pounds,10-Dec-15,16-Feb-18,N,15-Feb-18
4,B014UTSBZW,miscellaneous,Misc.,base_product,Base Product,3-Sep-15,USD,gl_pet_products,Y,N,N,N,Y,N,Y,N,N,Y,"Zuke's Genuine Jerky Dog Treats, Beef and Carr...",en_US,Zuke's,25065,PET_SUPPLIES,Zuke's,inches,pounds,4-Sep-15,10-Mar-18,N,9-Mar-18,B008OV929C,miscellaneous,Misc.,base_product,Base Product,25-Jul-12,USD,gl_pet_products,Y,N,N,Y,Y,N,Y,N,N,Y,"Hill's Science Diet Beef Jerky Dog Treats, Jer...",en_US,Hill's Pet Nutrition,1876,PET_SUPPLIES,Hill's Pet Nutrition,inches,pounds,26-Jul-12,9-Mar-18,N,9-Mar-18
5,B01C5TFLSE,consumer_electronics,Electronics,base_product,Base Product,24-Feb-16,USD,gl_home_entertainment,Y,N,N,N,Y,N,N,N,N,N,Samsung UN55KS8000 55-Inch 4K Ultra HD Smart L...,en_US,Samsung,UN55KS8000FXZA,TELEVISION,Samsung,inches,pounds,25-Feb-16,21-Mar-18,N,20-Mar-18,B012E97GJC,consumer_electronics,Electronics,base_product,Base Product,23-Jul-15,USD,gl_home_entertainment,Y,N,N,N,Y,N,N,N,N,N,"LG Electronics 55"" LED TV (55SL5B-B)",en_US,LG,55SL5B-B,TELEVISION,LG,inches,pounds,25-Jul-15,12-Sep-17,N,12-Sep-17


In [34]:
dates=["key_creation_date","key_dw_creation_date","key_dw_last_updated","key_last_updated","cand_creation_date","cand_dw_creation_date","cand_dw_last_updated","cand_last_updated"]


In [35]:
train3.drop(dates,axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3.drop(dates,axis=1,inplace=True)


In [36]:
train3[num_columns].nunique()

ID                              20621
label                               2
key_Region Id                       1
key_MarketPlace Id                  1
key_ean                           962
key_fma_qualified_price_max      1026
key_Product Group Code             39
key_item_classification_id          1
key_item_height                   418
key_item_length                   488
key_item_package_quantity          15
key_item_weight                   433
key_item_width                    443
key_product_type_id               128
key_upc                           886
key_pkg_height                    338
key_pkg_length                    495
key_pkg_weight                    558
key_pkg_width                     437
key_version                       936
cand_Region Id                      1
cand_MarketPlace Id                 1
cand_ean                        10296
cand_fma_qualified_price_max     8896
cand_Product Group Code            50
cand_item_classification_id         3
cand_item_he

In [37]:
t3=["key_Product Group Code","key_item_package_quantity","cand_Product Group Code","cand_item_classification_id","cand_item_package_quantity"]
for i in t3:
  print(i)
  print(train3[i].unique())
  print("#"*50)
  print()


key_Product Group Code
[201 229 107 199 504 421  21 147 121  79  23  60 200 468 469  75 196 263
  65  86 510 198 194 241 328 267 309 364 236 325 467 465 293 193 570 451
 470 425  14]
##################################################

key_item_package_quantity
[  1.   6.  nan  12.  36.   2.   4.   3.  10.  21.  60. 500. 100.   8.
  24.   5.]
##################################################

cand_Product Group Code
[ 60 107 199 504 421 201 229  21  23  79 469 121 422 200 468 196 263 194
  14  75 147  65  86 400 510 267 241 309 328 325 193 467 198 470 293 251
 570 197 425 236 265 261 485  63 540 465 364 541 451 494]
##################################################

cand_item_classification_id
[ 1.  4. 15.]
##################################################

cand_item_package_quantity
[1.000e+00 2.000e+01 3.600e+01       nan 6.000e+01 1.200e+01 4.000e+00
 3.000e+00 2.000e+00 1.000e+01 7.000e+00 6.000e+00 1.000e+02 3.000e+01
 5.000e+00 3.800e+01 2.400e+01 1.500e+02 5.000e+03 5.000e+01 

In [38]:
train3["key_item_package_quantity"].value_counts()

1.0      19441
2.0        137
4.0         86
10.0        76
3.0         61
6.0         56
12.0        55
100.0       30
36.0        27
60.0        19
21.0        16
24.0        14
500.0        7
8.0          5
5.0          3
Name: key_item_package_quantity, dtype: int64

In [39]:
for i in num_int_columns:
  if i not in t3:
    train3[i]=train3[i].fillna(int(train3[i].mean()))
train3[num_int_columns].isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3[i]=train3[i].fillna(int(train3[i].mean()))


ID                         0
label                      0
key_Region Id              0
key_MarketPlace Id         0
key_Product Group Code     0
key_product_type_id        0
key_version                0
cand_Region Id             0
cand_MarketPlace Id        0
cand_Product Group Code    0
cand_product_type_id       0
cand_version               0
dtype: int64

In [40]:
for i in num_float_columns:
  if i not in t3:
    train3[i]=train3[i].fillna(train3[i].mean())
train3[num_float_columns].isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3[i]=train3[i].fillna(train3[i].mean())


key_ean                           0
key_fma_qualified_price_max       0
key_item_classification_id        0
key_item_height                   0
key_item_length                   0
key_item_package_quantity       588
key_item_weight                   0
key_item_width                    0
key_upc                           0
key_pkg_height                    0
key_pkg_length                    0
key_pkg_weight                    0
key_pkg_width                     0
cand_ean                          0
cand_fma_qualified_price_max      0
cand_item_classification_id       0
cand_item_height                  0
cand_item_length                  0
cand_item_package_quantity      771
cand_item_weight                  0
cand_item_width                   0
cand_upc                          0
cand_pkg_height                   0
cand_pkg_length                   0
cand_pkg_weight                   0
cand_pkg_width                    0
dtype: int64

In [41]:
train3["cand_item_package_quantity"].unique()

array([1.000e+00, 2.000e+01, 3.600e+01,       nan, 6.000e+01, 1.200e+01,
       4.000e+00, 3.000e+00, 2.000e+00, 1.000e+01, 7.000e+00, 6.000e+00,
       1.000e+02, 3.000e+01, 5.000e+00, 3.800e+01, 2.400e+01, 1.500e+02,
       5.000e+03, 5.000e+01, 9.999e+03, 7.600e+01, 1.800e+02, 3.600e+02,
       4.000e+01, 1.440e+02, 4.800e+01, 1.500e+01, 8.000e+00, 2.000e+02,
       1.600e+01, 4.500e+01, 1.400e+03, 2.500e+01, 1.800e+01, 7.200e+01,
       8.000e+01, 9.000e+00, 7.200e+02, 2.700e+01, 2.100e+01, 3.000e+02,
       1.200e+03, 1.200e+02, 1.080e+02, 3.000e+03, 1.300e+01, 6.600e+01,
       6.000e+02, 2.400e+02, 9.000e+01, 1.900e+01, 1.000e+03, 4.300e+01,
       4.000e+02, 1.100e+02, 1.250e+02, 8.640e+02, 5.800e+01])

In [42]:
train3["cand_item_package_quantity"]=train3["cand_item_package_quantity"].fillna(train3["cand_item_package_quantity"].mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3["cand_item_package_quantity"]=train3["cand_item_package_quantity"].fillna(train3["cand_item_package_quantity"].mean())


In [43]:
train3["key_item_package_quantity"]=train3["key_item_package_quantity"].fillna(1.0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3["key_item_package_quantity"]=train3["key_item_package_quantity"].fillna(1.0)


In [44]:
from scipy import stats

In [45]:
object_columns = train3.select_dtypes(include=['object']).columns.tolist()
object_columns

['key_ASIN',
 'key_Binding Code',
 'key_binding_description',
 'key_classification_code',
 'key_classification_description',
 'key_currency_code',
 'key_Product Group Description',
 'key_has_ean',
 'key_has_online_play',
 'key_has_platform',
 'key_has_recommended_browse_nodes',
 'key_has_upc',
 'key_is_advantage',
 'key_is_conveyable',
 'key_is_discontinued',
 'key_is_manufacture_on_demand',
 'key_Is Sortable',
 'key_item_name',
 'key_language_code',
 'key_manufacturer_name',
 'key_model_number',
 'key_product_type',
 'key_publisher_studio_label',
 'key_pkg_dimensional_uom',
 'key_pkg_weight_uom',
 'key_is_deleted',
 'cand_ASIN',
 'cand_Binding Code',
 'cand_binding_description',
 'cand_classification_code',
 'cand_classification_description',
 'cand_currency_code',
 'cand_Product Group Description',
 'cand_has_ean',
 'cand_has_online_play',
 'cand_has_platform',
 'cand_has_recommended_browse_nodes',
 'cand_has_upc',
 'cand_is_advantage',
 'cand_is_conveyable',
 'cand_is_discontinued',

In [46]:
for i in object_columns:
    train3[i]=train3[i].fillna(train3[i].mode()[0])
train3[object_columns].isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3[i]=train3[i].fillna(train3[i].mode()[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3[i]=train3[i].fillna(train3[i].mode()[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3[i]=train3[i].fillna(train3[i].mode()[0])
A value is trying to be set on a copy of a slice from a DataF

key_ASIN                             0
key_Binding Code                     0
key_binding_description              0
key_classification_code              0
key_classification_description       0
key_currency_code                    0
key_Product Group Description        0
key_has_ean                          0
key_has_online_play                  0
key_has_platform                     0
key_has_recommended_browse_nodes     0
key_has_upc                          0
key_is_advantage                     0
key_is_conveyable                    0
key_is_discontinued                  0
key_is_manufacture_on_demand         0
key_Is Sortable                      0
key_item_name                        0
key_language_code                    0
key_manufacturer_name                0
key_model_number                     0
key_product_type                     0
key_publisher_studio_label           0
key_pkg_dimensional_uom              0
key_pkg_weight_uom                   0
key_is_deleted           

In [48]:
uqv=train3[num_columns].nunique()[train3[num_columns].nunique()==1] #numerical columns having only 1 value
uniq_val=uqv.index.to_list()
uniq_val

['key_Region Id',
 'key_MarketPlace Id',
 'key_item_classification_id',
 'cand_Region Id',
 'cand_MarketPlace Id']

In [49]:
train3.drop(uniq_val,axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3.drop(uniq_val,axis=1,inplace=True)


In [50]:
uqv1=train3[object_columns].nunique()[train3[object_columns].nunique()==1] #categorical columns having only 1 value
uq_val_obj=uqv1.index.to_list()
uq_val_obj

['key_classification_code',
 'key_classification_description',
 'key_currency_code',
 'key_has_online_play',
 'key_is_advantage',
 'key_is_discontinued',
 'key_is_manufacture_on_demand',
 'key_language_code',
 'key_pkg_dimensional_uom',
 'key_pkg_weight_uom',
 'cand_currency_code',
 'cand_has_online_play',
 'cand_is_discontinued',
 'cand_is_manufacture_on_demand',
 'cand_language_code',
 'cand_pkg_dimensional_uom',
 'cand_pkg_weight_uom']

In [51]:
train3.drop(uq_val_obj,axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train3.drop(uq_val_obj,axis=1,inplace=True)


In [52]:
train3.shape

(20621, 68)

In [53]:
object_columns = train3.select_dtypes(include=['object']).columns.tolist()
object_columns

['key_ASIN',
 'key_Binding Code',
 'key_binding_description',
 'key_Product Group Description',
 'key_has_ean',
 'key_has_platform',
 'key_has_recommended_browse_nodes',
 'key_has_upc',
 'key_is_conveyable',
 'key_Is Sortable',
 'key_item_name',
 'key_manufacturer_name',
 'key_model_number',
 'key_product_type',
 'key_publisher_studio_label',
 'key_is_deleted',
 'cand_ASIN',
 'cand_Binding Code',
 'cand_binding_description',
 'cand_classification_code',
 'cand_classification_description',
 'cand_Product Group Description',
 'cand_has_ean',
 'cand_has_platform',
 'cand_has_recommended_browse_nodes',
 'cand_has_upc',
 'cand_is_advantage',
 'cand_is_conveyable',
 'cand_Is Sortable',
 'cand_item_name',
 'cand_manufacturer_name',
 'cand_model_number',
 'cand_product_type',
 'cand_publisher_studio_label',
 'cand_is_deleted']

In [54]:
train3[object_columns].nunique()


key_ASIN                              1129
key_Binding Code                        26
key_binding_description                 26
key_Product Group Description           39
key_has_ean                              2
key_has_platform                         2
key_has_recommended_browse_nodes         2
key_has_upc                              2
key_is_conveyable                        2
key_Is Sortable                          2
key_item_name                         1129
key_manufacturer_name                  953
key_model_number                       939
key_product_type                       128
key_publisher_studio_label             953
key_is_deleted                           2
cand_ASIN                            20155
cand_Binding Code                       52
cand_binding_description                52
cand_classification_code                 3
cand_classification_description          3
cand_Product Group Description          50
cand_has_ean                             2
cand_has_pl

In [55]:
train3[object_columns].head()

Unnamed: 0,key_ASIN,key_Binding Code,key_binding_description,key_Product Group Description,key_has_ean,key_has_platform,key_has_recommended_browse_nodes,key_has_upc,key_is_conveyable,key_Is Sortable,key_item_name,key_manufacturer_name,key_model_number,key_product_type,key_publisher_studio_label,key_is_deleted,cand_ASIN,cand_Binding Code,cand_binding_description,cand_classification_code,cand_classification_description,cand_Product Group Description,cand_has_ean,cand_has_platform,cand_has_recommended_browse_nodes,cand_has_upc,cand_is_advantage,cand_is_conveyable,cand_Is Sortable,cand_item_name,cand_manufacturer_name,cand_model_number,cand_product_type,cand_publisher_studio_label,cand_is_deleted
0,B00YCZ6IKA,kitchen,Kitchen,gl_home,Y,N,N,Y,Y,N,Nickelodeon Teenage Mutant Ninja Turtles You B...,"Jay Franco and Sons, Inc.",JF22451BBCD,HOME,"Jay Franco and Sons, Inc.",N,B00CEEU86G,home_improvement,Tools & Home Improvement,base_product,Base Product,gl_home_improvement,Y,N,Y,Y,N,Y,N,Roommates Rmk2249Gm Teenage Mutant Ninja Turtl...,RoomMates,RMK2249GM,BUILDING_MATERIAL,RoomMates,N
1,B00U25WT7A,office_product,Office Product,gl_office_product,Y,N,Y,Y,Y,Y,BLOCKIT RFID Protector Sleeves - Made in the U...,Blockit Security Products LLC,CARS6-TP,OFFICE_PRODUCTS,Blockit Security Products LLC,N,B01FUA9HP8,office_product,Office Product,base_product,Base Product,gl_wireless,Y,N,N,N,N,Y,Y,RFID Blocking Sleeves (10 Credit Card & 2 Pass...,01 Digitals,01DRFIDBS,WIRELESS_ACCESSORY,01 Digitals,N
2,B011BZ3GXU,consumer_electronics,Electronics,gl_wireless,Y,N,N,Y,Y,Y,"Dual Output Portable Charger, Oripow Spark A6 ...",Oripow,22SK6A6,PHONE_ACCESSORY,Oripow,N,B0194WDVHI,consumer_electronics,Electronics,base_product,Base Product,gl_wireless,Y,N,Y,Y,N,Y,Y,"Anker PowerCore 10000, One of the Smallest and...",Anker,A1263G12,CONSUMER_ELECTRONICS,Anker,N
4,B014UTSBZW,miscellaneous,Misc.,gl_pet_products,Y,N,N,Y,Y,Y,"Zuke's Genuine Jerky Dog Treats, Beef and Carr...",Zuke's,25065,PET_SUPPLIES,Zuke's,N,B008OV929C,miscellaneous,Misc.,base_product,Base Product,gl_pet_products,Y,N,Y,Y,N,Y,Y,"Hill's Science Diet Beef Jerky Dog Treats, Jer...",Hill's Pet Nutrition,1876,PET_SUPPLIES,Hill's Pet Nutrition,N
5,B01C5TFLSE,consumer_electronics,Electronics,gl_home_entertainment,Y,N,N,Y,N,N,Samsung UN55KS8000 55-Inch 4K Ultra HD Smart L...,Samsung,UN55KS8000FXZA,TELEVISION,Samsung,N,B012E97GJC,consumer_electronics,Electronics,base_product,Base Product,gl_home_entertainment,Y,N,N,Y,N,N,N,"LG Electronics 55"" LED TV (55SL5B-B)",LG,55SL5B-B,TELEVISION,LG,N


In [56]:
train3[["key_binding_description","cand_binding_description","label"]][train3["label"]==1]

Unnamed: 0,key_binding_description,cand_binding_description,label
2,Electronics,Electronics,1
4,Misc.,Misc.,1
5,Electronics,Electronics,1
7,Electronics,Electronics,1
8,Kitchen,Kitchen,1
...,...,...,...
36776,Health and Beauty,Health and Beauty,1
36791,Kitchen,Kitchen,1
36793,Wireless Phone,Wireless Phone,1
36799,Kitchen,Kitchen,1


In [57]:
tdrop=["key_ASIN","cand_ASIN","key_binding_description","cand_binding_description","key_Binding Code","cand_Binding Code","key_manufacturer_name","cand_manufacturer_name","key_model_number","cand_model_number","key_product_type","cand_product_type","key_publisher_studio_label","cand_publisher_studio_label"]
train4=train3.drop(tdrop,axis=1)

In [58]:
train4["newkeyname"]=train4["key_item_name"]+" "+ train4["key_Product Group Description"]

In [59]:
train4["newcandname"]=train4["cand_item_name"]+" "+train4["cand_Product Group Description"]

In [60]:
train4.drop(["key_item_name","key_Product Group Description","cand_item_name","cand_Product Group Description"],axis=1,inplace=True)

In [61]:
train4[["newkeyname","newcandname"]].head()

Unnamed: 0,newkeyname,newcandname
0,Nickelodeon Teenage Mutant Ninja Turtles You B...,Roommates Rmk2249Gm Teenage Mutant Ninja Turtl...
1,BLOCKIT RFID Protector Sleeves - Made in the U...,RFID Blocking Sleeves (10 Credit Card & 2 Pass...
2,"Dual Output Portable Charger, Oripow Spark A6 ...","Anker PowerCore 10000, One of the Smallest and..."
4,"Zuke's Genuine Jerky Dog Treats, Beef and Carr...","Hill's Science Diet Beef Jerky Dog Treats, Jer..."
5,Samsung UN55KS8000 55-Inch 4K Ultra HD Smart L...,"LG Electronics 55"" LED TV (55SL5B-B) gl_home_e..."


In [62]:
object_columns4 = train4.select_dtypes(include=['object']).columns.tolist()

In [63]:
for i in object_columns4:
  print(i)
  print(train4[i].unique())
  print("#"*50)
  print()

key_has_ean
['Y' 'N']
##################################################

key_has_platform
['N' 'Y']
##################################################

key_has_recommended_browse_nodes
['N' 'Y']
##################################################

key_has_upc
['Y' 'N']
##################################################

key_is_conveyable
['Y' 'N']
##################################################

key_Is Sortable
['N' 'Y']
##################################################

key_is_deleted
['N' 'Y']
##################################################

cand_classification_code
['base_product' 'variation_parent' 'product_bundle']
##################################################

cand_classification_description
['Base Product' 'Variation Parent' 'Product Bundle']
##################################################

cand_has_ean
['Y' 'N']
##################################################

cand_has_platform
['N' 'Y']
##################################################

cand_has_recommended_

In [64]:
train4[object_columns4].columns

Index(['key_has_ean', 'key_has_platform', 'key_has_recommended_browse_nodes',
       'key_has_upc', 'key_is_conveyable', 'key_Is Sortable', 'key_is_deleted',
       'cand_classification_code', 'cand_classification_description',
       'cand_has_ean', 'cand_has_platform',
       'cand_has_recommended_browse_nodes', 'cand_has_upc',
       'cand_is_advantage', 'cand_is_conveyable', 'cand_Is Sortable',
       'cand_is_deleted', 'newkeyname', 'newcandname'],
      dtype='object')

In [65]:
train4.drop("cand_classification_description",axis=1,inplace=True)

In [66]:
train5=train4.copy()

In [67]:
from sklearn.preprocessing import LabelEncoder

categorical_features = ['key_has_ean', 'key_has_platform', 'key_has_recommended_browse_nodes',
       'key_has_upc', 'key_is_conveyable', 'key_Is Sortable', 'key_is_deleted',
       'cand_classification_code',
       'cand_has_ean', 'cand_has_platform',
       'cand_has_recommended_browse_nodes', 'cand_has_upc',
       'cand_is_advantage', 'cand_is_conveyable', 'cand_Is Sortable',
       'cand_is_deleted']
# Create label encoder object
label_encoders = {}
# Iterate over each categorical feature
for feature in categorical_features:
    # Initialize LabelEncoder for the feature
    label_encoders[feature] = LabelEncoder()
    # Fit LabelEncoder on the feature and transform the data
    train5[feature] = label_encoders[feature].fit_transform(train5[feature])




In [68]:
object_columns4 = train4.select_dtypes(include=['object']).columns.tolist()
train5[object_columns4].head()

Unnamed: 0,key_has_ean,key_has_platform,key_has_recommended_browse_nodes,key_has_upc,key_is_conveyable,key_Is Sortable,key_is_deleted,cand_classification_code,cand_has_ean,cand_has_platform,cand_has_recommended_browse_nodes,cand_has_upc,cand_is_advantage,cand_is_conveyable,cand_Is Sortable,cand_is_deleted,newkeyname,newcandname
0,1,0,0,1,1,0,0,0,1,0,1,1,0,1,0,0,Nickelodeon Teenage Mutant Ninja Turtles You B...,Roommates Rmk2249Gm Teenage Mutant Ninja Turtl...
1,1,0,1,1,1,1,0,0,1,0,0,0,0,1,1,0,BLOCKIT RFID Protector Sleeves - Made in the U...,RFID Blocking Sleeves (10 Credit Card & 2 Pass...
2,1,0,0,1,1,1,0,0,1,0,1,1,0,1,1,0,"Dual Output Portable Charger, Oripow Spark A6 ...","Anker PowerCore 10000, One of the Smallest and..."
4,1,0,0,1,1,1,0,0,1,0,1,1,0,1,1,0,"Zuke's Genuine Jerky Dog Treats, Beef and Carr...","Hill's Science Diet Beef Jerky Dog Treats, Jer..."
5,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,Samsung UN55KS8000 55-Inch 4K Ultra HD Smart L...,"LG Electronics 55"" LED TV (55SL5B-B) gl_home_e..."


In [69]:
train6=train5.copy()

In [70]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#google refered for syntax

# Define a function to calculate cosine similarity
def calculate_cosine_similarity(text1, text2):
    # Initialize a TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Fit and transform the text data to obtain the TF-IDF matrix
    tfidf_matrix = tfidf_vectorizer.fit_transform([text1, text2])

    # Calculate the cosine similarity between the two TF-IDF vectors
    cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

    return cosine_sim

# Apply the function to calculate cosine similarity for each pair of texts
def calculate_cosine_similarity_for_df(df, text1_col, text2_col):
    similarities = []
    for index, row in df.iterrows():
        text1 = row[text1_col]
        text2 = row[text2_col]
        similarity = calculate_cosine_similarity(text1, text2)
        similarities.append(similarity)
    return similarities

# Add the cosine similarity values to the DataFrame
train6['cosine_similarity'] = calculate_cosine_similarity_for_df(train6, 'newkeyname', 'newcandname')



In [71]:
train6["cosine_similarity"]

0        0.155929
1        0.172323
2        0.163357
4        0.395534
5        0.272060
           ...   
36796    0.201493
36797    0.072404
36798    0.175786
36799    0.380873
36801    0.225765
Name: cosine_similarity, Length: 20621, dtype: float64

In [72]:
train_final=train6.drop(["newkeyname","newcandname"],axis=1)

In [73]:
train_final.head()

Unnamed: 0,ID,label,key_ean,key_fma_qualified_price_max,key_Product Group Code,key_has_ean,key_has_platform,key_has_recommended_browse_nodes,key_has_upc,key_is_conveyable,key_Is Sortable,key_item_height,key_item_length,key_item_package_quantity,key_item_weight,key_item_width,key_product_type_id,key_upc,key_pkg_height,key_pkg_length,key_pkg_weight,key_pkg_width,key_is_deleted,key_version,cand_classification_code,cand_ean,cand_fma_qualified_price_max,cand_Product Group Code,cand_has_ean,cand_has_platform,cand_has_recommended_browse_nodes,cand_has_upc,cand_is_advantage,cand_is_conveyable,cand_Is Sortable,cand_item_classification_id,cand_item_height,cand_item_length,cand_item_package_quantity,cand_item_weight,cand_item_width,cand_product_type_id,cand_upc,cand_pkg_height,cand_pkg_length,cand_pkg_weight,cand_pkg_width,cand_is_deleted,cand_version,cosine_similarity
0,34016,0,32281220000.0,111.96,201,1,0,0,1,1,0,1.0,86.0,1.0,6.0,66.0,953,32281220000.0,10.0,20.0,6.3,15.0,0,272,0,885401000000.0,35.7,60,1,0,1,1,0,1,0,1.0,0.0,40.0,1.0,0.53,18.0,27713,885401000000.0,1.574803,18.110236,0.529104,5.11811,0,2867,0.155929
1,3581,0,784673000000.0,15.71,229,1,0,1,1,1,1,2.0,2.5,6.0,7.766398,0.1,3523,784673000000.0,0.2,4.8,0.022046,4.0,0,227,0,6142540000000.0,19.41,107,1,0,0,0,0,1,1,1.0,0.3,6.75,1.0,0.110231,4.5,3609,612922900000.0,0.3,6.75,0.1,4.5,0,65,0.172323
2,36025,1,712323000000.0,43.37,107,1,0,0,1,1,1,0.83,5.94,1.0,0.789375,2.24,648,712323000000.0,2.1,7.2,1.05,4.6,0,61,0,848061000000.0,44.41,107,1,0,1,1,0,1,1,1.0,0.86614,2.3622,1.0,0.396832,3.62204,3521,848061000000.0,2.007874,5.23622,0.654773,3.937008,0,1532,0.163357
4,14628,1,613423000000.0,23.85,199,1,0,0,1,1,1,9.33,2.75,1.0,0.438,7.5,937,613423000000.0,0.2,9.2,0.25,7.5,0,1671,0,52742190000.0,14.73,199,1,0,1,1,0,1,1,1.0,8.5,11.75,1.0,6.909666,9.875,937,52742190000.0,1.102362,7.874016,0.396832,5.19685,0,1253,0.395534
5,12882,1,653342000000.0,1496.73,504,1,0,0,1,0,0,30.5,48.2,1.0,39.2,9.2,659,653342000000.0,6.5,51.6,51.2,31.5,0,12014,0,719192000000.0,105.820569,504,1,0,0,1,0,0,0,1.0,30.5,48.7,1.0,46.52,11.5,659,719192000000.0,6.7,52.4,47.4,31.8,0,3189,0.27206


In [74]:
train_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20621 entries, 0 to 36801
Data columns (total 50 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   ID                                 20621 non-null  int64  
 1   label                              20621 non-null  int64  
 2   key_ean                            20621 non-null  float64
 3   key_fma_qualified_price_max        20621 non-null  float64
 4   key_Product Group Code             20621 non-null  int64  
 5   key_has_ean                        20621 non-null  int64  
 6   key_has_platform                   20621 non-null  int64  
 7   key_has_recommended_browse_nodes   20621 non-null  int64  
 8   key_has_upc                        20621 non-null  int64  
 9   key_is_conveyable                  20621 non-null  int64  
 10  key_Is Sortable                    20621 non-null  int64  
 11  key_item_height                    20621 non-null  flo

In [75]:
y=train_final["label"]
x=train_final.drop(["label","ID"],axis=1)

In [76]:
print(x.shape)
y.shape

(20621, 48)


(20621,)

### 2.3 <a name="23">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

We already have training and test datasets, however the test dataset is missing the labels - the goal of the project is to predict these labels.

To produce a validation set to evaluate model performance before submitting  split the training dataset into train and validation. Validation data you get here will be used later in section 3 to tune your classifier.

In [77]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20)

In [78]:
train_final.to_csv("/content/drive/MyDrive/ml_data/train_final.csv")

In [79]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()

In [80]:
rf.fit(x_train,y_train)

In [81]:
y_pred=rf.predict(x_test)

In [86]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
cm=confusion_matrix(y_test,y_pred)
print("Confusion Matrix")
print(cm)

Confusion Matrix
[[1310  691]
 [ 554 1570]]


In [87]:
cr=classification_report(y_test,y_pred)

In [88]:
print("classification report")
print(cr)

classification report
              precision    recall  f1-score   support

           0       0.70      0.65      0.68      2001
           1       0.69      0.74      0.72      2124

    accuracy                           0.70      4125
   macro avg       0.70      0.70      0.70      4125
weighted avg       0.70      0.70      0.70      4125



In [89]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6981818181818182


## 3. <a name="3">Train (and Tune) a Classifier</a> (Implement)
(<a href="#0">Go to top</a>)

Train and tune the classifier

In [90]:
from joblib import dump, load
dump(rf, 'random_forest_model.joblib')
loaded_model = load('random_forest_model.joblib')


In [None]:
# Implement here


In [None]:
# Implement here


In [None]:
# Implement here


## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained classifier to predict the labels on the test set. Test accuracy would be displayed upon a valid submission to the leaderboard.

In [None]:
# Implement here

# Get test data to test the classifier
# ! test data should come from public_test_features.csv !
# ...

# Use the trained model to make predictions on the test dataset
# test_predictions = ...


In [141]:
k2=(data_test.isnull().sum()/data_test.shape[0])*100
columns_with_null2 = k2[k2 >= 50]
columns_with_null2 = k2[k2 >= 50].index.to_list()
print(columns_with_null2)
print(len(columns_with_null2))


['key_brand_code', 'key_case_pack_quantity', 'key_color_map', 'key_country_of_origin', 'key_cpsia_cautionary_statement', 'key_customer_return_method', 'key_customer_return_policy', 'key_delivery_option', 'key_discontinued_date', 'key_esrb_age_rating', 'key_esrb_descriptors', 'key_excluded_direct_browse_node_id', 'key_fedas_id', 'key_fma_override', 'key_inner_package_type', 'key_is_adult_product', 'key_is_certified_organic', 'key_is_phone_upgradeable', 'key_is_super_saver_shipping_excl', 'key_isbn', 'key_item_display_diameter', 'key_item_display_height', 'key_item_display_length', 'key_item_display_length_uom', 'key_item_display_volume', 'key_item_display_volume_uom', 'key_item_display_weight', 'key_item_display_weight_uom', 'key_item_display_width', 'key_manufacturer_sku', 'key_manufacturer_vendor_code', 'key_max_weight_recommendation', 'key_mfg_series_number', 'key_min_weight_recommendation', 'key_monthly_recurring_charge', 'key_number_of_items', 'key_number_of_licenses', 'key_number_

In [142]:
key2=[]
cand2=[]
other2=[]
for i in columns_with_null2:
  if i.startswith("key"):
    cand2.append(i.replace("key","cand",1))
  elif i.startswith("cand"):
    key2.append(i.replace("cand","key",1))
  else:
    other2.append(i)
columns_with_null2.extend(key2)
columns_with_null2.extend(cand2)
print(columns_with_null2)
print(len(columns_with_null2))



['key_brand_code', 'key_case_pack_quantity', 'key_color_map', 'key_country_of_origin', 'key_cpsia_cautionary_statement', 'key_customer_return_method', 'key_customer_return_policy', 'key_delivery_option', 'key_discontinued_date', 'key_esrb_age_rating', 'key_esrb_descriptors', 'key_excluded_direct_browse_node_id', 'key_fedas_id', 'key_fma_override', 'key_inner_package_type', 'key_is_adult_product', 'key_is_certified_organic', 'key_is_phone_upgradeable', 'key_is_super_saver_shipping_excl', 'key_isbn', 'key_item_display_diameter', 'key_item_display_height', 'key_item_display_length', 'key_item_display_length_uom', 'key_item_display_volume', 'key_item_display_volume_uom', 'key_item_display_weight', 'key_item_display_weight_uom', 'key_item_display_width', 'key_manufacturer_sku', 'key_manufacturer_vendor_code', 'key_max_weight_recommendation', 'key_mfg_series_number', 'key_min_weight_recommendation', 'key_monthly_recurring_charge', 'key_number_of_items', 'key_number_of_licenses', 'key_number_

In [143]:
columns_drop2=list(set(columns_with_null2))

print(len(columns_drop2))
test2=data_test.drop(columns_drop2,axis=1)

130


In [163]:
print(columns_drop2)

['cand_publication_month', 'key_customer_return_method', 'key_item_display_width', 'key_ordering_channel', 'cand_number_of_points', 'cand_publication_date', 'cand_country_of_origin', 'cand_cpsia_cautionary_statement', 'cand_external_testing_certification', 'key_program_member', 'cand_customer_return_policy', 'cand_item_display_length_uom', 'cand_recall_notice_publication_date', 'cand_mfg_series_number', 'cand_esrb_age_rating', 'key_is_certified_organic', 'key_color_map', 'key_target_gender', 'key_publication_day', 'key_country_of_origin', 'key_mfg_series_number', 'cand_excluded_direct_browse_node_id', 'key_manufacturer_sku', 'key_item_display_length', 'key_delivery_option', 'cand_wireless_provider_code', 'key_number_of_items', 'cand_number_of_items', 'cand_is_certified_organic', 'cand_fma_override', 'cand_publication_day', 'cand_release_date_embargo_level', 'key_customer_return_policy', 'key_preferred_vendor', 'cand_program_member', 'cand_is_super_saver_shipping_excl', 'cand_color_map'

In [144]:
test2.dropna(thresh=92,inplace=True)


In [145]:
test2.shape

(8156, 97)

In [146]:
object_columns2 = test2.select_dtypes(include=['object']).columns.tolist()
num_columns2 = test2.select_dtypes(include=['int','float']).columns.tolist()
num_int_columns2 = test2.select_dtypes(include=['int']).columns.tolist()
num_float_columns2 = test2.select_dtypes(include=['float']).columns.tolist()

In [147]:
dates=["key_creation_date","key_dw_creation_date","key_dw_last_updated","key_last_updated","cand_creation_date","cand_dw_creation_date","cand_dw_last_updated","cand_last_updated"]
test2.drop(dates,axis=1,inplace=True)

In [148]:
test2[num_columns2].nunique()

ID                              8156
key_Region Id                      1
key_MarketPlace Id                 1
key_ean                          917
key_fma_qualified_price_max      975
key_Product Group Code            39
key_item_classification_id         1
key_item_height                  416
key_item_length                  483
key_item_package_quantity         14
key_item_weight                  428
key_item_width                   438
key_product_type_id              123
key_upc                          852
key_pkg_height                   333
key_pkg_length                   486
key_pkg_weight                   539
key_pkg_width                    428
key_version                      916
cand_Region Id                     1
cand_MarketPlace Id                1
cand_ean                        5230
cand_fma_qualified_price_max    5026
cand_Product Group Code           45
cand_item_classification_id        2
cand_item_height                1443
cand_item_length                1612
c

In [149]:
ts3=["key_Product Group Code","key_item_package_quantity","cand_Product Group Code","cand_item_classification_id","cand_item_package_quantity"]


In [150]:
for i in num_int_columns2:
  if i not in ts3:
    test2[i]=test2[i].fillna(int(test2[i].mean()))
for i in num_float_columns2:
  if i not in ts3:
    test2[i]=test2[i].fillna(test2[i].mean())
test2["cand_item_package_quantity"]=test2["cand_item_package_quantity"].fillna(test2["cand_item_package_quantity"].mean())


In [151]:
test2["key_item_package_quantity"].value_counts()

1.0      7778
2.0        45
10.0       33
4.0        29
6.0        29
3.0        18
12.0       16
60.0       14
36.0        8
100.0       7
24.0        4
5.0         3
21.0        3
500.0       2
Name: key_item_package_quantity, dtype: int64

In [152]:
test2["key_item_package_quantity"]=test2["key_item_package_quantity"].fillna(1.0)

In [153]:
object_columns2 = test2.select_dtypes(include=['object']).columns.tolist()


In [154]:
for i in object_columns2:
    test2[i]=test2[i].fillna(test2[i].mode()[0])
test2[object_columns2].isna().sum()

key_ASIN                             0
key_Binding Code                     0
key_binding_description              0
key_classification_code              0
key_classification_description       0
key_currency_code                    0
key_Product Group Description        0
key_has_ean                          0
key_has_online_play                  0
key_has_platform                     0
key_has_recommended_browse_nodes     0
key_has_upc                          0
key_is_advantage                     0
key_is_conveyable                    0
key_is_discontinued                  0
key_is_manufacture_on_demand         0
key_Is Sortable                      0
key_item_name                        0
key_language_code                    0
key_manufacturer_name                0
key_model_number                     0
key_product_type                     0
key_publisher_studio_label           0
key_pkg_dimensional_uom              0
key_pkg_weight_uom                   0
key_is_deleted           

In [155]:
uqv2=test2[num_columns2].nunique()[test2[num_columns2].nunique()==1]
uniq_val2=uqv2.index.to_list()
test2.drop(uniq_val2,axis=1,inplace=True)


In [156]:
uqv3=test2[object_columns2].nunique()[test2[object_columns2].nunique()==1]
uq_val_obj2=uqv3.index.to_list()
test2.drop(uq_val_obj2,axis=1,inplace=True)

In [157]:
test2.shape

(8156, 67)

In [158]:
test2.drop(tdrop,axis=1,inplace=True)
test2["newkeyname"]=test2["key_item_name"]+" "+ test2["key_Product Group Description"]
test2["newcandname"]=test2["cand_item_name"]+" "+test2["cand_Product Group Description"]
test2.drop(["key_item_name","key_Product Group Description","cand_item_name","cand_Product Group Description"],axis=1,inplace=True)

In [159]:
from sklearn.preprocessing import LabelEncoder

categorical_features = ['key_has_ean', 'key_has_platform', 'key_has_recommended_browse_nodes',
       'key_has_upc', 'key_is_conveyable', 'key_Is Sortable', 'key_is_deleted',
       'cand_classification_code',
       'cand_has_ean', 'cand_has_platform',
       'cand_has_recommended_browse_nodes', 'cand_has_upc',
       'cand_is_advantage', 'cand_is_conveyable', 'cand_Is Sortable',
       'cand_is_deleted']
# Create label encoder object
label_encoders = {}
# Iterate over each categorical feature
for feature in categorical_features:
    # Initialize LabelEncoder for the feature
    label_encoders[feature] = LabelEncoder()
    # Fit LabelEncoder on the feature and transform the data
    test2[feature] = label_encoders[feature].fit_transform(test2[feature])




In [160]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# Define a function to calculate cosine similarity
def calculate_cosine_similarity(text1, text2):
    # Initialize a TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Fit and transform the text data to obtain the TF-IDF matrix
    tfidf_matrix = tfidf_vectorizer.fit_transform([text1, text2])

    # Calculate the cosine similarity between the two TF-IDF vectors
    cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

    return cosine_sim

# Apply the function to calculate cosine similarity for each pair of texts
def calculate_cosine_similarity_for_df(df, text1_col, text2_col):
    similarities = []
    for index, row in df.iterrows():
        text1 = row[text1_col]
        text2 = row[text2_col]
        similarity = calculate_cosine_similarity(text1, text2)
        similarities.append(similarity)
    return similarities

# Add the cosine similarity values to the DataFrame
test2['cosine_similarity'] = calculate_cosine_similarity_for_df(test2, 'newkeyname', 'newcandname')



In [161]:
test3=test2.drop(["newkeyname","newcandname","ID","cand_classification_description"],axis=1)
test3.isna().sum()

key_ean                              0
key_fma_qualified_price_max          0
key_Product Group Code               0
key_has_ean                          0
key_has_platform                     0
key_has_recommended_browse_nodes     0
key_has_upc                          0
key_is_conveyable                    0
key_Is Sortable                      0
key_item_height                      0
key_item_length                      0
key_item_package_quantity            0
key_item_weight                      0
key_item_width                       0
key_product_type_id                  0
key_upc                              0
key_pkg_height                       0
key_pkg_length                       0
key_pkg_weight                       0
key_pkg_width                        0
key_is_deleted                       0
key_version                          0
cand_classification_code             0
cand_ean                             0
cand_fma_qualified_price_max         0
cand_Product Group Code  

In [162]:
new_predictions = loaded_model.predict(test3)



In [164]:
print(num_int_columns2)
print()
print(num_float_columns)
print()
print(object_columns2)
print()
print(uniq_val2)
print()
print(uq_val_obj2)


['ID', 'key_Region Id', 'key_MarketPlace Id', 'key_Product Group Code', 'key_product_type_id', 'key_version', 'cand_Region Id', 'cand_MarketPlace Id', 'cand_Product Group Code', 'cand_product_type_id', 'cand_version']

['key_ean', 'key_fma_qualified_price_max', 'key_item_classification_id', 'key_item_height', 'key_item_length', 'key_item_package_quantity', 'key_item_weight', 'key_item_width', 'key_upc', 'key_pkg_height', 'key_pkg_length', 'key_pkg_weight', 'key_pkg_width', 'cand_ean', 'cand_fma_qualified_price_max', 'cand_item_classification_id', 'cand_item_height', 'cand_item_length', 'cand_item_package_quantity', 'cand_item_weight', 'cand_item_width', 'cand_upc', 'cand_pkg_height', 'cand_pkg_length', 'cand_pkg_weight', 'cand_pkg_width']

['key_ASIN', 'key_Binding Code', 'key_binding_description', 'key_classification_code', 'key_classification_description', 'key_currency_code', 'key_Product Group Description', 'key_has_ean', 'key_has_online_play', 'key_has_platform', 'key_has_recommen

In [165]:
def inference(test2):

  columns_drop2=['cand_publication_month', 'key_customer_return_method', 'key_item_display_width', 'key_ordering_channel', 'cand_number_of_points', 'cand_publication_date', 'cand_country_of_origin', 'cand_cpsia_cautionary_statement', 'cand_external_testing_certification', 'key_program_member', 'cand_customer_return_policy', 'cand_item_display_length_uom', 'cand_recall_notice_publication_date', 'cand_mfg_series_number', 'cand_esrb_age_rating', 'key_is_certified_organic', 'key_color_map', 'key_target_gender', 'key_publication_day', 'key_country_of_origin', 'key_mfg_series_number', 'cand_excluded_direct_browse_node_id', 'key_manufacturer_sku', 'key_item_display_length', 'key_delivery_option', 'cand_wireless_provider_code', 'key_number_of_items', 'cand_number_of_items', 'cand_is_certified_organic', 'cand_fma_override', 'cand_publication_day', 'cand_release_date_embargo_level', 'key_customer_return_policy', 'key_preferred_vendor', 'cand_program_member', 'cand_is_super_saver_shipping_excl', 'cand_color_map', 'key_recall_notice_expiration_date', 'cand_item_display_volume_uom', 'cand_isbn', 'key_number_of_licenses', 'cand_recall_description', 'key_esrb_descriptors', 'cand_item_display_height', 'cand_monthly_recurring_charge', 'cand_esrb_descriptors', 'key_max_weight_recommendation', 'cand_wireless_provider', 'key_is_adult_product', 'key_release_date_embargo_level', 'key_recall_description', 'cand_publisher', 'key_number_of_points', 'cand_variation_theme_id', 'key_discontinued_date', 'key_item_display_volume', 'key_inner_package_type', 'key_case_pack_quantity', 'cand_fedas_id', 'key_brand_code', 'cand_manufacturer_vendor_code', 'key_is_phone_upgradeable', 'cand_publication_year', 'cand_brand_code', 'key_is_super_saver_shipping_excl', 'cand_manufacturer_sku', 'key_item_display_weight_uom', 'cand_is_adult_product', 'cand_max_weight_recommendation', 'key_unit_count', 'cand_preferred_vendor', 'cand_video_game_region', 'cand_unit_count', 'cand_item_display_diameter', 'key_monthly_recurring_charge', 'key_variation_theme_id', 'key_item_display_weight', 'cand_target_gender', 'key_item_display_height', 'key_wireless_provider', 'cand_product_sample_received_day', 'cand_item_display_weight', 'key_excluded_direct_browse_node_id', 'key_item_display_length_uom', 'key_publication_year', 'key_publication_month', 'cand_is_phone_upgradeable', 'key_number_of_pages', 'key_variation_theme_description', 'key_publisher_code', 'key_program_member_code', 'key_video_game_region_description', 'key_manufacturer_vendor_code', 'key_fedas_id', 'cand_video_game_region_description', 'cand_recall_notice_receive_date', 'cand_case_pack_quantity', 'cand_inner_package_type', 'cand_recall_notice_expiration_date', 'cand_customer_return_method', 'cand_variation_theme_description', 'key_external_testing_certification', 'cand_discontinued_date', 'key_product_sample_received_day', 'cand_ordering_channel', 'key_recall_notice_publication_date', 'key_item_display_diameter', 'key_publication_date', 'cand_item_display_width', 'key_min_weight_recommendation', 'key_recall_external_identifier', 'key_cpsia_cautionary_statement', 'cand_delivery_option', 'key_fma_override', 'cand_min_weight_recommendation', 'key_wireless_provider_code', 'key_esrb_age_rating', 'key_item_display_volume_uom', 'cand_item_display_weight_uom', 'cand_publisher_code', 'key_publisher', 'cand_item_display_volume', 'cand_recall_external_identifier', 'key_video_game_region', 'key_isbn', 'key_recall_notice_receive_date', 'cand_number_of_pages', 'cand_number_of_licenses', 'cand_program_member_code', 'cand_item_display_length']

  test2.drop(columns_drop2,axis=1,inplace=True)
  threshold = 92

  # Create a copy of the DataFrame to avoid modifying the original DataFrame
  filtered_df = test2.copy()

  # Iterate over rows in the DataFrame
  for index, row in test2.iterrows():
      # Count non-null values in the row
      non_null_count = row.count()

      # Check if the row has fewer than threshold non-null values
      if non_null_count < threshold:
          # Drop the row from the filtered DataFrame
          return "not enough data"
  dates=["key_creation_date","key_dw_creation_date","key_dw_last_updated","key_last_updated","cand_creation_date","cand_dw_creation_date","cand_dw_last_updated","cand_last_updated"]
  df.drop(dates,axis=1,inplace=True)
  ts3=["key_Product Group Code","key_item_package_quantity","cand_Product Group Code","cand_item_classification_id","cand_item_package_quantity"]
  num_int_columns2 = ['ID', 'key_Region Id', 'key_MarketPlace Id', 'key_Product Group Code', 'key_product_type_id', 'key_version', 'cand_Region Id', 'cand_MarketPlace Id', 'cand_Product Group Code', 'cand_product_type_id', 'cand_version']

  num_float_columns2 = ['key_ean', 'key_fma_qualified_price_max', 'key_item_classification_id', 'key_item_height', 'key_item_length', 'key_item_package_quantity', 'key_item_weight', 'key_item_width', 'key_upc', 'key_pkg_height', 'key_pkg_length', 'key_pkg_weight', 'key_pkg_width', 'cand_ean', 'cand_fma_qualified_price_max', 'cand_item_classification_id', 'cand_item_height', 'cand_item_length', 'cand_item_package_quantity', 'cand_item_weight', 'cand_item_width', 'cand_upc', 'cand_pkg_height', 'cand_pkg_length', 'cand_pkg_weight', 'cand_pkg_width']

  for i in num_int_columns2:
    if i not in ts3:
      test2[i]=test2[i].fillna(int(test2[i].mean()))
  for i in num_float_columns2:
    if i not in ts3:
      test2[i]=test2[i].fillna(test2[i].mean())
  test2["cand_item_package_quantity"]=test2["cand_item_package_quantity"].fillna(test2["cand_item_package_quantity"].mean())
  test2["key_item_package_quantity"]=test2["key_item_package_quantity"].fillna(1.0)
  object_columns2 = ['key_ASIN', 'key_Binding Code', 'key_binding_description', 'key_classification_code', 'key_classification_description', 'key_currency_code', 'key_Product Group Description', 'key_has_ean', 'key_has_online_play', 'key_has_platform', 'key_has_recommended_browse_nodes', 'key_has_upc', 'key_is_advantage', 'key_is_conveyable', 'key_is_discontinued', 'key_is_manufacture_on_demand', 'key_Is Sortable', 'key_item_name', 'key_language_code', 'key_manufacturer_name', 'key_model_number', 'key_product_type', 'key_publisher_studio_label', 'key_pkg_dimensional_uom', 'key_pkg_weight_uom', 'key_is_deleted', 'cand_ASIN', 'cand_Binding Code', 'cand_binding_description', 'cand_classification_code', 'cand_classification_description', 'cand_currency_code', 'cand_Product Group Description', 'cand_has_ean', 'cand_has_online_play', 'cand_has_platform', 'cand_has_recommended_browse_nodes', 'cand_has_upc', 'cand_is_advantage', 'cand_is_conveyable', 'cand_is_discontinued', 'cand_is_manufacture_on_demand', 'cand_Is Sortable', 'cand_item_name', 'cand_language_code', 'cand_manufacturer_name', 'cand_model_number', 'cand_product_type', 'cand_publisher_studio_label', 'cand_pkg_dimensional_uom', 'cand_pkg_weight_uom', 'cand_is_deleted']


  for i in object_columns2:
      test2[i]=test2[i].fillna(test2[i].mode()[0])

  uniq_val2=['key_Region Id', 'key_MarketPlace Id', 'key_item_classification_id', 'cand_Region Id', 'cand_MarketPlace Id']

  test2.drop(uniq_val2,axis=1,inplace=True)


  uq_val_obj2=['key_classification_code', 'key_classification_description', 'key_currency_code', 'key_has_online_play', 'key_is_advantage', 'key_is_discontinued', 'key_is_manufacture_on_demand', 'key_language_code', 'key_pkg_dimensional_uom', 'key_pkg_weight_uom', 'cand_currency_code', 'cand_has_online_play', 'cand_is_discontinued', 'cand_is_manufacture_on_demand', 'cand_language_code', 'cand_pkg_dimensional_uom', 'cand_pkg_weight_uom']

  test2.drop(uq_val_obj2,axis=1,inplace=True)
  test2.drop(tdrop,axis=1,inplace=True)
  test2["newkeyname"]=test2["key_item_name"]+" "+ test2["key_Product Group Description"]
  test2["newcandname"]=test2["cand_item_name"]+" "+test2["cand_Product Group Description"]
  test2.drop(["key_item_name","key_Product Group Description","cand_item_name","cand_Product Group Description"],axis=1,inplace=True)
  categorical_features = ['key_has_ean', 'key_has_platform', 'key_has_recommended_browse_nodes',
        'key_has_upc', 'key_is_conveyable', 'key_Is Sortable', 'key_is_deleted',
        'cand_classification_code',
        'cand_has_ean', 'cand_has_platform',
        'cand_has_recommended_browse_nodes', 'cand_has_upc',
        'cand_is_advantage', 'cand_is_conveyable', 'cand_Is Sortable',
        'cand_is_deleted']
  # Create label encoder object
  label_encoders = {}
  # Iterate over each categorical feature
  for feature in categorical_features:
      # Initialize LabelEncoder for the feature
      label_encoders[feature] = LabelEncoder()
      # Fit LabelEncoder on the feature and transform the data
      test2[feature] = label_encoders[feature].fit_transform(test2[feature])
  test2['cosine_similarity'] = calculate_cosine_similarity_for_df(test2, 'newkeyname', 'newcandname')
  test3=test2.drop(["newkeyname","newcandname","ID","cand_classification_description"],axis=1)
  new_predictions = loaded_model.predict(test3)
  return new_predictions




