## This experiment was performed using LightFM which a very popular recommender module and it has support to take in different data modalities such as text, image, graphical, etc. Please check out their official documentation in the link mentioned below:  

The objective of this experiment is to find the best parameters which give the best precision@12 for the problem at hand.

Please refer to notebooks where you will be able to visualize different experimentations based on light FM:  
1. Light FM with only customer article interactions:  
Link: https://www.kaggle.com/rickykonwar/h-m-lightfm-nofeatures  

2. Light FM with customer article interaction + 5 cat article features (product type name, product group name, graphical appearance name, color group name, department name)  
Link: https://www.kaggle.com/code/rickykonwar/h-m-lightfm-5articlefeatures 

3. Light FM with customer article interaction + 5 cat article features (product type name, product group name, graphical appearance name, color group name, department name) + article description embeddings  
Link: https://www.kaggle.com/rickykonwar/h-m-lightfm-2articlefeatures  

Link to LightFM documentation
making.lyst.com/lightfm/docs/home.html  

It incorporates Hyper Parameter tuning for the problem statement

Hope you like this notebook, please feel free to vote for this notebook

## Importing Required Libraries

In [1]:
# Importing Libraries
import sys, os
import re
import tqdm
import time
import pickle
import random
import itertools

import pandas as pd
import numpy as np
import scipy.sparse as sparse
%matplotlib inline
import matplotlib.pyplot as plt

# lightfm 
from lightfm import LightFM # model
from lightfm.evaluation import precision_at_k
from lightfm.cross_validation import random_train_test_split

# multiprocessing for inferencing
from multiprocessing import Pool

In [2]:
data_path = r'../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv'
customer_data_path = r'../input/h-and-m-personalized-fashion-recommendations/customers.csv'
article_data_path = r'../input/h-and-m-personalized-fashion-recommendations/articles.csv'
submission_data_path = r'../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv'

In [3]:
# Data Extraction
def create_data(datapath, data_type=None):
    if data_type is None:
        df = pd.read_csv(datapath)
    elif data_type == 'transaction':
        df = pd.read_csv(datapath, dtype={'article_id': str}, parse_dates=['t_dat'])
    elif data_type == 'article':
        df = pd.read_csv(datapath, dtype={'article_id': str})
    return df

In [4]:
%%time

# Load all sales data (for 3 years starting from 2018 to 2020)
# ALso, article_id is treated as a string column otherwise it 
# would drop the leading zeros while reading the specific column values
transactions_data=create_data(data_path, data_type='transaction')
print(transactions_data.shape)

# # Unique Attributes
print(str(len(transactions_data['t_dat'].drop_duplicates())) + "-total No of unique transactions dates in data sheet")
print(str(len(transactions_data['customer_id'].drop_duplicates())) + "-total No of unique customers ids in data sheet")
print(str(len(transactions_data['article_id'].drop_duplicates())) + "-total No of unique article ids courses names in data sheet")
print(str(len(transactions_data['sales_channel_id'].drop_duplicates())) + "-total No of unique sales channels in data sheet")

(31788324, 5)
734-total No of unique transactions dates in data sheet
1362281-total No of unique customers ids in data sheet
104547-total No of unique article ids courses names in data sheet
2-total No of unique sales channels in data sheet
CPU times: user 1min, sys: 5.87 s, total: 1min 6s
Wall time: 1min 31s


In [5]:
transactions_data.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [6]:
transactions_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31788324 entries, 0 to 31788323
Data columns (total 5 columns):
 #   Column            Dtype         
---  ------            -----         
 0   t_dat             datetime64[ns]
 1   customer_id       object        
 2   article_id        object        
 3   price             float64       
 4   sales_channel_id  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 1.2+ GB


In [7]:
%%time

# Load all Customers
customer_data=create_data(customer_data_path)
print(customer_data.shape)

print(str(len(customer_data['customer_id'].drop_duplicates())) + "-total No of unique customers ids in customer data sheet")

(1371980, 7)
1371980-total No of unique customers ids in customer data sheet
CPU times: user 3.98 s, sys: 431 ms, total: 4.42 s
Wall time: 6.12 s


In [8]:
customer_data.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [9]:
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   customer_id             1371980 non-null  object 
 1   FN                      476930 non-null   float64
 2   Active                  464404 non-null   float64
 3   club_member_status      1365918 non-null  object 
 4   fashion_news_frequency  1355971 non-null  object 
 5   age                     1356119 non-null  float64
 6   postal_code             1371980 non-null  object 
dtypes: float64(3), object(4)
memory usage: 73.3+ MB


In [10]:
%%time

# Load all Customers
article_data=create_data(article_data_path, data_type='article')
print(article_data.shape)

print(str(len(article_data['article_id'].drop_duplicates())) + "-total No of unique article ids in article data sheet")

(105542, 25)
105542-total No of unique article ids in article data sheet
CPU times: user 820 ms, sys: 60.5 ms, total: 881 ms
Wall time: 1.17 s


In [11]:
article_data.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [12]:
article_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105542 non-null  object
 1   product_code                  105542 non-null  int64 
 2   prod_name                     105542 non-null  object
 3   product_type_no               105542 non-null  int64 
 4   product_type_name             105542 non-null  object
 5   product_group_name            105542 non-null  object
 6   graphical_appearance_no       105542 non-null  int64 
 7   graphical_appearance_name     105542 non-null  object
 8   colour_group_code             105542 non-null  int64 
 9   colour_group_name             105542 non-null  object
 10  perceived_colour_value_id     105542 non-null  int64 
 11  perceived_colour_value_name   105542 non-null  object
 12  perceived_colour_master_id    105542 non-null  int64 
 13 

## Capturing Seasonal Effect by Limiting the transaction date
Based on notebook with link: https://www.kaggle.com/tomooinubushi/folk-of-time-is-our-best-friend/notebook

In [13]:
transactions_data = transactions_data[transactions_data['t_dat'] > '2020-08-21']
transactions_data.shape

(1190911, 5)

## Aggregating Customers and Articles irrespective of transaction dates

In [14]:
transactions_data = transactions_data.groupby(['customer_id','article_id']).agg({'price':'sum','t_dat':'count'}).reset_index()
transactions_data = transactions_data[['customer_id','article_id','price','t_dat']]

## Merging transaction data with articles group name data

In [15]:
# Combine article's product group name with transaction's data
merged_transactions_data = pd.merge(left=transactions_data, right=article_data[['article_id','product_group_name']], how='left', on='article_id')
merged_transactions_data.shape

(1051730, 5)

In [16]:
merged_transactions_data.head()

Unnamed: 0,customer_id,article_id,price,t_dat,product_group_name
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,568601043,0.050831,1,Garment Upper body
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,794321007,0.061,1,Garment Upper body
2,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,448509014,0.042356,1,Garment Lower body
3,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,719530003,0.033881,1,Garment Lower body
4,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,516859008,0.013542,1,Accessories


In [17]:
merged_transactions_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1051730 entries, 0 to 1051729
Data columns (total 5 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   customer_id         1051730 non-null  object 
 1   article_id          1051730 non-null  object 
 2   price               1051730 non-null  float64
 3   t_dat               1051730 non-null  int64  
 4   product_group_name  1051730 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 48.1+ MB


## Generating user and article index mapping dictionaries

In [18]:
def get_customers_list():
    # Creating a list of users
    return np.sort(customer_data['customer_id'].unique())

def get_articles_list():
    # Creating a list of courses 
    item_list = article_data['article_id'].unique()
    return item_list

def get_feature_list(feature_list=['product_group_name']):
    final_feature_df=pd.DataFrame()
    
    # Creating a list of features
    for feature_name in feature_list:
        intermediate_df = article_data[feature_name].copy()
        final_feature_df = pd.concat([final_feature_df, intermediate_df], ignore_index=True)
        
    final_feature_df = final_feature_df.drop_duplicates().reset_index(drop=True)
                
    return final_feature_df[0].unique()

def id_mappings(customers_list, articles_list, feature_list):
    """
    
    Create id mappings to convert user_id, item_id, and feature_id
    
    """
    customer_to_index_mapping = {}
    index_to_customer_mapping = {}
    for customer_index, customer_id in enumerate(customers_list):
        customer_to_index_mapping[customer_id] = customer_index
        index_to_customer_mapping[customer_index] = customer_id
        
    article_to_index_mapping = {}
    index_to_article_mapping = {}
    for article_index, article_id in enumerate(articles_list):
        article_to_index_mapping[article_id] = article_index
        index_to_article_mapping[article_index] = article_id
    
    feature_to_index_mapping = {}
    index_to_feature_mapping = {}
    for feature_index, feature_id in enumerate(feature_list):
        feature_to_index_mapping[feature_id] = feature_index
        index_to_feature_mapping[feature_index] = feature_id
        
    return customer_to_index_mapping, index_to_customer_mapping, \
           article_to_index_mapping, index_to_article_mapping, \
           feature_to_index_mapping, index_to_feature_mapping

In [19]:
customers = get_customers_list()
articles = get_articles_list()
features = get_feature_list(feature_list=['product_type_name','product_group_name','graphical_appearance_name','colour_group_name','department_name'])

In [20]:
customers

array(['00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657',
       '0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa',
       '000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318',
       ...,
       'ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1778d0116cffd259264',
       'ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38b2236865d949d4df6a',
       'ffffd9ac14e89946416d80e791d064701994755c3ab686a1eaf3458c36f52241'],
      dtype=object)

In [21]:
articles

array(['0108775015', '0108775044', '0108775051', ..., '0956217002',
       '0957375001', '0959461001'], dtype=object)

In [22]:
features

array(['Vest top', 'Bra', 'Underwear Tights', 'Socks', 'Leggings/Tights',
       'Sweater', 'Top', 'Trousers', 'Hair clip', 'Umbrella',
       'Pyjama jumpsuit/playsuit', 'Bodysuit', 'Hair string', 'Unknown',
       'Hoodie', 'Sleep Bag', 'Hair/alice band', 'Belt', 'Boots',
       'Bikini top', 'Swimwear bottom', 'Underwear bottom', 'Swimsuit',
       'Skirt', 'T-shirt', 'Dress', 'Hat/beanie', 'Kids Underwear top',
       'Shorts', 'Shirt', 'Cap/peaked', 'Pyjama set', 'Sneakers',
       'Sunglasses', 'Cardigan', 'Gloves', 'Earring', 'Bag', 'Blazer',
       'Other shoe', 'Jumpsuit/Playsuit', 'Sandals', 'Jacket', 'Costumes',
       'Robe', 'Scarf', 'Coat', 'Other accessories', 'Polo shirt',
       'Slippers', 'Night gown', 'Alice band', 'Straw hat', 'Hat/brim',
       'Tailored Waistcoat', 'Necklace', 'Ballerinas', 'Tie',
       'Pyjama bottom', 'Felt hat', 'Bracelet', 'Blouse',
       'Outdoor overall', 'Watch', 'Underwear body', 'Beanie', 'Giftbox',
       'Sleeping sack', 'Dungarees',

In [23]:
# Generate mapping, LightFM library can't read other than (integer) index
customer_to_index_mapping, index_to_customer_mapping, \
article_to_index_mapping, index_to_article_mapping, \
feature_to_index_mapping, index_to_feature_mapping = id_mappings(customers, articles, features)

## Generate Customer Article Interaction Matrix

In [24]:
def get_customer_article_interaction(customer_article_amt_df, agg_col_name='price'):
    #start indexing
    customer_article_amt_df["customer_id"] = customer_article_amt_df["customer_id"]
    customer_article_amt_df["article_id"] = customer_article_amt_df["article_id"]
    customer_article_amt_df[agg_col_name] = customer_article_amt_df[agg_col_name]

    # Preprocessing dataframe created
    customer_article_amt_df = customer_article_amt_df.rename(columns = {'price':'total_amount_spent', 't_dat': 'total_no_of_transactions'})

    # Replace Amount Column with category codes 
    if agg_col_name.__eq__('price'):
        customer_article_amt_df['total_amount_spent'] = customer_article_amt_df['total_amount_spent'].astype('category')
        customer_article_amt_df['total_amount_spent'] = customer_article_amt_df['total_amount_spent'].cat.codes
    elif agg_col_name.__eq__('t_dat'):
        customer_article_amt_df['total_no_of_transactions'] = customer_article_amt_df['total_no_of_transactions'].astype('category')
        customer_article_amt_df['total_no_of_transactions'] = customer_article_amt_df['total_no_of_transactions'].cat.codes

    return customer_article_amt_df

def get_interaction_matrix(df, df_column_as_row, df_column_as_col, 
                        df_column_as_value, row_indexing_map, col_indexing_map):
    
    row = df[df_column_as_row].apply(lambda x: row_indexing_map[x]).values
    col = df[df_column_as_col].apply(lambda x: col_indexing_map[x]).values
    value = df[df_column_as_value].values
    
    return sparse.coo_matrix((value, (row, col)), shape = (len(row_indexing_map), len(col_indexing_map)))

def get_article_feature_interaction(article_df, feature_dict={'product_group_name':1}):
    article_feature_df = article_df[['article_id']+list(feature_dict.keys())]

    # start indexing
    article_feature_df["article_id"] = article_feature_df["article_id"]
    for feature_name in feature_dict.keys():
        article_feature_df[feature_name] = article_feature_df[feature_name]
    
    # initiate the final feature df
    article_feature_final_df = pd.DataFrame()
    
    # allocate features into a single column "feature"
    for feature_name in tqdm.tqdm(feature_dict.keys(), desc='Concatenating feature names'):
        intermediate_feature_df = article_feature_df[['article_id', feature_name]].rename(columns={feature_name: 'feature'})
        intermediate_feature_df['feature_count'] = feature_dict[feature_name]
        
        article_feature_final_df = pd.concat([article_feature_final_df, intermediate_feature_df], ignore_index=True)
        
        del intermediate_feature_df
        
    # grouping for summing over feature count
    article_feature_final_df = article_feature_final_df.groupby(['article_id', 'feature'], as_index=False)['feature_count'].sum()
    
    return article_feature_final_df

### Customer Article Interaction based on Amount Spent

In [25]:
# Create customer and article interaction dataframe based on total amount spent
customer_to_article_amt = get_customer_article_interaction(customer_article_amt_df = merged_transactions_data[['customer_id','article_id','price']])
print(customer_to_article_amt.shape)  

(1051730, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [26]:
customer_to_article_amt

Unnamed: 0,customer_id,article_id,total_amount_spent
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043,4205
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007,4791
2,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,0448509014,3443
3,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,0719530003,2792
4,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,0516859008,939
...,...,...,...
1051725,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0804992033,1971
1051726,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,0689365050,657
1051727,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,0762846027,1971
1051728,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,0794819001,994


In [27]:
# Generate customer_article_interaction_matrix for train data
customer_to_article_interaction_amt = get_interaction_matrix(customer_to_article_amt, "customer_id", "article_id", "total_amount_spent", \
                                                            customer_to_index_mapping, article_to_index_mapping)

In [28]:
customer_to_article_interaction_amt

<1371980x105542 sparse matrix of type '<class 'numpy.int16'>'
	with 1051730 stored elements in COOrdinate format>

### Customer Article Interaction based on Transaction Counts

In [29]:
# Create customer and article interaction dataframe based on total number of transactions made
customer_to_article_tdat = get_customer_article_interaction(customer_article_amt_df = merged_transactions_data[['customer_id','article_id','t_dat']],
                                                            agg_col_name='t_dat')
print(customer_to_article_tdat.shape)     

(1051730, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [30]:
customer_to_article_tdat.head()

Unnamed: 0,customer_id,article_id,total_no_of_transactions
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,568601043,0
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,794321007,0
2,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,448509014,0
3,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d...,719530003,0
4,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,516859008,0


In [31]:
# Generate customer_article_interaction_matrix for train data
customer_to_article_interaction_tdat = get_interaction_matrix(customer_to_article_tdat, "customer_id", "article_id", "total_no_of_transactions", \
                                                            customer_to_index_mapping, article_to_index_mapping)

In [32]:
customer_to_article_interaction_tdat

<1371980x105542 sparse matrix of type '<class 'numpy.int8'>'
	with 1051730 stored elements in COOrdinate format>

### Article Feature Interaction based on Transaction Counts

In [33]:
# Create article and feature interaction dataframe
article_to_feature = get_article_feature_interaction(article_df = article_data, 
                                                    feature_dict={'product_type_name':1,
                                                              'product_group_name':1,
                                                              'graphical_appearance_name':1,
                                                              'colour_group_name':1,
                                                              'department_name':1})
print(article_to_feature.shape)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Concatenating feature names: 100%|██████████| 5/5 [00:00<00:00, 73.97it/s]


(515343, 3)


In [34]:
article_to_feature

Unnamed: 0,article_id,feature,feature_count
0,0108775015,Black,1
1,0108775015,Garment Upper body,1
2,0108775015,Jersey Basic,1
3,0108775015,Solid,1
4,0108775015,Vest top,1
...,...,...,...
515338,0959461001,Dress,1
515339,0959461001,Garment Full body,1
515340,0959461001,Jersey,1
515341,0959461001,Off White,1


In [35]:
article_to_feature.feature.value_counts()

Solid                              49747
Garment Upper body                 42741
Black                              22670
Garment Lower body                 19812
All over pattern                   17165
                                   ...  
Kids Boy License                       1
Woven bottoms inactive from S.7        1
Shirt Extended inactive from s1        1
Bra extender                           1
Clothing mist                          1
Name: feature, Length: 458, dtype: int64

In [36]:
article_to_feature.feature_count.unique()

array([1, 2])

In [37]:
# Generate article_to_feature interaction
article_to_feature_interaction = get_interaction_matrix(article_to_feature, "article_id", "feature", "feature_count", \
                                                       article_to_index_mapping, feature_to_index_mapping)

In [38]:
article_to_feature_interaction

<105542x458 sparse matrix of type '<class 'numpy.int64'>'
	with 515343 stored elements in COOrdinate format>

## Hyperparameter Tuning using Random Search

In [39]:
def sample_hyperparameters():
    while True:
        yield {
            "no_components": np.random.randint(16, 64),
            "learning_schedule": np.random.choice(["adagrad", "adadelta"]),
            "loss": np.random.choice(["bpr", "warp", "warp-kos"]),
            "learning_rate": np.random.exponential(0.05),
            "item_alpha": np.random.exponential(1e-8),
            "user_alpha": np.random.exponential(1e-8),
            "max_sampled": np.random.randint(5, 15),
            "num_epochs": np.random.randint(5, 50),
        }

### Sampling Hyperparmeters Function

### Perform Random Search

Train and Test Interactions are provided as input parameters to the function including the random samples to generate and number of threads to use to perform model training.  

Output would be the precision score, set of hyperprameters and the model

In [40]:
def random_search(train_interactions, test_interactions, item_features=None, num_samples=20, num_threads=4):
    for hyperparams in itertools.islice(sample_hyperparameters(), num_samples):
        num_epochs = hyperparams.pop("num_epochs")

        model = LightFM(**hyperparams)
        model.fit(train_interactions, item_features=item_features, epochs=num_epochs, num_threads=num_threads)

        score = precision_at_k(
                            model, 
                            test_interactions=test_interactions, 
                            train_interactions=train_interactions, 
                            item_features = item_features,
                            k=12, 
                            num_threads=num_threads
                            ).mean()
        
        print(score)

        hyperparams["num_epochs"] = num_epochs

        yield (score, hyperparams, model)

### Initiating Storage Dictionary

In [41]:
optimized_dict={}

### Splitting the primary dataset into train and test sets based on amount spent

In [42]:
sparse_customer_article_train, sparse_customer_article_test = random_train_test_split(customer_to_article_interaction_amt, test_percentage=0.2, random_state=42)

In [43]:
sparse_customer_article_train

<1371980x105542 sparse matrix of type '<class 'numpy.int16'>'
	with 841384 stored elements in COOrdinate format>

In [44]:
sparse_customer_article_test

<1371980x105542 sparse matrix of type '<class 'numpy.int16'>'
	with 210346 stored elements in COOrdinate format>

In [45]:
(score, hyperparams, model) = max(random_search(train_interactions = sparse_customer_article_train, 
                                                test_interactions = sparse_customer_article_test, 
                                                item_features = article_to_feature_interaction,
                                                num_threads = 4), key=lambda x: x[0])

0.000350401
0.00076225295
0.0012355559
0.00088319357
0.0005955509
0.0007687902
0.0006458884
0.000507297
0.0006668079
0.000494876
0.00097014016
0.00058116886
0.00044911474
0.0010080566
0.00064327347
0.00043669375
0.00065307954
0.00041250564
0.00031967557
0.0005216792


In [46]:
print("Best score {} at {}".format(score, hyperparams))

Best score 0.0012355558574199677 at {'no_components': 41, 'learning_schedule': 'adagrad', 'loss': 'warp', 'learning_rate': 0.0640607682015081, 'item_alpha': 3.687859740593897e-08, 'user_alpha': 5.944386269743704e-09, 'max_sampled': 9, 'num_epochs': 48}


In [47]:
optimized_dict['Amount_Spent'] = {'score': score, 
                                  'params': hyperparams}

### Splitting the primary dataset into train and test sets based on transaction count

In [48]:
sparse_customer_article_train, sparse_customer_article_test = random_train_test_split(customer_to_article_interaction_tdat, test_percentage=0.2, random_state=42)

In [49]:
sparse_customer_article_train

<1371980x105542 sparse matrix of type '<class 'numpy.int8'>'
	with 841384 stored elements in COOrdinate format>

In [50]:
sparse_customer_article_test

<1371980x105542 sparse matrix of type '<class 'numpy.int8'>'
	with 210346 stored elements in COOrdinate format>

In [51]:
(score, hyperparams, model) = max(random_search(train_interactions = sparse_customer_article_train, 
                                                test_interactions = sparse_customer_article_test, 
                                                item_features = article_to_feature_interaction,
                                                num_threads = 4), key=lambda x: x[0])

0.00039877725
0.00033078901
0.00054325233
0.0004719954
0.001127036
0.00112769
0.00028372023
0.00028306647
0.00080343813
5.8835987e-05
0.00050860445
0.0010263612
0.00033471145
0.00027914406
0.00010459731
8.6946515e-05
0.0003190218
0.00033601886
0.0011623377
0.00015035867


In [52]:
print("Best score {} at {}".format(score, hyperparams))

Best score 0.001162337721325457 at {'no_components': 29, 'learning_schedule': 'adadelta', 'loss': 'warp-kos', 'learning_rate': 0.04500246271404722, 'item_alpha': 8.996016099729502e-09, 'user_alpha': 6.279693363984694e-09, 'max_sampled': 13, 'num_epochs': 48}


In [53]:
optimized_dict['Transaction_Counts'] = {'score': score, 
                                       'params': hyperparams}

In [54]:
print(optimized_dict)

{'Amount_Spent': {'score': 0.0012355559, 'params': {'no_components': 41, 'learning_schedule': 'adagrad', 'loss': 'warp', 'learning_rate': 0.0640607682015081, 'item_alpha': 3.687859740593897e-08, 'user_alpha': 5.944386269743704e-09, 'max_sampled': 9, 'num_epochs': 48}}, 'Transaction_Counts': {'score': 0.0011623377, 'params': {'no_components': 29, 'learning_schedule': 'adadelta', 'loss': 'warp-kos', 'learning_rate': 0.04500246271404722, 'item_alpha': 8.996016099729502e-09, 'user_alpha': 6.279693363984694e-09, 'max_sampled': 13, 'num_epochs': 48}}}


## Saving the Optimized Params

In [55]:
with open('optimized_dict.pkl', 'wb') as f:
    pickle.dump(optimized_dict, f)