# Introduction
This challenge is about creating an fasion clothing   recommendation system. The data contains information about 
the customers, records of customers’ purchases, and information about the products including their images and 
text description. 

## Research Question
The Aim of this project is to predict the next 12 purchases for each customer based on their past purchases and the product metadata.

## Method 

I will be sampling the data into users who have more than 30 pruchases (Active users), and people who have 30 pruchases or less (Cold users). Thhis is becuase some neural networks will be computationally expensive for users who don't have much data. These algorithms use cotent-base filtering which works only if there is a sufficient amout of data.

For Active users I will be using:
1. GRU: an RNN that uses the structural embedding of the product data.
2. GRUF: an RNN that uses the visual embedding of the product data.

For Cold users, I will be using:
1. NPU: an RNN desgned for the cold user problem.
2. ALS: a matrix factorization technique, which counts as a colaborative filtering model.
3. Trending products: using the trending products that week. This will be very useful for attracting cold users.



## Downloading the data

In [None]:
! pip install kaggle

In [None]:
! mkdir ~/.kaggle

In [2]:
# Configuring the Kaggle API
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [3]:
!mkdir /content/data

In [None]:
! kaggle competitions download -c h-and-m-personalized-fashion-recommendations

In [None]:
! unzip /content/h-and-m-personalized-fashion-recommendations.zip -d data

In [7]:
! mkdir /content/data/image_data

In [8]:
# Moving the images from their subfolders to one main folder
! find /content/data/images -type f -exec mv --backup=numbered -t /content/data/image_data/ {} +

# Libraries

In [7]:
import os; os.environ['OPENBLAS_NUM_THREADS']='1'
import cv2
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelBinarizer
from keras.applications.xception import Xception,preprocess_input
import tensorflow as tf
import tensorflow_hub as hub
from keras.preprocessing import image
from keras.layers import Input
from keras.backend import reshape
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import pickle
import math

In [8]:
from keras.preprocessing.image import load_img 
from keras.preprocessing.image import img_to_array 
from keras.applications.vgg16 import preprocess_input 

# CNN 
from keras.applications.vgg16 import VGG16 
from keras.models import Model

# clustering and dimension reduction
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pickle
import math



# Image embedding
##  Using CNN to extract the image features

Here, I want to extract the image embeddings as extra information for the Deep models. I will be using a pre-trained CNN, specifically VGG16, due to it's predictive capabilities.

In [9]:
dir = '/content/data/image_data'

In [10]:
os.chdir(dir)

In [11]:
products = []

# creates a ScandirIterator aliased as files
with os.scandir(dir) as files:
  # loops through each file in the directory
    for file in files:
        if file.name.endswith('.jpg'):
          # adds only the image files to the flowers list
            products.append(file.name)

In [12]:
len(products)

105100

In [13]:
# load the model first and pass as an argument
model = VGG16()
model = Model(inputs = model.inputs, outputs = model.layers[-2].output)

def extract_features(file, model):
    # load the image as a 224x224 px array
    img = load_img(file, target_size=(224,224))
    # convert from 'PIL.Image.Image' to numpy array
    img = np.array(img) 
    # reshape the data for the model reshape(num_of_samples, dim 1, dim 2, channels)
    reshaped_img = img.reshape(1,224,224,3) 
    # prepare image for model
    imgx = preprocess_input(reshaped_img)
    imgx = np.asarray(imgx)
    # get the feature vector
    features = model.predict(imgx, use_multiprocessing=True)
    return features


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5


In [None]:
# ectracting the features
image_features = []
articles = []
for product in products:
    articles.append(product)
    image_features.append(extract_features(product,model))

In [None]:
np.save('/content/image_features.csv',image_features)

## Using PCA to reduce the feature space

The output of the neural network is a vector of 4096 values. This will be computationally expensive for our model. Therefore, I will be using PCA here to reduce the feature space to 128 values.

In [None]:
image_features = np.load('/content/drive/MyDrive/H&M/image_features.csv.npy')

In [None]:
image_features = image_features.reshape(-1,4096)

In [None]:
image_features.shape

(105100, 4096)

In [None]:
# Reducing the feature space from 4096 vectors to 128 trainable parameters
pca = PCA(n_components=128, random_state=22)
pca.fit(image_features)
x = pca.transform(image_features)

In [None]:
features = pd.DataFrame(x)

In [None]:
features.shape

(105100, 128)

# Data Preperation

The data has to be made in specific format called atomic files. I will be making one for customers, products, and the interactions between customers and products.




## Creating an image entity embedding atomic file

In [None]:
features['ent_emb:float_seq'] = features[features.columns].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

In [None]:
features = features['ent_emb:float_seq']
features.head()

0    -23.545088 -9.362087 2.1191468 7.269294 -3.904...
1    1.3210576 2.457583 7.941802 10.924126 -1.87812...
2    -8.405226 -1.1159095 -2.2450662 11.526868 -6.5...
3    -32.10634 4.0234857 -1.3979577 -17.651117 11.4...
4    -11.009327 -4.833187 5.2368045 16.553185 1.694...
Name: ent_emb:float_seq, dtype: object

In [None]:
articles = pd.DataFrame(products, columns= {'article'})
image_embedding = pd.concat([articles, features], axis=1)
image_embedding.head()

Unnamed: 0,article,ent_emb:float_seq
0,0656401005.jpg,-23.545088 -9.362087 2.1191468 7.269294 -3.904...
1,0661505002.jpg,1.3210576 2.457583 7.941802 10.924126 -1.87812...
2,0403844008.jpg,-8.405226 -1.1159095 -2.2450662 11.526868 -6.5...
3,0887457001.jpg,-32.10634 4.0234857 -1.3979577 -17.651117 11.4...
4,0423539001.jpg,-11.009327 -4.833187 5.2368045 16.553185 1.694...


In [None]:
image_embedding.rename(columns={'article':'ent_id:token'}, inplace=True)

In [None]:
image_emb1 = image_embedding
image_emb1['ent_id:token'] = image_emb1['ent_id:token'].str[:10]
image_emb1.head()

Unnamed: 0,ent_id:token,ent_emb:float_seq
0,656401005,-23.545088 -9.362087 2.1191468 7.269294 -3.904...
1,661505002,1.3210576 2.457583 7.941802 10.924126 -1.87812...
2,403844008,-8.405226 -1.1159095 -2.2450662 11.526868 -6.5...
3,887457001,-32.10634 4.0234857 -1.3979577 -17.651117 11.4...
4,423539001,-11.009327 -4.833187 5.2368045 16.553185 1.694...


In [None]:
image_emb1 = image_emb1.rename(columns ={'ent_id:token':'iid:token' , 'ent_emb:float_seq':'image_emb:float_seq'})

In [None]:
image_emb1.head()

Unnamed: 0,iid:token,image_emb:float_seq
0,656401005,-23.545088 -9.362087 2.1191468 7.269294 -3.904...
1,661505002,1.3210576 2.457583 7.941802 10.924126 -1.87812...
2,403844008,-8.405226 -1.1159095 -2.2450662 11.526868 -6.5...
3,887457001,-32.10634 4.0234857 -1.3979577 -17.651117 11.4...
4,423539001,-11.009327 -4.833187 5.2368045 16.553185 1.694...


In [None]:
! mkdir /content/recbox_data

In [None]:
image_emb1.to_csv('/content/recbox_data/recbox_data.image', index=False, sep='\t')

## Creating the item atomic file

In [13]:
df = pd.read_csv(r'/content/data/articles.csv')
df['article_id'] = df['article_id'].astype('int32')
df['article_id'] = '0' + df['article_id'].astype('str')

In [14]:
for col in df.columns:
    print(col)
    print(len(pd.unique(df[col])))

article_id
105542
product_code
47224
prod_name
45875
product_type_no
132
product_type_name
131
product_group_name
19
graphical_appearance_no
30
graphical_appearance_name
30
colour_group_code
50
colour_group_name
50
perceived_colour_value_id
8
perceived_colour_value_name
8
perceived_colour_master_id
20
perceived_colour_master_name
20
department_no
299
department_name
250
index_code
10
index_name
10
index_group_no
5
index_group_name
5
section_no
57
section_name
56
garment_group_no
21
garment_group_name
21
detail_desc
43405


we see so many couple of columns are [category_text, encoded_value]. So in order to avoid Multicollinearity, we will keep only one columns in each couple
We can see below couple of columns in item features, and we will keep one of them:

1. use product_type_no - skip product_type_name
2. use graphical_appearance_no - skip graphical_appearance_name
3. use colour_group_code - skip colour_group_name
4. use perceived_colour_value_id - skip perceived_colour_value_name
5. use perceived_colour_master_id - skip perceived_colour_master_name
6. use index_code - skip index_name
7. use index_group_no - skip index_group_name
8. use section_no - skip section_name
9. use garment_group_no - skip garment_group_name
10. use product_code, skip product_name
11. use department_no, skip department_name

In [15]:
df = df.drop(columns = ['product_type_name',
                        'graphical_appearance_name',
                        'colour_group_name',
                        'perceived_colour_value_name',
                        'perceived_colour_master_name',
                        'index_name',
                        'index_group_name',
                        'section_name', 
                        'garment_group_name', 'prod_name', 'department_name', 'detail_desc'])
df.head()

Unnamed: 0,article_id,product_code,product_type_no,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no
0,108775015,108775,253,Garment Upper body,1010016,9,4,5,1676,A,1,16,1002
1,108775044,108775,253,Garment Upper body,1010016,10,3,9,1676,A,1,16,1002
2,108775051,108775,253,Garment Upper body,1010017,11,1,9,1676,A,1,16,1002
3,110065001,110065,306,Underwear,1010016,9,4,5,1339,B,1,61,1017
4,110065002,110065,306,Underwear,1010016,10,3,9,1339,B,1,61,1017


I would also like to add the price in the product embedding. Since poducts have different prices deepending on the sale channel (whether it was bought online or in store), I will take the average of these prices and add them to the product features.

In [16]:
transaction = pd.read_csv('/content/data/transactions_train.csv', dtype={'article_id': 'str'})

In [17]:
transaction['article_id'] = transaction['article_id'].astype('int32')
transaction['article_id'] = '0' + transaction['article_id'].astype('str')

In [18]:
transaction.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [19]:
transaction = transaction[['article_id','price']]

In [20]:
transaction.groupby('article_id').mean()

Unnamed: 0_level_0,price
article_id,Unnamed: 1_level_1
0108775015,0.008142
0108775044,0.008114
0108775051,0.004980
0110065001,0.020219
0110065002,0.018205
...,...
0952267001,0.014982
0952938001,0.048006
0953450001,0.016836
0953763001,0.021908


In [21]:
# creating a dictionary that includes the products and their prices
prices = transaction.set_index('article_id').to_dict()['price']

In [22]:
df['price'] = df['article_id'].map(prices)

In [23]:
df['price'] = df['price'].astype('float')
df.head()

Unnamed: 0,article_id,product_code,product_type_no,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_code,index_group_no,section_no,garment_group_no,price
0,108775015,108775,253,Garment Upper body,1010016,9,4,5,1676,A,1,16,1002,0.002068
1,108775044,108775,253,Garment Upper body,1010016,10,3,9,1676,A,1,16,1002,0.008458
2,108775051,108775,253,Garment Upper body,1010017,11,1,9,1676,A,1,16,1002,0.008458
3,110065001,110065,306,Underwear,1010016,9,4,5,1339,B,1,61,1017,0.006763
4,110065002,110065,306,Underwear,1010016,10,3,9,1339,B,1,61,1017,0.005186


Here I would like to add the pretrained image embeddings to the product

In [None]:
embeddings = image_emb1.set_index('iid:token').to_dict()['image_emb:float_seq']

In [None]:
df['image_emb'] = df['article_id'].map(embeddings)

In [None]:
# Here, I'm renaming the columns 
temp = df.rename(
    columns={'article_id': 'item_id:token',
             'product_code': 'product_code:token',
             'product_type_no': 'product_type_no:token',
             'product_group_name': 'product_group_name:token_seq',
             'graphical_appearance_no': 'graphical_appearance_no:token', 
             'colour_group_code': 'colour_group_code:token',
             'perceived_colour_value_id': 'perceived_colour_value_id:token', 
             'perceived_colour_master_id': 'perceived_colour_master_id:token',
             'department_no': 'department_no:token', 
             'index_code': 'index_code:token',
             'index_group_no': 'index_group_no:token',
             'section_no': 'section_no:token', 
             'garment_group_no': 'garment_group_no:token',
             'price': 'price:float',
             'image_emb':'image_emb:float_seq'})
temp.head()

Unnamed: 0,item_id:token,product_code:token,product_type_no:token,product_group_name:token_seq,graphical_appearance_no:token,colour_group_code:token,perceived_colour_value_id:token,perceived_colour_master_id:token,department_no:token,index_code:token,index_group_no:token,section_no:token,garment_group_no:token,image_emb:float_seq
0,108775015,108775,253,Garment Upper body,1010016,9,4,5,1676,A,1,16,1002,-26.877287 -13.949906 -5.2173343 12.229804 -10...
1,108775044,108775,253,Garment Upper body,1010016,10,3,9,1676,A,1,16,1002,-26.07998 2.4806035 4.882296 1.9872117 11.8234...
2,108775051,108775,253,Garment Upper body,1010017,11,1,9,1676,A,1,16,1002,-22.196695 -0.17726707 -3.2106757 -0.44437423 ...
3,110065001,110065,306,Underwear,1010016,9,4,5,1339,B,1,61,1017,20.760696 -7.5740004 20.06105 -19.380611 -6.99...
4,110065002,110065,306,Underwear,1010016,10,3,9,1339,B,1,61,1017,-38.813778 -15.250411 -4.0560465 2.8414378 -9....


In [24]:
temp = df.rename(
    columns={'article_id': 'item_id:token',
             'product_code': 'product_code:token',
             'product_type_no': 'product_type_no:token',
             'product_group_name': 'product_group_name:token_seq',
             'graphical_appearance_no': 'graphical_appearance_no:token', 
             'colour_group_code': 'colour_group_code:token',
             'perceived_colour_value_id': 'perceived_colour_value_id:token', 
             'perceived_colour_master_id': 'perceived_colour_master_id:token',
             'department_no': 'department_no:token', 
             'index_code': 'index_code:token',
             'index_group_no': 'index_group_no:token',
             'section_no': 'section_no:token', 
             'garment_group_no': 'garment_group_no:token',
             'price': 'price:float',
             'image_emb':'image_emb:float_seq'})

In [26]:
!mkdir /content/recbox_data
!mkdir /content/NPE

In [None]:
temp.to_csv(r'/content/recbox_data/recbox_data.item', index=False, sep='\t')

In [27]:
temp.to_csv(r'/content/NPE/NPE.item', index=False, sep='\t')

## Creating the user-item interaction atomic file

In [14]:
df = pd.read_csv(r'/content/data/transactions_train.csv', )
#Tricks for reducing the memory
df['article_id'] = df['article_id'].astype('int32')
df['article_id'] = '0' + df['article_id'].astype('str')
df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [15]:
df['t_dat'] = pd.to_datetime(df['t_dat'], format="%Y-%m-%d")

In [16]:
#converting datetime to a timestamp numpy array by values and cast to int64 - output is in ns, so it needs to be divide by 10 ** 9
df['timestamp'] = df.t_dat.values.astype(np.int64) // 10 ** 9
df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,timestamp
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,1537401600
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,1537401600
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,1537401600
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2,1537401600
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2,1537401600


I will only be looking at interactions from Aug 2019 or after for the GRU model, since fasion habits is highly variant and changes drastically over time, I won't be looking at the fasion habits and purchases that are older than a year. However, for the NPE model, I need to include all the data because some of the cold users purchased all their items before 2020. This means that the model won't be able to predict for a user that had no purchase history.




In [17]:
#I will only be looking at interactions from 2020 or after for the GRU model
temp_GRU = df[df['timestamp'] > 1577859277][['customer_id', 'article_id', 'timestamp']].rename(
    columns={'customer_id': 'user_id:token', 'article_id': 'item_id:token', 'timestamp': 'timestamp:float'})
temp_GRU.head()

Unnamed: 0,user_id:token,item_id:token,timestamp:float
20820952,00025f8226be50dcab09402a2cacd520a99e112fe01fdd...,797565002,1577923200
20820953,00025f8226be50dcab09402a2cacd520a99e112fe01fdd...,797565001,1577923200
20820954,00067622de3151a7219b4ed9922def50b51601fbe41418...,801865004,1577923200
20820955,0010f56acce349e6e82bfef13ee39232a8bc0db0801ca4...,578752001,1577923200
20820956,0010f56acce349e6e82bfef13ee39232a8bc0db0801ca4...,578752001,1577923200


In [18]:
temp_NPE = df[['customer_id', 'article_id', 'timestamp']]
temp_NPE = temp_NPE.rename(columns={'customer_id': 'user_id:token', 'article_id': 'item_id:token', 'timestamp': 'timestamp:float'})
temp_NPE.head()

Unnamed: 0,user_id:token,item_id:token,timestamp:float
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,1537401600
1,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,1537401600
2,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,1537401600
3,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,1537401600
4,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,1537401600


In [19]:
len(temp_GRU), len(temp_NPE)

(10967372, 31788324)

In [20]:
temp_GRU.to_csv('/content/recbox_data/recbox_data.inter', index=False, sep='\t')

In [34]:
temp_NPE.to_csv('/content/NPE/NPE.inter', index=False, sep='\t')

## Creating the customer atomic file

In [35]:
df = pd.read_csv('/content/data/customers.csv')

In [None]:
df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [37]:
# Converting the NA values to zero for both FN and Active Columns
df['FN'] = df.apply(
    lambda x: 0 if x['FN'] != 1 else 1, axis=1)
df['Active'] = df.apply(
    lambda x: 0 if x['Active'] != 1 else 1, axis=1)

In [38]:
df.isna().sum()

customer_id                   0
FN                            0
Active                        0
club_member_status         6062
fashion_news_frequency    16009
age                       15861
postal_code                   0
dtype: int64

For those with missing age information, I will be replacing those values with the total average age for all users

In [39]:
df['age'] =df['age'].fillna(pd.Series.mean(df['age']) , axis = 0)
#df['Age']= df.apply(lambda x: pd.Series.mean(df['age']) if df['Age'] == pd.nan , axis =1)

In [40]:
df['club_member_status'] =df['club_member_status'].fillna('None', axis=0)
df['fashion_news_frequency'] = df['fashion_news_frequency'].fillna('None', axis = 0) 

In [41]:
df.isna().sum()

customer_id               0
FN                        0
Active                    0
club_member_status        0
fashion_news_frequency    0
age                       0
postal_code               0
dtype: int64

In [36]:
temp = df.rename(
    columns={ 'customer_id': 'user_id:token', 'FN' : 'FN:token' , 'Active':'Active:token', 'age':'age:token',
             'fashion_news_frequency': 'fasion_news_frequency:token' , 'postal_code': 'postal_code:token',
            'club_member_status' : 'club_member_status:token_seq'})
temp.head()

Unnamed: 0,user_id:token,FN:token,Active:token,club_member_status:token_seq,fasion_news_frequency:token,age:token,postal_code:token
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [42]:
temp = df.rename(
    columns={ 'customer_id': 'user_id:token', 'FN' : 'FN:token' , 'Active':'Active:token', 'age':'age:token',
             'fashion_news_frequency': 'fasion_news_frequency:token' , 'postal_code': 'postal_code:token',
            'club_member_status' : 'club_member_status:token_seq'})
temp.head()

Unnamed: 0,user_id:token,FN:token,Active:token,club_member_status:token_seq,fasion_news_frequency:token,age:token,postal_code:token
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0,0,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0,0,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0,0,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0,0,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1,1,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [43]:
temp.to_csv('/content/recbox_data/recbox_data.user', index=False, sep='\t')

# Active Users
These models are to predict the products for people with 30+ pruchases.

## GRU4Rec

In [1]:
import logging
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.sequential_recommender import GRU4Rec, npe , GRU4RecKG, GRU4RecF
from recbole.model.knowledge_aware_recommender import CKE
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger
from recbole.utils.case_study import full_sort_topk

First, I will be configuring all the important parameters in the model. 

In [32]:
parameter_dict = {
    'data_path': '/content',
    #'use_gpu':'True',
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',
    'TIME_FIELD': 'timestamp',
    'seq_len':{'image_emb': 150},

    ## This will make the model consider users who made 30 or more interactions 
    'user_inter_num_interval': "[30,inf)",

    ## This will make the model only consider the items that have 
    # been bought more than 30 times or more (the popular items)
    'item_inter_num_interval': "[30,inf)",

    #'additional_feat_suffix': ['image'],

    ## Columns to load
    'load_col': {'inter': ['user_id', 'item_id', 'timestamp'],
                 'item': ['item_id', 'product_code', 'product_type_no', 'product_group_name', 'graphical_appearance_no',
                          'colour_group_code', 'perceived_colour_value_id', 'perceived_colour_master_id',
                          'department_no', 'index_code', 'index_group_no', 'section_no', 'garment_group_no', 
                          'age', 'postal_code']
                 },
                  
                  
    # Columns used directly by the model (has to be selected manually )
    'selected_features': ['product_code', 'product_type_no', 'product_group_name', 'graphical_appearance_no',
                          'colour_group_code', 'perceived_colour_value_id', 'perceived_colour_master_id',
                          'department_no', 'index_code', 'index_group_no', 'section_no', 'garment_group_no', 
                          'age', 'postal_code'],
    'neg_sampling': None,
    #'embedding_size': 128,
    #'pooling_mode': 'max',
    'epochs': 70,
    'topk': 12,
    'loss_type': 'CE',
    'eval_args': {
        'split': {'RS': [9, 0, 1]},
        'group_by': 'user',
        'order': 'TO',
        'mode': 'full'}
}

config = Config(model='GRU4Rec', dataset='recbox_data', config_dict=parameter_dict)

# init randseed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config) 
logger = getLogger()
# Create handlers
c_handler = logging.StreamHandler()
c_handler.setLevel(logging.INFO)
logger.addHandler(c_handler)

# write config info into log
logger.info(config)

07 May 07:44    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = /content/recbox_data
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 70
train_batch_size = 2048
learner = adam
learning_rate = 0.001
neg_sampling = None
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}
repeatable = True
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [12]
valid_metric = MRR@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator = 	
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
s

This is for creating the readable dataset for the model. It combines the atomic files that I've created earlier, along with the parameter configuration that I set.

In [25]:
dataset = create_dataset(config)
logger.info(dataset)

07 May 07:37    INFO  recbox_data
The number of users: 79292
Average actions of users: 51.502490824936
The number of items: 20677
Average actions of items: 197.5084155542658
The number of inters: 4083684
The sparsity of the dataset: 99.75092208107817%
Remain Fields: ['user_id', 'item_id', 'timestamp']
recbox_data
The number of users: 79292
Average actions of users: 51.502490824936
The number of items: 20677
Average actions of items: 197.5084155542658
The number of inters: 4083684
The sparsity of the dataset: 99.75092208107817%
Remain Fields: ['user_id', 'item_id', 'timestamp']
recbox_data
The number of users: 79292
Average actions of users: 51.502490824936
The number of items: 20677
Average actions of items: 197.5084155542658
The number of inters: 4083684
The sparsity of the dataset: 99.75092208107817%
Remain Fields: ['user_id', 'item_id', 'timestamp']
recbox_data
The number of users: 79292
Average actions of users: 51.502490824936
The number of items: 20677
Average actions of items: 1

This is for creating the train, test, and validation data. Since this is a timeseries model, it will be based on the timestamp.

In [26]:
# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

07 May 07:38    INFO  [Training]: train_batch_size = [2048] negative sampling: [None]
[Training]: train_batch_size = [2048] negative sampling: [None]
[Training]: train_batch_size = [2048] negative sampling: [None]
[Training]: train_batch_size = [2048] negative sampling: [None]
07 May 07:38    INFO  [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}]
[Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}]
[Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}]
[Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}]


Training the model over 70 epochs. The output is too long so I will be supressing it.

In [None]:
model = GRU4Rec(config, train_data.dataset).to(config['device'])
logger.info(model)
# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data)

Using the model to predict the results.

In [28]:
external_user_ids = dataset.id2token(
    dataset.uid_field, list(range(dataset.user_num)))[1:]#fist element in array is 'PAD'(default of Recbole) ->remove it 

In [29]:
topk_items = []
for internal_user_id in list(range(dataset.user_num))[1:]:
    topk_scores, topk_iid_list = full_sort_topk([internal_user_id], model, test_data, k=12 ,device=config['device'])
    last_topk_iid_list = topk_iid_list[-1]
    external_item_list = dataset.id2token(dataset.iid_field, last_topk_iid_list.cpu()).tolist()
    topk_items.append(external_item_list)
print(len(topk_items))

79291


In [30]:
external_item_str = [' '.join(x) for x in topk_items]
result_GRU = pd.DataFrame(external_user_ids, columns=['customer_id'])
result_GRU['prediction'] = external_item_str
result_GRU.head()

Unnamed: 0,customer_id,prediction
0,001ddeb8fb74fec5693116da83b488e05ee9a9e179f3fd...,0706016001 0730683050 0866731001 0448509014 08...
1,002cf96a2e882620182534ca963f8b43f9aafa2b668c09...,0896169002 0915529003 0915526001 0751471043 08...
2,00357b192b81fc83261a45be87f5f3d59112db7d117513...,0915529003 0915526001 0884319005 0891591003 08...
3,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,0730683050 0852584001 0866731001 0912204001 09...
4,0071a08839ee650528afe22a7bff79334a09eed3edf67b...,0751471043 0685816002 0905365001 0751471001 05...


## GRU4RecF

In [31]:
parameter_dict = {
    'data_path': '/content',
    #'use_gpu':'True',
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',
    'TIME_FIELD': 'timestamp',
    'seq_len':{'image_emb': 150},
    # using two layers instead of one
    'num_layers': 2,
    ## This will make the model consider users who made 30 or more interactions 
    'user_inter_num_interval': "[30,inf)",

    ## This will make the model only consider the items that have 
    # been bought more than 30 times or more (the popular items)
    'item_inter_num_interval': "[30,inf)",

    #'additional_feat_suffix': ['image'],

    ## Columns to load
    'load_col': {'inter': ['user_id', 'item_id', 'timestamp'],
                 'item': ['item_id', 'image_emb']
                 },
                  
    #'preload_weight': {'iid':'image_emb'},
    #'alias_of_item_id': ['iid'],
                  
    # Columns used directly by the model (has to be selected manually )
    'selected_features': ['image_emb'],
    #['product_code', 'product_type_no', 'product_group_name', 'graphical_appearance_no',
                          #'colour_group_code', 'perceived_colour_value_id', 'perceived_colour_master_id',
                          #'department_no', 'index_code', 'index_group_no', 'section_no', 'garment_group_no', 
                          #'age', 'postal_code','image_emb'],
    'neg_sampling': "{'uniform': 1}",
    'embedding_size': 128,
    'pooling_mode': 'max',
    'epochs': 70,
    'topk': 12,
    'loss_type': 'BPR',
    'eval_args': {
        'split': {'RS': [9, 0, 1]},
        'group_by': 'user',
        'order': 'TO',
        'mode': 'full'}
}

config = Config(model='GRU4RecF', dataset='recbox_data', config_dict=parameter_dict)

# init randseed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config) 
logger = getLogger()
# Create handlers
c_handler = logging.StreamHandler()
c_handler.setLevel(logging.INFO)
logger.addHandler(c_handler)

# write config info into log
logger.info(config)

07 May 07:44    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = /content/recbox_data
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 70
train_batch_size = 2048
learner = adam
learning_rate = 0.001
neg_sampling = {'uniform': 1}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}
repeatable = True
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [12]
valid_metric = MRR@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator = 	
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = t

In [None]:
dataset = create_dataset(config)
logger.info(dataset)

06 May 07:08    INFO  recbox_data
The number of users: 79292
Average actions of users: 51.502490824936
The number of items: 20677
Average actions of items: 197.5084155542658
The number of inters: 4083684
The sparsity of the dataset: 99.75092208107817%
Remain Fields: ['user_id', 'item_id', 'timestamp', 'image_emb']
recbox_data
The number of users: 79292
Average actions of users: 51.502490824936
The number of items: 20677
Average actions of items: 197.5084155542658
The number of inters: 4083684
The sparsity of the dataset: 99.75092208107817%
Remain Fields: ['user_id', 'item_id', 'timestamp', 'image_emb']


In [None]:
# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

06 May 07:09    INFO  [Training]: train_batch_size = [2048] negative sampling: [{'uniform': 1}]
[Training]: train_batch_size = [2048] negative sampling: [{'uniform': 1}]
06 May 07:09    INFO  [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}]
[Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'RS': [9, 0, 1]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}]


In [None]:
model = GRU4RecF(config, train_data.dataset).to(config['device'])
logger.info(model)
# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data)

In [None]:
external_user_ids = dataset.id2token(
    dataset.uid_field, list(range(dataset.user_num)))[1:]#fist element in array is 'PAD'(default of Recbole) ->remove it 

In [None]:
topk_items = []
for internal_user_id in list(range(dataset.user_num))[1:]:
    topk_scores, topk_iid_list = full_sort_topk([internal_user_id], model, test_data, k=12 ,device=config['device'])
    last_topk_iid_list = topk_iid_list[-1]
    external_item_list = dataset.id2token(dataset.iid_field, last_topk_iid_list.cpu()).tolist()
    topk_items.append(external_item_list)
print(len(topk_items))

In [33]:
external_item_str = [' '.join(x) for x in topk_items]
result_GRUF = pd.DataFrame(external_user_ids, columns=['customer_id'])
result_GRUF['prediction'] = external_item_str
result_GRUF.head()

Unnamed: 0,customer_id,prediction
0,001ddeb8fb74fec5693116da83b488e05ee9a9e179f3fd...,0706016001 0730683050 0866731001 0448509014 08...
1,002cf96a2e882620182534ca963f8b43f9aafa2b668c09...,0896169002 0915529003 0915526001 0751471043 08...
2,00357b192b81fc83261a45be87f5f3d59112db7d117513...,0915529003 0915526001 0884319005 0891591003 08...
3,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,0730683050 0852584001 0866731001 0912204001 09...
4,0071a08839ee650528afe22a7bff79334a09eed3edf67b...,0751471043 0685816002 0905365001 0751471001 05...


# Cold Users
These models will be used for people with 30 or less purchases.
## NPE

In [None]:
# NPE

parameter_dict = {
    'data_path': '/content',
    'use_gpu':'True',
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',
    'TIME_FIELD': 'timestamp',
    'field_seperator': "\t",

    ## This will make the model consider users who made between 1 and 30 interactions 
    'user_inter_num_interval': "[3,25)",

    ## This will make the model only consider the items that have been bought more than 50 times or more (the popular items)
    'item_inter_num_interval': "[40,inf)",


    ## Columns to load
    'load_col': {'inter': ['user_id', 'item_id', 'timestamp'],
                 'item': ['item_id', 'product_code', 'product_type_no', 'product_group_name', 'graphical_appearance_no',
                      'colour_group_code', 'perceived_colour_value_id', 'perceived_colour_master_id',
                      'department_no', 'index_code', 'index_group_no', 'section_no', 'garment_group_no' ]
                 },

                  
    # Columns used directly by the model (has to be selected manually )
    'selected_features': ['product_code', 'product_type_no', 'product_group_name', 'graphical_appearance_no',
                          'colour_group_code', 'perceived_colour_value_id', 'perceived_colour_master_id',
                          'department_no', 'index_code', 'index_group_no', 'section_no', 'garment_group_no'],
    'neg_sampling':  None, #"{'uniform': 1}",
    
    'epochs': 1,
    'topk': 12,
    'loss_type': 'CE',
    'eval_args': {
        'split': {'RS': [9, 0, 1]},
        'group_by': 'user',
        'order': 'TO',
        'mode': 'full'}
}

config = Config(model='NPE', dataset='NPE', config_dict=parameter_dict)

# init randseed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config) 
logger = getLogger()
# Create handlers
c_handler = logging.StreamHandler()
c_handler.setLevel(logging.INFO)
logger.addHandler(c_handler)

# write config info into log
logger.info(config)

In [3]:
dataset = create_dataset(config)
logger.info(dataset)

07 May 03:32    INFO  NPE
The number of users: 697137
Average actions of users: 9.158620412659797
The number of items: 32599
Average actions of items: 195.86489968709736
The number of inters: 6384804
The sparsity of the dataset: 99.9719052508507%
Remain Fields: ['user_id', 'item_id', 'timestamp', 'product_code', 'product_type_no', 'product_group_name', 'graphical_appearance_no', 'colour_group_code', 'perceived_colour_value_id', 'perceived_colour_master_id', 'department_no', 'index_code', 'index_group_no', 'section_no', 'garment_group_no']
NPE
The number of users: 697137
Average actions of users: 9.158620412659797
The number of items: 32599
Average actions of items: 195.86489968709736
The number of inters: 6384804
The sparsity of the dataset: 99.9719052508507%
Remain Fields: ['user_id', 'item_id', 'timestamp', 'product_code', 'product_type_no', 'product_group_name', 'graphical_appearance_no', 'colour_group_code', 'perceived_colour_value_id', 'perceived_colour_master_id', 'department_no'

In [None]:
# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

In [None]:
# model loading and initialization
model = npe.NPE(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data)

### Predicting the results using the model

In [6]:
from recbole.utils.case_study import full_sort_topk
external_user_ids = dataset.id2token(
    dataset.uid_field, list(range(dataset.user_num)))[1:]#fist element in array is 'PAD'(default of Recbole) ->remove it 

In [13]:
topk_items = []
for internal_user_id in list(range(dataset.user_num))[1:]:
    _, topk_iid_list = full_sort_topk([internal_user_id], model, test_data, k=12, device=config['device'])
    last_topk_iid_list = topk_iid_list[-1]
    external_item_list = dataset.id2token(dataset.iid_field, last_topk_iid_list.cpu()).tolist()
    topk_items.append(external_item_list)
print(len(topk_items))

697136


In [14]:
external_item_str = [' '.join(x) for x in topk_items]
result_NPE = pd.DataFrame(external_user_ids, columns=['customer_id'])
result_NPE['prediction'] = external_item_str
result_NPE.head()

Unnamed: 0,customer_id,prediction
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0751471001 0706016001 0783346018 0720125001 03...
1,00401a367c5ac085cb9d4b77c56f3edcabf25153615db9...,0399256001 0562245018 0399223001 0399256018 05...
2,00402f4463c8dc1b3ee54abfdea280e96cd87320449eca...,0568601006 0507909001 0706016001 0399256001 05...
3,0045c79125b4dc958579f902b49eacd8598f9eeaa12205...,0610776002 0372860001 0568601006 0608776002 05...
4,00708c3da4d07706d4cad77c6aecc1b1ce33d21d73022c...,0720125001 0751471001 0372860002 0759871002 07...


## ALS model

ALS (Alternating Least Squares) is one of the most used ML models for recommender systems. It's a matrix factorization method. Basically, ALS factorizes the interaction matrix (user x items) into two smaller matrices, one for item embeddings and one for user embeddings. These new matrices are built in a manner such that the multiplication of a user and an item gives (approximately) it's interaction score. This build embeddings for items and for users that live in the same vector space, allowing the implementation of recommendations as simple cosine distances between users and items. This is, the 12 items we recommend for a given user are the 12 items with their embedding vectors closer to the user embedding vector.



In [None]:
!pip install --upgrade implicit

In [27]:
import implicit
from scipy.sparse import coo_matrix
from implicit.evaluation import mean_average_precision_at_k

In [25]:
base_path = '/content/data/'
csv_train = f'{base_path}transactions_train.csv'
csv_sub = f'{base_path}sample_submission.csv'
csv_users = f'{base_path}customers.csv'
csv_items = f'{base_path}articles.csv'
df = pd.read_csv(csv_train, dtype={'article_id': str}, parse_dates=['t_dat'])
df_sub = pd.read_csv(csv_sub)
dfu = pd.read_csv(csv_users)
dfi = pd.read_csv(csv_items, dtype={'article_id': str})

In [18]:
# Trying with less data:
# https://www.kaggle.com/tomooinubushi/folk-of-time-is-our-best-friend/notebook
df = df[df['t_dat'] > '2020-08-21']
df.shape

(1190911, 5)

In [19]:
# For validation this means 3 weeks of training and 1 week for validation
# For submission, it means 4 weeks of training
df['t_dat'].max()

Timestamp('2020-09-22 00:00:00')

### Data Manipulation

I will be assigning auto incrementing ids for both product id and customer id for the matrices. Remember I'm only doing this on a subset of data, then I will go training over the whole data.

In [26]:
ALL_USERS = dfu['customer_id'].unique().tolist()
ALL_ITEMS = dfi['article_id'].unique().tolist()

user_ids = dict(list(enumerate(ALL_USERS)))
item_ids = dict(list(enumerate(ALL_ITEMS)))

user_map = {u: uidx for uidx, u in user_ids.items()}
item_map = {i: iidx for iidx, i in item_ids.items()}

df['user_id'] = df['customer_id'].map(user_map)
df['item_id'] = df['article_id'].map(item_map)

del dfu, dfi

In [28]:
row = df['user_id'].values
col = df['item_id'].values
data = np.ones(df.shape[0])
coo_train = coo_matrix((data, (row, col)), shape=(len(ALL_USERS), len(ALL_ITEMS)))
coo_train

<1371980x105542 sparse matrix of type '<class 'numpy.float64'>'
	with 31788324 stored elements in COOrdinate format>

### Model Training

In [29]:
%%time
model = implicit.als.AlternatingLeastSquares(factors=10, iterations=2)
model.fit(coo_train)

  0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 4.49 s, sys: 184 ms, total: 4.68 s
Wall time: 5.67 s


### Validation

In [30]:
def to_user_item_coo(df):
    """ Turn a dataframe with transactions into a COO sparse items x users matrix"""
    row = df['user_id'].values
    col = df['item_id'].values
    data = np.ones(df.shape[0])
    coo = coo_matrix((data, (row, col)), shape=(len(ALL_USERS), len(ALL_ITEMS)))
    return coo


def split_data(df, validation_days=7):
    """ Split a pandas dataframe into training and validation data, using <<validation_days>>
    """
    validation_cut = df['t_dat'].max() - pd.Timedelta(validation_days)

    df_train = df[df['t_dat'] < validation_cut]
    df_val = df[df['t_dat'] >= validation_cut]
    return df_train, df_val

def get_val_matrices(df, validation_days=7):
    """ Split into training and validation and create various matrices
        
        Returns a dictionary with the following keys:
            coo_train: training data in COO sparse format and as (users x items)
            csr_train: training data in CSR sparse format and as (users x items)
            csr_val:  validation data in CSR sparse format and as (users x items)
    
    """
    df_train, df_val = split_data(df, validation_days=validation_days)
    coo_train = to_user_item_coo(df_train)
    coo_val = to_user_item_coo(df_val)

    csr_train = coo_train.tocsr()
    csr_val = coo_val.tocsr()
    
    return {'coo_train': coo_train,
            'csr_train': csr_train,
            'csr_val': csr_val
          }


def validate(matrices, factors=200, iterations=20, regularization=0.01, show_progress=True):
    """ Train an ALS model with <<factors>> (embeddings dimension) 
    for <<iterations>> over matrices and validate with MAP@12
    """
    coo_train, csr_train, csr_val = matrices['coo_train'], matrices['csr_train'], matrices['csr_val']
    
    model = implicit.als.AlternatingLeastSquares(factors=factors, 
                                                 iterations=iterations, 
                                                 regularization=regularization, 
                                                 random_state=42)
    model.fit(coo_train, show_progress=show_progress)
    
    # The MAPK by implicit doesn't allow to calculate allowing repeated items, which is the case.
    # TODO: change MAP@12 to a library that allows repeated items in prediction
    map12 = mean_average_precision_at_k(model, csr_train, csr_val, K=12, show_progress=show_progress, num_threads=4)
    print(f"Factors: {factors:>3} - Iterations: {iterations:>2} - Regularization: {regularization:4.3f} ==> MAP@12: {map12:6.5f}")
    return map12

In [31]:
matrices = get_val_matrices(df)

In [32]:
%%time
best_map12 = 0
for factors in [40, 50, 60, 100, 200, 500, 1000]:
    for iterations in [3, 12, 14, 15, 20]:
        for regularization in [0.01]:
            map12 = validate(matrices, factors, iterations, regularization, show_progress=False)
            if map12 > best_map12:
                best_map12 = map12
                best_params = {'factors': factors, 'iterations': iterations, 'regularization': regularization}
                print(f"Best MAP@12 found. Updating: {best_params}")

Factors:  40 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00283
Best MAP@12 found. Updating: {'factors': 40, 'iterations': 3, 'regularization': 0.01}
Factors:  40 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00384
Best MAP@12 found. Updating: {'factors': 40, 'iterations': 12, 'regularization': 0.01}
Factors:  40 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00378
Factors:  40 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00375
Factors:  40 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00378
Factors:  50 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00350
Factors:  50 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00370
Factors:  50 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00370
Factors:  50 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00373
Factors:  50 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00369
Factors:  60 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00312
Factors:  60 

In [33]:
del matrices

### Training over the full dataset

In [34]:
coo_train = to_user_item_coo(df)
csr_train = coo_train.tocsr()

In [35]:
def train(coo_train, factors=200, iterations=15, regularization=0.01, show_progress=True):
    model = implicit.als.AlternatingLeastSquares(factors=factors, 
                                                 iterations=iterations, 
                                                 regularization=regularization, 
                                                 random_state=42)
    model.fit(coo_train, show_progress=show_progress)
    return model

In [36]:
best_params

{'factors': 40, 'iterations': 12, 'regularization': 0.01}

In [37]:
model = train(coo_train, **best_params)

  0%|          | 0/12 [00:00<?, ?it/s]

In [38]:
# Submission
def submit(model, csr_train, submission_name="submissions.csv"):
    preds = []
    batch_size = 2000
    to_generate = np.arange(len(ALL_USERS))
    for startidx in range(0, len(to_generate), batch_size):
        batch = to_generate[startidx : startidx + batch_size]
        ids, scores = model.recommend(batch, csr_train[batch], N=12, filter_already_liked_items=False)
        for i, userid in enumerate(batch):
            customer_id = user_ids[userid]
            user_items = ids[i]
            article_ids = [item_ids[item_id] for item_id in user_items]
            preds.append((customer_id, ' '.join(article_ids)))

    df_preds = pd.DataFrame(preds, columns=['customer_id', 'prediction'])
    df_preds.to_csv(submission_name, index=False)
    
    display(df_preds.head())
    print(df_preds.shape)
    
    return df_preds

In [39]:
%%time
ALS_result = submit(model, csr_train);

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601006 0568597006 0568601007 0507909001 05...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0351484002 0673677002 0759871002 0599580055 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0351484002 0723529001 0699080001 0699081001 06...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0720125001 0484398001 0730683001 0564786001 04...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0720125001 0599580020 0751471001 0673396002 06...


(1371980, 2)
CPU times: user 41.7 s, sys: 667 ms, total: 42.4 s
Wall time: 42.2 s


## Trending Products

In [40]:
from math import sqrt
from pathlib import Path
from tqdm import tqdm
tqdm.pandas()

In [41]:
data_path = Path('/content/data')
N = 12

In [42]:
df = pd.read_csv(data_path / 'transactions_train.csv',
                 usecols = ['t_dat', 'customer_id', 'article_id'],
                 dtype={'article_id': str})

df['t_dat'] = pd.to_datetime(df['t_dat'])
last_ts = df['t_dat'].max()

In [None]:
# Adding the last day of billing week
df['ldbw'] = df['t_dat'].progress_apply(lambda d: last_ts - (last_ts - d).floor('7D'))

100%|█████████▉| 31787686/31788324 [1:02:44<00:00, 10893.00it/s]

In [None]:
#Count the number of transactions per week
weekly_sales = df.drop('customer_id', axis=1).groupby(['ldbw', 'article_id']).count()
weekly_sales = weekly_sales.rename(columns={'t_dat': 'count'})

In [None]:
df = df.join(weekly_sales, on=['ldbw', 'article_id'])

Now, let's assume that the target week sales will be similar to the last week of the training data

In [None]:
weekly_sales = weekly_sales.reset_index().set_index('article_id')
last_day = last_ts.strftime('%Y-%m-%d')

df = df.join(
    weekly_sales.loc[weekly_sales['ldbw']==last_day, ['count']],
    on='article_id', rsuffix="_targ")

df['count_targ'].fillna(0, inplace=True)
del weekly_sales

Calculating the sales rates according to chagnes in product popularity

In [None]:
df['quotient'] = df['count_targ'] / df['count']

In [None]:
# Using the popular products
target_sales = df.drop('customer_id', axis=1).groupby('article_id')['quotient'].sum()
general_pred = target_sales.nlargest(N).index.tolist()
del target_sales

Filling the purchase dictionary

In [None]:
purchase_dict = {}

for i in tqdm(df.index):
    cust_id = df.at[i, 'customer_id']
    art_id = df.at[i, 'article_id']
    t_dat = df.at[i, 't_dat']

    if cust_id not in purchase_dict:
        purchase_dict[cust_id] = {}

    if art_id not in purchase_dict[cust_id]:
        purchase_dict[cust_id][art_id] = 0
    
    x = max(1, (last_ts - t_dat).days)

    a, b, c, d = 2.5e4, 1.5e5, 2e-1, 1e3
    y = a / np.sqrt(x) + b * np.exp(-c*x) - d

    value = df.at[i, 'quotient'] * max(0, y)
    purchase_dict[cust_id][art_id] += value

Submitting the results

In [None]:
sub = pd.read_csv(data_path / 'sample_submission.csv')

pred_list = []
for cust_id in tqdm(sub['customer_id']):
    if cust_id in purchase_dict:
        series = pd.Series(purchase_dict[cust_id])
        series = series[series > 0]
        l = series.nlargest(N).index.tolist()
        if len(l) < N:
            l = l + general_pred[:(N-len(l))]
    else:
        l = general_pred
    pred_list.append(' '.join(l))

sub['prediction'] = pred_list
result_trending= sub
#sub.to_csv('submission.csv', index=None)

# Combining results
## Cold users

In [None]:
sub0 = ALS_result.sort_values('customer_id').reset_index(drop=True)
sub1 = result_NPE('customer_id').reset_index(drop=True)
sub2 = result_trending.sort_values('customer_id').reset_index(drop=True)
sub0.shape, sub1.shape, sub2.shape

((1371980, 2), (1371980, 2), (133300, 2))

In [None]:
sub0.columns = ['customer_id', 'prediction0']
sub0['prediction1'] = sub1['prediction']
sub0['prediction2'] = sub2['prediction']
del sub1, sub2
print(sub0.head())

                                         customer_id  \
0  00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...   
1  0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...   
2  000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...   
3  00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...   
4  00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...   

                                         prediction0  \
0  0568601043 0568601006 0656719005 0745232001 07...   
1  0826211002 0800436010 0739590027 0723529001 08...   
2  0794321007 0852643001 0852643003 0858883002 07...   
3  0448509014 0573085028 0751471001 0706016001 06...   
4  0730683050 0791587015 0896152002 0818320001 09...   

                                         prediction1  \
0  0568601043 0924243001 0924243002 0918522001 07...   
1  0924243001 0924243002 0918522001 0751471001 04...   
2  0794321007 0924243001 0924243002 0918522001 07...   
3  0924243001 0924243002 0918522001 0751471001 04...   
4  0924243001 0924243002 0918522001 0751471001

This is a custom ensembling function that ensembles the data and updates the weights accordingly.

In [None]:
def cust_blend(dt, W = [1,1,1]):
    #Global ensemble weights
    #W = [1.15,0.95,0.85]
    
    #Create a list of all model predictions
    REC = []
    REC.append(dt['prediction0'].split())
    REC.append(dt['prediction1'].split())
    REC.append(dt['prediction2'].split())
    
    #Create a dictionary of items recommended. 
    #Assign a weight according the order of appearance and multiply by global weights
    res = {}
    for M in range(len(REC)):
        for n, v in enumerate(REC[M]):
            if v in res:
                res[v] += (W[M]/(n+1))
            else:
                res[v] = (W[M]/(n+1))
    
    # Sort dictionary by item weights
    res = list(dict(sorted(res.items(), key=lambda item: -item[1])).keys())
    
    # Return the top 12 itens only
    return ' '.join(res[:12])

sub0['prediction'] = sub0.apply(cust_blend, W = [1.05,1.00,0.95], axis=1)
sub0.head()

Unnamed: 0,customer_id,prediction0,prediction1,prediction2,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0656719005 0745232001 07...,0568601043 0924243001 0924243002 0918522001 07...,0568601043 0924243001 0924243002 0918522001 07...,0568601043 0568601006 0924243001 0656719005 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0800436010 0739590027 0723529001 08...,0924243001 0924243002 0918522001 0751471001 04...,0924243001 0924243002 0918522001 0751471001 04...,0826211002 0924243001 0800436010 0924243002 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0852643003 0858883002 07...,0794321007 0924243001 0924243002 0918522001 07...,0794321007 0924243001 0924243002 0918522001 07...,0794321007 0852643001 0924243001 0852643003 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0573085028 0751471001 0706016001 06...,0924243001 0924243002 0918522001 0751471001 04...,0924243001 0924243002 0918522001 0751471001 04...,0448509014 0924243001 0751471001 0573085028 09...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0791587015 0896152002 0818320001 09...,0924243001 0924243002 0918522001 0751471001 04...,0924243001 0924243002 0918522001 0751471001 04...,0730683050 0924243001 0791587015 0924243002 08...


In [None]:
del sub0['prediction0']
del sub0['prediction1']
del sub0['prediction2']
#sub0.to_csv(f'submission.csv', index=False)

In [None]:
sub0.head()

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0924243001 0656719005 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0924243001 0800436010 0924243002 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0924243001 0852643003 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0924243001 0751471001 0573085028 09...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0924243001 0791587015 0924243002 08...


In [38]:
result_Cold = sub0

## Active users

In [40]:
sub0 = result_GRU.sort_values('customer_id').reset_index(drop=True)
sub1 = result_GRUF.sort_values('customer_id').reset_index(drop=True)
sub0.shape, sub1.shape

((79291, 2), (79291, 2))

In [41]:
sub0.columns = ['customer_id', 'prediction0']
sub0['prediction1'] = sub1['prediction']
del sub1
print(sub0.head())

                                         customer_id  \
0  00009d946eec3ea54add5ba56d5210ea898def4b46c685...   
1  0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...   
2  0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...   
3  0002d2ef78ec29c03a951963a2694be963c06d7123b972...   
4  000493dd9fc463df1acc2081450c9e75ef8e87d5dd17ed...   

                                         prediction0  \
0  0797892001 0685816002 0573937001 0889652001 06...   
1  0896169002 0912204001 0852584001 0905518001 07...   
2  0896169002 0915526001 0918836001 0915529003 08...   
3  0599580038 0599580052 0590928001 0688537004 05...   
4  0228257001 0915529003 0685813001 0898713001 08...   

                                         prediction1  
0  0797892001 0685816002 0573937001 0889652001 06...  
1  0896169002 0912204001 0852584001 0905518001 07...  
2  0896169002 0915526001 0918836001 0915529003 08...  
3  0599580038 0599580052 0590928001 0688537004 05...  
4  0228257001 0915529003 0685813001 0898713001 08..

In [42]:
def cust_blend(dt, W = [1,1]):
    #Global ensemble weights
    #W = [1.15,0.95,0.85]
    
    #Create a list of all model predictions
    REC = []
    REC.append(dt['prediction0'].split())
    REC.append(dt['prediction1'].split())

    #Create a dictionary of items recommended. 
    #Assign a weight according the order of appearance and multiply by global weights
    res = {}
    for M in range(len(REC)):
        for n, v in enumerate(REC[M]):
            if v in res:
                res[v] += (W[M]/(n+1))
            else:
                res[v] = (W[M]/(n+1))
    
    # Sort dictionary by item weights
    res = list(dict(sorted(res.items(), key=lambda item: -item[1])).keys())
    
    # Return the top 12 itens only
    return ' '.join(res[:12])

sub0['prediction'] = sub0.apply(cust_blend, W = [1.05,1.00,0.95], axis=1)
sub0.head()

Unnamed: 0,customer_id,prediction0,prediction1,prediction
0,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,0797892001 0685816002 0573937001 0889652001 06...,0797892001 0685816002 0573937001 0889652001 06...,0797892001 0685816002 0573937001 0889652001 06...
1,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,0896169002 0912204001 0852584001 0905518001 07...,0896169002 0912204001 0852584001 0905518001 07...,0896169002 0912204001 0852584001 0905518001 07...
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0896169002 0915526001 0918836001 0915529003 08...,0896169002 0915526001 0918836001 0915529003 08...,0896169002 0915526001 0918836001 0915529003 08...
3,0002d2ef78ec29c03a951963a2694be963c06d7123b972...,0599580038 0599580052 0590928001 0688537004 05...,0599580038 0599580052 0590928001 0688537004 05...,0599580038 0599580052 0590928001 0688537004 05...
4,000493dd9fc463df1acc2081450c9e75ef8e87d5dd17ed...,0228257001 0915529003 0685813001 0898713001 08...,0228257001 0915529003 0685813001 0898713001 08...,0228257001 0915529003 0685813001 0898713001 08...


In [43]:
del sub0['prediction0']
del sub0['prediction1']
sub0.head()

Unnamed: 0,customer_id,prediction
0,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,0797892001 0685816002 0573937001 0889652001 06...
1,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,0896169002 0912204001 0852584001 0905518001 07...
2,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0896169002 0915526001 0918836001 0915529003 08...
3,0002d2ef78ec29c03a951963a2694be963c06d7123b972...,0599580038 0599580052 0590928001 0688537004 05...
4,000493dd9fc463df1acc2081450c9e75ef8e87d5dd17ed...,0228257001 0915529003 0685813001 0898713001 08...


In [44]:
result_active = sub0

## Combing the results of both sampled data

In [None]:
submit_df = pd.merge(result_active, result_Cold, on='customer_id', how='outer')
submit_df.head()

Unnamed: 0,customer_id,prediction_x,prediction_y
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0924243001 0656719005 09...,
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0924243001 0800436010 0924243002 07...,
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0924243001 0852643003 09...,
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0924243001 0751471001 0573085028 09...,
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0924243001 0791587015 0924243002 08...,


In [None]:
submit_df = submit_df.fillna(-1)
submit_df['prediction'] = submit_df.apply(
    lambda x: x['prediction_y'] if x['prediction_y'] != -1 else x['prediction_x'], axis=1)
submit_df.head()

Unnamed: 0,customer_id,prediction_x,prediction_y,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0924243001 0656719005 09...,-1,0568601043 0568601006 0924243001 0656719005 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0924243001 0800436010 0924243002 07...,-1,0826211002 0924243001 0800436010 0924243002 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0924243001 0852643003 09...,-1,0794321007 0852643001 0924243001 0852643003 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0924243001 0751471001 0573085028 09...,-1,0448509014 0924243001 0751471001 0573085028 09...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0924243001 0791587015 0924243002 08...,-1,0730683050 0924243001 0791587015 0924243002 08...


In [None]:
submit_df = submit_df.drop(columns=['prediction_y', 'prediction_x'])
submit_df.head()

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0924243001 0656719005 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0924243001 0800436010 0924243002 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0924243001 0852643003 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0924243001 0751471001 0573085028 09...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0924243001 0791587015 0924243002 08...


In [None]:
submit_df.to_csv('/content/kaggle/submit_df.csv')

In [None]:
len(submit_df)

1371980

In [None]:
!apt-get install texlive-xetex texlive-fonts-recommended texlive-generic-recommended

In [47]:
!jupyter nbconvert --to PDF "/content/H&M.ipynb"

[NbConvertApp] Converting notebook /content/H&M.ipynb to PDF
[NbConvertApp] Writing 243268 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', './notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', './notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 195548 bytes to /content/H&M.pdf
