## Contents
* [Load Sets](#load_sets)
* [Handling NaN Values](#NaN_Values)
* [Article Similarity with LSH](#ASwLSH)
* [Creating the Final Datasets](#CFD)
* [Saving the the dataframes for the models](#SDF)


<a id="load_sets"></a>
## Load Sets

In [1]:
# mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!unzip -q "/content/drive/MyDrive/Data/H&M_project/h-and-m-personalized-fashion-recommendations.zip" # unzipping the data

In [3]:
import pandas as pd

# loading the datasets with pandas
customers = pd.read_csv('/content/customers.csv')
articles = pd.read_csv('/content/articles.csv')
sample_sub = pd.read_csv('/content/sample_submission.csv')
transactions_train = pd.read_csv('/content/transactions_train.csv')

In [4]:
len(transactions_train)

31788324

<a id="NaN_Values"></a>
## Handling NaN Values

In [5]:
customers.isna().sum()

customer_id                    0
FN                        895050
Active                    907576
club_member_status          6062
fashion_news_frequency     16009
age                        15861
postal_code                    0
dtype: int64

In [6]:
articles.isna().sum()

article_id                        0
product_code                      0
prod_name                         0
product_type_no                   0
product_type_name                 0
product_group_name                0
graphical_appearance_no           0
graphical_appearance_name         0
colour_group_code                 0
colour_group_name                 0
perceived_colour_value_id         0
perceived_colour_value_name       0
perceived_colour_master_id        0
perceived_colour_master_name      0
department_no                     0
department_name                   0
index_code                        0
index_name                        0
index_group_no                    0
index_group_name                  0
section_no                        0
section_name                      0
garment_group_no                  0
garment_group_name                0
detail_desc                     416
dtype: int64

In [7]:
transactions_train.isna().sum()

t_dat               0
customer_id         0
article_id          0
price               0
sales_channel_id    0
dtype: int64

For the transactions_train dataframe, column sales channel id, 2 is online and 1 store.

The customers dataframe has the most NaN values. The articles dataframe only has 416 NaN values in the detail_desc column, these can be replaced with an empty string ''.

In [8]:
# replacing the NaN values in detail_desc with an empty string
articles['detail_desc'] = articles['detail_desc'].fillna('')

Now the articles dataframe has no NaN values.

In [9]:
articles.isna().sum().sum()

0

For the customers dataframe, FN is if a customer get Fashion News newsletter, Active is if the customer is active for communication. The problem is that there are to many NaN values for these two columns. Specifically FN : $ \frac{895050}{1371980} 100 \approx 65\% $ and Active :  $ \frac{907576}{1371980} 100 \approx 66\% $ So those columns will be dropped.


In [10]:
# dropping the FN and Active columns
customers = customers.drop(['FN', 'Active'], axis=1)

In [11]:
customers.isna().sum()

customer_id                   0
club_member_status         6062
fashion_news_frequency    16009
age                       15861
postal_code                   0
dtype: int64

First the columns will be factorized and then the mice imputing method will be applied to impute the nan values.

In [12]:
import numpy as np

#
df_mice = customers.drop(['customer_id'], axis=1).copy() # taking a copy of the customers dataset excluding the id

df_mice['club_member_status'], cms_labels = pd.factorize(df_mice['club_member_status']) # factorizing all values of club_membership_status column
df_mice['club_member_status'] = df_mice['club_member_status'].replace(-1, np.nan) # factorization gave the nan values the value -1, so here it is replaced back to nan

df_mice['fashion_news_frequency'], fnf_labels = pd.factorize(df_mice['fashion_news_frequency']) # factorizing all values of fashion_news_frequency column
df_mice['fashion_news_frequency'] = df_mice['fashion_news_frequency'].replace(-1, np.nan) # factorization gave the nan values the value -1, so here it is replaced back to nan

df_mice['postal_code'], pc_labels = pd.factorize(df_mice['postal_code']) # factorizing all values of postal_code column

df_mice.head()

Unnamed: 0,club_member_status,fashion_news_frequency,age,postal_code
0,0.0,0.0,49.0,0
1,0.0,0.0,25.0,1
2,0.0,0.0,24.0,2
3,0.0,0.0,54.0,3
4,0.0,1.0,52.0,4


Now to impute the missing data using MICE

In [13]:

# Imputing with MICE
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import linear_model

# Define MICE Imputer and fill missing values
mice_imputer = IterativeImputer(estimator=linear_model.BayesianRidge(), n_nearest_features=None, imputation_order='ascending')

df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df_mice), columns=df_mice.columns)

df_mice_imputed.isna().sum()

club_member_status        0
fashion_news_frequency    0
age                       0
postal_code               0
dtype: int64

Now the df_mice_imputed will be reverted back to it's original form


In [14]:
df_mice_imputed['club_member_status'] = pd.Index(cms_labels)[df_mice_imputed['club_member_status'].to_numpy(dtype=int)]
df_mice_imputed['fashion_news_frequency'] = pd.Index(fnf_labels)[df_mice_imputed['fashion_news_frequency'].to_numpy(dtype=int)]
df_mice_imputed['postal_code'] = pd.Index(pc_labels)[df_mice_imputed['postal_code'].to_numpy(dtype=int)]

df_mice_imputed.isna().sum()

club_member_status        0
fashion_news_frequency    0
age                       0
postal_code               0
dtype: int64

In [15]:
df_mice_imputed.insert(0, 'customer_id', customers['customer_id'].to_numpy())
df_mice_imputed.head()

Unnamed: 0,customer_id,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [16]:
customers = df_mice_imputed.copy()
customers.head()

Unnamed: 0,customer_id,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [17]:
customers.isna().sum()

customer_id               0
club_member_status        0
fashion_news_frequency    0
age                       0
postal_code               0
dtype: int64

<a id="ASwKSH"></a>
## Article Similarity and Clustering using LSH

### downloading datasketch

In [18]:
# installing the datasketch library for applying lsh
!pip install datasketch

Collecting datasketch
  Downloading datasketch-1.5.9-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.7/76.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: datasketch
Successfully installed datasketch-1.5.9


### Importing necesarry libraries

In [19]:
# importing libraries to process text
import nltk
import re
from tqdm import tqdm

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

### Preprocessing the text

In [20]:
# Creating some functions to help with processing our data
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def remove_stopwords(text):
  stop_words = set(stopwords.words('english')) # remove duplicates of stop words
  text_tokens = nltk.word_tokenize(text) # tokenize the text
  filtered_text = [word for word in text_tokens if word not in stop_words] # remove stop words
  text = " ".join(filtered_text) # rejoin text
  return text

def lemmatize(text):

  tokens = nltk.word_tokenize(text) # tokenize the text
  lemmatizer = WordNetLemmatizer() # create the lemmatizer object
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens] # lemmatize

  # Join the lemmatized tokens back into a string
  lem_text = " ".join(lemmatized_tokens)

  return lem_text

In [21]:
def process(desc_list):

  # processing the descriptions
  processed_descriptions = []

  # lowercasing the letters for every row of the description column
  for desc in desc_list: # this is another way to get the column of your choice
    desc = desc.lower() # applying the .lower method for each description
    desc = re.sub(r'[^\w\s]', '', desc) # removig special characters, if there are any
    desc = re.sub(r'\d+', '', desc) # removing the numbers
    desc = remove_stopwords(desc) # renove stop words
    desc = lemmatize(desc) # lemmatize

    processed_descriptions.append(desc)

  return processed_descriptions

item_desc = articles['detail_desc'] # store all descriptions of items
articles['processed_desc'] = process(item_desc) # add processed descriptions to dataframe

### applying LSH with shinglings = words

In [22]:
from datasketch import MinHashLSHForest, MinHash
import time


# the set of descriptions for lsh
descriptions = articles['processed_desc'].unique()

# Start the timer
start_time = time.time()

# defining the number of permutations for the MinHash algorithm
num_perm = 128

# creating MinHash objects for each document
minhashes = []
for desc in descriptions:

  # creating a MinHash object with the specified number of permutations
  m = MinHash(num_perm=num_perm)

  # iterating over each word in the document
  for word in desc.split():

    # updating the MinHash object with the encoded word
    m.update(word.encode('utf-8'))

  # adding the MinHash object to the list
  minhashes.append(m)

# creating an LSH forest and adding the MinHashes to it
forest = MinHashLSHForest(num_perm=num_perm)

for i, m in enumerate(minhashes):
  # adding each MinHash object to the LSH forest, associating it with an index
  forest.add(i, m)

# indexing the forest to build the hash tables
forest.index()


# Calculate the training time
lsh_time = time.time() - start_time

print(f'LSH time : {lsh_time}')

LSH time : 67.44848895072937


### Nearest neighbor clustering with LSH results

In [23]:
n_neighbors = 1000

# Start the timer
start_time = time.time()


# going through every description
# getting the minhash of the description
# finding the similar n nearest descriptions and appending them to the list
neighbors = [forest.query(minhashes[idx], n_neighbors)
             for idx in tqdm(range(len(descriptions)), desc='Processing Neighbors')]


# going through every description
# iterating through the neighbors and adding them to the cluster
clusters = []
visited = set()
for idx in tqdm(range(len(descriptions)), desc='Creaating Clusters'):
    if idx not in visited:
        cluster = [idx]
        visited.add(idx)

        for neighbor in neighbors[idx]:
            if neighbor not in visited:
                cluster.append(neighbor)
                visited.add(neighbor)

        clusters.append(cluster)

# Calculate the training time
cl_time = time.time() - start_time
print()
print(f'Cluster Creation Time : {cl_time}')

Processing Neighbors: 100%|██████████| 42657/42657 [02:37<00:00, 270.22it/s]
Creaating Clusters: 100%|██████████| 42657/42657 [00:00<00:00, 495193.62it/s]

Cluster Creation Time : 157.9632694721222





In [24]:
print(f'{len(clusters)} Clusters have been created')

304 Clusters have been created


<a id="CFD"></a>
## Creating the Final Datasets

### Adding the 'class_label' to the articles dataframe

In [25]:
article_id_clusters = []

# iterating over all the clusters
for cluster in clusters:

  # extracting the descriptions of the cluster
  cluster_descs = descriptions[cluster]

  # taking the dataframe that contains these descriptions
  cluster_articles = articles[articles['processed_desc'].isin(cluster_descs)]

  # taking the id of those articles and adding it to article_id_clusters
  article_id_clusters.append(cluster_articles['article_id'])


In [26]:
# creating a column to add to articles as cluster classes
cluster_classes = pd.DataFrame([i for i in range(len(articles))])

class_label = 0
for cluster in article_id_clusters:

  # finding all the articles that belong to the cluster
  cl = articles[articles['article_id'].isin(cluster)]

  # assigning them their class label
  cluster_classes.iloc[cl.index, 0] = class_label

  class_label += 1


In [27]:
# appending row to the articles dataframe
articles['class_label'] = cluster_classes.to_numpy()

In [28]:
# saving articles to use in the model notebook
articles.to_pickle('articles.pkl', compression='gzip')

### Adding class label to transactions_train dataframe

In [29]:
# adding the cluster label of the article to the transactions_train dataset
# creating a dictionary mapping article IDs to class labels
article_class_map = dict(zip(articles['article_id'], articles['class_label']))

# mapping the article IDs to class labels in transactions_train
transactions_train['item_class'] = transactions_train['article_id'].map(article_class_map)


### Splitting to train test and validation set

In [30]:
from datetime import datetime, timedelta
data = transactions_train.copy()

# Converting the string column to datetime
data['t_dat'] = pd.to_datetime(data['t_dat'])

# Setting the date column as the index
data.set_index('t_dat', inplace=True)

In [31]:
# choosing how many days will the val and test be
val_size = 30
test_size = 7

# splitting the data

# Getting the first and last date
last_date = data.index[-1]
first_date = data.index[0]
# calculating the test start date
test_start_date = last_date - timedelta(days=test_size)

# calculating the val start date
val_start_date = test_start_date - timedelta(days=val_size)

# get the transactions data for train, test and validation sets
train_trans = data.loc[first_date: val_start_date]
val_trans = data.loc[val_start_date: test_start_date]
test_trans = data.loc[test_start_date: last_date]


### Combining the customer and transactions_train grouped by customer id for train val and test sets

In [32]:
# this function will help in creating the data for each transactions set
def combine(transactions_data):

  # grouping by customer
  grouped_cust = transactions_data.groupby('customer_id').agg(lambda x: x.tolist())

  # dropping duplicates of customer_id
  customers_unique = customers.drop_duplicates(subset=['customer_id'])

  # combining customers with the cluster classes label lists from what they bought
  customers_fin = pd.merge(customers_unique, grouped_cust, on='customer_id', how='outer')

  return customers_fin

train = combine(train_trans)
val = combine(val_trans)
test = combine(test_trans)

#### Clearing the nan values that were created in the combination process

In [33]:
train.dropna(inplace=True)
val.dropna(inplace=True)
test.dropna(inplace=True)

### Combining the customer and transactions_train grouped by customer id for the whole transactions_train (this will be used after we train and test the model in order to exploit all the data we have)

In [34]:
train_all = combine(transactions_train)

In [35]:
train_all.isna().sum()

customer_id                  0
club_member_status           0
fashion_news_frequency       0
age                          0
postal_code                  0
t_dat                     9699
article_id                9699
price                     9699
sales_channel_id          9699
item_class                9699
dtype: int64

From what is gathered above, it can be concluded that 9699 customers did not buy anything

<a id="#SDF"></a>
## Saving the dataframes for the model

In [36]:
train.to_pickle('train.pkl', compression='gzip')
val.to_pickle('val.pkl', compression='gzip')
test.to_pickle('test.pkl', compression='gzip')
train_all.to_pickle('train_all.pkl', compression='gzip')