<a href="https://colab.research.google.com/github/Abelbrown/h-m-kaggle-project/blob/main/ALS_H%26M_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:


!pip install --upgrade implicit
     


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##Summary of Alternating Least Squares ('ALS')##

The 'ALS' (Alternating Least Squares) algorithm aims to decompose a user-articles matrix into a scalar product of two matrices of lower dimensions, one of which will describe the properties of the users and the other that of the articles offered to the sale (principle of "matrix factorization"). This approximation of the original matrix into two 'sub-matrices' will be done in such a way that a cost function (composed of the root mean square error and a regularization term) is minimized. Each matrix will share 'latent factors' (latent factors) which, for one, will be represented in a column and for the other in a line. These 'latent factors' represent projections of the explanatory variables in a lower dimensional space. The ALS follows the same principle as the linear regression: the algo will establish an iterative process consisting of two steps. At each step, the model will fix one matrix and find the modalities of the other matrix while guaranteeing that the cost function decreases or at worst stagnates (because the OLS, from which the ALS is derived, guarantees a minimum squared error)

The model will therefore learn to factorize our initial matrix into two representations (a user matrix and an article matrix). This will allow him to accurately predict the next items likely to be purchased by a customer. In this case, we have to predict 12 H&M items that a user would be likely to buy: these 12 items will be those with embedded vectors close to the customer's embedded vectors.





In [None]:
# Modules imports 
import os; os.environ['OPENBLAS_NUM_THREADS']='1'
import numpy as np
import pandas as pd
import implicit
from scipy.sparse import coo_matrix
from implicit.evaluation import mean_average_precision_at_k
     




## Load files

In [None]:
df_cust = pd.read_csv('/content/drive/MyDrive/customers.csv')
df_cust_sample = df_cust.sample(frac=0.05)
df_art = pd.read_csv('/content/drive/MyDrive/articles.csv', dtype={'article_id': str})
df_art_sample = df_art.sample(frac=0.05)
df_trans = pd.read_csv('/content/drive/MyDrive/transactions_train.csv', dtype={'article_id': str}, parse_dates=['t_dat'])
df_trans_sample = df_trans.sample(frac=0.05)
df_sub = pd.read_csv('/content/drive/MyDrive/sample_submission.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Preparation of the data for the creation of two matrices (U and V)




In [None]:
#Option 1: select only transaction lines for which the date is greater than August 21, 2020 to minimize processing time
df_trans_bis = df_trans[df_trans['t_dat'] > '2020-08-21']
df_trans_bis.head()
#Option 2: select 5% of the original sample file -> df_trans_sample (see cell above)


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
30597413,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,913688003,0.033881,2
30597414,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,913688003,0.033881,2
30597415,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,923460001,0.042356,2
30597416,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,934380001,0.050831,2
30597417,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,913688001,0.033881,2


In [None]:
#Creating a list of user ids and item ids from customer and item files
list_cust_id = df_cust['customer_id'].unique().tolist()
list_art_id = df_art['article_id'].unique().tolist()

In [None]:
# Create a dictionary with the user id list and the article id list
cust_ids = dict(list(enumerate(list_cust_id))) # énumerate (index, val )
art_ids = dict(list(enumerate(list_art_id)))
#Display first lines 
print(list(cust_ids.items())[:6])
'\n'
print(list(art_ids.items())[:6])

[(0, '00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657'), (1, '0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa'), (2, '000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318'), (3, '00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2c5feb1ca5dff07c43e'), (4, '00006413d8573cd20ed7128e53b7b13819fe5cfc2d801fe7fc0f26dd8d65a85a'), (5, '000064249685c11552da43ef22a5030f35a147f723d5b02ddd9fd22452b1f5a6')]
[(0, '0108775015'), (1, '0108775044'), (2, '0108775051'), (3, '0110065001'), (4, '0110065002'), (5, '0110065011')]


In [None]:
# Assignment of customer indices and item indices to the transaction bis table
## Invert keys - values from our dictionary created in the previous cell
cust_map = {u: uidx for uidx, u in cust_ids.items()}
# Substitution of each customer_id by the index created and assign them in a new column 'cust_id'
df_trans_bis['cust_id'] = df_trans_bis['customer_id'].map(cust_map) 

art_map = {a: aidx for aidx, a in art_ids.items()}
df_trans_bis['art_id'] = df_trans_bis['article_id'].map(art_map)
df_trans_bis
     


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trans_bis['cust_id'] = df_trans_bis['customer_id'].map(cust_map)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trans_bis['art_id'] = df_trans_bis['article_id'].map(art_map)


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,cust_id,art_id
30597413,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0913688003,0.033881,2,38,103595
30597414,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0913688003,0.033881,2,38,103595
30597415,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0923460001,0.042356,2,38,104483
30597416,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0934380001,0.050831,2,38,105214
30597417,2020-08-22,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0913688001,0.033881,2,38,103593
...,...,...,...,...,...,...,...
31788319,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0929511001,0.059305,2,1371691,104961
31788320,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0891322004,0.042356,2,1371691,100629
31788321,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,0918325001,0.043203,1,1371721,104053
31788322,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,0833459002,0.006763,1,1371747,88521


## Creating a COO main matrix ##



In [None]:
# Creation of a COO matrix (sparse matrix) which will contain the coordinates as tuples (row, column, associated value)
## The code for creating this matrix is shown on the following site: https://scipy-lectures.org/advanced/scipy_sparse/coo_matrix.html
row = np.array([0, 3, 1])
col = np.array([0, 3, 1])
data = np.array([2, 1, 1])
coo_matrix((data, (row, col)), shape=(4, 4)).todense()

matrix([[2, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 1]])

In [None]:
# Creation of our sparse COO matrix for the training phase on the model indicated in the cell just above
row = df_trans_bis['cust_id'].values
col = df_trans_bis['art_id'].values
data = np.ones(df_trans_bis.shape[0])
coo_train = coo_matrix((data, (row, col)), shape=(len(list_cust_id), len(list_art_id)))
coo_train

<1371980x105542 sparse matrix of type '<class 'numpy.float64'>'
	with 1190911 stored elements in COOrdinate format>

In [None]:
# Function to create a matrix COO Customer - Articles
def coo_cust_art(df):
  """
  Transform a df into a COO sparse matrix with clients in row and items in column

  Parameter
  -------
  df: set of data as a dataframe

  Return
  -----
  A COO sparse matrix 
  """
  row = df['cust_id'].values
  col = df['art_id'].values
  data = np.ones(df.shape[0]) 
  coo = coo_matrix((data, (row, col)), shape=(len(list_cust_id), len(list_art_id)))
  return coo

## Dataset Split

In [None]:
# Split of the dataset into a training set and a validation set (over a period of 7 days as imposed by H&M)
date_separ_train_val = df_trans_bis['t_dat'].max() - pd.Timedelta(days = 7)
## Splitting the main dataset into two sets: train and validation according to the date of separation calculated previously
df_trans_bis_train = df_trans_bis[df_trans_bis['t_dat'] < date_separ_train_val]
df_trans_bis_val = df_trans_bis[df_trans_bis['t_dat'] >= date_separ_train_val]

     


In [None]:
# Check that the dataset is well split (matching dates)
print("La dernière date du set training est:", df_trans_bis_train['t_dat'].max())
print("La première date du set validation est:", df_trans_bis_val['t_dat'].min(), "et la dernière date est:",df_trans_bis_val['t_dat'].max() )

La dernière date du set training est: 2020-09-14 00:00:00
La première date du set validation est: 2020-09-15 00:00:00 et la dernière date est: 2020-09-22 00:00:00


In [None]:
# Creation of a function for an automation of the splitting of the dataset as carried out above
def split_dataset(df, valid_period=7):
  """
  Split the dataframe into a train & validation datsets in order to get a validation period of 7 days

  Parameters
  ---------
  df: set of data as a dataframe
  valid_period : period of validation which is supposed to be of 7 days

  Returns
  -------
  Dataframe for training & Dataframe for validation 

  """
## Creation of a separation date from the date set in order to establish a validation period of 7 days
  date_separ_train_val = df['t_dat'].max() - pd.Timedelta(valid_period)
## Splitting the main dataset into two sets: train and validation according to the date of separation calculated previously
  dataframe_train = df[df['t_dat'] < date_separ_train_val]
  dataframe_val = df[df['t_dat'] >= date_separ_train_val]
  return dataframe_train, dataframe_val

##Transformation of our partitioned dataset (training and validation) into COO and CSR matrices

In [None]:
# Transformation of our set data train and set data valid into coo matrices using our function
coo_df_train = coo_cust_art(df_trans_bis_train)
coo_df_val = coo_cust_art(df_trans_bis_val)
# Conversion of our coo train and coo val matrices into csr matrices in order to perform a param grid search
csr_df_train = coo_df_train.tocsr()
csr_df_val = coo_df_val.tocsr()


# Function to return COO and CSR matrices for set training and validation
def matrices_coo_csr(df):
  """
  Transform dataframes into COO and CSR matrices for the period of training and validation

  Parameter:
  --------
  df: set of data as a dataframe

  Return:
  ------
  The COO and CSR matrices for the train and validation period
  """
  dataframe_train, dataframe_val = split_dataset(df, valid_period=7)
  coo_df_train = coo_cust_art(dataframe_train)
  csr_df_train = coo_df_train.tocsr()

  coo_df_val = coo_cust_art(dataframe_val)
  csr_df_val = coo_df_val.tocsr()

  return coo_df_train, csr_df_train, csr_df_val

## Searching for the best parameters

In [None]:
## Do a gridsearch to find the optimal ALS settings with the set train csr matrix
from sklearn.model_selection import ParameterGrid
grid = ParameterGrid({
    "factors": [50, 60, 100, 200, 500],
    'regularization':[0.01],
    'iterations':[5, 10, 15]
})

best_map_ = 0
for params in grid:
    model = implicit.als.AlternatingLeastSquares(**params)
    model.fit(csr_df_train, show_progress=False)
    map_ = implicit.evaluation.mean_average_precision_at_k(model, train_user_items = csr_df_train, 
                                                           test_user_items = csr_df_val, K=12,num_threads =0)
    
    if map_ > best_map_:
      best_map_ = map_
      best_params = params
      print(f"The parameters of the best map12 are: {best_params}")

###Please note that there is no complete output because execution of the cell caused a system bug (not enough RAM memory). We therefore decided to use
### the best parameters shown during the last test.

  0%|          | 0/75481 [00:00<?, ?it/s]

The parameters of the best map12 are: {'factors': 50, 'iterations': 5, 'regularization': 0.01}


  0%|          | 0/75481 [00:00<?, ?it/s]

The parameters of the best map12 are: {'factors': 50, 'iterations': 10, 'regularization': 0.01}


  0%|          | 0/75481 [00:00<?, ?it/s]

  0%|          | 0/75481 [00:00<?, ?it/s]

  0%|          | 0/75481 [00:00<?, ?it/s]

  0%|          | 0/75481 [00:00<?, ?it/s]

  0%|          | 0/75481 [00:00<?, ?it/s]

The parameters of the best map12 are: {'factors': 100, 'iterations': 5, 'regularization': 0.01}


  0%|          | 0/75481 [00:00<?, ?it/s]

The parameters of the best map12 are: {'factors': 100, 'iterations': 10, 'regularization': 0.01}


  0%|          | 0/75481 [00:00<?, ?it/s]

  0%|          | 0/75481 [00:00<?, ?it/s]

  0%|          | 0/75481 [00:00<?, ?it/s]

  0%|          | 0/75481 [00:00<?, ?it/s]

## Training with the best parameters

In [None]:
model_als = implicit.als.AlternatingLeastSquares(factors=100, iterations=10, regularization=0.01)
model_als.fit(coo_train)



  0%|          | 0/10 [00:00<?, ?it/s]

## Preparing the Kaggle submission file

In [None]:
coo_train



<1371980x105542 sparse matrix of type '<class 'numpy.float64'>'
	with 1190911 stored elements in COOrdinate format>

In [None]:
# Prior transformation of our COO matrix into a CSR matrix

csr_train = coo_train.tocsr()

# Function for the submission 
def submit(model, csr_train, submission_name="submissions.csv"):
    preds = []
    batch_size = 2000
    to_generate = np.arange(len(list_cust_id ))
    for startidx in range(0, len(to_generate), batch_size):
        batch = to_generate[startidx : startidx + batch_size]
        ids, scores = model.recommend(batch, csr_train[batch], N=12, filter_already_liked_items=False)
        for i, userid in enumerate(batch):
            customer_id = cust_ids[userid]
            user_items = ids[i]
            article_ids = [art_ids[item_id] for item_id in user_items]
            preds.append((customer_id, ' '.join(article_ids)))

    df_preds = pd.DataFrame(preds, columns=['customer_id', 'prediction'])
    df_preds.to_csv(submission_name, index=False)
    
    display(df_preds.head())
    print(df_preds.shape)
    
    return df_preds

df_preds = submit(model_als, csr_train)

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0762846031 0568601006 0568601044 0568597006 05...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0112679048 0111609001 0111593001 0111586001 01...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0805000001 0794321011 0804992014 0804992017 07...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0112679048 0111609001 0111593001 0111586001 01...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0112679048 0111609001 0111593001 0111586001 01...


(1371980, 2)


## Kaggle Score

The Kaggle score obtained is 0.0092 (knowing that the training of the model was performed on only part of the data)