<a href="https://colab.research.google.com/github/Eidellin/BERT4Rec-for-Spotify-Recommendation/blob/master/COMP9727_BERT4Rec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **COMP9727 Recommender Systems** - BERT4Rec (Others are removed.)
## Project Implementation - Song Recommender

@Authors:
- Miguel Ilagan
- Fanglue Liu
- Jaehwi Park
- Chance Xu

### Part 1. Problem
Music is a popular form of media that has found purpose in many areas such as entertainment, recreation and communication. With advancements in the technology field, there has been a corresponding demand for digital music streaming platforms such as ‘Spotify’, ‘Apple music’ and ‘Amazon music’ which has promoted easy access for consumers to listen to their favourite music. However, this has simultaneously introduced a new issue of there being too many songs to select from which has made it hard to users to find and discover new music that align with their interests.

This problem has motivated the aim of developing a song recommender system that offers personalised music recommendations based on their interests. The recommender system is intended to be used for all users ranging from the casual listener to the music connoisseur. The implementation of this song recommender presents music recommendations through both a web and mobile user interface.

### Part 2. Dataset

#### 2.1 Spotify tracks dataset
The recommender system utilises song data obtained on Kaggle extracted from Spotify and is available from https://www.kaggle.com/datasets/amitanshjoshi/spotify-1million-tracks/data. This dataset has been used for implementation due to the abundance of data, and its translation of song information into audio features that can be used for content-based and collaborative-based recommendation systems. Visualisation of the data is shown below and stored within the `tracks_data` dataframe.  
BUT my BERT4Rec model doesn't require this dataset.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import pandas as pd
import numpy as np

# I don't need it.

#### 2.2 Spotify playlist dataset
The recommender system utilises song playlist data obtained on Kaggle extracted from Spotify and is available from https://www.kaggle.com/datasets/andrewmvd/spotify-playlists?select=spotify_dataset.csv. This dataset has been used for implementation due to the abundance of data and for building the collaborative-based aspect of the recommendation system. Visualisation of the data is shown below and stored within the `playlist_data` dataframe.

In [None]:
playlist_data = pd.read_csv('/content/gdrive/MyDrive/Colab_Notebooks/Spotify_playlists.csv', on_bad_lines='skip')
playlist_data.columns = ['user_id', 'artist_name', 'track_name', 'playlist_name']
playlist_data

Unnamed: 0,user_id,artist_name,track_name,playlist_name
0,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,HARD ROCK 2010
1,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,"(What's So Funny 'Bout) Peace, Love And Unders...",HARD ROCK 2010
2,9cc0cfd4d7d7885102480dd99e7a90d6,Tiffany Page,7 Years Too Late,HARD ROCK 2010
3,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,Accidents Will Happen,HARD ROCK 2010
4,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,Alison,HARD ROCK 2010
...,...,...,...,...
12891675,2302bf9c64dc63d88a750215ed187f2c,Mötley Crüe,Wild Side,iPhone
12891676,2302bf9c64dc63d88a750215ed187f2c,John Lennon,Woman,iPhone
12891677,2302bf9c64dc63d88a750215ed187f2c,Tom Petty,You Don't Know How It Feels,iPhone
12891678,2302bf9c64dc63d88a750215ed187f2c,Tom Petty,You Wreck Me,iPhone


A basic exploratory data analysis (EDA) is performed on this dataset to gain better understanding of the presented data.
- Each sample within the dataset is described by features namely, `User id`, `Artist name`, `Track name` and `Playlist name`. These features are stored within a list called `playlist_data_features`.

In [None]:
playlist_data_features = list(playlist_data.columns)
playlist_data_features

- The dataset contains `15918` unique users, `289820` unique artists and `2032043` unique song track names. These values are contained within the `num_unique_users_alt`, `num_unique_artists_alt` and `num_unique_track_names_alt` variables respectively.

In [None]:
# The number of unique artists, songs and genres within the playlist dataset
num_unique_users_alt = playlist_data['user_id'].nunique()
num_unique_artists_alt = playlist_data['artist_name'].nunique()
num_unique_track_names_alt = playlist_data['track_name'].nunique()
print(f'Number of unique users: {num_unique_users_alt}\nNumber of unique artists: {num_unique_artists_alt}\nNumber of unique track_names: {num_unique_track_names_alt}')

#### 2.3 Strength and weaknesses of the datasets
- There is a significant amount of data meaning that based on our employed method of recommendation, there will be significant computational cost required to generate recommendations for users.
- The `tracks_data` and `playlist_data` have uncommon songs between the datasets. This introduces problems in our methodology of computing the Tfidf vectors for songs within the `tracks_data` dataset since `words` for certain songs in the `playlist_data` dataset will not be present in the resulting Tfidf vector. This is remedied by only considering the common songs between both datasets as described in **Part 3.1.3**. This in turn also helps reduce the amount of data, hence making computation faster.
- Both datasets are extracted from `Spotify` so they have the same presentation of information, in particular `artist_name` and `track_name`. This makes it simple to merge the two datasets with minimal preprocessing required.
- Data in the `tracks_data` dataset describe songs in terms of numerical audio features as described above, making it suitable for content-based recommendation. Similarly, data in the `playlist_data` dataset have a large collection of user playlists making it suitable for collaborative-based recommendation. A combination of these datasets thus opens opportunity for a hybrid content/collaborative recommender.
- Songs in the datasets are already classified by their genre, so there is no need for the development of any machine learning classification models.

### Part 3. Methods

#### 3.1 Preprocessing
##### 3.1.1 Feature selection
- `track_id` feature from `tracks_data` is removed since it offers no valuable information in describing the particular song.
- `playlist_name` feature from `playlist_data` is removed to simplify the generation of user profiles in **Part 3.2**.

In [None]:
tracks_data = tracks_data.drop(columns=['track_id'])
playlist_data = playlist_data.drop(columns=['playlist_name'])

##### 3.1.2 Handling missing data
- Samples with missing data are removed from both `tracks_data` and `playlist_data` datasets since given the nature of the song recommendation task, these values cannot necessarily be interpolated from other songs within the same dataset (as each song is unique). Consequently, imputation methods for missing numerical values are also not considered since this would lead to misleading representation of that particular song.
- Entries with missing data are removed using `dropna` method available in the `pandas` library.
- The indexes of the resulting dataframes are reset using the `reset_index` method available in the `pandas` library. This is to keep the numbering of the samples consistent.

In [None]:
tracks_data.dropna(inplace=True)
playlist_data.dropna(inplace=True)

tracks_data.reset_index(drop=True, inplace=True)
playlist_data.reset_index(drop=True, inplace=True)

###### 3.1.2.1 Copy `tracks_data` as `tracks_df` for BERT4Rec
Since this part, preprocessing steps are same for BERT4Rec.

In [None]:
playlists_df = playlist_data.copy() # Different models required different preprocessing. We had four models.

#### 3.5 Preprocessing for BERT4Rec
##### 3.5.1 Create a `track_id` for each track
Identical track name and artist name would be considered as an identical track. BERT4Rec only needs `track_ids` and they could be simply 0~number of tracks.

`tracks` is a dataframe of unique pairs of `artist_name` and `track_name`. By simply assuming 0 to number of unique tracks - 1 as each track, we can easily add `track_id` column.

In [None]:
playlists_df.drop_duplicates(inplace=True)

tracks = playlists_df.copy()
tracks.drop(columns=['user_id'], inplace=True)
tracks.drop_duplicates(inplace=True)
tracks['track_id'] = [i for i in range(len(tracks))]
tracks.reset_index(drop=True, inplace=True)
tracks.head()

Unnamed: 0,artist_name,track_name,track_id
0,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,0
1,Elvis Costello & The Attractions,"(What's So Funny 'Bout) Peace, Love And Unders...",1
2,Tiffany Page,7 Years Too Late,2
3,Elvis Costello & The Attractions,Accidents Will Happen,3
4,Elvis Costello,Alison,4


In [None]:
playlists_df = playlists_df.merge(tracks, on=['artist_name', 'track_name'], how='left')
playlists_df.head()

Unnamed: 0,user_id,artist_name,track_name,track_id
0,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,0
1,ec6a9abc7a818b0c00788add9ec69c58,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,0
2,7cae243a6e617bbac43848e587cf0177,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,0
3,6850dd8323fec9eecb29ce17bb967f2c,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,0
4,0098b965803a4c10723f8e216f9e0904,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,0


##### 3.5.2 Suggested Preprocessing by `BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer - Fei Sun`


>4 EXPERIMENTS
4.1 Dataset
For dataset preprocessing, we follow the common practice in [22, 40, 49]. For all datasets, we convert all numeric ratings or the presence of a review to implicit feedback of 1 (i.e., the user interacted with the item). After that, we group the interaction records by users and build the interaction sequence for each user by sorting these interaction records according to the timestamps. To ensure the quality of the dataset, following the common practice [12, 22, 40, 49], we keep users with at least five feedbacks. The statistics of the processed datasets are summarized in Table 1.

There is no column for user interaction and we assume the order provided by the original dataset as time order. Then we do not need to care about first two suggested steps but first, we need to group track ids as seqence for each user.

In [None]:
# group tracks by user
user_history = playlists_df.groupby('user_id')['track_id'].apply(list).reset_index()
user_history.rename(columns={'track_id': 'sequence'}, inplace=True)

Then keep the users with at least five tracks in history sequence.

In [None]:
user_history = user_history[user_history['sequence'].map(len) >= 5]
user_history.head()

Unnamed: 0,user_id,sequence
0,00055176fea33f6e027cd3302289378b,"[858, 1279, 1362, 1510, 1519, 1525, 1536, 1538..."
1,0007f3dd09c91198371454c608d47f22,"[205, 1084, 1090, 1212, 1319, 1493, 1496, 1499..."
2,000b0f32b5739f052b9d40fcc5c41079,"[2204, 4275, 5270, 6405, 6940, 8023, 8136, 948..."
3,000c11a16c89aa4b14b328080f5954ee,"[516, 696, 741, 769, 869, 911, 1164, 1279, 131..."
4,00123e0f544dee3ab006aa7f1e5725a7,"[446, 486, 673, 687, 807, 844, 909, 939, 940, ..."


Final result is a dataframe for multiple sequences grouped by each user.

Save into csv files

In [None]:
tracks.to_csv('./data/tracks.csv', index=False)
user_history.to_csv('./data/user_history.csv', index=False)

### Part 4. Experiments


#### 4.3 BERT4Rec Model
BERT4Rec model has been explained on `BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer - Fei Sun`. Simply,

1. Embedding Layer
- embedding(track id) + embedding(index)  

2. Transformer Layer
- multi-head self-attention
- position-wise feed-forward network  

3. Output Layer
After $L$ layers that hierarchically exchange information across all positions in the previous layer, we get the final output $H^L$ for all items of the input sequence. Assuming that we mask the item $v_$ at time step $t$, we then predict the masked items $v_t$ based on $h_t^L$ as shown in Figure 1b. Specifically, we apply a two-layer feed-forward network with GELU activation in between to produce an output distribution over target items:
$$
P(v) = \text{softmax}(\text{GELU}(h_t^L W^P + b^P) E^\top + b^O)
$$
where $W^P$ is the learnable projection matrix, $b^P$ and $b^O$ are bias terms. $E \in \mathbb{R}^{|V| \times d}$ is the embedding matrix for the item set $V$.


In [None]:
! pip install labml_nn

Collecting labml_nn
  Downloading labml_nn-0.4.136-py3-none-any.whl.metadata (13 kB)
Collecting labml==0.4.168 (from labml_nn)
  Downloading labml-0.4.168-py3-none-any.whl.metadata (7.5 kB)
Collecting labml-helpers==0.4.89 (from labml_nn)
  Downloading labml_helpers-0.4.89-py3-none-any.whl.metadata (1.4 kB)
Collecting einops (from labml_nn)
  Downloading einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Collecting fairscale (from labml_nn)
  Downloading fairscale-0.4.13.tar.gz (266 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m266.3/266.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting gitpython (from labml==0.4.168->labml_nn)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from labml_nn.transformers.feed_forward import FeedForward

class EmbeddingLayer(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, max_len, emb_dropout):
        super(EmbeddingLayer, self).__init__()
        self.item_embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.position_embedding = nn.Embedding(max_len, embedding_dim)
        self.embedding_dropout = nn.Dropout(emb_dropout)

    def forward(self, input_ids):
        positions = torch.arange(0, input_ids.size(1)).unsqueeze(0).to(input_ids.device)
        item_embeddings = self.item_embedding(input_ids)
        position_embeddings = self.position_embedding(positions)
        embeddings = item_embeddings + position_embeddings
        return self.embedding_dropout(embeddings)

class TransformerLayer(nn.Module):
    def __init__(self, embedding_dim, num_heads, d_ff, attn_dropout=0.1, fnn_dropout=0.1):
        super(TransformerLayer, self).__init__()
        self.attention = nn.MultiheadAttention(embedding_dim, num_heads, dropout=attn_dropout, batch_first=True)
        self.feed_forward = FeedForward(embedding_dim, d_ff, fnn_dropout)

    def forward(self, x, mask=None):
        attn_output, _ = self.attention(x, x, x, attn_mask=mask, need_weights=False)
        x = x + attn_output
        x = F.layer_norm(x, x.size()[1:])
        ff_output = self.feed_forward(x)
        x = x + ff_output
        x = F.layer_norm(x, x.size()[1:])
        return x

class BERT4Rec(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, max_len, num_layers, num_heads, d_ff, emb_dropout=0.1, attn_dropout=0.1, fnn_dropout=0.1):
        super(BERT4Rec, self).__init__()
        self.embedding = EmbeddingLayer(num_embeddings, embedding_dim, max_len, emb_dropout)
        self.transformer_layers = nn.ModuleList(
            [TransformerLayer(embedding_dim, num_heads, d_ff, attn_dropout, fnn_dropout) for _ in range(num_layers)]
        )
        self.linear = nn.Linear(embedding_dim, embedding_dim)
        self.output_layer = nn.Linear(embedding_dim, num_embeddings)

    def forward(self, input_ids, mask=None):
        x = self.embedding(input_ids)
        for transformer_layer in self.transformer_layers:
            x = transformer_layer(x, mask)
        x = F.gelu(self.linear(x))
        logits = self.output_layer(x)
        return logits

##### 4.3.1 Create Dataset and Dataloader

In [None]:
import torch
import random
from torch.utils.data import Dataset, DataLoader

class MaskedDataset(Dataset):
    def __init__(self, interaction_sequences, num_tracks, max_len, masked_lm_prob=0.15, masked_prob=1, masked_rand_prob=0.5):
        self.interaction_sequences = interaction_sequences
        self.num_tracks = num_tracks
        self.max_len = max_len

        self.masked_lm_prob = masked_lm_prob
        self.masked_prob = masked_prob
        self.masked_rand_prob = masked_rand_prob

        self.masked_token = num_tracks + 1  # mask token ID

    def __len__(self):
        return len(self.interaction_sequences)

    def __getitem__(self, idx):
        input_seq = self.interaction_sequences[idx]

        if len(input_seq) > self.max_len:
            input_seq = input_seq[:self.max_len]

        # Mask input sequence
        input_seq, target = self.mask_input_sequence(input_seq)

        # Pad sequences
        input_seq = self.pad_sequence(input_seq)
        target = self.pad_sequence(target)

        return torch.tensor(input_seq, dtype=torch.long), torch.tensor(target, dtype=torch.long)

    def mask_input_sequence(self, input_seq):
        target = input_seq.copy()

        for i in range(len(input_seq) - 1):
            if random.random() < self.masked_lm_prob:
                # replace with mask token
                if random.random() < 0.8:
                    input_seq[i] = self.masked_token
                # replace with random token
                elif random.random() < 0.5:
                    input_seq[i] = random.randint(1, self.num_tracks + 1)

        input_seq[-1] = self.masked_token  # mask the last token

        return input_seq, target

    def pad_sequence(self, sequence):
        padding_length = self.max_len - len(sequence)
        padded_sequence = [0] * padding_length + sequence  # Use 0 for padding
        return padded_sequence

Convert `sequences` column from `user_history` **dataframe** as **list** of `sequences`.

In [None]:
sequences = user_history['sequence'].tolist()

###### 4.3.1.1 Hyperparameter Setup and Convert Sequences to Usuable Sequences
Hyperparameter has been selected by the limit of hardware resources.

In [None]:
max_len = 50 # Maximum sequence length
masked_lm_prob = 0.15 # Probability of masking a token
masked_prob = 1 # Probability of replacing a masked token with a random token
masked_rand_prob = 0.5 # Probability of replacing a masked token with a random token instead of [MASK]
batch_size = 32 # Batch size

embedding_dim = 16 # hidden size
num_layers = 2 # number of transformer layers
num_heads = 2 # number of attention heads
d_ff = 16 # feed-forward hidden size
emb_dropout = 0.1 # embedding dropout
attn_dropout = 0.1 # attention dropout
fnn_dropout = 0.1 # feed-forward dropout
learning_rate = 0.001 # learning rate
# num_epochs = 10 # number of training epochs

Limit the number of sequence for each sequence.

In [None]:
for i, sequence in enumerate(sequences):
    if len(sequence) > max_len:
        sequences[i] = sequence[:max_len]

Fit track ids to avoid CUDA and Index Error.

In [None]:
from sklearn.preprocessing import LabelEncoder

all_track_ids = list(set(track_id for sequence in sequences for track_id in sequence))

label_encoder = LabelEncoder()
label_encoder.fit(all_track_ids)

encoded_sequences = [label_encoder.transform(sequence).tolist() for sequence in sequences]

vocab_size = len(label_encoder.transform(all_track_ids))

encoded_sequences = [[vocab_size if track==0 else track for track in sequence] for sequence in encoded_sequences]

num_embeddings = vocab_size + 2

Save LabelEncoder for Future Acutal Usage.  
- `encoded sequences` -> `track id sequences`
- `track id sequences` -> `track name and artist name` by `tracks.csv`

In [None]:
import pickle

with open('./data/label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

###### 4.3.1.2 Generate Dataset and Dataloader

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dataset = MaskedDataset(encoded_sequences, vocab_size, max_len, masked_lm_prob, masked_prob, masked_rand_prob)
dataloader = DataLoader(dataset, batch_size, shuffle=True)

Split Dataset/Dataloader into train and validation set/loader by 8:2.

In [None]:
from torch.utils.data import random_split

# Assume 'dataset' is your dataset object
# Split the dataset into training and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

##### 4.3.2 Test Dataset, Dataloader and Model for Finding any Error

In [None]:
for input_ids, target in dataloader:
    input_ids = input_ids.to(device)
    target = target.to(device)
    print(f"Input shape: {input_ids.size()}")
    print(f"Target shape: {target.size()}")
    emb = EmbeddingLayer(num_embeddings, embedding_dim, max_len, emb_dropout).to(device)
    emb_x = emb(input_ids)
    print(f"Embedding shape: {emb_x.size()}")
    trn = TransformerLayer(embedding_dim, num_heads, d_ff, attn_dropout, fnn_dropout).to(device)
    trn_x = trn(emb_x)
    print(f"Transformer shape: {trn_x.size()}")
    brt = BERT4Rec(num_embeddings, embedding_dim, max_len, num_layers, num_heads, d_ff, emb_dropout=0.1, attn_dropout=0.1, fnn_dropout=0.1).to(device)
    brt_x = brt(input_ids)
    print(f"BERT4Rec shape: {brt_x.size()}")
    break

Input shape: torch.Size([32, 50])
Target shape: torch.Size([32, 50])
Embedding shape: torch.Size([32, 50, 16])
Transformer shape: torch.Size([32, 50, 16])
BERT4Rec shape: torch.Size([32, 50, 111037])


##### 4.3.3 Initialize the model, loss function, and optimizer
Optimzer is an Adam optimizer and criterion is cross entropy loss.

In [None]:
model = BERT4Rec(num_embeddings, embedding_dim, max_len, num_layers, num_heads, d_ff, emb_dropout, attn_dropout, fnn_dropout).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=0)  # Ignore the padding index

##### 4.3.4 Actual Training
Average loss is expected less than 0.5 so it would be repeated until reach that value.

In [None]:
epoch = 0
avg_loss = 100

# Initialize GradScaler
scaler = torch.GradScaler(device="cuda")

# Training loop
model.train()
while avg_loss > 0.5:
    total_loss = 0
    epoch += 1
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
        input_ids, targets = batch
        input_ids, targets = input_ids.to(device), targets.to(device)

        optimizer.zero_grad()

        with torch.autocast(device_type='cuda'):
            outputs = model(input_ids)

            # Shift the targets to align with the outputs
            outputs = outputs[:, :-1].contiguous().view(-1, outputs.size(-1))
            targets = targets[:, 1:].contiguous().view(-1)

            loss = criterion(outputs, targets)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        total_loss += loss.item()

        # Free up GPU memory
        torch.cuda.empty_cache()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch}, Loss: {avg_loss:.4f}")

It has started from Loss: 12.5722.  
Training Epoch 1/10: 100%|██████████| 383/383 [01:03<00:00,  6.00it/s]  
Epoch 1/10, Loss: 12.5722  
Training Epoch 2/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 2/10, Loss: 10.9363  
Training Epoch 3/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 3/10, Loss: 9.5578  
Training Epoch 4/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 4/10, Loss: 8.5510  
Training Epoch 5/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 5/10, Loss: 7.6713  
Training Epoch 6/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 6/10, Loss: 6.8761       
Training Epoch 7/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 7/10, Loss: 6.1736  
Training Epoch 8/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 8/10, Loss: 5.5573  
Training Epoch 9/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 9/10, Loss: 5.0106  
Training Epoch 10/10: 100%|██████████| 383/383 [01:03<00:00,  5.99it/s]  
Epoch 10/10, Loss: 4.5395  
...  
Training Epoch 21: 100%|██████████| 383/383 [01:08<00:00,  5.56it/s]  
Epoch 21, Loss: 2.0249  
Training Epoch 22: 100%|██████████| 383/383 [01:06<00:00,  5.77it/s]  
Epoch 22, Loss: 1.9509  
Training Epoch 23: 100%|██████████| 383/383 [01:07<00:00,  5.67it/s]   
Epoch 23, Loss: 1.8874  
Training Epoch 24: 100%|██████████| 383/383 [01:09<00:00,  5.54it/s]  
Epoch 24, Loss: 1.8366  
Training Epoch 25: 100%|██████████| 383/383 [01:09<00:00,  5.55it/s]  
Epoch 25, Loss: 1.7878  
Training Epoch 26: 100%|██████████| 383/383 [01:12<00:00,  5.30it/s]  
Epoch 26, Loss: 1.7406  
Training Epoch 27: 100%|██████████| 383/383 [01:04<00:00,  5.90it/s]  
Epoch 27, Loss: 1.7010  
Training Epoch 28: 100%|██████████| 383/383 [01:11<00:00,  5.38it/s]  
Epoch 28, Loss: 1.6577  
Training Epoch 29: 100%|██████████| 383/383 [01:08<00:00,  5.59it/s]  
Epoch 29, Loss: 1.6272  
Training Epoch 30: 100%|██████████| 383/383 [01:08<00:00,  5.57it/s]  
Epoch 30, Loss: 1.6060  
Training Epoch 31: 100%|██████████| 383/383 [01:06<00:00,  5.76it/s]  
Epoch 31, Loss: 1.5950  
Training Epoch 32: 100%|██████████| 383/383 [01:11<00:00,  5.38it/s]  
Epoch 32, Loss: 1.5787   
Training Epoch 33: 100%|██████████| 383/383 [01:10<00:00,  5.43it/s]  
Epoch 33, Loss: 1.5710   
  
It looks difficult to get better loss by more epochs.

### Part 5. Evaluation


#### 5.3 BERT4Rec Evaluation
Hit Ratio@K (HR@K) - Hit Ratio@K measures the proportion of times the true item is among the top K recommendations. It is a binary indicator for each test case, which is then averaged over all test cases.

- $$
HR@K = \frac{1}{N} \sum_{i=1}^N \text{hit}_i
$$
where $\text{hit}_i$ is 1 if the true item is in the top K recommendations for the i-th user, otherwise 0.

Normalized Discounted Cumulative Gain@K (NDCG@K) - NDCG@K measures the ranking quality of the top K recommendations by considering the positions of the true item. It is a normalized form of DCG which takes into account the position of the correct item in the recommendation list.

- $ \text{DCG@K} = \sum_{i=1}^K \frac{\text{rel}_(i)}{\log_2(i+1)} $  
- $ \text{IDCG@K} = \sum_{i=1}^K \frac{1}{\log_2(i+1)} $  
- $ \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}} $  

Mean Reciprocal Rank (MRR) - MRR measures the average of the reciprocal ranks of the true items across all test users. The reciprocal rank is the inverse of the rank of the first relevant item.

- $ MRR = \frac{1}{N} \sum_{i=1}^N \frac{1}{\text{rank}_(i)} $

Note: if a sequence has shorter length than maximum length it has been filled by `0` at the front by `Customised Dataset Class`. So, we skip this padding index, `0`, in the evaluation.

Helper Functions

In [None]:
def hit_ratio(rank, k):
    return int(rank < k)

def ndcg(rank, k):
    if rank < k:
        return 1 / np.log2(rank + 2)
    return 0

def mrr(rank):
    return 1 / (rank + 1)

In [None]:
model.eval()
hit_ratios_1 = []
hit_ratios_5 = []
hit_ratios_10 = []
ndcgs_5 = []
ndcgs_10 = []
mrrs = []

with torch.no_grad():
    for batch in tqdm(val_loader, desc="Evaluating"):
        input_ids, targets = batch
        input_ids, targets = input_ids.to(device), targets.to(device)

        with torch.autocast(device_type='cuda', dtype=torch.float16):
            outputs = model(input_ids)

        batch_size, seq_len, vocab_size = outputs.size()

        for i in range(batch_size):
            for j in range(seq_len):
                true_item = targets[i, j].item()

                if true_item == 0:
                    continue

                logits = outputs[i, j]
                sorted_indices = torch.argsort(-logits)
                rank = (sorted_indices == true_item).nonzero(as_tuple=True)[0].item()

                hit_ratios_1.append(hit_ratio(rank, 1))
                hit_ratios_5.append(hit_ratio(rank, 5))
                hit_ratios_10.append(hit_ratio(rank, 10))
                ndcgs_5.append(ndcg(rank, 5))
                ndcgs_10.append(ndcg(rank, 10))
                mrrs.append(mrr(rank))

    print(f"Hit Ratio@1: {np.mean(hit_ratios_1):.4f}")
    print(f"Hit Ratio@5: {np.mean(hit_ratios_5):.4f}")
    print(f"Hit Ratio@10: {np.mean(hit_ratios_10):.4f}")
    print(f"NDCG@5: {np.mean(ndcgs_5):.4f}")
    print(f"NDCG@10: {np.mean(ndcgs_10):.4f}")
    print(f"MRR: {np.mean(mrrs):.4f}")

After epoch 33 -  
Evaluating: 100%|██████████| 96/96 [00:53<00:00,  1.79it/s]  
Hit Ratio@1: 0.4717  
Hit Ratio@5: 0.4720  
Hit Ratio@10: 0.4724  
NDCG@5: 0.4719  
NDCG@10: 0.4720  
MRR: 0.4720  