<a href="https://colab.research.google.com/github/F-Yousefi/RecSys-BST-Pytorch/blob/main/Behavior_Sequence_Transformer_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Movie Recommendation System
###**_Behavior Sequence Transformer- Pytorch_**
---------------

Recommendation systems play a very essential role in our lives, however they are almost ignored academically. In this project, I will create a movie recommendation system model based on Movielens dataset. The neat idea behind this project is inspired by the paper `Behavior Sequence Transformer` that can be found through this link. In mentioned article, the author explains why this model and architecture responds better then any other RecSys before it. This architecture uses the feature that almost all the other RecSys ignored, the behavior sequence of each user in a period of time. For example, in our case, when a user have never seen a horror movie, definitelt "The silence of lambs" is not the movie that a good RecSys recommend to him/her. Additionally, a user might decide to see movies of the other genres too. For example, after watching hundreds of drama movies, he/she decided to watch comedy movies, in this case all the previous generation of RecSys might fail to understand the pattern, but not this one.

##Dataset
-------
####Download Dataset
Our first step would be to download and extract Movielens dataset. It is free, and porvided for different purpose with different sizes.

In [22]:
#download the dataset:
from torchvision.datasets.utils import download_and_extract_archive

url = "http://files.grouplens.org/datasets/movielens/ml-1m.zip"
filename = './'
root = 'downloads'
download_and_extract_archive(url, root, filename)

!rm -r downloads

Downloading http://files.grouplens.org/datasets/movielens/ml-1m.zip to downloads/ml-1m.zip


100%|██████████| 5917549/5917549 [00:00<00:00, 20457178.76it/s]


Extracting downloads/ml-1m.zip to ./


####Read the dataset which is in .dat format.
-----------
In the next code cell, we read the dataset through three files which are in .dat format using pandas library.
```["movies.dat","ratings.dat","users.dat"]```

In [2]:
# There are three files. ratings/ users/ movies.
from pathlib import Path
import pandas as pd


def read_dat(path, columns):
  try:
    df = pd.read_csv(
      path,
      sep="::",
      names=columns,
      engine='python',
      )
  except:
    df = pd.read_csv(
      path,
      sep="::",
      names=columns,
      engine='python',
      encoding="ISO-8859-1"
      )
  return df


file_name = Path("ml-1m")
files_list = ["movies.dat","ratings.dat","users.dat"]
files_colums = [["movie_id", "title", "genres"],
 ["user_id", "movie_id", "rating", "unix_timestamp"],
  ["user_id", "sex", "age_group", "occupation", "zip_code"]]

movies_org, ratings_org, users_org = \
 [read_dat(file_name / files_list[i], columns = files_colums[i]) for i in range(len(files_list))]

In [3]:
movies, ratings, users = movies_org.copy(), ratings_org.copy(), users_org.copy()

####Resizing the dataset
--------
In the five following cells, we will check whether all the movies in the movies
file are mentioned in ratings or not. In the other words, we want to keep
movies that are present in the user records.

In [4]:
movie_groups = ratings.groupby("movie_id")
movie_ids_with_atleast_100_views = movie_groups.size()[(movie_groups.size() > 100)].keys()

ratings = ratings[ratings.movie_id.isin(movie_ids_with_atleast_100_views)].reset_index(drop=True)


In [5]:
print(f"All users are present in ratings :\
            {len(users.user_id.unique()) == len(ratings.user_id.unique())}")
print(f"All movies are voted in ratings :\
            {len(movies.movie_id.unique()) == len(ratings.movie_id.unique())}")

All users are present in ratings :            True
All movies are voted in ratings :            False


In [6]:
movies = movies[movies.movie_id.isin(ratings.movie_id)].reset_index(drop=True)
print(f"Movies before dropping unseen movies:{movies_org.shape}")
print(f"Movies after dropping unseen movies:{movies.shape}")

Movies before dropping unseen movies:(3883, 3)
Movies after dropping unseen movies:(2006, 3)


In [7]:
users = users[users.user_id.isin(ratings.user_id)].reset_index(drop=True)
print(f"users before dropping inactive users:{users_org.shape}")
print(f"users after dropping inactive users:{users.shape}")

users before dropping inactive users:(6040, 5)
users after dropping inactive users:(6040, 5)


In [8]:
print(f"All users are present in ratings :\
            {len(users.user_id.unique()) == len(ratings.user_id.unique())}")
print(f"All movies are voted in ratings :\
            {len(movies.movie_id.unique()) == len(ratings.movie_id.unique())}")

All users are present in ratings :            True
All movies are voted in ratings :            True


####Categories
In this cell, all the user ids and movie ids and other categorical features are
replaced with the corresponding category number. For example, one of the users' features is gender which is between M:male and F:female. We need to associate it with a number, in this case 0:female 1:male.

In [9]:
def to_categorical(column):
  cat_obj = pd.Categorical(column)
  return pd.Series(cat_obj.codes)

ratings[["user_id", "movie_id", "unix_timestamp"]] = ratings[["user_id", "movie_id", "unix_timestamp"]].apply(to_categorical, axis=0)
ratings_truth = pd.concat([ratings.sort_values(by="movie_id"),ratings_org.sort_values(by="movie_id")], axis= 1)

movies[["movie_id"]] = movies[["movie_id"]].apply(to_categorical, axis=0,)
movie_truth = pd.concat([movies,movies_org], axis= 1)

users = users.apply(to_categorical, axis=0)
users_truth = pd.concat([users,users_org], axis= 1)

In [10]:
users_ratings = pd.merge(left=ratings, right= users, on="user_id")
users_ratings = users_ratings.drop(columns = ['zip_code'])

In [11]:
max_ratings = users_ratings.rating.max()
min_ratings = users_ratings.rating.min()
users_ratings.rating = (users_ratings.rating - users_ratings.rating.min()) / \
                      (users_ratings.rating.max() - users_ratings.rating.min())
users_ratings.describe()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,sex,age_group,occupation
count,940925.0,940925.0,940925.0,940925.0,940925.0,940925.0,940925.0
mean,3025.972677,951.6194,0.652918,214886.919582,0.75509,2.491514,8.04491
std,1729.447761,563.519628,0.276035,122218.992504,0.430034,1.354066,6.510204
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1512.0,505.0,0.5,111067.0,1.0,2.0,2.0
50%,3072.0,888.0,0.75,219242.0,1.0,2.0,7.0
75%,4479.0,1439.0,0.75,315215.0,1.0,3.0,14.0
max,6039.0,2005.0,1.0,441727.0,1.0,6.0,20.0


###Time Series
Since our architecture is highly depend on the sequence of users choices, we should sort all the records by their time.

In [12]:
user_groups = users_ratings.sort_values(by=["unix_timestamp"]).groupby("user_id")
dataset = pd.DataFrame(data = {
        "user_id": list(user_groups.groups.keys()),
        "sex" : list(user_groups.sex.unique().explode()),
        "age_group" : list(user_groups.age_group.unique().explode()),
        "occupation" : list(user_groups.occupation.unique().explode()),
        "movie_ids": list(user_groups.movie_id.apply(list)),
        "ratings": list(user_groups.rating.apply(list)),
        "timestamps": list(user_groups.unix_timestamp.apply(list)),
    })


len(dataset.loc[0]["movie_ids"])

52

###Genres
------------
"Genre" is a multilable feature in this case. For example, "Toy story"'s genre is Animation|Comedy. As a consequense, we need to generate all the lables for all the movies. For example:

|Movie title|Animation| Comedy| Drama| Action |....|
|----|----|----|----|----|----|
|Toy Story|1|1|0|0|...|

In [13]:
genres = [
    "Action",
    "Adventure",
    "Animation",
    "Children's",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
]

for genre in genres:
    movies[genre] = movies["genres"].apply(
        lambda values: int(genre in values.split("|"))
    )

###Parameters
Some information is stored because we will need them later, when we want to build our model.

In [14]:
USER_FEATUERS = ["user_id","sex", "age_group", "occupation"]
FEATURES_VOCABULARY = {
    "movie_id": movies.movie_id.tolist(),
    "user_id" : users.user_id.tolist(),
    "sex" : users.sex.unique().tolist(),
    "occupation": users.occupation.unique().tolist(),
    "age_group": users.age_group.unique().tolist(),
    "genres": movies[genres].to_numpy()
}
PARAMETERS ={
    "sequence_length": 4,
    "step": 2,
}

###Behavior Sequences
In the next cell, we define how long sequence of users' choice should be considered by the model. In this case, we have used three movies. In the other words, the model should learn what movie it should recommend after these three movis. For example:
1. First movie: Toy story
2. Second movie: Cindrella
3. Third movie: Mulan
4. Target movie: Snow White and the Seven Dwarfs 100%


In [15]:
def create_sequences(values, sequence, step):
  start_idx = 0
  sec_list = []
  while True:
    end_idx = start_idx + sequence
    sec = values[start_idx:end_idx]
    start_idx += step

    if end_idx > len(values):
      sec = values[-sequence:]
      sec_list.append(sec)
      break
    sec_list.append(sec)
  return sec_list


dataset["movie_ids"] = dataset["movie_ids"].apply(
    lambda values: create_sequences(
        values,
        PARAMETERS["sequence_length"],
        PARAMETERS["step"] ))

dataset["ratings"] = dataset["ratings"].apply(
    lambda values: create_sequences(
        values,
        PARAMETERS["sequence_length"],
        PARAMETERS["step"] ))

dataset = dataset.drop(columns = ["timestamps"])

dataset = dataset.explode(column=["ratings", "movie_ids"]).reset_index(drop=True)
dataset.head()

Unnamed: 0,user_id,sex,age_group,occupation,movie_ids,ratings
0,0,0,0,10,"[1661, 649, 863, 495]","[0.75, 1.0, 0.75, 1.0]"
1,0,0,0,10,"[863, 495, 1201, 903]","[0.75, 1.0, 0.5, 1.0]"
2,0,0,0,10,"[1201, 903, 1759, 593]","[0.5, 1.0, 0.75, 0.75]"
3,0,0,0,10,"[1759, 593, 1462, 363]","[0.75, 0.75, 1.0, 0.5]"
4,0,0,0,10,"[1462, 363, 140, 581]","[1.0, 0.5, 0.75, 1.0]"


###Dataset In Pytorch
The dataset that we have already prepared is not suitable for Pytorch lightning models, so we need to use `torch.utils.data` for representing our dataset.

In [16]:
from torch.utils import data
import torch
import numpy as np

class MovieDataset(data.Dataset):
  def __init__(self, dataset):
    self.len = len(dataset)
    self.user_id = torch.tensor(dataset.user_id.values, dtype=int)
    self.sex = torch.tensor(dataset.sex.values, dtype=int)
    self.occupation = torch.tensor(dataset.occupation.values, dtype=int)
    self.age_group = torch.tensor(dataset.age_group.values, dtype=int)
    self.movie_ids = torch.tensor(dataset.movie_ids.tolist(), dtype=int)
    self.ratings = torch.tensor(dataset.ratings.tolist(), dtype=torch.float32)

  def __len__(self):
    return self.len


  def __getitem__(self, idx):
    if isinstance(idx, slice):
      raise 0
    sequence_movie_ids = self.movie_ids[idx][:-1]
    target_movie_id = self.movie_ids[idx][-1:]
    sequence_ratings = self.ratings[idx][:-1]
    target_rating = self.ratings[idx][-1:]


    return self.user_id[idx], self.sex[idx], self.age_group[idx],\
           self.occupation[idx], sequence_movie_ids,\
           target_movie_id, sequence_ratings, target_rating

random_selection = np.random.rand(len(dataset.index)) <= 0.85
train_data = dataset[random_selection]
test_data = dataset[~random_selection]

train_data = MovieDataset(train_data)
test_data = MovieDataset(test_data)


## Architecture
The architecture of this model is pretty hard to explain, so I decided to include two references instead and use a graphical graph that represent the architecture of the model.

In [None]:
!pip install lightning

In [26]:
import lightning as ltorch
from torch import nn
import torch
import math
import torchmetrics

def other_features():
  embedding_list = []
  for feature in USER_FEATUERS:
    num_embeddings=len(FEATURES_VOCABULARY[feature])
    embedding_dim= int(math.sqrt(len(FEATURES_VOCABULARY[feature])))
    emb = nn.Embedding(
        num_embeddings=num_embeddings,
        embedding_dim= embedding_dim
        )
    embedding_list.append(emb)
  return embedding_list

class MovieLens(ltorch.LightningModule):

  def __init__(self):
    super().__init__()
    #Params
    len_movie_voc = len(FEATURES_VOCABULARY["movie_id"])
    self.len_movie_embedded = int(math.sqrt(len_movie_voc))
    dropout_rate = 0.15
    num_heads = 4
    #END

    #OTHER FEATURES
    for idx,feature in enumerate(USER_FEATUERS):
      num_embeddings=len(FEATURES_VOCABULARY[feature])
      embedding_dim= int(math.sqrt(len(FEATURES_VOCABULARY[feature])))
      emb = nn.Embedding(
          num_embeddings=num_embeddings,
          embedding_dim= embedding_dim
          )
      setattr(self, f'emb_{idx}', emb)
    #END

    self.sequence_movie_embedding = nn.Embedding(
                                        num_embeddings=len_movie_voc,
                                        embedding_dim=self.len_movie_embedded)

    self.genres_embedding = nn.Embedding(
                        num_embeddings=FEATURES_VOCABULARY["genres"].shape[0],
                        embedding_dim= FEATURES_VOCABULARY["genres"].shape[-1])
    self.genres_embedding.weight = torch.nn.Parameter( #all weights are initialized ...
                torch.from_numpy(FEATURES_VOCABULARY["genres"].astype(float)))
    self.genres_embedding.weight.requires_grad=False #trainable = False


    in_features_num = FEATURES_VOCABULARY["genres"].shape[-1] + self.len_movie_embedded

    self.movies_sequence_dense =nn.Sequential(
            nn.Linear(in_features=in_features_num * (PARAMETERS["sequence_length"] -1),
                      out_features=256),
            nn.BatchNorm1d(num_features=256),
            nn.ReLU(),
            nn.Linear(in_features=256,
                      out_features=128),
            nn.BatchNorm1d(num_features=128),
            nn.ReLU(),
            nn.Linear(in_features=128,
                      out_features=(PARAMETERS["sequence_length"] -1) * self.len_movie_embedded),
            nn.BatchNorm1d(num_features=(PARAMETERS["sequence_length"] -1) * self.len_movie_embedded)
    )



    position = torch.arange(PARAMETERS["sequence_length"] -1).unsqueeze(1)
    div_term = torch.exp(
        torch.arange(0,self.len_movie_embedded, 2) * (-math.log(10000.0) / 60))

    self.positional_embedding = torch.zeros(
        PARAMETERS["sequence_length"] -1,
        self.len_movie_embedded)

    self.positional_embedding[:, 0::2] = torch.sin(position * div_term)
    self.positional_embedding[:, 1::2] = torch.cos(position * div_term)
    if torch.cuda.is_available():
        self.positional_embedding = self.positional_embedding.to("cuda")
    self.target_movie_embedding = nn.Embedding(
                                      num_embeddings=len_movie_voc,
                                      embedding_dim= self.len_movie_embedded)

    self.movies_target_dense = nn.Sequential(
            nn.Linear(in_features=in_features_num,
                                      out_features=256),
            nn.BatchNorm1d(num_features=1),
            nn.ReLU(),
            nn.Linear(in_features=256,
                                      out_features=128),
            nn.BatchNorm1d(num_features=1),
            nn.ReLU(),
            nn.Linear(in_features=128,
                                      out_features=self.len_movie_embedded),
            nn.BatchNorm1d(num_features=1)
    )



    self.multi_head =nn.MultiheadAttention(
                            embed_dim= self.len_movie_embedded,
                            num_heads= num_heads,
                            dropout=dropout_rate)

    self.dropout_first = nn.Dropout(dropout_rate)

    self.norm_transformer_first = nn.BatchNorm1d(PARAMETERS["sequence_length"])

    self.dense_transformer = nn.Linear(in_features=self.len_movie_embedded * PARAMETERS["sequence_length"],
                             out_features=self.len_movie_embedded * PARAMETERS["sequence_length"])

    self.norm_transformer_second = nn.BatchNorm1d(PARAMETERS["sequence_length"])

    self.dropout_second = nn.Dropout(dropout_rate)

    self.fully_connected = nn.Sequential(
        nn.Linear(in_features=84 + (PARAMETERS["sequence_length"] * self.len_movie_embedded),
                  out_features=256),
        nn.BatchNorm1d(num_features=256),
        nn.LeakyReLU(),
        nn.Dropout(dropout_rate),
        nn.Linear(in_features=256,out_features=128),
        nn.BatchNorm1d(num_features=128),
        nn.LeakyReLU(),
        nn.Dropout(dropout_rate),
        nn.Linear(in_features=128,out_features=1),
        )
    self.loss_function = nn.functional.mse_loss

  def other_features_encoder(self, batch):
    concat_other_features = []
    for idx,input in enumerate(batch):
      concat_other_features.append(getattr(self, f'emb_{idx}')(input))
    other_features = torch.cat(concat_other_features, dim=-1)
    return other_features

  def seq_mov_emb(self, batch):
    sme = self.sequence_movie_embedding(batch)
    ge = self.genres_embedding(batch)
    cat_sme = torch.cat([sme,ge],dim = -1).float()
    out = self.dense_trainer(
         self.movies_sequence_dense,
         cat_sme,
         (batch.shape[0],PARAMETERS["sequence_length"] -1, self.len_movie_embedded))

    return nn.functional.relu(out)

  def tar_mov_emb(self, batch):
    tme = self.target_movie_embedding(batch)
    ge = self.genres_embedding(batch)
    cat_sme = torch.cat([tme,ge],dim = -1).float()
    out = self.movies_target_dense(cat_sme)
    return nn.functional.relu(out)

  def transformer(self, ts):
    mh, _ = self.multi_head(ts,ts,ts)
    drmh = self.dropout_first(mh)
    add = ts + drmh
    norm_f = self.norm_transformer_first(add)
    lcr = nn.functional.leaky_relu(norm_f)
    dst = self.dense_trainer(self.dense_transformer,lcr)
    drds = self.dropout_second(dst)
    add_nd = norm_f + drds
    norm_s = self.norm_transformer_second(add_nd)
    flat = torch.flatten(norm_s, start_dim=1)
    return flat


  def fully_connected_model(self, input):
    return self.fully_connected(input)

  def dense_trainer(self, layer, input, output_shape= None):
    input_shape = input.shape
    input = torch.flatten(input, start_dim=1)
    output = layer(input)

    if output_shape == None : output = torch.reshape(output, input_shape)
    else: output = torch.reshape(output, output_shape)

    return output


  def training_step(self, batch, batch_idx=0):
    self.train()
    other_features = self.other_features_encoder(batch[:4])
    sme = self.seq_mov_emb(batch[4])
    tme = self.tar_mov_emb(batch[5])
    sme = self.positional_embedding + sme
    sr = batch[6].unsqueeze(dim = -1)
    mul_tme_sr = torch.mul(sr, sme)
    transformer_features = torch.cat([tme,mul_tme_sr], dim=-2)
    flat = self.transformer(transformer_features)
    cat_tr_oth = torch.cat([other_features,flat],dim = -1)
    y_pred = self.fully_connected_model(cat_tr_oth)
    y = batch[7]
    loss = self.loss_function(y_pred, y)
    mae = nn.functional.l1_loss(y_pred, y)
    mse = nn.functional.mse_loss(y_pred, y)
    self.log_dict({"Train Loss": loss, "Train l1 loss":mae,"Train mse loss":mse  },on_epoch=True, prog_bar=True, enable_graph=True )
    return loss

  def validation_step(self, batch, batch_idx=0):
    self.eval()
    other_features = self.other_features_encoder(batch[:4])
    sme = self.seq_mov_emb(batch[4])
    tme = self.tar_mov_emb(batch[5])
    sme = self.positional_embedding + sme
    sr = batch[6].unsqueeze(dim = -1)
    mul_tme_sr = torch.mul(sr, sme)
    transformer_features = torch.cat([tme,mul_tme_sr], dim=-2)
    flat = self.transformer(transformer_features)
    cat_tr_oth = torch.cat([other_features,flat],dim = -1)
    y_pred = self.fully_connected_model(cat_tr_oth)
    y = batch[7]
    loss = self.loss_function(y_pred, y)
    self.log_dict({"Validation Loss": loss},on_epoch=True, prog_bar=True,on_step = False, enable_graph=True)

  def predict_step(self, batch, batch_idx=0):
    self.eval()
    other_features = self.other_features_encoder(batch[:4])
    sme = self.seq_mov_emb(batch[4])
    tme = self.tar_mov_emb(batch[5])
    sme = self.positional_embedding + sme
    sr = batch[6].unsqueeze(dim = -1)
    mul_tme_sr = torch.mul(sr, sme)
    transformer_features = torch.cat([tme,mul_tme_sr], dim= 1)

    flat = self.transformer(transformer_features)
    cat_tr_oth = torch.cat([other_features,flat],dim = -1)
    y_pred = self.fully_connected_model(cat_tr_oth)
    return torch.cat([batch[4], batch[5], y_pred], dim=1)

  def configure_optimizers(self):
    optimizer = torch.optim.Adagrad(self.parameters(), lr=1e-3, weight_decay=5e-6)
    return optimizer





model = MovieLens()

In [27]:
from lightning.pytorch.callbacks import RichProgressBar
from lightning.pytorch.callbacks.progress.rich_progress import RichProgressBarTheme
from lightning.pytorch.callbacks.early_stopping import EarlyStopping


# create your own theme!
progress_bar = RichProgressBar(
    theme=RichProgressBarTheme(
        description="green_yellow",
        progress_bar="green1",
        progress_bar_finished="green1",
        progress_bar_pulse="#6206E0",
        batch_progress="green_yellow",
        time="grey82",
        processing_speed="grey82",
        metrics="#8756d6",
        metrics_text_delimiter="\n",
        metrics_format=".3f",

    )
)
trainer = ltorch.Trainer(max_epochs=100, accelerator="auto",
                         devices="auto", strategy="auto",
                         callbacks=[EarlyStopping(
                             monitor="Validation Loss", mode="min"),
                                    progress_bar])

trainer.fit(model=model,
            train_dataloaders=torch.utils.data.DataLoader(
                train_data,batch_size=2048,shuffle=True, num_workers=10),
            val_dataloaders=torch.utils.data.DataLoader(
                test_data,batch_size=2048,shuffle=False, num_workers=10))

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

In [37]:
from torch.utils import data
import torch
import numpy as np


class MoviePrediction(data.Dataset):
  def __init__(self, dataset, user_id):

    user_id_df = dataset.iloc[dataset.groupby("user_id").groups[user_id]].copy()
    user_id_df["movie_ids"] = user_id_df["movie_ids"].apply(lambda x: x[:-1])
    user_id_df["ratings"] = user_id_df["ratings"].apply(lambda x: x[:-1])

    movies_watched_by_user = user_id_df["movie_ids"].values
    self.movies_watched_by_user = set(sum(movies_watched_by_user, []))

    movies_not_watched = \
      movies[~movies["movie_id"].isin(self.movies_watched_by_user)]["movie_id"].values

    user_id_df["target_movie_id"] = [list(movies_not_watched)] * len(user_id_df)
    dataset = user_id_df.explode(column = "target_movie_id")
    self.len = len(dataset)
    self.user_id = torch.tensor(dataset.user_id.values, dtype=int)
    self.sex = torch.tensor(dataset.sex.values, dtype=int)
    self.occupation = torch.tensor(dataset.occupation.values, dtype=int)
    self.age_group = torch.tensor(dataset.age_group.values, dtype=int)
    self.movie_ids = torch.tensor(dataset.movie_ids.tolist(), dtype=int)
    self.ratings = torch.tensor(dataset.ratings.tolist(), dtype=torch.float32)
    self.target_movie_id = torch.tensor(dataset.target_movie_id.values.tolist(), dtype=int).unsqueeze(dim=1)

  def __len__(self):
    return self.len


  def __getitem__(self, idx):
    sequence_movie_ids = self.movie_ids[idx]
    sequence_ratings = self.ratings[idx]


    return self.user_id[idx], self.sex[idx], self.age_group[idx],\
           self.occupation[idx], sequence_movie_ids,\
           self.target_movie_id[idx], sequence_ratings

predict_data = MoviePrediction(dataset,100)

prediction = trainer.predict(model, torch.utils.data.DataLoader(predict_data,batch_size=512,shuffle=False, num_workers=2))
prediction = torch.cat(prediction , dim=0)
prediction = prediction.numpy()
prediction_df = pd.DataFrame({key:prediction[:,idx] for idx,key in enumerate([f"movie_{i}" for i in range(1,PARAMETERS["sequence_length"])]+["movie_id","movie_score"])})


INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

In [38]:
pred_movies = prediction_df.copy()
pred_movies["movie_id"] = pred_movies["movie_id"].astype(int)
pred_movies = pd.merge(left=pred_movies, right=movies, on="movie_id")
movie_groups = pred_movies.groupby("movie_id")

pred_movies = pd.DataFrame({
    "movie_title": movie_groups.title.unique().explode(),
    "movie_score": (movie_groups.movie_score.max()*100).astype(int),
}).sort_values(by="movie_score", ascending=False).reset_index(drop=True)
users_movies = movies.where(movies.movie_id.isin(predict_data.movies_watched_by_user)).dropna()
users_movies= users_movies[["title"]].head(10).reset_index(drop=True)
pd.concat([pred_movies.head(10),users_movies], axis=1)

Unnamed: 0,movie_title,movie_score,title
0,"Shawshank Redemption, The (1994)",97,Get Shorty (1995)
1,One Flew Over the Cuckoo's Nest (1975),96,Bad Boys (1995)
2,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),95,Die Hard: With a Vengeance (1995)
3,Schindler's List (1993),95,Waterworld (1995)
4,On the Waterfront (1954),95,Star Wars: Episode IV - A New Hope (1977)
5,Dr. Strangelove or: How I Learned to Stop Worr...,95,Outbreak (1995)
6,"Usual Suspects, The (1995)",95,Star Trek: Generations (1994)
7,Life Is Beautiful (La Vita è bella) (1997),94,I Love Trouble (1994)
8,North by Northwest (1959),94,Maverick (1994)
9,Rear Window (1954),94,"River Wild, The (1994)"
