Content recommendation is a fun and often challenging space for Data Science and Machine Learning. In this notebook, I predict the next movie that someone is likely to watch based on the past 10 movies they've watched - not including information about the genre, ratings/sentiment, etc. This mimics the content recommendation methodologies that many products have, wherein users simply do what they're going to without necessarily rating each experience or providing tags about their usage and expectations.

This uses the [MovieLens dataset](https://grouplens.org/datasets/movielens/) {F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872}.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load datasets
movies = pd.read_csv("ml-32m/movies.csv")
ratings = pd.read_csv("ml-32m/ratings.csv")
# Join to get the movie title, not using genre or tags
ratings = ratings.merge(movies[['movieId', 'title']], how='left', left_on='movieId', right_on='movieId')
# Subsample ratings for compute size, pulling all ratings for userId's 1->25k
ratings = ratings[ratings['userId'].between(1, 25_000)]
# Encode movieId so it starts from 0 to avoid embedding errors later
movie_encoder = LabelEncoder()
all_movie_ids = ratings['movieId'].unique()
movie_encoder.fit(all_movie_ids)
ratings['movieId'] = movie_encoder.transform(ratings['movieId'])
ratings.sort_values(by='movieId')

Unnamed: 0,userId,movieId,rating,timestamp,title
1801800,11442,0,2.5,1189816084,Toy Story (1995)
416639,2747,0,4.0,1543947635,Toy Story (1995)
1334654,8537,0,5.0,833975160,Toy Story (1995)
501510,3246,0,3.0,1436358506,Toy Story (1995)
8750,60,0,3.0,1441153002,Toy Story (1995)
...,...,...,...,...,...
765965,5029,45731,3.0,1697098749,Totally Killer (2023)
2970353,18699,45732,0.5,1696892007,Pet Sematary: Bloodlines (2023)
519912,3367,45733,0.5,1696725919,Space Wars: Quest for the Deepstar (2023)
2164518,13783,45734,2.5,1696850629,Something to Remind Me (2002)


In [18]:
# Let's see what the most popular movies were
ratings.groupby(['movieId', 'title']).count().sort_values(by=['userId'], ascending=False).head(25)

Unnamed: 0_level_0,Unnamed: 1_level_0,userId,rating,timestamp
movieId,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
314,"Shawshank Redemption, The (1994)",12882,12882,12882
351,Forrest Gump (1994),12453,12453,12453
292,Pulp Fiction (1994),12231,12231,12231
2460,"Matrix, The (1999)",11603,11603,11603
584,"Silence of the Lambs, The (1991)",11286,11286,11286
257,Star Wars: Episode IV - A New Hope (1977),10663,10663,10663
2846,Fight Club (1999),9559,9559,9559
474,Jurassic Park (1993),9434,9434,9434
521,Schindler's List (1993),9153,9153,9153
4840,"Lord of the Rings: The Fellowship of the Ring, The (2001)",9078,9078,9078


Below I define train and test datasets. The testing data has 2 batches:
1) users not in the training data at all
2) users in the training data, but only their last few movies rated (with those last few excluded from the train set to prevent data leakage)

This lets me test how the next-movie prediction works both on unseen users >and< on users who are continuing to watch movies beyond what's currently in the dataset.

In [9]:
# Step 1: Split out some users entirely for out-of-user test
unique_users = ratings['userId'].unique()
out_users = pd.Series(unique_users).sample(frac=0.1, random_state=42)
out_user_ratings = ratings[ratings['userId'].isin(out_users)]

# Remaining users for in-user split
in_users_ratings = ratings[~ratings['userId'].isin(out_users)]

# Step 2: From in-users, hold out last few ratings per user for in-user test
def hold_out_last_n(df, n=3, sequence_length=10):
    df = df.sort_values(by=['userId', 'timestamp'])
    test_rows = []
    train_rows = []
    for user_id, group in df.groupby('userId'):
        # For simplicity, only include users with at least last-n + sequence_length movies rated.
        # There are other ways of managing this, i.e. padding the initial watches with null/0 values, but no need for that
            # in this exploratory setting.
        if len(group) <= n + sequence_length:
            continue
        else:
            test_rows.append(group.iloc[-n - sequence_length:])
            train_rows.append(group.iloc[:-n - sequence_length])
    return pd.concat(train_rows), pd.concat(test_rows)

in_user_train, in_user_test = hold_out_last_n(in_users_ratings, n=5)

Now I define the datasets and dataloaders that will pull the sequence of 10 past movies a user rated (independent/X data) and the next movie they rate (dependent/Y data).

In [10]:
import torch
from torch.utils.data import Dataset, DataLoader

# Treat movie predictions as a sequential process, grouped on the user.
class MovieRatingSequenceDataset(Dataset):
    def __init__(self, df, sequence_length=10):
        self.sequence_length = sequence_length
        self.samples = []

        # Sort ratings by user and timestamp
        df = df.sort_values(by=['userId', 'timestamp'])

        # Group by user and build sequences
        for user_id, group in df.groupby('userId'):
            movies = group['movieId'].tolist()

            if len(movies) <= sequence_length:
                continue

            for i in range(len(movies) - sequence_length):
                seq_movies = movies[i:i+sequence_length]
                target_movie = movies[i+sequence_length]

                self.samples.append((seq_movies, target_movie))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        seq_movies, target_movie = self.samples[idx]
        return {
            'movie_ids': torch.tensor(seq_movies, dtype=torch.long),
            'target_movie': torch.tensor(target_movie, dtype=torch.long)
        }
    
train_dataset = MovieRatingSequenceDataset(in_user_train)
train_dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True)

in_user_test_dataset = MovieRatingSequenceDataset(in_user_test)
in_user_test_dataloader = DataLoader(in_user_test_dataset, batch_size=128, shuffle=False)

out_user_test_dataset = MovieRatingSequenceDataset(out_user_ratings)
out_user_test_dataloader = DataLoader(out_user_test_dataset, batch_size=128, shuffle=False)

This model is a transformer-based encoder that learns patterns in the sequence of 10 movies watched. Each movie ID is embedded and passed through a linear layer, added with learned positional embeddings (sinusoidal positional embeddings will likely work well, too), then processed by a TransformerEncoder to model sequential and relational dependencies. The output corresponding to the last movie in the sequence is passed through a final linear layer to predict logit scores over all movies, representing the next likely movie.

In [11]:
import torch.nn as nn
import torch.nn.functional as F

class MovieRatingTransformer(nn.Module):
    def __init__(self, num_movies, embedding_dim=64, nhead=4, num_layers=2, sequence_length=10):
        super().__init__()
        self.movie_embedding = nn.Embedding(num_movies, embedding_dim)
        self.input_dim = embedding_dim
        self.linear_in = nn.Linear(self.input_dim, embedding_dim)
        encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_dim, nhead=nhead, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc = nn.Linear(embedding_dim, num_movies)
        self.pos_embedding = nn.Parameter(torch.randn(1, sequence_length, embedding_dim))

    def forward(self, movie_ids):
        movie_embeds = self.movie_embedding(movie_ids)
        x = self.linear_in(movie_embeds)
        x = x + self.pos_embedding[:, :x.size(1)]  # add positional encoding
        out = self.transformer(x)
        last_token = out[:, -1, :]
        pred = self.fc(last_token)
        return pred

Below is the training and evaluating code. The model is trained like any other backpropagation, using an appropriate classifier loss (CrossEntropyLoss).
It is evaluated on this average loss and the accuracy of the next movie rated from the model's prediction (i.e., did the model guess the next rated movie correctly).

In [None]:
import torch.optim as optim
from tqdm import tqdm

num_movies = ratings['movieId'].max() + 1
print(num_movies)

model = MovieRatingTransformer(num_movies)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
model.to(device)

optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

def evaluate(model, dataloader, most_popular_movie):
    """Calculate the loss and accuracy of the predictions"""
    model.eval()
    total_loss = 0
    correct = 0
    baseline_correct = 0
    total = 0

    with torch.no_grad():
        for batch in dataloader:
            movie_ids = batch['movie_ids'].to(device)
            target = batch['target_movie'].to(device)

            output = model(movie_ids)
            loss = criterion(output, target)
            total_loss += loss.item() * movie_ids.size(0)

            # Predictions
            preds = output.argmax(dim=1)
            correct += (preds == target).sum().item()
            baseline_correct += (most_popular_movie == target).sum().item()
            total += target.size(0)

    avg_loss = total_loss / len(dataloader.dataset)
    model_acc = correct / total
    baseline_acc = baseline_correct / total
    return avg_loss, model_acc, baseline_acc


# Training loop
epochs = 20
model.train()

most_popular_movie = in_user_train['movieId'].value_counts().idxmax()
most_popular_movie = torch.tensor(most_popular_movie).to(device)

for epoch in range(epochs):
    total_loss = 0
    model.train()
    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch+1}"):
        movie_ids = batch['movie_ids'].to(device)
        target = batch['target_movie'].to(device)

        optimizer.zero_grad()
        output = model(movie_ids)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * movie_ids.size(0)

    train_loss = total_loss / len(train_dataloader.dataset)
    in_loss, in_acc, in_base_acc = evaluate(model, in_user_test_dataloader, most_popular_movie)
    out_loss, out_acc, out_base_acc = evaluate(model, out_user_test_dataloader, most_popular_movie)
    train_loss, train_acc, train_base_acc = evaluate(model, train_dataloader, most_popular_movie)

    print(f"Epoch {epoch+1}: Train Loss: {train_loss:.4f}")
    print(f"  In-User:  Loss: {in_loss:.4f} | Acc: {in_acc:.4f} | Baseline Acc: {in_base_acc:.4f}")
    print(f"  Out-User: Loss: {out_loss:.4f} | Acc: {out_acc:.4f} | Baseline Acc: {out_base_acc:.4f}")
    print(f"  Train: Loss: {train_loss:.4f} | Acc: {train_acc:.4f} | Baseline Acc: {train_base_acc:.4f}")

45736
cuda


Epoch 1: 100%|██████████| 23729/23729 [06:48<00:00, 58.10it/s]


Epoch 1: Train Loss: 7.1932
  In-User:  Loss: 7.0835 | Acc: 0.0126 | Baseline Acc: 0.0034
  Out-User: Loss: 7.1756 | Acc: 0.0119 | Baseline Acc: 0.0022
  Train: Loss: 7.1932 | Acc: 0.0113 | Baseline Acc: 0.0020


Epoch 2: 100%|██████████| 23729/23729 [06:47<00:00, 58.27it/s]


Epoch 2: Train Loss: 6.9638
  In-User:  Loss: 6.8931 | Acc: 0.0171 | Baseline Acc: 0.0034
  Out-User: Loss: 6.9739 | Acc: 0.0163 | Baseline Acc: 0.0022
  Train: Loss: 6.9638 | Acc: 0.0156 | Baseline Acc: 0.0020


Epoch 3: 100%|██████████| 23729/23729 [06:17<00:00, 62.88it/s]


Epoch 3: Train Loss: 6.8386
  In-User:  Loss: 6.7880 | Acc: 0.0194 | Baseline Acc: 0.0034
  Out-User: Loss: 6.8706 | Acc: 0.0190 | Baseline Acc: 0.0022
  Train: Loss: 6.8386 | Acc: 0.0184 | Baseline Acc: 0.0020


Epoch 4: 100%|██████████| 23729/23729 [06:39<00:00, 59.36it/s]


Epoch 4: Train Loss: 6.7639
  In-User:  Loss: 6.7388 | Acc: 0.0217 | Baseline Acc: 0.0034
  Out-User: Loss: 6.8141 | Acc: 0.0211 | Baseline Acc: 0.0022
  Train: Loss: 6.7639 | Acc: 0.0208 | Baseline Acc: 0.0020


Epoch 5: 100%|██████████| 23729/23729 [06:37<00:00, 59.71it/s]


Epoch 5: Train Loss: 6.6987
  In-User:  Loss: 6.6867 | Acc: 0.0229 | Baseline Acc: 0.0034
  Out-User: Loss: 6.7696 | Acc: 0.0223 | Baseline Acc: 0.0022
  Train: Loss: 6.6987 | Acc: 0.0221 | Baseline Acc: 0.0020


Epoch 6: 100%|██████████| 23729/23729 [06:35<00:00, 59.92it/s]


Epoch 6: Train Loss: 6.6483
  In-User:  Loss: 6.6557 | Acc: 0.0260 | Baseline Acc: 0.0034
  Out-User: Loss: 6.7324 | Acc: 0.0245 | Baseline Acc: 0.0022
  Train: Loss: 6.6483 | Acc: 0.0239 | Baseline Acc: 0.0020


Epoch 7: 100%|██████████| 23729/23729 [06:30<00:00, 60.81it/s]


Epoch 7: Train Loss: 6.6178
  In-User:  Loss: 6.6444 | Acc: 0.0267 | Baseline Acc: 0.0034
  Out-User: Loss: 6.7208 | Acc: 0.0250 | Baseline Acc: 0.0022
  Train: Loss: 6.6178 | Acc: 0.0247 | Baseline Acc: 0.0020


Epoch 8: 100%|██████████| 23729/23729 [06:39<00:00, 59.34it/s]


Epoch 8: Train Loss: 6.5851
  In-User:  Loss: 6.6258 | Acc: 0.0279 | Baseline Acc: 0.0034
  Out-User: Loss: 6.7001 | Acc: 0.0260 | Baseline Acc: 0.0022
  Train: Loss: 6.5851 | Acc: 0.0258 | Baseline Acc: 0.0020


Epoch 9: 100%|██████████| 23729/23729 [06:37<00:00, 59.74it/s]


Epoch 9: Train Loss: 6.5569
  In-User:  Loss: 6.6074 | Acc: 0.0286 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6826 | Acc: 0.0265 | Baseline Acc: 0.0022
  Train: Loss: 6.5569 | Acc: 0.0263 | Baseline Acc: 0.0020


Epoch 10: 100%|██████████| 23729/23729 [06:35<00:00, 60.07it/s]


Epoch 10: Train Loss: 6.5375
  In-User:  Loss: 6.6017 | Acc: 0.0288 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6735 | Acc: 0.0272 | Baseline Acc: 0.0022
  Train: Loss: 6.5375 | Acc: 0.0269 | Baseline Acc: 0.0020


Epoch 11: 100%|██████████| 23729/23729 [06:43<00:00, 58.84it/s]


Epoch 11: Train Loss: 6.5198
  In-User:  Loss: 6.6005 | Acc: 0.0295 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6677 | Acc: 0.0277 | Baseline Acc: 0.0022
  Train: Loss: 6.5198 | Acc: 0.0276 | Baseline Acc: 0.0020


Epoch 12: 100%|██████████| 23729/23729 [06:43<00:00, 58.82it/s]


Epoch 12: Train Loss: 6.4931
  In-User:  Loss: 6.5812 | Acc: 0.0301 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6500 | Acc: 0.0285 | Baseline Acc: 0.0022
  Train: Loss: 6.4931 | Acc: 0.0286 | Baseline Acc: 0.0020


Epoch 13: 100%|██████████| 23729/23729 [06:40<00:00, 59.18it/s]


Epoch 13: Train Loss: 6.4790
  In-User:  Loss: 6.5795 | Acc: 0.0312 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6435 | Acc: 0.0287 | Baseline Acc: 0.0022
  Train: Loss: 6.4790 | Acc: 0.0289 | Baseline Acc: 0.0020


Epoch 14: 100%|██████████| 23729/23729 [06:56<00:00, 56.93it/s]


Epoch 14: Train Loss: 6.4635
  In-User:  Loss: 6.5708 | Acc: 0.0300 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6373 | Acc: 0.0289 | Baseline Acc: 0.0022
  Train: Loss: 6.4635 | Acc: 0.0291 | Baseline Acc: 0.0020


Epoch 15: 100%|██████████| 23729/23729 [06:42<00:00, 58.96it/s]


Epoch 15: Train Loss: 6.4546
  In-User:  Loss: 6.5605 | Acc: 0.0318 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6329 | Acc: 0.0292 | Baseline Acc: 0.0022
  Train: Loss: 6.4546 | Acc: 0.0297 | Baseline Acc: 0.0020


Epoch 16: 100%|██████████| 23729/23729 [06:41<00:00, 59.04it/s]


Epoch 16: Train Loss: 6.4366
  In-User:  Loss: 6.5664 | Acc: 0.0315 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6229 | Acc: 0.0289 | Baseline Acc: 0.0022
  Train: Loss: 6.4366 | Acc: 0.0297 | Baseline Acc: 0.0020


Epoch 17: 100%|██████████| 23729/23729 [06:41<00:00, 59.03it/s]


Epoch 17: Train Loss: 6.4228
  In-User:  Loss: 6.5519 | Acc: 0.0321 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6176 | Acc: 0.0298 | Baseline Acc: 0.0022
  Train: Loss: 6.4228 | Acc: 0.0302 | Baseline Acc: 0.0020


Epoch 18: 100%|██████████| 23729/23729 [06:48<00:00, 58.08it/s]


Epoch 18: Train Loss: 6.4080
  In-User:  Loss: 6.5465 | Acc: 0.0313 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6130 | Acc: 0.0303 | Baseline Acc: 0.0022
  Train: Loss: 6.4080 | Acc: 0.0305 | Baseline Acc: 0.0020


Epoch 19: 100%|██████████| 23729/23729 [06:37<00:00, 59.73it/s]


Epoch 19: Train Loss: 6.4010
  In-User:  Loss: 6.5393 | Acc: 0.0325 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6116 | Acc: 0.0301 | Baseline Acc: 0.0022
  Train: Loss: 6.4010 | Acc: 0.0310 | Baseline Acc: 0.0020


Epoch 20: 100%|██████████| 23729/23729 [06:26<00:00, 61.44it/s]


Epoch 20: Train Loss: 6.3947
  In-User:  Loss: 6.5559 | Acc: 0.0321 | Baseline Acc: 0.0034
  Out-User: Loss: 6.6124 | Acc: 0.0303 | Baseline Acc: 0.0022
  Train: Loss: 6.3947 | Acc: 0.0309 | Baseline Acc: 0.0020


After 20 epochs, the model ends up performing about equally well on the train, in-user, and out-user datasets, suggesting it has not overfit the data.
Furthermore, it has clearly found some patterns of relevance: simply predicting the most popular movie (The Shawshank Redemption) as the next movie rated only provides accuracies of between 0.22% and 0.34% on the test sets, depending on the dataset, while the model is accurate at a rate of 3.03% and 3.21% - a 10-fold increase in next movie rated relevance!
This shows that a model that only knows what a user has done, without knowing anything about what the thing being done actually is, can find new things the user is likely to do at a much-better-than-random-chance rate.
I'm not looking into it here, but it's quite likely that the model is also finding 2nd, 3rd, 4th, 5th, etc. most likely movies that are better-than-random-chance as well.
This shows the utility and validity of content prediction (and recommendation - it's really the same thing) using only a sequence of user interactions. This method doesn't need internal resources to tag/annotate content, nor does it need users to take time rating what they've done or defining what they expect, a huge time save and likely a significant benefit to user experience as well!