# Movie Model Training Script

The script presented below illustrates the data preprocessing and training processes we employed to develop our machine learning model. Approaching this project, our primary challenge was acquiring hands-on experience in machine learning, a field in which we initially had minimal expertise. We recognize that the implementation may be rudimentary adn kind of naive, particularly regarding the predictive methodology, which might not represent the most effective strategy. Nevertheless, we take pride in having operational models, despite acknowledging the potential for further enhancement in their predictive accuracy. It is also important to note that we extensively used ChatGPT to assist with the model training procedures, given our limited prior knowledge.


Note: in the script below we only used a small part of our dataset (10000 movies) for the training, since it would have taken too much time and storage to train and save it to our local machines. Therefore, the following script serves as a demonstration which can be performed and tested on your local device.
We trained the model on the whole dataset on a different machine, but the same code (just adjusted for whole dataset). We uploaded this model to Hugging Face and then fetched it later on in our project. 

### Data loading and Preprocessing

In the two following code blocks we load and preprocess our data. The overall goal is to facilitate the training of a model that can predict similar movies based on genre, which could be used in the recommendation system. The structuring of movie pairs based on genre similarity provides a focused dataset.

In [66]:
from torch.utils.data import Dataset, DataLoader
import os
import pandas as pd
import numpy as np
import torch

class MoviePairDataset(Dataset):
    def __init__(self, preferred_movie_ids, similar_movie_ids):
        """
        Args:
            preferred_movie_ids (list of int): The IDs of the preferred movies.
            similar_movie_ids (list of int): The IDs of movies similar to the preferred ones.
        """
        assert len(preferred_movie_ids) == len(similar_movie_ids), "The lists must have the same length."
        self.preferred_movie_ids = preferred_movie_ids
        self.similar_movie_ids = similar_movie_ids
        
    def __len__(self):
        return len(self.preferred_movie_ids)
        
    def __getitem__(self, idx):
        preferred_id = torch.tensor(self.preferred_movie_ids[idx], dtype=torch.long)
        similar_id = torch.tensor(self.similar_movie_ids[idx], dtype=torch.long)
        return preferred_id, similar_id

In [67]:
# size of the dataset
num_movie_ids = 10000 

# path to the dataset folder
data_folder = "../data/"

# list all CSV files in the data directory
csv_files = [f for f in os.listdir(data_folder) if f.endswith('.csv')]

# load all CSV files into a single DataFrame
all_movies_df = pd.concat(
    (pd.read_csv(f"{data_folder}{file}") for file in csv_files),
    ignore_index=True
)[:num_movie_ids]

# define the thresholds for the similarity of movies
RATING_DIFF_THRESHOLD = 1.0  # movies within this rating difference are considered similar
YEAR_DIFF_THRESHOLD = 5  # movies within this range of years are considered similar

# initialize lists to store the IDs of the preferred and similar movies
preferred_movie_ids = []
similar_movie_ids = []

# lists of movies by genre
genres = {}
lst = all_movies_df.values.tolist()
for i, l in enumerate(lst):
    genre = []
    if l[5] in genres:
        genre = genres[l[5]]
    else:
        genres[l[5]] = genre
    genre.append([i, l[1]])
print('Dict by genre created')

# create dataset from pairs: movie + random movie with same genre
import random    
for row_idx, row in all_movies_df.iterrows():
    same_genre_movies = genres[row['genre']] # get all movies of the same genre
    pos = row_idx
    while pos == row_idx:
        pos = random.randint(0, len(same_genre_movies) - 1) # get a random movie of the same genre
    similar_movie_ids.append(same_genre_movies[pos][0]) # add the random movie to the list of similar movies
    preferred_movie_ids.append(row_idx) # add the preferred movie to the list of preferred movies

# create dataset from pairs: movie + random movie with same genre
dataset = MoviePairDataset(preferred_movie_ids=preferred_movie_ids, similar_movie_ids=similar_movie_ids)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=0)

# create dataset from pairs: movie + random movie with same genre (same as above, should be like 90%/10% split, but would perform worse - so just for demo)
eval_dataset = MoviePairDataset(preferred_movie_ids=preferred_movie_ids, similar_movie_ids=similar_movie_ids)
eval_dataloader = DataLoader(eval_dataset, batch_size=8, shuffle=True, num_workers=0)

Dict by genre created


### Neural Network Model Definition

The following code defines a neural network model named MovieIDPredictor using PyTorch, designed for predicting movie IDs. The model follows the Transformer architecture. 

In [68]:
import torch
import torch.nn as nn
import torch.optim as optim

class MovieIDPredictor(nn.Module):
    def __init__(self, num_movie_ids, movie_id_embedding_dim=64, transformer_heads=8, transformer_layers=1, transformer_dim=64):
        super(MovieIDPredictor, self).__init__()
        self.movie_id_embedding_dim = movie_id_embedding_dim # Dimension of the movie ID embeddings
        self.transformer_dim = transformer_dim # Dimension of the transformer output

        # ensure the embedding dimension for movie_id matches the transformer dimension
        self.movie_id_embedding = nn.Embedding(num_movie_ids, self.movie_id_embedding_dim)

        # eransformer layer
        self.transformer = nn.TransformerEncoder(nn.TransformerEncoderLayer(d_model=self.transformer_dim, nhead=transformer_heads), num_layers=transformer_layers)
        
        # output layer to classify movie IDs (stays unchanged)
        self.fc_out = nn.Linear(self.transformer_dim, num_movie_ids)
        
    def forward(self, movie_id):
        movie_id_emb = self.movie_id_embedding(movie_id).view(-1, 1, self.movie_id_embedding_dim) # embedding layer
        x = self.transformer(movie_id_emb) # transformer
        x = x.view(-1, self.transformer_dim) # flatten the output
        output = self.fc_out(x) # output layer
        return output


### Training Loop

The following code is for the setup and execution of a training loop for the neural network model designed to predict similar movie IDs from preferred movie IDs. The training loop leverages supervised learning as it uses a dataset that contains both the inputs and the corresponding target outputs. 

In [69]:
# set the device to be used for training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# model initialization
model = MovieIDPredictor(num_movie_ids=num_movie_ids, movie_id_embedding_dim=64, transformer_heads=8, transformer_layers=1, transformer_dim=64)

model.to(device) # move the model to the device

# loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0005)

# training loop setup
num_epochs = 30

lowest_val_loss = float('inf') # to track the lowest validation loss
best_model_state = None  # to save the best model state

for epoch in range(num_epochs):
    model.train()  #set the model to training mode
    train_loss = 0.0
    # Training step
    for preferred_id, similar_id in dataloader:
        preferred_id, similar_id = preferred_id.to(device), similar_id.to(device)
        optimizer.zero_grad()
        outputs = model(preferred_id)
        loss = criterion(outputs, similar_id)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    avg_train_loss = train_loss / len(dataloader) # average training loss

    # evaluation step
    model.eval()  # set the model to evaluation mode
    val_loss = 0.0
    with torch.no_grad():
        for preferred_id, similar_id in dataloader:
            preferred_id, similar_id = preferred_id.to(device), similar_id.to(device)
            outputs = model(preferred_id)
            loss = criterion(outputs, similar_id)
            val_loss += loss.item()

    avg_val_loss = val_loss / len(eval_dataloader) # average validation loss

    print(f'Epoch {epoch+1}, Training Loss: {avg_train_loss:.4f}, Validation Loss: {avg_val_loss:.4f}')

    # check if the current validation loss is the lowest
    if avg_val_loss < lowest_val_loss:
        print(f'Validation loss decreased ({lowest_val_loss:.4f} --> {avg_val_loss:.4f}). Saving model ...')
        lowest_val_loss = avg_val_loss
        best_model_state = model.state_dict()

best_model_state_cpu = {k: v.cpu() for k, v in best_model_state.items()} # move the best model state to the CPU
torch.save(best_model_state_cpu, 'movie_predictor_model.pth') # save the best model state to a file



Epoch 1, Training Loss: 9.4150, Validation Loss: 8.9467
Validation loss decreased (inf --> 8.9467). Saving model ...
Epoch 2, Training Loss: 8.6678, Validation Loss: 8.0072
Validation loss decreased (8.9467 --> 8.0072). Saving model ...
Epoch 3, Training Loss: 7.9888, Validation Loss: 7.0574
Validation loss decreased (8.0072 --> 7.0574). Saving model ...
Epoch 4, Training Loss: 7.0418, Validation Loss: 5.7033
Validation loss decreased (7.0574 --> 5.7033). Saving model ...
Epoch 5, Training Loss: 5.7061, Validation Loss: 3.9427
Validation loss decreased (5.7033 --> 3.9427). Saving model ...
Epoch 6, Training Loss: 4.0778, Validation Loss: 2.0817
Validation loss decreased (3.9427 --> 2.0817). Saving model ...
Epoch 7, Training Loss: 2.4160, Validation Loss: 0.7742
Validation loss decreased (2.0817 --> 0.7742). Saving model ...
Epoch 8, Training Loss: 1.0816, Validation Loss: 0.2092
Validation loss decreased (0.7742 --> 0.2092). Saving model ...
Epoch 9, Training Loss: 0.3822, Validation 

### Load Model Weights

Load the model weights for the validation later on.

In [70]:
# load the model
loaded_model = MovieIDPredictor(num_movie_ids=num_movie_ids)

# load the state dictionary
loaded_model.load_state_dict(torch.load('../models/movie_predictor_model.pth')) 

# set the model to evaluation mode
loaded_model.eval()

MovieIDPredictor(
  (movie_id_embedding): Embedding(10000, 64)
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
        )
        (linear1): Linear(in_features=64, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=64, bias=True)
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (fc_out): Linear(in_features=64, out_features=10000, bias=True)
)

### Validation

In [71]:
correct = 0
total = 0
with torch.no_grad():  # no need to track gradients during evaluation
    for preferred_id, similar_id in dataloader:  # assuming you have a dataloader for your evaluation dataset
        outputs = loaded_model(preferred_id)
        _, predicted = torch.max(outputs.data, 1)  # get the index of the max log-probability
        total += similar_id.size(0)
        correct += (predicted == similar_id).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy of the model on the evaluation dataset: {accuracy:.2f}%')


Accuracy of the model on the evaluation dataset: 100.00%


### Example of a Movie Recommendation

The following code illustrates an example of a movie recommendation with the model which was trained in the scripts before.

In [73]:
import torch

def predict(movie_id, n):
    """
    Predicts the top n recommended movie_ids for a given movie_id.

    Args:
    movie_id (int): The movie ID for which recommendations are to be made.
    n (int): The number of recommendations to return.

    Returns:
    list: A list of the top n recommended movie IDs.
    """
    # convert movie_id to a tensor and add a batch dimension (batch size = 1)
    movie_id_tensor = torch.tensor([movie_id], dtype=torch.long)
    
    # ensure the model is in evaluation mode
    loaded_model.eval()
    
    with torch.no_grad():  # Inference doesn't require gradient calculation
        # get model output for the given movie_id
        outputs = loaded_model(movie_id_tensor)
        
        # get the scores, ignore the first recommendation as it's the movie itself
        _, recommended_ids = torch.topk(outputs, n + 1, dim=1)
        
        # convert to a list and remove the input movie_id from the recommendations
        recommended_ids = recommended_ids[0].tolist()
        if movie_id in recommended_ids:
            recommended_ids.remove(movie_id)
        else:  # if the movie_id is not in the top n+1, remove the last to keep n recommendations
            recommended_ids.pop()

    return recommended_ids[:n]

movie_id = 20  # example movie ID
n = 5  # number of recommendations
recommended_movie_ids = predict(movie_id, n) # get recommendations
print(f"Recommended Movie IDs for Movie ID {movie_id}: {recommended_movie_ids}")


Recommended Movie IDs for Movie ID 20: [5350, 7299, 7324, 8470, 3605]


## Upload Model to Hugging Face

For the training of the whole model we then used the following code to upload it to Hugging Face (as it would be too large to push in to GitHub). 

In [None]:
from huggingface_hub import create_repo
from huggingface_hub import notebook_login

notebook_login()

In [None]:
create_repo("movie_match_model", private=False)