# Recommendation System - Movies

I am going to build a recommendation system for movies using the MovieLens dataset. The dataset contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018. The dataset was generated on September 26, 2018. Users were selected at random for inclusion. All selected users had rated at least 20 movies. Each user is represented by an id, and no other information is provided.

## Credits
Citation:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872<br>
Source of knowledge and ideas: https://www.youtube.com/watch?v=G4MBc40rQ2k&t=315s

In [28]:
import pandas as pd

movies_df = pd.read_csv('./data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('./data/ml-latest-small/ratings.csv')


In [3]:
print(f"The dimensions of the movies dataframe are: {movies_df.shape}")
print(f"The dimensions of the ratings dataframe are: {ratings_df.shape}")

The dimensions of the movies dataframe are: (9742, 3)
The dimensions of the ratings dataframe are: (100836, 4)


In [4]:
# Let's take a look at the movies dataframe
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
# Let's take a look at the ratings dataframe
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Comment
As we can see, there are duplicates in the ratings dataframe. It's a good sign because it means that users rated more than one movie. Actually, at least 20 movies each. <br>
## Movies Dataframe
The movies dataframe has 3 columns: movieId, title and genres. The movieId is a unique identifier for each movie. The title is the name of the movie. The genres column contains a list of genres separated by a pipe (|). <br>
## Ratings Dataframe
The ratings dataframe has 4 columns: userId, movieId, rating and timestamp. The userId is a unique identifier for each user. The movieId is a unique identifier for each movie. The rating is the score given by the user to the movie. The timestamp is the time when the user rated the movie. <br>


## Theoretical Background
### Collaborative Filtering
Collaborative filtering relies on user behavior and preferences. It makes recommendations based on the idea that if two users have agreed on many items in the past, they are likely to agree on future items as well. There are two types: <br>
- User-based collaborative filtering: Recommends items that similar users have liked.<br>
- Item-based collaborative filtering: Recommends items that are similar to the ones the user has liked or interacted with in the past.<br>
Example: If user A and user B both liked movies X and Y, and user A liked movie Z, the system might recommend movie Z to user B.
### Content-Based Filtering
Content-based filtering recommends items based on the attributes of the items themselves and the user's past behavior. It focuses on the features of items (such as genre, author, keywords, etc.) and suggests items similar to those the user has shown interest in. <br>
Example: If a user likes action movies, the system might recommend more action movies based on their genre, director, or cast.
### Key Difference:
- Collaborative filtering leverages user preferences and interactions.
- Content-based filtering uses item features and user history to make recommendations.
### What are we going to do with our dataset?
We will primarily use collaborative filtering, specifically matrix factorization.
## Matrix Factorization
Matrix factorization is a technique often used in collaborative filtering methods for recommendation systems, where the goal is to predict missing entries in a matrix, typically a user-item interaction matrix (e.g., user ratings of movies, products, etc.). The idea is to break down the matrix into two lower-dimensional matrices, called factors or embeddings, that capture the latent features underlying the interactions between users and items.
### In the context of recommendation systems:
- The matrix: This is typically a large, sparse matrix, where rows represent users, columns represent items (e.g., movies), and the entries are ratings (or interactions) provided by users to items.
- The goal: The objective of matrix factorization is to approximate the original matrix by decomposing it into two smaller matrices that can be multiplied together to predict the missing ratings.

In [25]:
# Let's provide some more analysis
movie_names = movies_df.set_index('movieId')['title'].to_dict()
print(f"movie_name_ 1: {movie_names[1]}")
number_of_users = len(ratings_df['userId'].unique())
number_of_movies = len(ratings_df['movieId'].unique())

print(f"Number of users: {number_of_users}")
print(f"Number of movies: {number_of_movies}")
print(f"The full rating matrix would be {number_of_users}x{number_of_movies}\n")

print(f"Number of ratings: {len(ratings_df)}")
print(f"So, {len(ratings_df) / (number_of_users * number_of_movies) * 100:.2f}% of the rating matrix is filled\n")

movie_name_ 1: Toy Story (1995)
Number of users: 610
Number of movies: 9724
The full rating matrix would be 610x9724

Number of ratings: 100836
So, 1.70% of the rating matrix is filled



We have a pretty sparse matrix<br>
In addition, the more users and products, the number of elements will increase exponentially<br>
In order to restrict the amount of memory needed to store the matrix, we can use matrix factorization

In [7]:
import torch as th
import numpy as np


class CollaborativeFiltering(th.nn.Module):
    def __init__(self, num_users, num_items, latent_factors=20):
        super().__init__()
        # Define user-specific latent factors (embeddings)
        self.user_embeddings = th.nn.Embedding(num_users, latent_factors)
        # Define item-specific latent factors (embeddings)
        self.item_embeddings = th.nn.Embedding(num_items, latent_factors)
        
        # Initialize the embeddings with randomness ( normal distribution )
        self.user_embeddings.weight.data.uniform_(0, 0.05)
        self.item_embeddings.weight.data.uniform_(0, 0.05)
        
    def forward(self, interaction_data):
        user_indices, item_indices = interaction_data[:, 0], interaction_data[:, 1]
        user_factors = self.user_embeddings(user_indices)
        item_factors = self.item_embeddings(item_indices)
        return (user_factors * item_factors).sum(1)
    
    def predict(self, user_idx, item_idx):
        interaction_data = th.tensor([[user_idx, item_idx]], dtype=th.long)
        return self.forward(interaction_data)

In [8]:
from torch.utils.data.dataset import Dataset

class RatingDataset(Dataset):
    def __init__(self):
        self.ratings_data = ratings_df.copy()
        
        # Create user and item index mappings
        unique_users = ratings_df.userId.unique()
        unique_items = ratings_df.movieId.unique()
        
        self.user_to_idx = {user: idx for idx, user in enumerate(unique_users)}
        self.item_to_idx = {item: idx for idx, item in enumerate(unique_items)}
        
        # Map user and item IDs to continuous indexes
        self.ratings_data.userId = ratings_df.userId.apply(lambda x: self.user_to_idx[x])
        self.ratings_data.movieId = ratings_df.movieId.apply(lambda x: self.item_to_idx[x])
        
        # Extract features and target
        self.features = self.ratings_data.drop(['rating', 'timestamp'], axis=1).values
        self.target = self.ratings_data['rating'].values
        self.features = th.tensor(self.features)
        self.target = th.tensor(self.target, dtype=th.float32)
        
    def __getitem__(self, index):
        return self.features[index], self.target[index]
     
    def __len__(self):
        return len(self.ratings_data)
    

In [9]:
from torch.utils.data.dataloader import DataLoader
# Initialize model with new params
num_epochs = 128
device = th.device('cuda' if th.cuda.is_available() else 'cpu')
model = CollaborativeFiltering(number_of_users, number_of_movies, latent_factors=8).to(device)

# Loss function and optimizer
loss_function = th.nn.MSELoss()
optimizer = th.optim.Adam(model.parameters(), lr=1e-3)

# Create Dataset and DataLoader
dataset = RatingDataset()
data_loader = DataLoader(dataset, batch_size=128, shuffle=True)

In [10]:
from torch.autograd import Variable
from tqdm.notebook import tqdm
# Training loop
for epoch in tqdm(range(num_epochs)):
    epoch_losses = []
    for user_item_data, ratings in data_loader:
        user_item_data, ratings = user_item_data.to(device), ratings.to(device)
        
        optimizer.zero_grad()
        predicted_ratings = model(user_item_data)
        loss = loss_function(predicted_ratings, ratings)
        epoch_losses.append(loss.item())
        
        loss.backward()
        optimizer.step()
        
    avg_epoch_loss = np.mean(epoch_losses)
    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {avg_epoch_loss:.4f}')

  0%|          | 0/128 [00:00<?, ?it/s]

Epoch 1/128, Loss: 11.0570
Epoch 2/128, Loss: 4.7467
Epoch 3/128, Loss: 2.4778
Epoch 4/128, Loss: 1.7223
Epoch 5/128, Loss: 1.3459
Epoch 6/128, Loss: 1.1286
Epoch 7/128, Loss: 0.9916
Epoch 8/128, Loss: 0.9003
Epoch 9/128, Loss: 0.8370
Epoch 10/128, Loss: 0.7919
Epoch 11/128, Loss: 0.7590
Epoch 12/128, Loss: 0.7347
Epoch 13/128, Loss: 0.7159
Epoch 14/128, Loss: 0.7015
Epoch 15/128, Loss: 0.6902
Epoch 16/128, Loss: 0.6816
Epoch 17/128, Loss: 0.6749
Epoch 18/128, Loss: 0.6696
Epoch 19/128, Loss: 0.6654
Epoch 20/128, Loss: 0.6630
Epoch 21/128, Loss: 0.6604
Epoch 22/128, Loss: 0.6588
Epoch 23/128, Loss: 0.6577
Epoch 24/128, Loss: 0.6565
Epoch 25/128, Loss: 0.6559
Epoch 26/128, Loss: 0.6550
Epoch 27/128, Loss: 0.6543
Epoch 28/128, Loss: 0.6532
Epoch 29/128, Loss: 0.6522
Epoch 30/128, Loss: 0.6510
Epoch 31/128, Loss: 0.6492
Epoch 32/128, Loss: 0.6475
Epoch 33/128, Loss: 0.6450
Epoch 34/128, Loss: 0.6423
Epoch 35/128, Loss: 0.6389
Epoch 36/128, Loss: 0.6346
Epoch 37/128, Loss: 0.6297
Epoch 38/

In [11]:
# Extract the learned embeddings
user_embedding_matrix = model.user_embeddings.weight.data.cpu().numpy()
item_embedding_matrix = model.item_embeddings.weight.data.cpu().numpy()

In [12]:
# Perform clustering on the learned item embeddings
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(item_embedding_matrix)

[WinError 2] Nie można odnaleźć określonego pliku
  File "C:\Users\falan\miniconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\falan\miniconda3\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\falan\miniconda3\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\falan\miniconda3\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


In [27]:
# Create mappings: idx2movieid and movieid2name
idx2movieid = {idx: movie_id for idx, movie_id in enumerate(ratings_df['movieId'].unique())}
movieid2name = movies_df.set_index('movieId')['title'].to_dict()  # Mapping from movieId to movie titles

# Display the top movies in each cluster
for cluster_idx in range(10):  # Assume there are 10 clusters
    movie_indices = np.where(kmeans.labels_ == cluster_idx)[0]  # Indices of movies in the current cluster
    movie_details = []

    # Collect details of movies in the current cluster
    for idx in movie_indices:
        movie_id = idx2movieid.get(idx)
        if movie_id in movieid2name:  # Check if the movie has a title in `movieid2name`
            movie_name = movieid2name[movie_id]
            rating_count = len(ratings_df[ratings_df['movieId'] == movie_id])  # Number of ratings
            movie_details.append((movie_name, rating_count))
    
    # If the cluster contains movies, display the results
    if movie_details:
        print(f"Cluster {cluster_idx}:")
        # Sort movies by the number of ratings and display the top 10
        for movie in sorted(movie_details, key=lambda x: x[1], reverse=True)[:10]:
            print(f"\t{movie[0]}")  # Display the movie title
    else:
        print(f"Cluster {cluster_idx} is empty or has no recognizable movies.")


Cluster 0:
	Shrek (2001)
	Finding Nemo (2003)
	Stargate (1994)
	Incredibles, The (2004)
	Spider-Man (2002)
	Up (2009)
	WALL·E (2008)
	Harry Potter and the Chamber of Secrets (2002)
	Iron Man (2008)
	Harry Potter and the Prisoner of Azkaban (2004)
Cluster 1:
	Coneheads (1993)
	Judge Dredd (1995)
	Batman & Robin (1997)
	River Wild, The (1994)
	Godzilla (1998)
	Desperately Seeking Susan (1985)
	Super Mario Bros. (1993)
	Timecop (1994)
	Fantastic Four: Rise of the Silver Surfer (2007)
	Speed 2: Cruise Control (1997)
Cluster 2:
	Forrest Gump (1994)
	Shawshank Redemption, The (1994)
	Silence of the Lambs, The (1991)
	Matrix, The (1999)
	Jurassic Park (1993)
	Braveheart (1995)
	Terminator 2: Judgment Day (1991)
	Toy Story (1995)
	Seven (a.k.a. Se7en) (1995)
	Apollo 13 (1995)
Cluster 3:
	Ace Ventura: Pet Detective (1994)
	Ace Ventura: When Nature Calls (1995)
	Demolition Man (1993)
	Starship Troopers (1997)
	Face/Off (1997)
	Robin Hood: Men in Tights (1993)
	Desperado (1995)
	Pleasantville (19

## Summary
We can see that the movies clustered are pretty in the same genre or have something in common. For instance Cluster 0 seems to be a group of fairy tales or fantasy movies<br> 
Furthermore, the loss is decreasing as the number of epochs increases<br>
The model is working as expected and I find it very useful while looking something interesting to watch in the evening.<br>
I hope you enjoyed this notebook and found it interesting<br>