# Recommander System - Date Night Movies

## Instructions:
- To train the model, you will need to press the "Run All" button.
- Once the model is trained, just change the ids of the 2 users in the last cell and run it to get a list of their top movies recommandation.

Imports

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from implicit.als import AlternatingLeastSquares
import scipy.sparse as sparse
import numpy as np

## Loading Datas

For this part, I chose to only use the MovieLens dataset as i had severe memory problems on my computer. I dropped the Timestamp column for this dataset as i judged it not useful for our purpose. This dataset includes 1 million ratings from 6,000 users on 4,000 movies.

In [4]:
# Load datas
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

# Drop the timestamp column as it's not needed
ratings = ratings.drop(columns='timestamp')

# Train-test split
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

## Feature Engineering

I created a user-item interaction matrix in a sparse format without using the pivot method as this method was too heavy on my computer and would simply not work. This approach is memory efficient and suitable for large datasets.

In [5]:
# Create user-item interaction matrix
user_item_sparse = sparse.coo_matrix(
    (train_data['rating'], (train_data['userId'], train_data['movieId']))
)

# For memory efficiency
user_item_sparse = user_item_sparse.tocsr()

## Model Development

I used the ALS algorithm from the implicit library for collaborative filtering. ALS is well-suited for implicit feedback datasets and scales well with large datasets.

In [6]:
# Initialize the ALS
als_model = AlternatingLeastSquares(factors=20, regularization=0.1, iterations=15)

# Train the ALS
als_model.fit(user_item_sparse.T)

  check_blas_config()


  0%|          | 0/15 [00:00<?, ?it/s]

## Recommandation Algorithm

I developed a function to recommend movies for a couple by combining their preferences. The function averages the preferences of the two users to compute a combined score for each movie.

In [15]:
def recommend_movies_for_couple(user1_id, user2_id, model=als_model, user_item_sparse=user_item_sparse, num_recommendations=10):
    user_factors = model.user_factors
    item_factors = model.item_factors
    
    if user1_id >= user_factors.shape[0] or user2_id >= user_factors.shape[0]:
        return []
    
    user1_vector = user_factors[user1_id]
    user2_vector = user_factors[user2_id]
    
    # Combine the preferences of both users
    combined_vector = (user1_vector + user2_vector) / 2
    
    # Calculate scores for all movies
    scores = item_factors.dot(combined_vector)
    
    # Get top movie recommendations
    movie_indices = np.argsort(scores)[::-1][:num_recommendations]
    
    # Map indices to movie IDs
    movie_ids = np.array(user_item_sparse.indices[movie_indices])
    
    # Map indices to movie titles
    result = []
    for movie_id in movie_ids:
        result.append(movies[movies['movieId'] == movie_id]['title'].iloc[0])
    
    return result

Change user1_id and user2_id to the ids of the two users of the couple to get their top recommandations.

In [16]:
user1_id = 1
user2_id = 2

# Get top movie recommendations for the couple
print("Top movie recommendations for the couple:", recommend_movies_for_couple(user1_id, user2_id))

Top movie recommendations for the couple: ['Blink (1994)', 'Secret of NIMH, The (1982)', 'Airport 1975 (1974)', 'Graduate, The (1967)', 'Road to Perdition (2002)', 'American Psycho (2000)', 'Baby Driver (2017)', 'Down to Earth (2001)', 'Star Wars: Episode IV - A New Hope (1977)', 'Kill Bill: Vol. 1 (2003)']


## Evaluation

To evaluate my model, I am going to calculate the RMSE

In [20]:
# Function to compute RMSE in a memory-efficient way
def compute_rmse(model, test_data, num_users, batch_size=1000):
    user_factors = model.user_factors
    item_factors = model.item_factors
    
    rmse_sum = 0
    count = 0
    
    for start in range(0, num_users, batch_size):
        end = min(start + batch_size, num_users)
        user_batch = range(start, end)
        
        # Get true ratings for users in the batch
        true_ratings = []
        predicted_ratings = []
        
        for user_id in user_batch:
            user_ratings = test_data[test_data['userId'] == user_id]
            for _, row in user_ratings.iterrows():
                movie_id = row['movieId']
                true_rating = row['rating']
                
                # Predict rating
                if movie_id < item_factors.shape[0] and user_id < user_factors.shape[0]:
                    predicted_rating = np.dot(user_factors[user_id], item_factors[0])
                    true_ratings.append(true_rating)
                    predicted_ratings.append(predicted_rating)
        
        if true_ratings:
            rmse_sum += mean_squared_error(true_ratings, predicted_ratings, squared=False) * len(true_ratings)
            count += len(true_ratings)
    
    rmse = rmse_sum / count
    return rmse

# Compute RMSE
num_users = train_data['userId'].nunique()
rmse = compute_rmse(als_model, test_data, num_users)
print("RMSE:", rmse)



RMSE: 3.690893840152147




After the calculations, I obtained a RMSE of 3.69 which isn't great, this probably stems from the fact the i didn't use all of the features from the dataset because of memory constraints.