# Project REMA 2024 - Abel ANDRY


## Welcome to my project for the REMA 2024 course, I have put all the explanation + source directly in this notebook

# Introduction
This project aimed to develop a movie recommender system specifically tailored for couples, addressing the challenge of finding films that satisfy the preferences of two individuals simultaneously. We utilized collaborative filtering techniques and incorporated data from both MovieLens and IMDb datasets to create a system that balances individual tastes with overall movie quality.

Import necessary libraries for data manipulation (pandas, numpy), machine learning tasks (sklearn), and regex (re).

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
import re

## Loading the data

The movielens-1m was a bit tricky to load, but the imdb was fine.

In [2]:
movies_ml = pd.read_csv('/kaggle/input/movielens-1m-dataset/movies.dat', 
                        sep='::', 
                        engine='python', 
                        names=['movieId', 'title', 'genres'],
                        encoding='ISO-8859-1')

ratings_ml = pd.read_csv('/kaggle/input/movielens-1m-dataset/ratings.dat', 
                         sep='::', 
                         engine='python', 
                         names=['userId', 'movieId', 'rating', 'timestamp'],
                         encoding='ISO-8859-1')

movies_imdb = pd.read_csv('/kaggle/input/imdbdatasets/title_basics.tsv', sep='\t')
ratings_imdb = pd.read_csv('/kaggle/input/imdbdatasets/title_ratings.tsv', sep='\t')

  movies_imdb = pd.read_csv('/kaggle/input/imdbdatasets/title_basics.tsv', sep='\t')


We are here building function to clean the titles of the movies in order to link the two datasets together

In [3]:
def clean_title(title):
    if pd.isna(title) or not isinstance(title, str):
        return ''
    return re.sub(r'\s*\(\d{4}\)\s*$', '', str(title)).strip()

def extract_year(title):
    if pd.isna(title) or not isinstance(title, str):
        return None
    match = re.search(r'\((\d{4})\)$', str(title))
    return int(match.group(1)) if match else None

## Data preprocessing
Here we are preprocessing the data by cleaning titles (ze use the function we did last stage), merging datasets, handling duplicates, creating a user-item matrix, normalizing the data, and splitting it into training and test sets.

In [4]:
movies_ml['clean_title'] = movies_ml['title'].apply(clean_title)
movies_ml['year'] = movies_ml['title'].apply(extract_year)
movies_imdb['clean_title'] = movies_imdb['primaryTitle'].apply(clean_title)
movies_imdb['startYear'] = pd.to_numeric(movies_imdb['startYear'], errors='coerce')

merged_movies = pd.merge(movies_ml, movies_imdb,
                         left_on=['clean_title', 'year'],
                         right_on=['clean_title', 'startYear'],
                         how='inner')

merged_movies = pd.merge(merged_movies, ratings_imdb, left_on='tconst', right_on='tconst', how='left')

user_ratings = ratings_ml.merge(merged_movies[['movieId', 'tconst', 'genres_x', 'genres_y', 'averageRating', 'numVotes']], on='movieId')

user_ratings = user_ratings.sort_values('timestamp').drop_duplicates(subset=['userId', 'movieId'], keep='last')

user_item_matrix = user_ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)


scaler = StandardScaler()
user_item_matrix_scaled = scaler.fit_transform(user_item_matrix)

X_train, X_test = train_test_split(user_item_matrix_scaled, test_size=0.2, random_state=42)

# Collaborative Filtering Functions:

## calculate_user_similarity

### This function calculates the similarity between all pairs of users based on their movie ratings.

Input: 

user_item_matrix (a matrix where rows are users, columns are movies, and values are ratings)

Process:

It uses cosine_similarity from sklearn to compute the similarity between each pair of users
Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space


Output: A DataFrame where both rows and columns are users, and each cell contains the similarity score between those two users


## user_based_collaborative_filtering

### This function predicts ratings for a specific user based on the ratings of similar users.

Inputs:

- user_item_matrix: The matrix of user ratings
- user_similarity_df: The similarity matrix produced by calculate_user_similarity
- user_id: The ID of the user we're predicting for
- k: The number of similar users to consider


Process:

- Get similarity scores for the target user
- Identify the top k most similar users (excluding the user themselves)
- Get the ratings of these similar users
- Calculate weighted average ratings:

Weights are the similarity scores
For each movie, multiply eqch similar user's rating by their similarity score
Sum these weighted ratings and divide by the sum of similarity scores




Output: A Series with predicted ratings for all movies for the target user

The key idea here is that users who have similar taste (as measured by cosine similarity of their rating vectors) are likely to rate new movies similarly. By weighting the ratings of similar users, we can predict how the target user might rate movies they haven't seen yet.

In [5]:
def calculate_user_similarity(user_item_matrix):
    return pd.DataFrame(
        cosine_similarity(user_item_matrix),
        index=user_item_matrix.index,
        columns=user_item_matrix.index
    )

def user_based_collaborative_filtering(user_item_matrix, user_similarity_df, user_id, k=5):
    user_similarities = user_similarity_df[user_id]
    
    similar_users = user_similarities.sort_values(ascending=False).index[1:k+1]
    
    similar_users_ratings = user_item_matrix.loc[similar_users]
    
    user_sim_scores = user_similarities[similar_users]
    weighted_ratings = similar_users_ratings.T.dot(user_sim_scores)
    weighted_avg_ratings = weighted_ratings / user_sim_scores.sum()
    
    return pd.Series(weighted_avg_ratings, index=user_item_matrix.columns)

# Recommender System:

## recommend_for_couple

### This function recommends a movie for a couple based on their individual preferences.

Inputs: 
- IDs of two users
- the user-item matrix
- user similarity matrix
- merged movie data
- k (number of similar users to consider)

Using collaborative filtering, we first predict individual ratings for each user. These individual predictions are then averaged to generate a couple's predicted rating for each movie. We create a dataframe that includes movie ids and their predicted ratings for the couple and merge this df with additional movie information, such as imdb ratings. To determine the final recommendation, we calculate a combined score by weighing the couple's predicted rating at 70% and the imdb average rating at 30%. The movie with the highest combined score is selected and returned along with its predicted rating.

## evaluate_recommendations

### This function evaluates the recommendation system's performance.

To assess the system's performance, we used two common regression metrics:

- Root Mean Square Error (RMSE)
- Mean Absolute Error (MAE)

These metrics measure the average deviation between predicted and actual ratings.

### These functions work together to provide and evaluate movie recommendations for couples. The recommend_for_couple function combines individual user preferences to suggest a movie, while evaluate_recommendations tell us how well these recommendations match actual couple preferences.

In [6]:
def recommend_for_couple(user1_id, user2_id, user_item_matrix, user_similarity_df, merged_movies_df, k=5):

    user1_predictions = user_based_collaborative_filtering(user_item_matrix, user_similarity_df, user1_id, k)
    user2_predictions = user_based_collaborative_filtering(user_item_matrix, user_similarity_df, user2_id, k)
    
    couple_predictions = (user1_predictions + user2_predictions) / 2
    
    predicted_df = pd.DataFrame({
        'movieId': couple_predictions.index,
        'predicted_rating': couple_predictions.values
    })
    
    combined_df = pd.merge(
        predicted_df,
        merged_movies_df[['movieId', 'averageRating', 'title', 'genres_x', 'genres_y', 'numVotes']],
        on='movieId',
        how='inner'
    )
    
    combined_df['combined_score'] = 0.7 * combined_df['predicted_rating'] + 0.3 * combined_df['averageRating'].fillna(0)
    
    best_movie = combined_df.loc[combined_df['combined_score'].idxmax()]
    
    return best_movie, best_movie['predicted_rating']

def evaluate_recommendations(user_item_matrix, user_similarity_df, merged_movies_df):
    mse = 0
    mae = 0
    count = 0
    
    for i in range(0, len(user_item_matrix), 2):
        if i+1 < len(user_item_matrix):
            user1_id = user_item_matrix.index[i]
            user2_id = user_item_matrix.index[i+1]
            
            try:
                recommended_movie, predicted_rating = recommend_for_couple(user1_id, user2_id, user_item_matrix, user_similarity_df, merged_movies_df)
                
                movie_id = recommended_movie['movieId']
                if movie_id in user_item_matrix.columns:
                    actual_rating1 = user_item_matrix.loc[user1_id, movie_id]
                    actual_rating2 = user_item_matrix.loc[user2_id, movie_id]
                    actual_rating = (actual_rating1 + actual_rating2) / 2
                    
                    mse += (predicted_rating - actual_rating) ** 2
                    mae += abs(predicted_rating - actual_rating)
                    count += 1
            except KeyError:
                continue
    
    rmse = np.sqrt(mse / count) if count > 0 else np.inf
    mae = mae / count if count > 0 else np.inf
    
    return rmse, mae

## Collaborative Filtering Approach
Our recommender system use user-based collaborative filtering. Key components include:

- User Similarity Calculation: We computed cosine similarity between user rating vectors to identify users with similar tastes.
- Rating Prediction: For each user, we predicted ratings based on the weighted average of ratings from similar users.
- Couple Recommendations: We combined individual user predictions and incorporated IMDb ratings to generate recommendations suitable for couples.

# Evaluation of the model

In [7]:
user_similarity_df = calculate_user_similarity(user_item_matrix)

rmse, mae = evaluate_recommendations(user_item_matrix, user_similarity_df, merged_movies)
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")

RMSE: 1.3539
MAE: 1.0020


As we can see, the metrics are far from perfect but they are still acceptable I think (the rating is on 5 stars). There is room for improvement, and with more time to do this project, we might be able to lower the RMSE and MAE.

## Example recommendation
Change the users at your will

In [8]:
user1_id, user2_id = 87, 94 
recommended_movie, predicted_rating = recommend_for_couple(user1_id, user2_id, user_item_matrix, user_similarity_df, merged_movies)
print(f"Recommended movie for users {user1_id} and {user2_id}:")
print(f"Title: {recommended_movie['title']}")
print(f"MovieLens Genres: {recommended_movie['genres_x']}")
print(f"IMDb Genres: {recommended_movie['genres_y']}")
print(f"IMDb Rating: {recommended_movie['averageRating']}")
print(f"Predicted rating for the couple: {predicted_rating:.2f}/5")

Recommended movie for users 87 and 94:
Title: Jaws (1975)
MovieLens Genres: Action|Horror
IMDb Genres: Adventure,Thriller
IMDb Rating: 8.1
Predicted rating for the couple: 3.97/5


# Discussion about what we have done

## Strengths

- Integration of Multiple Data Sources: By combining MovieLens and IMDb data, our system leverages both user ratings and broader movie information.
- Couple-Focused Approach: The system explicitly considers the preferences of two users, addressing a real-world scenario often overlooked in recommender systems.
- Balance of Preferences and Quality: Our method combines predicted ratings with IMDb scores, aiming to recommend movies that are both personally appealing and critically acclaimed.

## Limitations

- Accuracy: While our RMSE and MAE scores are reasonable, there's room for improvement. Future iterations could explore matrix factorization techniques or deep learning approaches to enhance prediction accuracy.
- Evaluation Metrics: Our current metrics focus solely on rating prediction accuracy. Incorporating ranking-based metrics (e.g., Precision@k, Recall@k) and diversity measures could provide a more comprehensive evaluation of the system's performance.
- Cold Start Problem: The current system may struggle with new users or movies. Implementing content-based features could help address this limitation.
- Scalability: As the dataset grows, computational efficiency may become a concern. Exploring more efficient similarity computation methods or moving to a model-based approach could improve scalability.
- User Interface: Developing a user-friendly interface for couples to interact with the system would be a valuable next step for practical application.

# Conclusion

So overall, I did my best to do this project, even if the results are not brilliant, they are still ok. I struggled in the beginning of this project, especially on how to do it because of the free choices.

My film suggestion tool for couples is working on is based on collaborative filtering, taking into account the likes of two people at the same time. The results we have so far are promising, but there's definitely room for improvement and growth. This project sets the stage for more advanced couple-focused recommendation tools, which could potentially be used for more than just movies, reaching into other situations where couples make decisions together.

## Sources used for this project
(in order of importance)
- This course : https://github.com/oscar-defelice/Recommender-Systems-Course/blob/main/src/01.Introduction.ipynb
- Stackoverflow
- https://proclusacademy.com/blog/explainer/regression-metrics-you-must-know/
- ChatGPT (especially for debug + form + setup of the project)
- https://realpython.com/build-recommendation-engine-collaborative-filtering/