In [5]:
# import potential libraries for data analysis

!pip install lightfm
import pandas as pd
import numpy as np
from lightfm.evaluation import precision_at_k, auc_score




In this lab, we will apply learned concepts about recommender system to a movie recommendation task.

The given dataset is from a kaggle dataset:https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

Your tasks in this lab include two steps: 1) implement a collaborative filtering based recommender system leveraging the dataset and report the performance of your approach, 2) discuss the relaitonship between the popularity of item/user with the performance on related recommendation.

You may consider the using the lightFM (https://github.com/lyst/lightfm) for supporting your task.



Step 1: Collaborative Filtering for recommendation

In [6]:
# put your code for implementing Step 1 in this code block



#movies = pd.read_csv('movies_metadata.csv')
ratings = pd.read_csv('ratings_small.csv')

from lightfm import LightFM
from lightfm.data import Dataset

dataset = Dataset()
dataset.fit((x for x in ratings['userId']),
            (x for x in ratings['movieId']))

interactions, weights = dataset.build_interactions(((row['userId'], row['movieId'])
                                                    for _, row in ratings.iterrows()))


In [7]:
model = LightFM(loss='warp')
model.fit(interactions, epochs=30, num_threads=2)


<lightfm.lightfm.LightFM at 0x7ef5a2b8a200>

In [8]:

precision = precision_at_k(model, interactions, k=5).mean()
print(f'Precision@5: {precision}')


Precision@5: 0.5964232683181763


In [9]:
auc = auc_score(model, interactions).mean()
print(f'AUC Score: {auc}')

AUC Score: 0.9635312557220459


Step 2: Exploring if the performance on highly popular(with a lot of records) user and item generally higher than the ones on niche user and item.



In [10]:
# code for stpe 2

item_popularity = ratings.groupby('movieId').size()
user_activity = ratings.groupby('userId').size()

item_popularity_corr = np.corrcoef(item_popularity, interactions.sum(axis=0))[0, 1]
user_activity_corr = np.corrcoef(user_activity, interactions.sum(axis=1).T)[0, 1]

print(f'Item Popularity Correlation: {item_popularity_corr}')
print(f'User Activity Correlation: {user_activity_corr}')


Item Popularity Correlation: 0.22634770716123465
User Activity Correlation: 0.9999999999999999


# put your findings here


### Evaluation Metrics

1. **Precision@5**:
   - The Precision@5 score is `0.5964`.
   - This indicates that approximately 59.64% of the top 5 recommendations for users are relevant, suggesting a relatively high quality of the recommendations generated by the model.

2. **AUC Score**:
   - The AUC (Area Under the Curve) score is `0.9635`.
   - This score is very close to 1, indicating excellent performance in distinguishing between relevant and irrelevant items. A higher AUC score suggests that the model is effective in ranking relevant items higher than irrelevant ones.

### Popularity Metrics Correlation

1. **Item Popularity Correlation**:
   - The correlation between item popularity and interaction sums is `0.2263`.
   - This positive correlation, although modest, indicates that more popular items (movies that received more ratings) tend to have higher interaction sums in the dataset. However, the correlation is not very strong, suggesting that other factors also play significant roles in item interactions.

2. **User Activity Correlation**:
   - The correlation between user activity and interaction sums is `1.0`.
   - This near-perfect correlation indicates that users who are more active (users who provided more ratings) have higher interaction sums. This strong correlation is expected because user activity directly contributes to the interaction matrix used by the recommender system.

