# Practice PS06: Recommendations engines (interactions-based)

Author: <font color="blue">Alan Le Roux </font>

E-mail: <font color="blue">alan.leroux01@estudiant.upf.edu</font>

Date: <font color="blue">01/11/2025</font>

# 1. The Movies dataset

# 1.1. Load the input files

In [55]:
# LEAVE THIS CODE AS-IS
# But feel free to add imports in an extra cell if needed

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from math import *
import random
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [56]:
# LEAVE THIS CODE AS-IS

FILENAME_MOVIES = "ml32m-movies-2000s.csv.gz"
FILENAME_RATINGS = "ml32m-ratings-2000s.csv.gz"
FILENAME_TAGS = "ml32m-tags-2000s.csv.gz"

In [57]:
# LEAVE THIS CODE AS-IS

# Load movies
movies = pd.read_csv(FILENAME_MOVIES, 
                    compression='gzip',
                    sep=',', 
                    engine='python', 
                    encoding='utf-8',
                    names=['movie_id', 'title', 'genres'])

# Remove header row from this file
movies.drop(index=0, inplace=True)

# Make sure the movie id is numeric
movies["movie_id"] = pd.to_numeric(movies["movie_id"])
display(movies.head(5))

Unnamed: 0,movie_id,title,genres
1,2769,"Yards, The (2000)",Crime|Drama
2,3177,Next Friday (2000),Comedy
3,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
4,3225,Down to You (2000),Comedy|Romance
5,3228,Wirey Spindell (2000),Comedy


In [58]:
# LEAVE THIS CODE AS-IS

# Load ratings
ratings_raw = pd.read_csv(FILENAME_RATINGS, 
                    sep=',', 
                    compression='gzip',
                    encoding='utf-8',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

Unnamed: 0,user_id,movie_id,rating
0,4,223,4.0
1,4,1210,3.0
2,4,1272,4.0
3,4,1327,3.0
4,4,1513,2.0


# 1.2. Merge the data into a single dataframe

In [59]:

ratings = pd.merge(ratings_raw,movies,how ='inner',on = 'movie_id')
display(ratings.head(5))

Unnamed: 0,user_id,movie_id,rating,title,genres
0,33,3285,4.5,"Beach, The (2000)",Adventure|Drama
1,1209,3285,4.0,"Beach, The (2000)",Adventure|Drama
2,1402,3285,2.0,"Beach, The (2000)",Adventure|Drama
3,1411,3285,3.0,"Beach, The (2000)",Adventure|Drama
4,1766,3285,2.5,"Beach, The (2000)",Adventure|Drama


In [60]:
def find_movies(keyword, movies):
    #initialize a counter to keep track of the number of movies found
    count = 0
    #we iterate over the rows of the dataframe
    for _, row in movies.iterrows():
        #we get the title of the movie
        title = row['title']
        #we check if the keyword is in the title.Tip given by the statement
        if keyword in title:
            print(f"movie_id: {row['movie_id']}, title: {title}")
            count += 1
    if count == 0:
        print("No movies found with the given keyword.")

In [61]:
# LEAVE AS-IS

# For testing, this should print 6 movies
find_movies("Final Destination", movies)

movie_id: 3409, title: Final Destination (2000)
movie_id: 6058, title: Final Destination 2 (2003)
movie_id: 43679, title: Final Destination 3 (2006)
movie_id: 71252, title: Final Destination, The (Final Destination 4) (Final Destination in 3-D, The) (2009)
movie_id: 85278, title: City of Your Final Destination, The (2009)
movie_id: 88932, title: Final Destination 5 (2011)


In [62]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [63]:
# LEAVE AS-IS

# For testing, should print "Final Destination 5 (2011)"
print(get_title(88932, movies))

Final Destination 5 (2011)


## 1.3. Count unique registers

In [64]:
num_users_rating = ratings['user_id'].unique()
num_movies_rated = ratings['movie_id'].unique()
num_movies = movies['movie_id'].unique()
print(f"Number of users who have rated a movie: {len(num_users_rating)}")
print(f"Number of movies rated by users: {len(num_movies_rated)}")
print(f"Total number of movies : {len(num_movies)}")


Number of users who have rated a movie: 16348
Number of movies rated by users: 2878
Total number of movies : 51444


# 2. Item-based Collaborative Filtering

## 2.1. Data pre-processing

In [65]:
# We create the new dataframe without the genres column using the drop function 
rated_movies = ratings.drop(columns=['genres'])
display(rated_movies.head(10))

Unnamed: 0,user_id,movie_id,rating,title
0,33,3285,4.5,"Beach, The (2000)"
1,1209,3285,4.0,"Beach, The (2000)"
2,1402,3285,2.0,"Beach, The (2000)"
3,1411,3285,3.0,"Beach, The (2000)"
4,1766,3285,2.5,"Beach, The (2000)"
5,1844,3285,3.0,"Beach, The (2000)"
6,1860,3285,2.5,"Beach, The (2000)"
7,2051,3285,1.0,"Beach, The (2000)"
8,2112,3285,4.5,"Beach, The (2000)"
9,2421,3285,2.5,"Beach, The (2000)"


In [66]:
# We first create ratings_summary  from the rated_movies dataframe using a fucntion given in the statement.
#Here we are taking from rated_movies dataframe:
#all the movies with same movie_id are grouped together and we only keep the first title for each movie_id and 
#we copy the title associated to this movie_id into a new dataframe called ratings_summary

ratings_summary = rated_movies.groupby('movie_id').first()[['title']].copy()

# And now we compute the mean and count of ratings for each movie_id
#i added another column that will be the median of ratings for each movie_id as in the next step we will need it.
ratings_mean = rated_movies.groupby('movie_id')['rating'].mean()
ratings_count = rated_movies.groupby('movie_id')['rating'].count()
ratings_median = rated_movies.groupby('movie_id')['rating'].median()

# And we add it to the dataframe
ratings_summary['ratings_mean'] = ratings_mean.round(2)
ratings_summary['ratings_count'] = ratings_count
ratings_summary['ratings_median'] = ratings_median.round(2)


#Here we move movie_id from index to column
ratings_summary = ratings_summary.reset_index()

# We display the 10 first rows of the new dataframe
display(ratings_summary.head(10))



Unnamed: 0,movie_id,title,ratings_mean,ratings_count,ratings_median
0,2769,"Yards, The (2000)",3.0,73,3.0
1,3177,Next Friday (2000),2.92,161,3.0
2,3190,Supernova (2000),2.34,131,2.5
3,3225,Down to You (2000),2.68,109,3.0
4,3228,Wirey Spindell (2000),1.67,3,2.0
5,3239,Isn't She Great? (2000),2.29,29,2.0
6,3273,Scream 3 (2000),2.44,832,2.5
7,3275,"Boondock Saints, The (2000)",3.88,1406,4.0
8,3276,Gun Shy (2000),3.02,31,3.0
9,3279,Knockout (2000),1.0,1,1.0


In [67]:
#here I created a new dataframe which contains only the movies with at least 100 ratings
#and then I sorted this new dataframe by ratings_mean in descending order
#the functions were given in the statement
#finally I displayed the first 10 rows of this new dataframe
top_10_movies = ratings_summary[ratings_summary.ratings_count >= 1000].sort_values(by='ratings_mean', ascending=False).head(10)
print("Top 10 movies by average rating (with at least 1000 ratings):")
display(top_10_movies)

#I do basically the same thing as before but this time I sort by ratings_median
top_10_median = ratings_summary[ratings_summary.ratings_count >= 1000].sort_values(by='ratings_median', ascending=False).head(10)
print("Top 10 movies by median rating (with at least 1000 ratings):")
display(top_10_median[['movie_id', 'title', 'ratings_median', 'ratings_count']])





Top 10 movies by average rating (with at least 1000 ratings):


Unnamed: 0,movie_id,title,ratings_mean,ratings_count,ratings_median
736,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.22,3607,4.5
2469,44555,"Lives of Others, The (Das leben der Anderen) (...",4.21,1321,4.5
882,6016,City of God (Cidade de Deus) (2002),4.19,2848,4.5
259,4226,Memento (2000),4.16,5804,4.0
2692,48516,"Departed, The (2006)",4.14,3819,4.0
2718,48780,"Prestige, The (2006)",4.1,3610,4.0
1255,7153,"Lord of the Rings: The Return of the King, The...",4.09,7365,4.5
1934,31658,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.09,1703,4.0
498,4993,"Lord of the Rings: The Fellowship of the Ring,...",4.09,7935,4.5
849,5952,"Lord of the Rings: The Two Towers, The (2002)",4.07,7447,4.0


Top 10 movies by median rating (with at least 1000 ratings):


Unnamed: 0,movie_id,title,ratings_median,ratings_count
736,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.5,3607
498,4993,"Lord of the Rings: The Fellowship of the Ring,...",4.5,7935
2469,44555,"Lives of Others, The (Das leben der Anderen) (...",4.5,1321
882,6016,City of God (Cidade de Deus) (2002),4.5,2848
1255,7153,"Lord of the Rings: The Return of the King, The...",4.5,7365
7,3275,"Boondock Saints, The (2000)",4.0,1406
1285,7254,The Butterfly Effect (2004),4.0,2239
1215,6947,Master and Commander: The Far Side of the Worl...,4.0,1056
1233,7090,Hero (Ying xiong) (2002),4.0,1077
1245,7143,"Last Samurai, The (2003)",4.0,2241


<font size="+1" color="red">Repeat this, but this time consider movies receiving at least 3 ratings, and having a median of 4.5 or above.</font>

In [68]:
#I define another dataframe with the constraints given in the statement
new_movies = ratings_summary[(ratings_summary.ratings_count >= 3) & (ratings_summary.ratings_median >= 4.5)]
#And again I do the same as the previous cells but with this new dataframe

# Top 10 by mean rating
top10_mean = new_movies.sort_values(by='ratings_mean', ascending=False).head(10)
print("Top 10 movies by average rating (min 3 ratings, median >= 4.5):")
display(top10_mean)

# Top 10 by median rating
top10_median = new_movies.sort_values(by='ratings_median', ascending=False).head(10)
print("Top 10 movies by median rating (min 3 ratings, median >= 4.5):")
display(top10_median)

Top 10 movies by average rating (min 3 ratings, median >= 4.5):


Unnamed: 0,movie_id,title,ratings_mean,ratings_count,ratings_median
1711,27550,Hell House (2001),4.5,3,4.5
1958,31900,Travellers and Magicians (2003),4.3,5,4.5
248,4165,"Me You Them (Eu, Tu, Eles) (2000)",4.25,6,4.75
1739,27646,Soldier's Girl (2003),4.25,6,4.5
1731,27627,Oasis (2002),4.25,6,4.5
1393,7767,"Best of Youth, The (La meglio gioventÃ¹) (2003)",4.25,54,4.5
1668,27423,"O Auto da Compadecida (Dog's Will, A) (2000)",4.24,35,4.5
736,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.22,3607,4.5
2469,44555,"Lives of Others, The (Das leben der Anderen) (...",4.21,1321,4.5
1682,27469,Millennium Mambo (2001),4.21,7,4.5


Top 10 movies by median rating (min 3 ratings, median >= 4.5):


Unnamed: 0,movie_id,title,ratings_mean,ratings_count,ratings_median
248,4165,"Me You Them (Eu, Tu, Eles) (2000)",4.25,6,4.75
2143,34314,Funny Ha Ha (2002),3.88,4,4.75
276,4244,"Day I Became a Woman, The (Roozi khe zan shoda...",4.1,5,4.5
2469,44555,"Lives of Others, The (Das leben der Anderen) (...",4.21,1321,4.5
2056,33363,Unconscious (Inconscientes) (2004),3.5,5,4.5
2050,33270,"Taste of Tea, The (Cha no aji) (2004)",3.93,7,4.5
1958,31900,Travellers and Magicians (2003),4.3,5,4.5
1902,31148,Day of the Wacko (Dzien swira) (2002),3.61,18,4.5
1739,27646,Soldier's Girl (2003),4.25,6,4.5
1731,27627,Oasis (2002),4.25,6,4.5


In the first list of the top 10 highest rated movies and with a minimum number of ratings of 1000 , we can recognize popular movies that are known as good movies, som Christopher Nolan movies for example that are popular and highly recommended among people.
Instead when the minum rating count is 3 then new movies appear like Travellers and Magicians (2003)	that are not very much popular and have only 3 ratings, this loses reliability as only a few people have rated them.

## 2.2. Compute the user-movie matrix

In [69]:
user_movie = rated_movies.pivot_table(index='user_id', columns='movie_id', values='rating')

user_movie.head(5)

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,51187,51194,51255,51312,51314,51317,51402,51412,51418,51433
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
33,,,,,,,,,,,...,,,4.0,,,,,,,
63,,,,,,,,,,,...,,,,,,,,,,
94,,,,,,,,,,,...,,,4.5,,,,,,,
95,,,,,,,,,,,...,,,,,,,,,,
131,,,,0.5,,,,,,,...,,,,,,,,,,


This happens because in the dataframe we have so many films and its normal that the user has not rated all of them, that is why we have so many Nan values. 
This is called sparsity, it's a condition where most entries in the  matrix  nan values, meaning very few values contain actual information.

# 2.3. Explore some correlations in the user-movie matrix

In [70]:

#The statement give all the steps to follow
#we first define the variables with the names that they give for every title 
#then we create the series s1,s2,s3 that contain the ratings of the users for each movie
#after that we concatenate these series into a dataframe called ratings3
#finally we drop the rows with Nan values and display the first 10 rows of the dataframe

id_pivot = movies.loc[movies['title'] == 'Finding Nemo (2003)', 'movie_id'].values[0]
id_m1 = movies.loc[movies['title'] == 'Animatrix, The (2003)', 'movie_id'].values[0]
id_m2 = movies.loc[movies['title'] == 'Hey Arnold! The Movie (2002)', 'movie_id'].values[0]

s1 = user_movie[id_pivot].dropna()
s2 = user_movie[id_m1].dropna()
s3 = user_movie[id_m2].dropna()

ratings3 = pd.concat([s1, s2, s3], axis=1)
ratings3.columns = ['Finding Nemo (2003)', 'Animatrix, The (2003)', 'Hey Arnold! The Movie (2002)']
ratings3.dropna(inplace=True)
display(ratings3.head(10))

Unnamed: 0_level_0,Finding Nemo (2003),"Animatrix, The (2003)",Hey Arnold! The Movie (2002)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9867,3.5,2.5,2.5
26686,4.0,3.0,3.0
95370,5.0,5.0,5.0
181756,3.5,4.0,3.0


ratings3 is a small dataframe where each column is the rating values for one of the three selected movies and each row corresponds to a user_id. it keeps only the user_id of the people that have rated the three movies.

In [71]:
#The statement gives already the function to compute the Pearson correlation between two series
#we compute the pairwise correlations between the three movies and print the results

corr_fn_anim = ratings3['Finding Nemo (2003)'].corr(ratings3['Animatrix, The (2003)'])
corr_fn_hey = ratings3['Finding Nemo (2003)'].corr(ratings3['Hey Arnold! The Movie (2002)'])
corr_anim_hey = ratings3['Animatrix, The (2003)'].corr(ratings3['Hey Arnold! The Movie (2002)'])

print(f"Similarity between 'Finding Nemo (2003)' and 'Animatrix, The (2003)': {corr_fn_anim:.2f}")
print(f"Similarity between 'Finding Nemo (2003)' and 'Hey Arnold! The Movie (2002)': {corr_fn_hey:.2f}")
print(f"Similarity between 'Animatrix, The (2003)' and 'Hey Arnold! The Movie (2002)': {corr_anim_hey:.2f}")

Similarity between 'Finding Nemo (2003)' and 'Animatrix, The (2003)': 0.74
Similarity between 'Finding Nemo (2003)' and 'Hey Arnold! The Movie (2002)': 0.96
Similarity between 'Animatrix, The (2003)' and 'Hey Arnold! The Movie (2002)': 0.90


The three movies are highly correlated between them,maybe because the three of them are animated.The highest correlation appear between finding nemo and hey arnold, this maybe is because they come from the same country.

In [72]:
#The statement already gives us this function that extracts the ratings of the pivot movie and renames the column to "rating"
df = pd.DataFrame(user_movie[id_pivot].dropna()).rename(columns={id_pivot: "rating"})
#we then compute the Pearson correlation between the ratings of the pivot movie and all the other movies, the function is also given in the statement
similarity_series = user_movie.corrwith(df["rating"])
#we drpo the Nan values from the series
similarity_series.dropna(inplace=True)
#and we store the result in a new dataframe with two columns: movie_id and corr_with_pivot
similarity_to_pivot = pd.DataFrame({'movie_id': similarity_series.index, 'corr_with_pivot': similarity_series.values})

display(similarity_to_pivot.head(10))

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.t

Unnamed: 0,movie_id,corr_with_pivot
0,2769,0.064023
1,3177,0.284792
2,3190,0.228716
3,3225,0.106047
4,3239,-0.347351
5,3273,0.190426
6,3275,0.13094
7,3276,-0.107572
8,3285,0.032926
9,3286,0.243133


In [73]:
#We create a new dataframe by merging similarity_to_pivot and ratings_summary on movie_id
#then we filter the movies with more than 1000 ratings
#we keep only the columns movie_id, corr_with_pivot, title, ratings_mean and ratings_count
#we sort the dataframe by corr_with_pivot in descending order
#and we display the first 20 rows of the final dataframe

corr_with_pivot = pd.merge(similarity_to_pivot, ratings_summary, on='movie_id')
corr_with_pivot = corr_with_pivot[(corr_with_pivot.ratings_count >= 1000) & (corr_with_pivot.ratings_median >= 4.0)]

corr_with_pivot = corr_with_pivot[['movie_id', 'corr_with_pivot', 'title', 'ratings_mean', 'ratings_count']]
corr_with_pivot = corr_with_pivot.sort_values('corr_with_pivot', ascending=False)

display(corr_with_pivot.head(20))

Unnamed: 0,movie_id,corr_with_pivot,title,ratings_mean,ratings_count
916,6377,1.0,Finding Nemo (2003),3.83,5036
427,4886,0.656471,"Monsters, Inc. (2001)",3.84,4985
1426,8961,0.556336,"Incredibles, The (2004)",3.82,4575
2475,50872,0.549734,Ratatouille (2007),3.83,3095
284,4306,0.511667,Shrek (2001),3.76,6013
178,4016,0.419808,"Emperor's New Groove, The (2000)",3.66,1230
943,6539,0.417772,Pirates of the Caribbean: The Curse of the Bla...,3.8,5288
2025,40815,0.409237,Harry Potter and the Goblet of Fire (2005),3.77,2807
1288,8368,0.401876,Harry Potter and the Prisoner of Azkaban (2004),3.8,3421
433,4896,0.39905,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.67,3787


This dataframe shows the movies that are most correlated (in descending order) to the film  "Finding nemo", we notice that the films that appear with the highest correlation are all family friendly and animated movies like for example shrek or ratatouille.That makes sense: users who like one popular animated family movie tend to like other family-oriented animations.
If you increase the  ratings_count to a much larger value, the list becomes smaller and dominated by well known and widely-rated films; correlations become more reliable but not that much diverse. 
If you lower the count , you include many niche  movies: the list grows  but many correlations become noisy  because they rely on very few shared raters.

# 2.4. Implement the item-based recommendations

In [74]:
item_similarity = user_movie.corr()

display(item_similarity.head(10))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,51187,51194,51255,51312,51314,51317,51402,51412,51418,51433
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,-0.406295,-0.049976,0.119232,,,0.261728,-0.166238,-0.754337,,...,,,0.359602,,,,,-0.274301,,
3177,-0.406295,1.0,0.680619,0.174519,,,-0.010883,0.321711,0.09759,,...,-1.0,,0.223872,,,,,0.707963,,
3190,-0.049976,0.680619,1.0,0.358953,,-0.582209,0.114346,0.215633,-0.304842,,...,,,-0.047137,,,,,0.593796,,
3225,0.119232,0.174519,0.358953,1.0,,0.5,0.318758,0.473451,0.044622,,...,,,-0.398374,,,,,0.963143,,
3228,,,,,1.0,,1.0,,,,...,,,,,,,,,,
3239,,,-0.582209,0.5,,1.0,0.591579,,0.5,,...,,,-1.0,,,,,,,
3273,0.261728,-0.010883,0.114346,0.318758,1.0,0.591579,1.0,0.176796,-0.168983,,...,1.0,,0.170269,,,,,0.206301,-0.567178,
3275,-0.166238,0.321711,0.215633,0.473451,,,0.176796,1.0,0.495821,,...,0.298818,,0.233542,-1.0,,,,0.191663,0.342029,
3276,-0.754337,0.09759,-0.304842,0.044622,,0.5,-0.168983,0.495821,1.0,,...,,,0.892531,,,,,0.486664,,
3279,,,,,,,,,,,...,,,,,,,,,,


In [75]:

#I created another dataframe and this one uses the min_periods as stated
item_similarity_min_ratings = user_movie.corr(min_periods=100)
display(item_similarity_min_ratings.head(10))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,51187,51194,51255,51312,51314,51317,51402,51412,51418,51433
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,,,,,,,,,,,...,,,,,,,,,,
3177,,1.0,,,,,,,,,...,,,,,,,,,,
3190,,,1.0,,,,,,,,...,,,,,,,,,,
3225,,,,1.0,,,,,,,...,,,,,,,,,,
3228,,,,,,,,,,,...,,,,,,,,,,
3239,,,,,,,,,,,...,,,,,,,,,,
3273,,,,,,,1.0,0.176796,,,...,,,0.170269,,,,,,,
3275,,,,,,,0.176796,1.0,,,...,,,0.233542,,,,,0.191663,,
3276,,,,,,,,,,,...,,,,,,,,,,
3279,,,,,,,,,,,...,,,,,,,,,,


In [76]:
# Leave this code as-is

# Gets the rating a user_id has given to a movie_id
def get_rating(user_movie, user_id, movie_id):
    return user_movie[movie_id][user_id]

# Gets a list of rated movies for a user_id
def get_rated_movies(user_movie, user_id):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)
    
# Print rated movies
def print_rated_movies(user_movie, movies, user_id):
    for movie_id in get_rated_movies(user_movie, user_id):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_movie, user_id, movie_id), get_title(movie_id, movies)))


In [98]:
#we first get the ids of the movies
super_movie_1 = 5349   
super_movie_2 = 3793   
super_movie_3 = 8961   

drama_movie_1 = 3408   
drama_movie_2 = 5995   
drama_movie_3 = 4995 
#we found the users in the user_movie data frame that satisfies the conditions for the super hero movie : 
# has given to the three movies a rating higher than 4.5 and then we get the index of these users



super_users = user_movie[((user_movie[super_movie_1] > 4.5) &(user_movie[super_movie_2] > 4.5) & (user_movie[super_movie_3] > 4.5))].index
#we print how many users satisfy this condition and we randomly choose one of them
print("Number of superhero-type users:", len(super_users))
user_id_super = random.choice(super_users.tolist())
print("Chosen superhero fan user_id:", user_id_super)

 
#we do very similarly for the drama movies but this time we add the condition that the user has not rated any of the super hero movies
drama_users = user_movie[((user_movie[drama_movie_1] > 4.5) & (user_movie[drama_movie_2] > 4.5) &(user_movie[drama_movie_3] > 4.5) &
                   (user_movie[super_movie_1].isnull()))].index
print("Number of drama-type users:", len(drama_users))
user_id_drama = random.choice(drama_users.tolist())
print("Chosen drama fan user_id:", user_id_drama)

Number of superhero-type users: 26
Chosen superhero fan user_id: 86976
Number of drama-type users: 5
Chosen drama fan user_id: 15826


In [99]:
# LEAVE THIS CODE AS-IS
# We use this to check that the user ids you selected are correct

assert get_rating(user_movie, user_id_super, super_movie_1) > 4.5
assert get_rating(user_movie, user_id_super, super_movie_2) > 4.5
assert get_rating(user_movie, user_id_super, super_movie_3) > 4.5

assert get_rating(user_movie, user_id_drama, drama_movie_1) > 4.5
assert get_rating(user_movie, user_id_drama, drama_movie_2) > 4.5
assert get_rating(user_movie, user_id_drama, drama_movie_3) > 4.5


In [100]:
user_id_super = 86976
user_id_drama = 15826

In [101]:
# LEAVE AS-IS (TESTING CODE)

print_rated_movies(user_movie, movies, user_id_super)

3578 5.0 Gladiator (2000) 
5816 5.0 Harry Potter and the Chamber of Secrets (2002) 
27706 5.0 Lemony Snicket's A Series of Unfortunate Events (2004) 
45499 5.0 X-Men: The Last Stand (2006) 
6365 5.0 Matrix Reloaded, The (2003) 
6377 5.0 Finding Nemo (2003) 
6537 5.0 Terminator 3: Rise of the Machines (2003) 
6541 5.0 League of Extraordinary Gentlemen, The (a.k.a. LXG) (2003) 
6548 5.0 Bad Boys II (2003) 
40815 5.0 Harry Potter and the Goblet of Fire (2005) 
7153 5.0 Lord of the Rings: The Return of the King, The (2003) 
40339 5.0 Chicken Little (2005) 
7458 5.0 Troy (2004) 
36529 5.0 Lord of War (2005) 
34150 5.0 Fantastic Four (2005) 
34048 5.0 War of the Worlds (2005) 
8644 5.0 I, Robot (2004) 
8665 5.0 Bourne Supremacy, The (2004) 
33794 5.0 Batman Begins (2005) 
33679 5.0 Mr. & Mrs. Smith (2005) 
8961 5.0 Incredibles, The (2004) 
5952 5.0 Lord of the Rings: The Two Towers, The (2002) 
6156 5.0 Shanghai Knights (2003) 
45517 5.0 Cars (2006) 
4896 5.0 Harry Potter and the Sorcerer's 

In [102]:
# LEAVE AS-IS (TESTING CODE)

print_rated_movies(user_movie, movies, user_id_drama)

5377 5.0 About a Boy (2002) 
4033 5.0 Thirteen Days (2000) 
4995 5.0 Beautiful Mind, A (2001) 
5218 5.0 Ice Age (2002) 
6287 5.0 Anger Management (2003) 
3408 5.0 Erin Brockovich (2000) 
6377 5.0 Finding Nemo (2003) 
4304 5.0 Startup.com (2001) 
5957 5.0 Two Weeks Notice (2002) 
4246 5.0 Bridget Jones's Diary (2001) 
5445 5.0 Minority Report (2002) 
6873 5.0 Intolerable Cruelty (2003) 
3751 5.0 Chicken Run (2000) 
3578 5.0 Gladiator (2000) 
5995 5.0 Pianist, The (2002) 
4993 4.5 Lord of the Rings: The Fellowship of the Ring, The (2001) 
5418 4.5 Bourne Identity, The (2002) 
5608 4.5 Das Experiment (Experiment, The) (2001) 
5299 4.5 My Big Fat Greek Wedding (2002) 
5945 4.5 About Schmidt (2002) 
6155 4.5 How to Lose a Guy in 10 Days (2003) 
4963 4.5 Ocean's Eleven (2001) 
4886 4.5 Monsters, Inc. (2001) 
5943 4.5 Maid in Manhattan (2002) 
6373 4.5 Bruce Almighty (2003) 
4254 4.5 Crocodile Dundee in Los Angeles (2001) 
6378 4.5 Italian Job, The (2003) 
3793 4.5 X-Men (2000) 
6942 4.5 Love

In [103]:
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    
    # Create an empty series
    movies_relevance = pd.Series(dtype=float)
    
    # Iterate through the movies the user has rated
    for rated_movie in user_movie.loc[user_id].dropna().index:
        
        # Obtain the rating given
        rating_given = user_movie.loc[user_id, rated_movie]
      
        
        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = item_similarity_matrix[rated_movie]
        
        # Multiply this vector by the given rating
        weighted_similarities = similarities * rating_given
        
        # Append these terms to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df

In [107]:
#we first create the dataframe calling the function defined above in the super hero user id
super_relevance_df = get_movies_relevance(user_id_super, user_movie, item_similarity_min_ratings)
#as the output of the previous function will only haave as columns relevance and movie_id, we merge with the movies dataframe as they share
# the movie_id column and we get a new dataframe
super_df_relevance = super_relevance_df.merge(movies, on='movie_id')
# we sort the dataframe by relevance in descending order
super_df_relevance = super_df_relevance.sort_values(by='relevance', ascending=False)
# we display the top 10 results
display(super_df_relevance.head(10))

#we do the same for the drama user id

drama_relevance_df = get_movies_relevance(user_id_drama, user_movie, item_similarity_min_ratings)
drama_df_relevance = drama_relevance_df.merge(movies, on='movie_id')
drama_df_relevance = drama_df_relevance.sort_values(by='relevance', ascending=False)
display(drama_df_relevance.head(10))


Unnamed: 0,relevance,movie_id,title,genres
2132,136.78535,34150,Fantastic Four (2005),Action|Adventure|Sci-Fi
2533,132.897192,45499,X-Men: The Last Stand (2006),Action|Sci-Fi|Thriller
2603,132.737307,46972,Night at the Museum (2006),Action|Comedy|Fantasy|IMAX
2450,130.789697,44022,Ice Age 2: The Meltdown (2006),Adventure|Animation|Children|Comedy
1938,129.81023,31685,Hitch (2005),Comedy|Romance
1051,129.466319,6548,Bad Boys II (2003),Action|Comedy|Crime|Thriller
660,129.302387,5459,Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (...,Action|Comedy|Sci-Fi
297,128.042811,4270,"Mummy Returns, The (2001)",Action|Adventure|Comedy|Thriller
1592,127.400155,8972,National Treasure (2004),Action|Adventure|Drama|Mystery|Thriller
2552,127.317753,45722,Pirates of the Caribbean: Dead Man's Chest (2006),Action|Adventure|Fantasy


Unnamed: 0,relevance,movie_id,title,genres
660,77.605686,5459,Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (...,Action|Comedy|Sci-Fi
1011,76.073218,6378,"Italian Job, The (2003)",Action|Crime
1006,75.276698,6373,Bruce Almighty (2003),Comedy|Drama|Fantasy|Romance
803,73.774304,5816,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy
563,73.147109,5218,Ice Age (2002),Adventure|Animation|Children|Comedy
73,72.655465,3624,Shanghai Noon (2000),Action|Adventure|Comedy|Western
1938,72.590435,31685,Hitch (2005),Comedy|Romance
194,71.403724,4018,What Women Want (2000),Comedy|Romance
2533,70.826093,45499,X-Men: The Last Stand (2006),Action|Sci-Fi|Thriller
297,70.789855,4270,"Mummy Returns, The (2001)",Action|Adventure|Comedy|Thriller


We can see in the example of the super hero user that the recommended films are indeed very accurate as the ones with the highest relevance are superhero movies and the others share a very similar genre such as action and aventure.

In the case of drama is more diverse, we still see some action films such as the italian job and a lot of comedy like Hitch or  What womens want
In the case of the super hero movies I see a more accurate recomendatoçion list but not  with the drama user.

In [None]:

#we define a function that given a user id, the user_movie dataframe and the item similarity matrix
# returns a dataframe with the recommended movies for that user id after removing the movies that the user has already watched

def get_recommended_movies(user_id, user_movie, item_similarity_matrix):
    #we first get the movies relevance dataframe calling the function defined previously
    movies_relevance_df = get_movies_relevance(user_id, user_movie, item_similarity_matrix)    
    #we set the index of the dataframe to be movie_id to facilitate the removal of watched movies
    movies_relevance_df = movies_relevance_df.set_index('movie_id')
    #we get the list of movies that the user has already watched(if it is rated we assume it has been watched)
    watched_movies = get_rated_movies(user_movie, user_id)
    #we drop these movies from the relevance dataframe
    movies_relevance_df = movies_relevance_df.drop(watched_movies, errors='ignore')
    #and finally we return the dataframe with the recommended movies
    return movies_relevance_df

In [None]:
#we call the function defined above for both user ids and display the top 20 recommended movies for each user
recommended_super = get_recommended_movies(user_id_super, user_movie, item_similarity_min_ratings)
# as the output of the previous function will only have as columns relevance and movie_id, we merge with the movies dataframe as they share
# the movie_id column and we get a new dataframe
recommended_super = recommended_super.merge(movies, on='movie_id')
# we sort the dataframe by relevance in descending order
recommended_super = recommended_super.sort_values(by='relevance', ascending=False)
print("Top 20 recommended movies for the superhero fan:")
display(recommended_super.head(20))
#we do the same for the drama user id
recommended_drama = get_recommended_movies(user_id_drama, user_movie, item_similarity_min_ratings)
recommended_drama = recommended_drama.merge(movies, on='movie_id')
recommended_drama = recommended_drama.sort_values(by='relevance', ascending=False)
print("Top 20 recommended movies for the drama fan:")
display(recommended_drama.head(20))


Top 20 recommended movies for the superhero fan:


Unnamed: 0,movie_id,relevance,title,genres
2391,44022,130.789697,Ice Age 2: The Meltdown (2006),Adventure|Animation|Children|Comedy
1889,31685,129.81023,Hitch (2005),Comedy|Romance
1544,8972,127.400155,National Treasure (2004),Action|Adventure|Drama|Mystery|Thriller
2334,42738,126.119871,Underworld: Evolution (2006),Action|Fantasy|Horror
1027,6564,123.955597,Lara Croft Tomb Raider: The Cradle of Life (2003),Action|Adventure|Comedy|Romance|Thriller
2030,33646,122.871487,"Longest Yard, The (2005)",Comedy|Drama
987,6383,122.62146,"2 Fast 2 Furious (Fast and the Furious 2, The)...",Action|Crime|Thriller
1337,7454,122.603201,Van Helsing (2004),Action|Adventure|Fantasy|Horror
982,6378,122.302529,"Italian Job, The (2003)",Action|Crime
1044,6595,119.750485,S.W.A.T. (2003),Action|Thriller


Top 20 recommended movies for the drama fan:


Unnamed: 0,movie_id,relevance,title,genres
630,5459,77.605686,Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (...,Action|Comedy|Sci-Fi
69,3624,72.655465,Shanghai Noon (2000),Action|Adventure|Comedy|Western
1885,31685,72.590435,Hitch (2005),Comedy|Romance
185,4018,71.403724,What Women Want (2000),Comedy|Romance
2480,45499,70.826093,X-Men: The Last Stand (2006),Action|Sci-Fi|Thriller
285,4270,70.789855,"Mummy Returns, The (2001)",Action|Adventure|Comedy|Thriller
1426,8644,70.774255,"I, Robot (2004)",Action|Adventure|Sci-Fi|Thriller
1011,6564,70.771269,Lara Croft Tomb Raider: The Cradle of Life (2003),Action|Adventure|Comedy|Romance|Thriller
352,4701,70.621144,Rush Hour 2 (2001),Action|Comedy
192,4025,69.929018,Miss Congeniality (2000),Comedy|Crime


Superhero fan:  around 65–75% of the top recommendations appear relevant (many top-20 items are action/adventure). You have actual superhero movies like hulk or ghost rider but mainly movies that share genre.

Drama fan: around 30–45% appear relevant (top results contain many comedies/romances and fewer clear drama matches). More variate , a lot of comedy and action as well
   
After removing movies that the user has already watched, the relevance scores remain largely consistent with the previous results. This indicates that the recommendation model is stable and effectively prioritizes movies with strong correlations to user interests.

<font size="+2" color="#003300">I hereby declare that I completed this practice myself, that my answers were not written by an AI-enabled code assistant, and that except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>