# Movie Recommendations HW

**Name:**  

**Collaboration Policy:** Homeworks will be done individually: each student must hand in their own answers. Use of partial or entire solutions obtained from others or online is strictly prohibited.

**Late Policy:** Late submission have a penalty of 2\% for each passing hour. 

**Submission format:** Successfully complete the Movie Lens recommender as described in this jupyter notebook. Submit a `.py` and an `.ipynb` file for this notebook. You can go to `File -> Download as ->` to download a .py version of the notebook. 

**Only submit one `.ipynb` file and one `.py` file.** The `.ipynb` file should have answers to all the questions. Do *not* zip any files for submission. 

**Download the dataset from here:** https://grouplens.org/datasets/movielens/1m/

In [178]:
# Import all the required libraries
import numpy as np
import pandas as pd
import random

## Reading the Data
Now that we have downloaded the files from the link above and placed them in the same directory as this Jupyter Notebook, we can load each of the tables of data as a CSV into Pandas. Execute the following, provided code.

In [36]:
# Read the dataset from the two files into ratings_data and movies_data
#NOTE: if you are getting a decode error, add "encoding='ISO-8859-1'" as an additional argument
#      to the read_csv function
column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"]
ratings_data  = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python')
column_list_movies = ["MovieID","Title","Genres"]
movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python', encoding = 'latin-1')
column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"]
user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')

`ratings_data`, `movies_data`, `user_data` corresponds to the data loaded from `ratings.dat`, `movies.dat`, and `users.dat` in Pandas.

## Data analysis

We now have all our data in Pandas - however, it's as three separate datasets! To make some more sense out of the data we have, we can use the Pandas `merge` function to combine our component data-frames. Run the following code:

In [37]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)
data

Unnamed: 0,UserID,MovieID,Ratings,Timestamp,Gender,Age,Occupation,Zixp-code,Title,Genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...,...,...,...,...
1000204,5949,2198,5,958846401,M,18,17,47901,Modulations (1998),Documentary
1000205,5675,2703,3,976029116,M,35,14,30030,Broken Vessels (1998),Drama
1000206,5780,2845,1,958153068,M,18,17,92886,White Boys (1999),Drama
1000207,5851,3607,5,957756608,F,18,20,55410,One Little Indian (1973),Comedy|Drama|Western


Next, we can create a pivot table to match the ratings with a given movie title. Using `data.pivot_table`, we can aggregate (using the average/`mean` function) the reviews and find the average rating for each movie. We can save this pivot table into the `mean_ratings` variable. 

In [38]:
mean_ratings=data.pivot_table('Ratings','Title',aggfunc='mean')
mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",3.027027
'Night Mother (1986),3.371429
'Til There Was You (1997),2.692308
"'burbs, The (1989)",2.910891
...And Justice for All (1979),3.713568
...,...
"Zed & Two Noughts, A (1985)",3.413793
Zero Effect (1998),3.750831
Zero Kelvin (Kjærlighetens kjøtere) (1995),3.500000
Zeus and Roxanne (1997),2.521739


Now, we can take the `mean_ratings` and sort it by the value of the rating itself. Using this and the `head` function, we can display the top 15 movies by average rating.

In [39]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],aggfunc='mean')
top_15_mean_ratings = mean_ratings.sort_values(by = 'Ratings',ascending = False).head(15)
top_15_mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
Ulysses (Ulisse) (1954),5.0
Lured (1947),5.0
Follow the Bitch (1998),5.0
Bittersweet Motel (2000),5.0
Song of Freedom (1936),5.0
One Little Indian (1973),5.0
Smashing Time (1967),5.0
Schlafes Bruder (Brother of Sleep) (1995),5.0
"Gate of Heavenly Peace, The (1995)",5.0
"Baby, The (1973)",5.0


Let's adjust our original `mean_ratings` function to account for the differences in gender between reviews. This will be similar to the same code as before, except now we will provide an additional `columns` parameter which will separate the average ratings for men and women, respectively.

In [40]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')
mean_ratings

Gender,F,M
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375000,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024
...,...,...
"Zed & Two Noughts, A (1985)",3.500000,3.380952
Zero Effect (1998),3.864407,3.723140
Zero Kelvin (Kjærlighetens kjøtere) (1995),,3.500000
Zeus and Roxanne (1997),2.777778,2.357143


We can now sort the ratings as before, but instead of by `Rating`, but by the `F` and `M` gendered rating columns. Print the top rated movies by male and female reviews, respectively.

In [41]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)

mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
print(top_female_ratings.head(15))

top_male_ratings = mean_ratings.sort_values(by='M', ascending=False)
print(top_male_ratings.head(15))

Gender                                               F         M
Title                                                           
Clean Slate (Coup de Torchon) (1981)               5.0  3.857143
Ballad of Narayama, The (Narayama Bushiko) (1958)  5.0  3.428571
Raw Deal (1948)                                    5.0  3.307692
Bittersweet Motel (2000)                           5.0       NaN
Skipped Parts (2000)                               5.0  4.000000
Lamerica (1994)                                    5.0  4.666667
Gambler, The (A Játékos) (1997)                    5.0  3.166667
Brother, Can You Spare a Dime? (1975)              5.0  3.642857
Ayn Rand: A Sense of Life (1997)                   5.0  4.000000
24 7: Twenty Four Seven (1997)                     5.0  3.750000
Twice Upon a Yesterday (1998)                      5.0  3.222222
Woman of Paris, A (1923)                           5.0  2.428571
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                5.0  4.750000
Gate of Heavenly Peace, T

In [42]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

Gender,F,M,diff
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"James Dean Story, The (1957)",4.0,1.0,-3.0
Country Life (1994),5.0,2.0,-3.0
"Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)",4.0,1.0,-3.0
Babyfever (1994),3.666667,1.0,-2.666667
"Woman of Paris, A (1923)",5.0,2.428571,-2.571429
Cobra (1925),4.0,1.5,-2.5
"Other Side of Sunday, The (Søndagsengler) (1996)",5.0,2.928571,-2.071429
"To Have, or Not (1995)",4.0,2.0,-2.0
For the Moment (1994),5.0,3.0,-2.0
Phat Beach (1996),3.0,1.0,-2.0


Let's try grouping the data-frame, instead, to see how different titles compare in terms of the number of ratings. Group by `Title` and then take the top 10 items by number of reviews. We can see here the most popularly-reviewed titles.

In [None]:
ratings_by_title=data.groupby('Title').size()
ratings_by_title.sort_values(ascending=False).head(10)

Similarly, we can filter our grouped data-frame to get all titles with a certain number of reviews. Filter the dataset to get all movie titles such that the number of reviews is >= 2500.

## Question 1

Create a ratings matrix using Numpy. This matrix allows us to see the ratings for a given movie and user ID. The element at location $[i,j]$ is a rating given by user $i$ for movie $j$. Print the **shape** of the matrix produced.  

Additionally, choose 3 users that have rated the movie with MovieID "**1377**" (Batman Returns). Print these ratings, they will be used later for comparison.


**Notes:**
- Do *not* use `pivot_table`.
- A ratings matrix is *not* the same as `ratings_data` from above.
- The ratings of movie with MovieID $i$ are stored in the ($i$-1)th column (index starts from 0)  
- Not every user has rated every movie. Missing entries should be set to 0 for now.
- If you're stuck, you might want to look into `np.zeros` and how to use it to create a matrix of the desired shape.
- Every review lies between 1 and 5, and thus fits within a `uint8` datatype, which you can specify to numpy.

In [309]:
# Create the matrix
user_ids = data['UserID'].unique()
movie_ids = data['MovieID'].unique()

user_ids = sorted(user_ids)
movie_ids = sorted(movie_ids)

ratings_matrix = np.zeros((user_ids[len(user_ids)-1], movie_ids[len(movie_ids)-1]))

for index, user_id in enumerate(user_ids):
    user_data = data[data['UserID'] == user_id]

    for _, row in user_data.iterrows():
        movie_id = row['MovieID']
        rating = row['Ratings']
        ratings_matrix[user_id-1, movie_id-1] = rating

In [310]:
# Print the shape

print("Shape of the Ratings Matrix:", ratings_matrix.shape)

Shape of the Ratings Matrix: (6040, 3952)


In [313]:
# Store and print ratings for Batman Returns

movie_title = "Batman Returns (1992)"

subset = data[data['Title'] == movie_title]

if not subset.empty:
    movie_id = subset.iloc[0]['MovieID']

#Choose 3 users
random_numbers = [random.randint(1, 6040) for _ in range(3)]
print(random_numbers)

[3359, 5392, 306]


## Question 2

Normalize the ratings matrix (created in **Question 1**) using Z-score normalization. While we can't use `sklearn`'s `StandardScaler` for this step, we can do the statistical calculations ourselves to normalize the data.

Before you start:
- Your first step should be to get the average of every *column* of the ratings matrix (we want an average by title, not by user!).
- Make sure that the mean is calculated considering only non-zero elements. If there is a movie which is rated only by 10 users, we get its mean rating using (sum of the 10 ratings)/10 and **NOT** (sum of 10 ratings)/(total number of users)
- All of the missing values in the dataset should be replaced with the average rating for the given movie. This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an "expected value" which is useful for computing correlations and recommendations in later steps.
- In our matrix, 0 represents a missing rating.
- Next, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every *column*. It may be very close but not exactly zero because of the limited precision `float`s allow.
- Lastly, divide this by the standard deviation of the *column*.

- Not every MovieID is used, leading to zero columns. This will cause a divide by zero error when normalizing the matrix. Simply replace any NaN values in your normalized matrix with 0.

In [312]:
non_zero_ratings_matrix = np.where(ratings_matrix > 0, ratings_matrix, np.nan)

#Get the average of every column of the ratings matrix 
mean_ratings_by_movie = []
for movie_ratings in non_zero_ratings_matrix.T:
    if not np.isnan(movie_ratings).all():
        mean_rating = np.nanmean(movie_ratings)
        mean_ratings_by_movie.append(mean_rating)
    else:
        mean_ratings_by_movie.append(np.nan)
        
mean_ratings_by_movie = np.array(mean_ratings_by_movie)

#Replace the missing values
ratings_matrix_replace = np.copy(ratings_matrix)
zero_indices = ratings_matrix_replace == 0
ratings_matrix_replace[zero_indices] = np.tile(mean_ratings_by_movie, (ratings_matrix.shape[0], 1))[zero_indices]

#Subtract the average
ratings_matrix_subtract = ratings_matrix_replace - mean_ratings_by_movie
std_devs = np.std(ratings_matrix_replace, axis=0)

#Divide by standard deviation
ratings_matrix_normalized = ratings_matrix_subtract/std_devs

#Replace NaN values
ratings_matrix_normalized = np.nan_to_num(ratings_matrix_normalized)

  ratings_matrix_normalized = ratings_matrix_subtract/std_devs


## Question 3

We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix from the previous question. Perform the process using numpy, and along the way print the shapes of the $U$, $S$, and $V$ matrices you calculated.

In [270]:
# Compute the SVD of the normalised matrix

U, S, VT = np.linalg.svd(ratings_matrix_normalized)
Sigma = np.diag(S)

In [271]:
# Print the shapes

print("Shape of U matrix:", U.shape)
print("Shape of Sigma matrix:", S.shape)
print("Shape of VT matrix:", VT.shape)

Shape of U matrix: (6040, 6040)
Shape of Sigma matrix: (3952,)
Shape of VT matrix: (3952, 3952)


## Question 4

Reconstruct four rank-k rating matrix $R_k$, where $R_k = U_kS_kV_k^T$ for k = [100, 1000, 2000, 3000]. Using each of $R_k$ make predictions for the 3 users selected in Question 1, for the movie with ID 1377 (Batman Returns). Compare the original ratings with the predicted ratings.

In [314]:
ranks = [100, 1000, 2000, 3000, np.linalg.matrix_rank(ratings_matrix_normalized)]

print("Original Ratings:")
print(ratings_matrix_normalized[random_numbers[0] , movie_id - 1])
print(ratings_matrix_normalized[random_numbers[1] , movie_id - 1])
print(ratings_matrix_normalized[random_numbers[2] , movie_id - 1])

print("Predicted Ratings:")
for k in ranks:
    print(f"Rank {k}:")
    predictions = U[:,:k]@Sigma[:k,:k]@VT[:k,:]
    print(predictions[random_numbers[0] , movie_id - 1])
    print(predictions[random_numbers[1] , movie_id - 1])
    print(predictions[random_numbers[2] , movie_id - 1])

Original Ratings:
0.0
0.0
0.0
Predicted Ratings:
Rank 100:
0.011556717366372638
-0.49734101142688614
-0.14349646010474817
Rank 1000:
0.3607441460031015
0.6244195975299992
-0.38500797927274566
Rank 2000:
0.15267017120043594
-0.06708749419597887
-0.04355128696819266
Rank 3000:
0.09780270251282296
-0.035437482670307656
-0.023405248014827258
Rank 3558:
4.0783348920214735e-15
7.873909857458727e-15
-4.961309141293668e-16


## Question 5

### Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within $cosine(x,y) \in [0,1]$. $0$ means there is no similarity (perpendicular), where $1$ (parallel) means that both the items are 100% similar.

$$ cosine(x,y) = \frac{x^T y}{||x|| ||y||}  $$

**Based on the reconstruction rank-1000 rating matrix $R_{1000}$ and the cosine similarity,** sort the movies which are most similar. You will have a function `top_movie_similarity` which sorts data by its similarity to a movie with ID `movie_id` and returns the top $n$ items, and a second function `print_similar_movies` which prints the titles of said similar movies. Return the top 5 movies for the movie with ID `1377` (*Batman Returns*)

Note: While finding the cosine similarity, there are a few empty columns which will have a magnitude of **zero** resulting in NaN values. These should be replaced by 0, otherwise these columns will show most similarity with the given movie. 

In [303]:
def cosine_similarity(matrix, movie_id):
    target_movie_vector = matrix[movie_id]
    denominators = (np.linalg.norm(matrix, axis=1) * np.linalg.norm(target_movie_vector))
    epsilon = 1e-9  # Small positive value
    denominators[denominators < epsilon] = epsilon
    cos_sim = np.dot(matrix, target_movie_vector) / denominators
    return cos_sim
    
# Sort the movies based on cosine similarity
def top_movie_similarity(data, movie_id, top_n = 5):
    sim_scores = cosine_similarity(data.T, movie_id)
    top_indices = np.argsort(sim_scores)[::-1][1:top_n+1]  
    return top_indices
    
def print_movies(top_indices):
    top_indices = top_indices + 1
    similar_movie_titles = data['Title'].loc[data['MovieID'].isin(top_indices)].unique()
    for i, title in enumerate(similar_movie_titles, 1):
        print(f"Similar Movie {i}: {title}")
        print(data[data['Title'] == movie_title]['Genres'].values[0])
        print()

# Print the top 5 movies for Batman Returns
movie_id = 1377
ratings_matrix_rank_1000 = U[:,:1000]@Sigma[:1000,:1000]@VT[:1000,:]
top_indices = top_movie_similarity(ratings_matrix_rank_1000, movie_id - 1)
print('Most Similar movies: ')
print_movies(top_indices)

Most Similar movies: 
Similar Movie 1: Batman Forever (1995)
Action|Adventure|Comedy|Crime

Similar Movie 2: Batman (1989)
Action|Adventure|Comedy|Crime

Similar Movie 3: Back to the Future Part II (1989)
Action|Adventure|Comedy|Crime

Similar Movie 4: Dick Tracy (1990)
Action|Adventure|Comedy|Crime

Similar Movie 5: Batman & Robin (1997)
Action|Adventure|Comedy|Crime



## Question 6

### Movie Recommendations
Using the same process from Question 5, write `top_user_similarity` which sorts data by its similarity to a user with ID `user_id` and returns the top result. Then find the MovieIDs of the movies that this similar user has rated most highly, but that `user_id` has not yet seen. Find at least 5 movie recommendations for the user with ID `5954` and print their titles.

Hint: To check your results, find the genres of the movies that the user likes and compare with the genres of the recommended movies.

In [315]:
#Sort users based on cosine similarity

def top_user_similarity(data, user_id, top_n=1):
    sim_scores = cosine_similarity(data, user_id)
    top_indices = np.argsort(sim_scores)[::-1][1:top_n+1]  # Exclude the target user itself
    return top_indices

def recommend_movies(user_id, most_similar_user_id, data, num_recommendations=5):
    similar_user_ratings = ratings_matrix[most_similar_user_id]
    user_ratings = ratings_matrix[user_id]
    unseen_movies = similar_user_ratings[user_ratings == 0]

    recommendations = np.argsort(unseen_movies)[::-1][:num_recommendations]
    return recommendations

user_id = 5954
most_similar_users = top_user_similarity(ratings_matrix_rank_1000, user_id - 1, top_n=1)
most_similar_user_id = most_similar_users[0]
recommended_movies = recommend_movies(user_id - 1, most_similar_user_id - 1, ratings_matrix)
print('Top 5 recommended movies: ')
print_movies(recommended_movies)
print(f'Top 5 favourite movies of user {user_id}: ')
print_movies(np.argsort(ratings_matrix[user_id - 1])[::-1][:5])

Top 5 recommended movies: 
Similar Movie 1: American Werewolf in Paris, An (1997)
Action|Adventure|Comedy|Crime

Similar Movie 2: Minus Man, The (1999)
Action|Adventure|Comedy|Crime

Similar Movie 3: Love Stinks (1999)
Action|Adventure|Comedy|Crime

Similar Movie 4: Rich and Strange (1932)
Action|Adventure|Comedy|Crime

Similar Movie 5: Tinseltown (1998)
Action|Adventure|Comedy|Crime

Top 5 favourite movies of user 5954: 
Similar Movie 1: Toy Story (1995)
Action|Adventure|Comedy|Crime

Similar Movie 2: Field of Dreams (1989)
Action|Adventure|Comedy|Crime

Similar Movie 3: Man Who Would Be King, The (1975)
Action|Adventure|Comedy|Crime

Similar Movie 4: Hidden, The (1987)
Action|Adventure|Comedy|Crime

Similar Movie 5: My Life as a Dog (Mitt liv som hund) (1985)
Action|Adventure|Comedy|Crime

