# Collaborative Filtering on Movie Dataset
In this Notebook, we will be implementing an algorithm to perform collaborative filtering to a dataset. We will test our algorithm with a small synthetic (artificial) dataset, before we use the algorithm to recommend items from a larger dataset - the [MovieLens dataset](https://grouplens.org/datasets/movielens/100k/)

Our Notebooks in CSMODEL are designed to be guided learning activities. To use them, simply go through the cells from top to bottom, following the directions along the way. If you find any unclear parts or mistakes in the Notebooks, email your instructor.

## Instructions
* Read each cell and implement the TODOs sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Answer all the markdown/text cells with 'Question #' on them. The answer must strictly consume one line only.
* You are expected to search how to some functions work on the Internet or via the docs. 
* The notebooks will undergo a 'Restart and Run All' command, so make sure that your code is working properly.
* You are expected to understand the dataset loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

## Import
Import **pandas** and **matplotlib**.

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Synthetic Dataset
Before we use a more complicated dataset, we will first demonstrate collaborative filtering using a synthetic (artificial) data drawn from a random sample. Suppose that the values in the synthetic dataset represents the ratings, on a scale of 1 to 5, of people to different movies. Each row represents a movie, while each column represents a person. The synthetic dataset contains 6 different movies rated by 8 different people. A value of `0` means that the person has not rated that movie yet.

In [20]:
np.random.seed(19)
data = np.random.choice(6, (6, 8))
print(data)

[[5 5 2 0 3 4 2 2]
 [0 2 5 1 2 2 4 0]
 [1 3 1 4 5 5 4 1]
 [1 2 5 3 1 0 4 5]
 [5 0 1 4 4 1 4 3]
 [1 5 4 1 3 5 1 3]]


Convert the data type of the dataset from `numpy` arrays to `pandas` `DataFrame`.

In [21]:
rows = ['Movie ' + str(x) for x in range(data.shape[0])]
columns = ['User ' + str(x) for x in range(data.shape[1])]
syn_df = pd.DataFrame(data, index=rows, columns=columns)
syn_df

Unnamed: 0,User 0,User 1,User 2,User 3,User 4,User 5,User 6,User 7
Movie 0,5,5,2,0,3,4,2,2
Movie 1,0,2,5,1,2,2,4,0
Movie 2,1,3,1,4,5,5,4,1
Movie 3,1,2,5,3,1,0,4,5
Movie 4,5,0,1,4,4,1,4,3
Movie 5,1,5,4,1,3,5,1,3


Since a value of `0` means that the person has not rated that movie yet, let us replace `0` with the value `NaN`. This is useful if we want to exclude that cells with value `NaN` from the computation.

In [22]:
syn_df = syn_df.replace(0, np.nan)
syn_df

Unnamed: 0,User 0,User 1,User 2,User 3,User 4,User 5,User 6,User 7
Movie 0,5.0,5.0,2,,3,4.0,2,2.0
Movie 1,,2.0,5,1.0,2,2.0,4,
Movie 2,1.0,3.0,1,4.0,5,5.0,4,1.0
Movie 3,1.0,2.0,5,3.0,1,,4,5.0
Movie 4,5.0,,1,4.0,4,1.0,4,3.0
Movie 5,1.0,5.0,4,1.0,3,5.0,1,3.0


## Filtering the Synthetic Dataset
Open `collaborative_filtering.py` file. Some of the functions in the `CollaborativeFiltering` class are not yet implemented. We will implement the missing parts of this class.

Import the `CollaborativeFiltering` class.

In [26]:
from collaborative_filtering import CollaborativeFiltering

Instantiate a `CollaborativeFiltering` object with `k` equal to `2`. The parameter `k` indicates the number of similar items that we need to consider in giving similar recommendations.

Assign the object to variable `cfilter`.

In [27]:
# Write your code here
cfilter = CollaborativeFiltering(k=2)

Open `collaborative_filtering.py` file and complete the `get_row_mean()` function. If the parameter `data` is a `DataFrame`, the function will return a `Series` containing the mean of each row in the `DataFrame`. If the parameter `data` is a `Series`, the function will return an `np.float64` which is the mean of the `Series`. This function should not consider blank ratings represented as `NaN`. Inline comments should help you in completing the contents of the function.

Get the row mean for movie `0` by calling the function `get_row_mean()` and assign the return value to variable `mean_0`.

In [28]:
# Write your code here
mean_0 = cfilter.get_row_mean(syn_df).iloc[0]

In [29]:
print('{:.2f}'.format(mean_0))

3.29


**Question #1:** What is the average rating of the movie `0`? Limit to 2 decimal places.

Answer: 3.29

Get the row mean for all movies by calling the function `get_row_mean()` and assign the return value to variable `mean`.

In [30]:
# Write your code here
mean = cfilter.get_row_mean(syn_df)

In [31]:
print(mean.round(2))

Movie 0    3.29
Movie 1    2.67
Movie 2    3.00
Movie 3    3.00
Movie 4    3.14
Movie 5    2.88
dtype: float64


**Question #2:** What is the average rating of the movie `3`? Limit to 2 decimal places.

Answer: 3.00

Open `collaborative_filtering.py` file and complete the `normalize_data()` function. This function normalizes the dataset by subtracting the row mean for each user rating for a specific movie. Inline comments should help you in completing the contents of the function.

Normalize the ratings of all movies by calling the function `normalize_data()` and assign the return value to variable `normalized_df`.

In [32]:
# Write your code here
normalized_df = cfilter.normalize_data(syn_df, mean)

In [33]:
normalized_df.round(2)

Unnamed: 0,User 0,User 1,User 2,User 3,User 4,User 5,User 6,User 7
Movie 0,1.71,1.71,-1.29,,-0.29,0.71,-1.29,-1.29
Movie 1,,-0.67,2.33,-1.67,-0.67,-0.67,1.33,
Movie 2,-2.0,0.0,-2.0,1.0,2.0,2.0,1.0,-2.0
Movie 3,-2.0,-1.0,2.0,0.0,-2.0,,1.0,2.0
Movie 4,1.86,,-2.14,0.86,0.86,-2.14,0.86,-0.14
Movie 5,-1.88,2.12,1.12,-1.88,0.12,2.12,-1.88,0.12


**Question #3:** Did user `0` like movie `0`? Yes or No?

Answer: Yes

**Question #4:** Did user `1` like movie `2`? Yes or No?

Answer: No

Open `collaborative_filtering.py` file and complete the `get_cosine_similarity()` function. This function computes and returns the cosine similarity between two vectors of the same shape. The cosine similarity, $S_c$, between two vectors $A$ and $B$ is computed as:
$$S_c(A, B)=\dfrac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$

This function has 2 parameters - `vector1` and `vector2`. You may pass these combinations of data types in this function:
- a `Series` and a `Series` - the function returns a single similarity based on these two vectors. The data type of the result is `np.float64`.
- a `DataFrame` and a `Series` - the returns a `Series` of similarities of a single vector (represented as a `Series`) and a set of vectors (represented as a `DataFrame`). If the shape of the `DataFrame` is (3, 2), the shape of the `Series` should be (2,) to enable broadcasting. This operation will result to a `Series` of shape (3,).

Implement the `get_cosine_similarity()` function. Inline comments should help you in completing the contents of the function.

Get the cosine similarity between movie `2` and itself by calling the function `get_cosine_similarity()` and assign the return value to variable `sim_2_2`.

In [34]:
# Write your code here
m2_vector = normalized_df.loc['Movie 2']

sim_2_2 = cfilter.get_cosine_similarity(m2_vector, m2_vector)

In [35]:
print('Movie 2:', [round(x, 2) for x in normalized_df.iloc[2, :]])
print('Movie 2:', [round(x, 2) for x in normalized_df.iloc[2, :]])
print('Cosine similarity:', '{:.2f}'.format(sim_2_2), '\n')

Movie 2: [-2.0, 0.0, -2.0, 1.0, 2.0, 2.0, 1.0, -2.0]
Movie 2: [-2.0, 0.0, -2.0, 1.0, 2.0, 2.0, 1.0, -2.0]
Cosine similarity: 1.00 



**Question #5:** What is the cosine similarity between movie `2` and itself? Limit to 2 decimal places.

Answer: 1.00

Get the cosine similarity between movie `1` and movie `2` by calling the function `get_cosine_similarity()` and assign the return value to variable `sim_1_2`.

In [38]:
# Write your code here
m1_vector = normalized_df.loc['Movie 1']
sim_1_2 = cfilter.get_cosine_similarity(m2_vector, m1_vector)

In [39]:
print('Movie 1:', [round(x, 2) for x in normalized_df.iloc[1, :]])
print('Movie 2:', [round(x, 2) for x in normalized_df.iloc[2, :]])
print('Cosine similarity:', '{:.2f}'.format(sim_1_2), '\n') # -0.49

Movie 1: [nan, -0.67, 2.33, -1.67, -0.67, -0.67, 1.33, nan]
Movie 2: [-2.0, 0.0, -2.0, 1.0, 2.0, 2.0, 1.0, -2.0]
Cosine similarity: -0.49 



**Question #6:** What is the cosine similarity between movie `1` and movie `2`? Limit to 2 decimal places.

Answer: -0.49

Print the normalized score for movies `0` to `5`.

In [40]:
print('Movie 0:', [round(x, 2) for x in normalized_df.iloc[0, :]])
print('Movie 1:', [round(x, 2) for x in normalized_df.iloc[1, :]])
print('Movie 2:', [round(x, 2) for x in normalized_df.iloc[2, :]])
print('Movie 3:', [round(x, 2) for x in normalized_df.iloc[3, :]])
print('Movie 4:', [round(x, 2) for x in normalized_df.iloc[4, :]])
print('Movie 5:', [round(x, 2) for x in normalized_df.iloc[5, :]])

Movie 0: [1.71, 1.71, -1.29, nan, -0.29, 0.71, -1.29, -1.29]
Movie 1: [nan, -0.67, 2.33, -1.67, -0.67, -0.67, 1.33, nan]
Movie 2: [-2.0, 0.0, -2.0, 1.0, 2.0, 2.0, 1.0, -2.0]
Movie 3: [-2.0, -1.0, 2.0, 0.0, -2.0, nan, 1.0, 2.0]
Movie 4: [1.86, nan, -2.14, 0.86, 0.86, -2.14, 0.86, -0.14]
Movie 5: [-1.88, 2.12, 1.12, -1.88, 0.12, 2.12, -1.88, 0.12]


Suppose we want to get the cosine similarity between a set of vectors and another vector. Let's call the `get_cosine_similarity()` function and compute their cosine similarity. 

Get the cosine similarity between movie `4` and all other movies (i.e., movies `0`, `1`, `2`, `3`, and `5`) by calling the function `get_cosine_similarity()` and assign the return value to variable `sim_4`. 

The function should only be called once in the next code block. Do not call the `get_cosine_similarity()` function multiple times. Make sure that the `get_cosine_similarity()` function receives a `Series` and a `DataFrame`. 

In [58]:
# Write your code here

# other solution using iloc
m4_vector = normalized_df.iloc[4]  # Get Movie 4 vector as a Series
other_movies = normalized_df.iloc[[0, 1, 2, 3, 5]]  # Get other movies as a DataFrame

# m4_vector = normalized_df.loc["Movie 4"]  # Get Movie 4 vector as a Series
# other_movies = normalized_df.loc[["Movie 0", "Movie 1", "Movie 2", "Movie 3", "Movie 5"]]  # Get other movies as a DataFrame

sim_4 = cfilter.get_cosine_similarity(m4_vector, other_movies)

In [59]:
print('\nCosine similarities:\n' + str(sim_4.round(2)))


Cosine similarities:
User 0   -0.53
User 1    0.00
User 2   -0.30
User 3   -0.21
User 4   -0.06
User 5   -0.75
User 6    0.01
User 7    0.01
dtype: float64


In [60]:
# this is just for confirmation

# between movie 4 and 1
m1_vector = normalized_df.loc["Movie 1"]
sim_4_1 = cfilter.get_cosine_similarity(m4_vector, m1_vector)
print(f"movie 4 and 1: {sim_4_1}")

# between movie 4 and 3
m3_vector = normalized_df.loc["Movie 3"]
sim_4_3 = cfilter.get_cosine_similarity(m4_vector, m3_vector)
print(f"movie 4 and 3: {sim_4_3}")

movie 4 and 1: -0.3412849781756846
movie 4 and 3: -0.5590852462516898


**Question #7:** What is the cosine similarity between movie `4` and movie `1`? Limit to 2 decimal places.

Answer: -0.34

**Question #8:** What is the cosine similarity between movie `4` and movie `3`? Limit to 2 decimal places.

Answer: -0.56

Open `collaborative_filtering.py` file and complete the `get_k_similar()` function. This function returns two values - the indices of the top `k` similar items to the vector from the dataset, and a `Series` representing their similarity values to the vector. This function has 2 parameters - `data` and `vector`. We find the top `k` items from the `DataFrame` `data` which are highly similar to the `Series` `vector`. Since we are talking about vectors, we will measure similarity using the cosine similarity, which we have implemented in the `get_cosine_similarity()` function. Inline comments should help you in completing the contents of the function.

Get the similar movies to movie `1` by calling the function `get_k_similar()` and assign the return value to variable `similar_movies`.

In [67]:
# Write your code here
other_movies_df = normalized_df.iloc[[0, 2, 3, 4, 5]]  # Get other movies as a DataFrame
similar_movies = cfilter.get_k_similar(other_movies_df, m1_vector)

In [68]:
print(similar_movies[1].round(2))

Movie 3    0.56
Movie 5    0.02
dtype: float64


**Question #9:** Give the top 2 movies that are most similar to movie `1`.

Answer: Movie 3 and Movie 5. I did not include movie 1 since it is itself.

Open `collaborative_filtering.py` file and complete the `get_rating()` function. This function computes and returns an extrapolated value for a missing rating. This function has 3 parameters - `data`, `index`, and `column`. The parameter `data` is the dataset represent as a `DataFrame`. The parameters `index` and `column` represent the row and column in the dataset, respectively, of the missing rating that we want to extrapolate.

This function gets the top `k` similar items to the item in row `index`, then infer the missing rating for the user in column `column`.

The rating of user `x` to item `i`, represented as $r_{xi}$, given the set of similar items `N`, is computed as:

$$r_{xi}=\dfrac{\sum_{y \in N}^{}s_{xy}r_{yi}}{\sum_{y \in N}^{}s_{xy}}$$

Implement the `get_rating()` function. Inline comments should help you in completing the contents of the function.

In the synthetic dataset, user `1` has not yet rated movie `4`. Infer the rating of user `1` to movie `4` by calling the function `get_rating()` and assign the return value to variable `rating_1_4`. 

In [75]:
# Write your code here
rating_1_4 = cfilter.get_rating(syn_df, "Movie 4", "User 1")

rating_1_1 = cfilter.get_rating(syn_df, "Movie 1", "User 1")

In [76]:
print(round(rating_1_4, 2))
print(round(rating_1_1, 2))

5.0
2.11


**Question #10:** What is the predicted rating of user `1` to movie `1`? Limit to 2 decimal places.

Answer: 2.11

In the synthetic dataset, user `3` has not yet rated movie `0`. Infer the rating of user `3` to movie `0` by calling the function `get_rating()` and assign the return value to variable `rating_3_0`. 

In [77]:
# Write your code here
rating_3_0 = cfilter.get_rating(syn_df, "Movie 0", "User 3")

In [78]:
print(round(rating_3_0, 2))

2.76


**Question #11:** What is the predicted rating of user `6` to movie `2`? Limit to 2 decimal places.

Answer: 2.76

## MovieLens Dataset
For this notebook, we will work on a dataset called `MovieLens dataset`. This dataset contains 1682 movies rated by 943 users, from 1-5. There are a total of 100k ratings. We have already pre-processed the dataset to be stored as a csv file, where each row represents a movie and a column represents a user. The value in row `x` and column `y` is the rating of user `y` to movie `x`. A rating of 0 means that the user has not rated the item yet.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Notepad to see how it is exactly formatted.

Let's read the dataset.

In [81]:
ml_df = pd.read_csv('ml-100k.csv', header=None)
ml_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,933,934,935,936,937,938,939,940,941,942
0,5,4,0,0,4,4,0,0,0,4,...,2,3,4,0,4,0,0,5,0,0
1,3,0,0,0,3,0,0,0,0,0,...,4,0,0,0,0,0,0,0,0,5
2,4,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,0
3,3,0,0,0,0,0,5,0,0,4,...,5,0,0,0,0,0,2,0,0,0
4,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's read the file `u.txt` which contains details about the movies in the dataset. This is a tab separated
list of:
movie id | movie title | release date | video release date |IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
The last 19 fields are the genres, a `1` indicates the movie is of that genre, a `0` indicates it is not; movies can be in several genres at once.
From this file, we will get the index of our `DataFrame`.

In [82]:
indices = []
with open('u.txt','r', encoding='ISO-8859-1') as f:
    line = f.readline()
    while line != '':
        indices.append(line.split('|')[1])
        line = f.readline()
ml_df.index = indices
ml_df.columns = ['User ' + str(x) for x in range(943)]

Since a value of `0` means that the person has not rated that movie yet, let us replace `0` with the value `NaN`. This is useful if we want to exclude that cells with value `NaN` from the computation.

In [84]:
# Write your code here
ml_df = ml_df.replace(0, np.nan)
ml_df.head()

Unnamed: 0,User 0,User 1,User 2,User 3,User 4,User 5,User 6,User 7,User 8,User 9,...,User 933,User 934,User 935,User 936,User 937,User 938,User 939,User 940,User 941,User 942
Toy Story (1995),5.0,4.0,,,4.0,4.0,,,,4.0,...,2.0,3.0,4.0,,4.0,,,5.0,,
GoldenEye (1995),3.0,,,,3.0,,,,,,...,4.0,,,,,,,,,5.0
Four Rooms (1995),4.0,,,,,,,,,,...,,,4.0,,,,,,,
Get Shorty (1995),3.0,,,,,,5.0,,,4.0,...,5.0,,,,,,2.0,,,
Copycat (1995),3.0,,,,,,,,,,...,,,,,,,,,,


Instantiate a `CollaborativeFiltering` object with `k` equal to `3`. The parameter `k` indicates the number of similar items that we need to consider in giving similar recommendations.

In [85]:
# Write your code here
cfilter_new = CollaborativeFiltering(k=3)

Get similar movies to `Lion King, The (1994)`. Assign the return value to variable `similar_movies`. 

In [88]:
# Write your code here
lion_king_vector = ml_df.loc["Lion King, The (1994)"]
ml_df_without_lion_king = ml_df.drop("Lion King, The (1994)")
similar_movies = cfilter_new.get_k_similar(ml_df_without_lion_king, lion_king_vector)

In [89]:
print(similar_movies[1].round(2))

Aladdin (1992)                          0.38
Beauty and the Beast (1991)             0.36
Robin Hood: Prince of Thieves (1991)    0.29
dtype: float64


**Question #12:** Give the top 3 movies that are most similar to `Lion King, The (1994)`.

Answer: Aladdin (1992) = 0.38, Beauty and the Beast (1991) = 0.36, and Robin Hood: Prince of Thieves (1991) = 0.29.

Get similar movies to `Amityville Curse, The (1990)`. Assign the return value to variable `similar_movies`. 

In [90]:
# Write your code here
amityville_vector = ml_df.loc["Amityville Curse, The (1990)"]
ml_df_without_amity = ml_df.drop("Amityville Curse, The (1990)")
similar_movies = cfilter_new.get_k_similar(ml_df_without_amity, amityville_vector)

In [91]:
print(similar_movies[1].round(2))

Amityville 3-D (1983)    0.95
Bad Moon (1996)          0.33
Fog, The (1980)          0.32
dtype: float64


**Question #13:** Give the top 3 movies that are most similar to `Amityville Curse, The (1990)`.

Answer: Amityville 3-D (1983) = 0.95, Bad Moon (1996) = 0.33, and Fog, The (1980) = 0.32

Get similar movies to `Star Trek: The Wrath of Khan (1982)`. Assign the return value to variable `similar_movies`. 

In [92]:
# Write your code here
star_trek_vector = ml_df.loc["Star Trek: The Wrath of Khan (1982)"]
ml_df_without_star_trek = ml_df.drop("Star Trek: The Wrath of Khan (1982)")
similar_movies = cfilter_new.get_k_similar(ml_df_without_star_trek, star_trek_vector)

In [93]:
print(similar_movies[1].round(2))

Star Trek IV: The Voyage Home (1986)             0.40
Star Trek III: The Search for Spock (1984)       0.38
Star Trek VI: The Undiscovered Country (1991)    0.33
dtype: float64


**Question #14:** Give the top 3 movies that are most similar to `Star Trek: The Wrath of Khan (1982)`.

Answer:
1. Star Trek IV: The Voyage Home (1986)             0.40                    
2. Star Trek III: The Search for Spock (1984)       0.38               
3. Star Trek VI: The Undiscovered Country (1991)    0.33             