## Collaborative Filtering on Movie Dataset
In this Notebook, we will be implementing an algorithm to perform collaborative filtering to a dataset. We will test our algorithm with a small synthetic (artificial) dataset, before we use the algorithm to recommend items from a larger dataset - the [MovieLens dataset](https://grouplens.org/datasets/movielens/100k/)

Our Notebooks in CSMODEL are designed to be guided learning activities. To use them, simply through the cells from top to bottom, following the directions along the way. If you find any unclear parts or mistakes in the Notebooks, email me at thomas.tiam-lee@dlsu.edu.ph

## Import
Import **pandas** and **matplotlib**.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2

## Synthetic Dataset
Before we use a more complicated dataset, we will first demonstrate collaborative filtering using a synthetic (artificial) data drawn from a random sample. Suppose that the values in the synthetic dataset represents the ratings, on a scale of 1 to 5, of people to different movies. Each row represents a movie, while each column represents a person. The synthetic dataset contains 6 different movies rated by 8 different people. A value of `0` means that the person has not rated that movie yet.

In [2]:
np.random.seed(1)
data = np.random.choice(6, (6, 8))
print(data)

[[5 3 4 0 1 3 5 0]
 [0 1 4 5 4 1 2 4]
 [5 2 4 3 4 2 4 5]
 [2 4 1 1 0 5 1 1]
 [5 1 1 0 4 1 0 0]
 [5 3 2 1 0 3 5 1]]


Convert the data type of the dataset from `numpy` arrays to `pandas` `DataFrame`.

In [3]:
rows = ['Movie ' + str(x) for x in range(data.shape[0])]
columns = ['User ' + str(x) for x in range(data.shape[1])]
syn_df = pd.DataFrame(data, index=rows, columns=columns)
print(syn_df)

         User 0  User 1  User 2  User 3  User 4  User 5  User 6  User 7
Movie 0       5       3       4       0       1       3       5       0
Movie 1       0       1       4       5       4       1       2       4
Movie 2       5       2       4       3       4       2       4       5
Movie 3       2       4       1       1       0       5       1       1
Movie 4       5       1       1       0       4       1       0       0
Movie 5       5       3       2       1       0       3       5       1


Since a value of `0` means that the person has not rated that movie yet, let us replace `0` with the value `NaN`. This is useful if we want to exclude that cells with value `NaN` from the computation.

In [4]:
syn_df = syn_df.replace(0, np.nan)
print(syn_df)

         User 0  User 1  User 2  User 3  User 4  User 5  User 6  User 7
Movie 0     5.0       3       4     NaN     1.0       3     5.0     NaN
Movie 1     NaN       1       4     5.0     4.0       1     2.0     4.0
Movie 2     5.0       2       4     3.0     4.0       2     4.0     5.0
Movie 3     2.0       4       1     1.0     NaN       5     1.0     1.0
Movie 4     5.0       1       1     NaN     4.0       1     NaN     NaN
Movie 5     5.0       3       2     1.0     NaN       3     5.0     1.0


## Filtering the Synthetic Dataset
Open `collaborative_filtering.py` file. Some of the functions in the `CollaborativeFiltering` class are not yet implemented. We will implement the missing parts of this class.

Import the `CollaborativeFiltering` class.

In [5]:
from collaborative_filtering import CollaborativeFiltering

Instantiate a `CollaborativeFiltering` object with `k` equal to `2`. The parameter `k` indicates the number of similar items that we need to consider in giving similar recommendations.

In [6]:
cfilter = CollaborativeFiltering(2)

Open `collaborative_filtering.py` file and complete the `get_row_mean()` function. If the parameter `data` is a `DataFrame`, the function will return a `Series` containing the mean of each row in the `DataFrame`. If the parameter `data` is a `Series`, the function will return an `np.float64` which is the mean of the `Series`. This function should not consider blank ratings represented as `NaN`.

Implement the `get_row_mean()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [7]:
mean = cfilter.get_row_mean(syn_df.iloc[0])
print(mean)

3.5


**Question:** What is the average rating of the movie `0`? 
- 3.5

In [8]:
mean = cfilter.get_row_mean(syn_df)
print(mean)

Movie 0    3.500000
Movie 1    3.000000
Movie 2    3.625000
Movie 3    2.142857
Movie 4    2.400000
Movie 5    2.857143
dtype: float64


**Question:** What is the average rating of the movie `2`? 
- 3.625000

**Question:** What is the average rating of the movie `4`? 
- 2.400000

Open `collaborative_filtering.py` file and complete the `normalize_data()` function. This function normalizes the dataset by subtracting the row mean for each user rating for a specific movie.

Implement the `normalize_data()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [9]:
normalized_df = cfilter.normalize_data(syn_df, mean)
print(normalized_df)

           User 0    User 1    User 2    User 3  User 4    User 5    User 6  \
Movie 0  1.500000 -0.500000  0.500000       NaN  -2.500 -0.500000  1.500000   
Movie 1       NaN -2.000000  1.000000  2.000000   1.000 -2.000000 -1.000000   
Movie 2  1.375000 -1.625000  0.375000 -0.625000   0.375 -1.625000  0.375000   
Movie 3 -0.142857  1.857143 -1.142857 -1.142857     NaN  2.857143 -1.142857   
Movie 4  2.600000 -1.400000 -1.400000       NaN   1.600 -1.400000       NaN   
Movie 5  2.142857  0.142857 -0.857143 -1.857143     NaN  0.142857  2.142857   

           User 7  
Movie 0       NaN  
Movie 1  1.000000  
Movie 2  1.375000  
Movie 3 -1.142857  
Movie 4       NaN  
Movie 5 -1.857143  


**Question:** What is the normalized rating of user `0` to movie `0`? 
- 1.500000

**Question:** What is the normalized rating of user `2` to movie `4`? 
- -1.400000

**Question:** What is the normalized rating of user `4` to movie `1`? 
-  1.000

**Question:** What is the normalized rating of user `6` to movie `1`? 
- 1.500000

Open `collaborative_filtering.py` file and complete the `get_cosine_similarity()` function. This function computes and returns the cosine similarity between two vectors of the same shape. The cosine similarity, $S_c$, between two vectors $A$ and $B$ is computed as:
$$S_c(A, B)=\dfrac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$

This function has 2 parameters - `vector1` and `vector2`. You may pass these combinations of data types in this function:
- a `Series` and a `Series` - the function returns a single similarity based on these two vectors. The data type of the result is `np.float64`.
- a `DataFrame` and a `Series` - the returns a `Series` of similarities of a single vector (represented as a `Series`) and a set of vectors (represented as a `DataFrame`). If the shape of the `DataFrame` is (3, 2), the shape of the `Series` should be (2,) to enable broadcasting. This operation will result to a `Series` of shape (3,).

Implement the `get_cosine_similarity()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

Suppose we want to get the cosine similarity between two vectors. Let's call the `get_cosine_similarity()` function and compute their cosine similarity.

In [10]:
sim_1_1 = cfilter.get_cosine_similarity(normalized_df.iloc[1, :], normalized_df.iloc[1, :])
print('Movie 1:', [round(x, 2) for x in normalized_df.iloc[1, :]])
print('Movie 1:', [round(x, 2) for x in normalized_df.iloc[1, :]])
print('Cosine similarity:', sim_1_1, '\n')

sim_1_2 = cfilter.get_cosine_similarity(normalized_df.iloc[1, :], normalized_df.iloc[2, :])
print('Movie 1:', [round(x, 2) for x in normalized_df.iloc[1, :]])
print('Movie 2:', [round(x, 2) for x in normalized_df.iloc[2, :]])
print('Cosine similarity:', sim_1_2, '\n')

Movie 1: [nan, -2.0, 1.0, 2.0, 1.0, -2.0, -1.0, 1.0]
Movie 1: [nan, -2.0, 1.0, 2.0, 1.0, -2.0, -1.0, 1.0]
Cosine similarity: 1.0 

Movie 1: [nan, -2.0, 1.0, 2.0, 1.0, -2.0, -1.0, 1.0]
Movie 2: [1.38, -1.62, 0.38, -0.62, 0.38, -1.62, 0.38, 1.38]
Cosine similarity: 0.556890098923011 



**Question:** What is the cosine similarity between movie `1` and movie `1`?
- 1

**Question:** What is the cosine similarity between movie `1` and movie `2`?
- 0.556890098923011 

Suppose we want to get the cosine similarity between a set of vectors and another vector. Let's call the `get_cosine_similarity()` function and compute their cosine similarity.

In [11]:
sim_0 = cfilter.get_cosine_similarity(normalized_df.iloc[0, :], normalized_df.iloc[1:, :])
print('Movie 0:', [round(x, 2) for x in normalized_df.iloc[0, :]])
print('Movie 1:', [round(x, 2) for x in normalized_df.iloc[1, :]])
print('Movie 2:', [round(x, 2) for x in normalized_df.iloc[2, :]])
print('Movie 3:', [round(x, 2) for x in normalized_df.iloc[3, :]])
print('Movie 4:', [round(x, 2) for x in normalized_df.iloc[4, :]])
print('Movie 5:', [round(x, 2) for x in normalized_df.iloc[5, :]])
print('\nCosine similarities:\n' + str(sim_0))

Movie 0: [1.5, -0.5, 0.5, nan, -2.5, -0.5, 1.5, nan]
Movie 1: [nan, -2.0, 1.0, 2.0, 1.0, -2.0, -1.0, 1.0]
Movie 2: [1.38, -1.62, 0.38, -0.62, 0.38, -1.62, 0.38, 1.38]
Movie 3: [-0.14, 1.86, -1.14, -1.14, nan, 2.86, -1.14, -1.14]
Movie 4: [2.6, -1.4, -1.4, nan, 1.6, -1.4, nan, nan]
Movie 5: [2.14, 0.14, -0.86, -1.86, nan, 0.14, 2.14, -1.86]

Cosine similarities:
0   -0.110581
1    0.328436
2   -0.348851
3    0.045382
4    0.420673
dtype: float64


**Question:** What is the cosine similarity between movie `0` and movie `1`?
- -0.110581

**Question:** What is the cosine similarity between movie `0` and movie `3`?
- -0.348851

**Question:** What is the cosine similarity between movie `0` and movie `5`?
-  0.420673

Open `collaborative_filtering.py` file and complete the `get_k_similar()` function. This function returns two values - the indices of the top `k` similar items to the vector from the dataset, and a `Series` representing their similarity values to the vector. This function has 2 parameters - `data` and `vector`. We find the top `k` items from the `DataFrame` `data` which are highly similar to the `Series` `vector`. Since we are talking about vectors, we will measure similarity using the cosine similarity, which we have implemented in the `get_cosine_similarity()` function.

Implement the `get_k_similar()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

Suppose we want to get the `k` similar movies to movie `0`. Let's call the `get_k_similar()` function, with `k=2` which we have set during instantiation. 

In [12]:
movie0 = syn_df.iloc[0, :]
other_movies = syn_df.iloc[1:, :]
similar_movies = cfilter.get_k_similar(other_movies, movie0)
print(similar_movies[1])

4    0.420673
1    0.328436
dtype: float64


**Question:** Give the top 2 movies that are most similar to movie `0`.
- Movie 2 and 5

Open `collaborative_filtering.py` file and complete the `get_rating()` function. This function computes and returns an extrapolated value for a missing rating. This function has 3 parameters - `data`, `index`, and `column`. The parameter `data` is the dataset represent as a `DataFrame`. The parameters `index` and `column` represent the row and column in the dataset, respectively, of the missing rating that we want to extrapolate.

This function gets the top `k` similar items to the item in row `index`, the infer the missing rating for the user in column `column`.

The rating of user `x` to item `i`, represented as $r_{xi}$, given the set of similar items `N`, is computed as:

$$r_{xi}=\dfrac{\sum_{y \in N}^{}s_{xy}r_{yi}}{\sum_{y \in N}^{}s_{xy}}$$

Implement the `get_rating()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In the synthetic dataset, user `3` has not yet rated movie `0`. Let's use the `get_rating()` function to infer the rating of user `3` to movie `0` using similar movies.

In [13]:
rating_0_3 = cfilter.get_rating(syn_df, 0, 3)
print(round(rating_0_3, 2))

2.0


**Question:** What is the predicted rating of user `3` to movie `0`?
- 2.0

In the synthetic dataset, user `0` has not yet rated movie `1`. Let's use the `get_rating()` function to infer the rating of user `0` to movie `1` using similar movies.

In [14]:
rating_1_0 = cfilter.get_rating(syn_df, 1, 0)
print(round(rating_1_0, 2))

5.0


**Question:** What is the predicted rating of user `0` to movie `1`?
- 5

## MovieLens Dataset
For this notebook, we will work on a dataset called `MovieLens dataset`. This dataset contains 1682 movies rated by 943 users, from 1-5. There are a total of 100k ratings. We have already pre-processed the dataset to be stored as a csv file, where each row represents a movie and a column represents a user. The value in row `x` and column `y` is the rating of user `y` to movie `x`. A rating of 0 means that the user has not rated the item yet.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Notepad to see how it is exactly formatted.

Let's read the dataset.

In [15]:
ml_df = pd.read_csv('ml-100k.csv', header=None)

Let's read the file `u.item` which contains details about the movies in the dataset. This is a tab separated
list of:
movie id | movie title | release date | video release date |IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
The last 19 fields are the genres, a `1` indicates the movie is of that genre, a `0` indicates it is not; movies can be in several genres at once.
From this file, we will get the index of our `DataFrame`.

In [16]:
indices = []
with open('u.item','r', encoding = "ISO-8859-1") as f:
    line = f.readline()
    while line != '':
        indices.append(line.split('|')[1])
        line = f.readline()
ml_df.index = indices
ml_df.columns = ['User ' + str(x) for x in range(943)]

Since a value of `0` means that the person has not rated that movie yet, let us replace `0` with the value `NaN`. This is useful if we want to exclude that cells with value `NaN` from the computation.

In [17]:
ml_df = ml_df.replace(0, np.nan)
ml_df

Unnamed: 0,User 0,User 1,User 2,User 3,User 4,User 5,User 6,User 7,User 8,User 9,...,User 933,User 934,User 935,User 936,User 937,User 938,User 939,User 940,User 941,User 942
Toy Story (1995),5.0,4.0,,,4.0,4.0,,,,4.0,...,2.0,3.0,4.0,,4.0,,,5.0,,
GoldenEye (1995),3.0,,,,3.0,,,,,,...,4.0,,,,,,,,,5.0
Four Rooms (1995),4.0,,,,,,,,,,...,,,4.0,,,,,,,
Get Shorty (1995),3.0,,,,,,5.0,,,4.0,...,5.0,,,,,,2.0,,,
Copycat (1995),3.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mat' i syn (1997),,,,,,,,,,,...,,,,,,,,,,
B. Monkey (1998),,,,,,,,,,,...,,,,,,,,,,
Sliding Doors (1998),,,,,,,,,,,...,,,,,,,,,,
You So Crazy (1994),,,,,,,,,,,...,,,,,,,,,,


Instantiate a `CollaborativeFiltering` object with `k` equal to `5`. The parameter `k` indicates the number of similar items that we need to consider in giving similar recommendations.

In [18]:
cfilter = CollaborativeFiltering(5)

Suppose that we want to get similar movies to Toy Story (1995). Let's use the `get_k_similar()` function.

In [19]:
toy_story = ml_df.iloc[0, :]
other_movies = ml_df.iloc[1:, :]

#similar_movies = cfilter.get_k_similar(toy_story, other_movies)
#print(similar_movies[1])
similar_movies = cfilter.get_k_similar(other_movies,toy_story)
print(similar_movies[1])

  ans = np.nansum(vector*vector1)/(np.sqrt(np.nansum([i**2 for i in vector]))* vec1)


586    0.246966
93     0.234939
926    0.212976
424    0.212905
69     0.212355
dtype: float64


In [20]:
for i in similar_movies[0]:
    print(other_movies.iloc[i].name)

Beauty and the Beast (1991)
Aladdin (1992)
Craft, The (1996)
Transformers: The Movie, The (1986)
Lion King, The (1994)


**Question:** Give the top 5 movies that are most similar to Toy Story (1995).
- Beauty and the Beast (1991)
Aladdin (1992)
Craft, The (1996)
Transformers: The Movie, The (1986)
Lion King, The (1994)

Suppose that we want to get similar movies to Batman Forever (1995). Let's use the `get_k_similar()` function.

In [21]:
batman_forever = ml_df.iloc[28, :]
other_movies = pd.concat([ml_df.iloc[:28, :], ml_df.iloc[29:, :]])

# similar_movies = cfilter.get_k_similar(batman_forever, other_movies)
# print(similar_movies[1])
similar_movies = cfilter.get_k_similar(other_movies, batman_forever)
print(similar_movies[1])

136    0.327483
36     0.324005
252    0.305765
766    0.271320
930    0.261004
dtype: float64


In [22]:
for i in similar_movies[0]:
    print(other_movies.iloc[i].name)

D3: The Mighty Ducks (1996)
Net, The (1995)
Batman & Robin (1997)
Casper (1995)
First Kid (1996)


**Question:** Give the top 5 movies that are most similar to Batman Forever (1995).
- D3: The Mighty Ducks (1996)
Net, The (1995)
Batman & Robin (1997)
Casper (1995)
First Kid (1996)

Suppose that we want to get similar movies to Aladdin (1992). Let's use the `get_k_similar()` function.

In [23]:
aladdin = ml_df.iloc[94, :]
other_movies = pd.concat([ml_df.iloc[:94, :], ml_df.iloc[95:, :]])

#similar_movies = cfilter.get_k_similar(aladdin, other_movies)
#print(similar_movies[1])
similar_movies = cfilter.get_k_similar(other_movies,aladdin)
print(similar_movies[1])

70     0.383341
944    0.313559
586    0.287629
416    0.273185
100    0.271581
dtype: float64


In [24]:
for i in similar_movies[0]:
    print(other_movies.iloc[i].name)

Lion King, The (1994)
Fox and the Hound, The (1981)
Beauty and the Beast (1991)
Cinderella (1950)
Aristocats, The (1970)


**Question:** Give the top 5 movies that are most similar to Aladdin (1992).
- Lion King, The (1994)
Fox and the Hound, The (1981)
Beauty and the Beast (1991)
Cinderella (1950)
Aristocats, The (1970)