# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based and model-based collaborative filtering recommenders for movies. 

Author: <font color="blue">Luca Franceschi</font>

E-mail: <font color="blue">luca.franceschi01@estudiant.upf.edu</font>

Date: <font color="blue">30/10/2024</font>

# 1. The Movies dataset

We will use the same dataset as in ps05, the 25M version of [MovieLens DataSet](https://grouplens.org/datasets/movielens/) released in late 2019. We will use a sub-set containing only movies released in the 2000s, and only 10% of the users and all of their ratings.

* **MOVIES** are described in `movies-2000s.csv` in the following format: `movieId,title,genres`.
* **RATINGS** are contained in `ratings-2000s.csv` in the following format: `userId,movieId,rating`
* **TAGS** are contained in `tags.csv` in the following format: `userId,movieId,tag,timestamp`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 1.1. Load the input files

In [1]:
# Leave this code as-is

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import *
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# Leave this code as-is
DIR = 'movielens-25M-filtered/'
FILENAME_MOVIES = DIR + "movies-2000s.csv"
FILENAME_RATINGS = DIR + "ratings-2000s.csv"
FILENAME_TAGS = DIR + "tags-2000s.csv"

In [3]:
# Leave this code as-is

movies = pd.read_csv(
    FILENAME_MOVIES,
    sep=',',
    engine='python',
    encoding='latin-1',
    names=['movie_id', 'title', 'genres'],
)
# movies.set_index('movie_id', inplace=True)
display(movies.head(5))

ratings_raw = pd.read_csv(
    FILENAME_RATINGS,
    sep=',',
    encoding='latin-1',
    engine='python',
    names=['user_id', 'movie_id', 'rating'],
)
display(ratings_raw.head(5))

tags_raw = pd.read_csv(
    FILENAME_TAGS,
    sep=',',
    encoding='latin-1',
    engine='python',
    names=['user_id', 'movie_id', 'tag', 'timestamp'],
)
display(tags_raw.head(5))

Unnamed: 0,movie_id,title,genres
0,2769,"Yards, The (2000)",Crime|Drama
1,3177,Next Friday (2000),Comedy
2,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
3,3225,Down to You (2000),Comedy|Romance
4,3228,Wirey Spindell (2000),Comedy


Unnamed: 0,user_id,movie_id,rating
0,4,1,3.0
1,4,260,3.5
2,4,296,4.0
3,4,541,4.5
4,4,589,4.0


Unnamed: 0,user_id,movie_id,tag,timestamp
0,4,44665,unreliable narrators,1573943619
1,68,3481,music,1472113217
2,91,3481,based on a book,1399117141
3,91,3481,break-up,1399117159
4,91,3481,Catherine Zeta-Jones,1399117136


## 1.2. Merge the data into a single dataframe

Join the data into a single dataframe that should contain columns: user_id, movie_id, rating, timestamp, title, genders.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code from the previous practice that joined these three dataframes using "merge" into a single dataframe named "ratings". Print the first 5 rows of the resulting dataframe, which should contain columns "user_id", "movie_id", "rating", "title", and "genres".</font>

In [4]:
# ONLY MERGE TWO DATAFRAMES
ratings = pd.merge(ratings_raw, movies, how='inner', on='movie_id')
# ratings = pd.merge(ratings, tags_raw, how='inner', on=['user_id', 'movie_id']).drop('tag', axis=1)
display(ratings.head(5))

Unnamed: 0,user_id,movie_id,rating,title,genres
0,4,3624,2.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western
1,4,3751,3.5,Chicken Run (2000),Animation|Children|Comedy
2,4,3793,1.5,X-Men (2000),Action|Adventure|Sci-Fi
3,4,3827,3.0,Space Cowboys (2000),Action|Adventure|Comedy|Sci-Fi
4,4,4308,3.5,Moulin Rouge (2001),Drama|Musical|Romance


<font size="+1" color="red">Replace this cell with your code from the previous practice for "find_movies" that list movies containing a keyword</font>

In [5]:
def find_movies(query_str: str, df: pd.DataFrame):
    query = df.apply(lambda x: query_str in x['title'], axis=1)
    # print(df[query][['movie_id', 'title']])
    return df[query][['movie_id', 'title']]

In [6]:
# LEAVE AS-IS

# For testing, this should print 9 movies
display(find_movies("Spider-Man", movies))

Unnamed: 0,movie_id,title
632,5349,Spider-Man (2002)
1523,8636,Spider-Man 2 (2004)
3114,52722,Spider-Man 3 (2007)
4986,76709,Spider-Man: The Ultimate Villain Showdown (2002)
7100,95510,"Amazing Spider-Man, The (2012)"
9153,110553,The Amazing Spider-Man 2 (2014)
10915,122926,Untitled Spider-Man Reboot (2017)
29305,195159,Spider-Man: Into the Spider-Verse (2018)
31393,201773,Spider-Man: Far from Home (2019)


The following function, which you can leave as-is, prints the title of a movie given its movie_id.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [7]:
# LEAVE AS-IS


def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [8]:
# def get_title(movie_id, movies):
#     return movies.loc[movie_id].title

In [9]:
# LEAVE AS-IS

# For testing, should print "Spider-Man 2 (2004)"
print(get_title(8636, movies))

Spider-Man 2 (2004)


## 1.3. Count unique registers

Count the number of unique users and unique movies in the `ratings` variable. Use [unique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html). Print also the total number of movies in the `movies` variable. Your code should print:

```
Number of users who have rated a movie : 12676
Number of movies that have been rated  : 2049
Total number of movies                 : 33168
```

Note that ratings are heavily concentrated on a few popular movies.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your own code to indicate the number of unique users and unique movies in the "ratings" variable.</font>

In [10]:
print(f'Number of users who have rated a movie: {ratings['user_id'].nunique()}')
print(f'Number of movies that have been rated: {ratings['movie_id'].nunique():>6}')
print(f'Total number of movies: {movies['movie_id'].nunique():>21}')

Number of users who have rated a movie: 12676
Number of movies that have been rated:   2049
Total number of movies:                 33168


# 2. Item-based Collaborative Filtering

The two main types of interactions-based recommender system, also known as *collaborative filtering* algorithms are:

1. **User-based Collaborative Filtering**: To recommend items for user A, we first look at other users B1, B2, ..., Bk with a similar behavior to A, and aggregate their preferences. For instance, if all Bi like a movie that A has not watched, it would be a good candidate to be recommended. 


2. **Item-based Collaborative Filtering**: To recommend items for user A, we first look at all the items I1, I2, ..., Ik that the user A has consumed, and find items that elicit similar ratings from other users. For instnce, an item that is rated positively by the same users that rate positively the Ii items, and negatively by the same users that rate negatively the Ii items, would be a good candidate to be recommended.

In both cases, a similarity matrix needs to be built. For user-based, the **user-similarity matrix** will consist of some **distance metrics** that measure the similarity between any two pairs of users. For item-based, the **matrix** will measure the similarity between any two pairs of items.

As we already know, there are several metrics strategy for measure the "similarity" of two items. Some of the most used metrics are Jaccard, Cosine and Pearson. Meanwhile, Jaccard similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B (very useful for those use cases where there is not a numeric rating but just a boolean value like a product being bought), in Pearson and Cosine similarities we measure the similarity between two vectors.

For the purpose of this assignment, we will use **Pearson Similarity** and we will implement a **Item-based Collaborative filtering**.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 2.1. Data pre-processing

Firstly, create a new dataframe called "rated_movies" that is simply the "ratings" dataset with column genres removed using the [Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to generate "rated_movies" and print the first ten rows. This should have columns user_id, movie_id, rating, title</font>

In [11]:
rated_movies = ratings.drop('genres', axis=1)
display(rated_movies.head(10))

Unnamed: 0,user_id,movie_id,rating,title
0,4,3624,2.5,Shanghai Noon (2000)
1,4,3751,3.5,Chicken Run (2000)
2,4,3793,1.5,X-Men (2000)
3,4,3827,3.0,Space Cowboys (2000)
4,4,4308,3.5,Moulin Rouge (2001)
5,4,4816,4.0,Zoolander (2001)
6,4,4886,3.5,"Monsters, Inc. (2001)"
7,4,4963,4.5,Ocean's Eleven (2001)
8,4,4974,4.0,Not Another Teen Movie (2001)
9,4,4993,4.5,"Lord of the Rings: The Fellowship of the Ring,..."


Now, using the `rated_movies` dataframe, create a new dataframe named `ratings_summary` containing the following columns:

* movie_id
* title
* user_movie_mean (average rating)
* ratings_count (number of people who have rated this movie)

You can use the following operations:

* Initialize `ratings_summary` to be only the movie_id and title of all movies in `rated_movies`
   * To group dataframe `df` by column `a` and keep only one unique row per value of `a`, use: `df.groupby('a').first()`
* Compute two series: `ratings_mean` and `ratings_count`:
   * To obtain a series with the average of column `a` for each distinct value of column `b` in dataframe `df`, use: `df.groupby(b)['a'].mean()`
   * To obtain a series with the count of column `a` for each distinct value of column `b` in dataframe `df`, use: `df.groupby(b)['a'].count()`
* Add these series to the `ratings_summary`
   * To add a series `s` with column name `a` to dataframe `df`, use: `df['a'] = s`
    
<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to generate "ratings_summary" and print the first 10 rows.</font>

In [12]:
# Initialize ratings_summary
ratings_summary = (
    pd.DataFrame({'movie_id': ratings['movie_id'], 'title': ratings['title']})
    .groupby('movie_id')
    .first()
)

# Compute and assign mean and count series
grouping = ratings.groupby('movie_id')['rating']
ratings_summary['ratings_mean'] = grouping.mean()
ratings_summary['ratings_count'] = grouping.count()

display(ratings_summary.head(10))

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2769,"Yards, The (2000)",3.122549,102
3177,Next Friday (2000),2.824,125
3190,Supernova (2000),2.395683,139
3225,Down to You (2000),2.577273,110
3228,Wirey Spindell (2000),2.5,2
3239,Isn't She Great? (2000),1.947368,19
3273,Scream 3 (2000),2.444664,759
3275,"Boondock Saints, The (2000)",3.870682,1071
3276,Gun Shy (2000),3.33871,31
3279,Knockout (2000),2.0,2


To select from dataframe A those having column C larger or equal to N, you can do `A[A.C >= N]`.

To sort dataframe A by decreasing values of column C, you can do `A.sort_values(by='C', ascending=False)`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to print the top 5 highest rated movies, considering only movies receiving at least 100 ratings.</font>

In [13]:
display(
    ratings_summary[ratings_summary['ratings_count'] > 100]
    .sort_values('ratings_mean', ascending=False)
    .head(5)
)

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.215216,2458
6016,City of God (Cidade de Deus) (2002),4.186592,2133
4226,Memento (2000),4.158512,4476
7156,Fog of War: Eleven Lessons from the Life of Ro...,4.112013,308
4973,"Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le)...",4.097234,3687


<font size="+1" color="red">Repeat this, but this time consider movies receiving at least 3 ratings.</font>

In [14]:
display(
    ratings_summary[ratings_summary['ratings_count'] > 3]
    .sort_values('ratings_mean', ascending=False)
    .head(5)
)

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5082,"Rumor of Angels, A (2000)",4.666667,6
31954,Beautiful City (Shah-re ziba) (2004),4.4,5
5224,Promises (2001),4.388889,18
6672,War Photographer (2001),4.229167,24
5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.215216,2458


<font size="+1" color="red">Replace this cell with a brief commentary, in your own words, on what happens when the number of ratings is set to a small value.</font>

Similarly as what happened in Lab 5 (remember `SuperBabies: Baby Geniuses 2 (2004)`), the best (and worst) movies in terms of rating mean tend to have very small rating count. This happens because when se sample size is really small, some statistics like the mean are not really applicable. Probably if these winner movies had more ratings, they would be lower in the ranking. 

## 2.2. Compute the user-movie matrix

Before calculating the **similarity matrix**, we create a table where columns are movies and rows are users, and each movie-user cell contains the score of that user for that movie.

We will use the [pivot_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) function of Pandas, which receives a dataframe plus the variable that will make the rows, the variable that will make the columns, and the variable that will make the cells, and transform it into a matrix of the specified rows, columns, and cells.

For instance, if you have a dataframe D containing:

```
U V W
1 a 3.0
1 b 2.0
2 a 1.0
2 c 4.0
```

Calling `D.pivot_table(index='U', columns='V', values='W')` will create the following:

```
V  a   b   c
U
1 3.0 2.0 NaN
2 1.0 NaN 4.0
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to generate a "user_movie" matrix by calling "pivot_table" on "rated_movies". Print the first 5 rows. It might take about one minute to compute, depending on your computer.</font>

In [15]:
display(rated_movies.head(5))

Unnamed: 0,user_id,movie_id,rating,title
0,4,3624,2.5,Shanghai Noon (2000)
1,4,3751,3.5,Chicken Run (2000)
2,4,3793,1.5,X-Men (2000)
3,4,3827,3.0,Space Cowboys (2000)
4,4,4308,3.5,Moulin Rouge (2001)


In [16]:
user_movie = rated_movies.pivot_table(
    index='user_id', columns='movie_id', values='rating'
)
display(user_movie.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
33,,,,,,,,,,,...,,,,,,,,,,
62,,,,,,,,4.5,,,...,,,,,,,,,,3.5
63,,,,,,,,,,,...,,,,,,,,,,
95,,,,,,,,3.5,,,...,,,,,,,,,,


<font size="+1" color="red">Replace this a brief commentary indicating why do you think the "user_movie" matrix has so many "NaN" values. How do we call this characteristic of user ratings in recommender systems?</font>

As explained in theory classes, the similarity matrix is a very sparse matrix. This is because a user only has rated a couple of movies, not all of them. This effect applied to a large number of users ends up in a situation where only a couple of entries are filled in a huge NaN matrix.

## 2.3. Explore some correlations in the user-movie matrix

Now let us explore whether correlations in this matrix make sense.

1. Locate the movie_id for the following three movies:
  * [Lord of the Rings: The Fellowship of the Ring (2001)](https://en.wikipedia.org/wiki/The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring) -- name this id_pivot
  * [Finding Nemo (2003)](https://en.wikipedia.org/wiki/Finding_Nemo) -- name this id_m1
  * [Talk to Her (Hable con Ella) (2002)](https://en.wikipedia.org/wiki/Talk_to_Her) -- name this id_m2
2. Obtain the ratings for each of these movies: `user_movie[movie_id].dropna()`. You will obtain a column, containing a series of ratings for each movie.
3. Consolidate these four series into a single dataframe: `ratings3 = pd.concat([s1, s2, s3], axis=1)`
4. Drop from `ratings3` all rows containing a *NaN*. This will keep only the users that have rated all the 3 movies.
5. Display the first 10 rows from this table.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute and display the first 10 rows of the "ratings3" table as described above.</font>

In [17]:
id_pivot = int(
    find_movies(
        'Lord of the Rings: The Fellowship of the Ring, The (2001)', movies
    ).movie_id.iloc[0]
)
id_m1 = int(find_movies('Finding Nemo (2003)', movies).movie_id.iloc[0])
id_m2 = int(find_movies('Talk to Her (Hable con Ella) (2002)', movies).movie_id.iloc[0])

ratings3 = pd.concat(
    [
        user_movie[id_pivot].dropna(),
        user_movie[id_m1].dropna(),
        user_movie[id_m2].dropna(),
    ],
    axis=1,
)
ratings3 = ratings3.dropna(how='any')
display(ratings3.head(10))

Unnamed: 0_level_0,4993,6377,5878
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
859,3.0,4.0,5.0
1229,4.0,4.0,4.5
1281,3.0,2.5,3.0
1722,5.0,4.5,4.0
2004,4.5,3.0,3.5
4590,4.0,4.0,2.0
5052,2.0,4.0,4.0
5144,5.0,5.0,5.0
6497,3.5,3.5,3.5
8369,3.0,4.0,4.5


To compute Pearson correlation, we use the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html) method.

To compute the correlation between two columns `a`, `b` in dataframe `df`, we use: `df[a].corr(df[b])`.

Compute the correlations between all pairs of columns of the `ratings3` table. You should display:

```
Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': 0.38
Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': 0.16
Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': 0.20
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute all correlations between these three movies, as described above.</font>

In [18]:
for i, j in enumerate(ratings3.columns.to_list()):
    for k in ratings3.columns.to_list()[i + 1 :]:
        print(
            'Similarity between \'{}\' and \'{}\': {:.2f}'.format(
                ratings_summary.loc[j, 'title'],
                ratings_summary.loc[k, 'title'],
                ratings3[j].corr(ratings3[k]),
            )
        )

Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': 0.38
Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': 0.16
Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': 0.20


<font size="+1" color="red">Replace this cell with a brief commentary on the correlations you find.</font>

There seems to be a positie correlation between all pairs, however the correlation is not very pronounced. This means that if a person has rated `Lord of the Rings` positively, there is a good chance that they will rate `Finding Nemo` similarly. When the correlation is low, like between `Lord of the Rings` and `Talk to Her` or between `Finding Nemo` and `Talk to Her`, the chance that the enjoyment will be similar is not as high, probably we should do further analysis before drawing any conclusion.

Now let us take the first movie selected above, the one with movie_id `id_pivot`.

Select the column corresponding to this movie in `user_movies` and compute its correlation with all other columns in `user_movies`. This can be done with  [corrwith](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corrwith.html).

To extract the ratings for a movie into a dataframe containing a single column "rating", you can use:

```
df = pd.DataFrame(user_movie[id_movie].dropna()).rename(columns={id_movie: "rating"})
```

To compute the correlation between two single-column dataframes `df1` and `df2`, you can use:

```
corr = df1.corrwith(df2)[0]
```

Store the result in a new dataframe named `similarity_to_pivot` containing two columns: `movie_id` and `corr_with_pivot`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create a "similarity_to_pivot" series that contains the computed correlations, droping the NaNs in the series.</font>

In [29]:
# similarity_to_pivot = pd.DataFrame(
#     user_movie.apply(lambda x: user_movie[id_pivot].corr(x)).transpose()
# ).rename(columns={0: 'corr_with_pivot'})

# similarity_to_pivot = (
#     pd.DataFrame(user_movie.corrwith(user_movie[id_pivot]))
#     .rename(columns={0: 'corr_with_pivot'})
#     .dropna()
# )


def get_similarity(movie_id: int, user_movie: pd.DataFrame):
    return (
        pd.DataFrame(user_movie.corrwith(user_movie[movie_id]))
        .rename(columns={0: 'corr_with_pivot'})
        .dropna()
    )


similarity_to_pivot = get_similarity(id_pivot, user_movie)

display(similarity_to_pivot.head(10))

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)
  c /= stddev[:, None]
  c /= stddev[None, :]


Unnamed: 0_level_0,corr_with_pivot
movie_id,Unnamed: 1_level_1
2769,-0.127515
3177,0.093221
3190,0.041206
3225,0.1266
3239,0.338378
3273,0.166968
3275,0.182484
3276,0.134264
3285,0.075311
3286,0.242781


Next, create a dataframe `corr_with_pivot` by using `similarity_to_pivot` and `ratings_summary`. This dataframe should have the following columns:

* movie_id
* corr_with_pivot - the correlation between movies movie_id and id_pivot
* title
* ratings_mean
* ratings_count

Keep only rows in which *ratings_count* > 500, i.e., popular movies. To filter a dataframe `df` and keep only rows having column `c` larger than `x`, use `df[df[c] > x]`.

Display the top 10 rows with the largest correlation. To select the largest `n` rows from dataframe `df` according to column `c`, use `df.sort_values(c, ascending=False).head(n)`. 

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create a "corr_with_pivot" dataframe as specified above, and to print the 20 movies (rated 500 times or more) with the highest correlation with the selected movie.</font>

In [30]:
display(ratings_summary)

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2769,"Yards, The (2000)",3.122549,102
3177,Next Friday (2000),2.824000,125
3190,Supernova (2000),2.395683,139
3225,Down to You (2000),2.577273,110
3228,Wirey Spindell (2000),2.500000,2
...,...,...,...
33154,Enron: The Smartest Guys in the Room (2005),3.893293,164
33158,xXx: State of the Union (2005),2.265625,128
33162,Kingdom of Heaven (2005),3.417234,441
33164,House of Wax (2005),2.308943,123


In [31]:
corr_with_pivot = pd.merge(
    similarity_to_pivot,
    ratings_summary[ratings_summary['ratings_count'] > 500],
    how='inner',
    on='movie_id',
)
display(corr_with_pivot.sort_values(by='corr_with_pivot', ascending=False).head(20))

Unnamed: 0_level_0,corr_with_pivot,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4993,1.0,"Lord of the Rings: The Fellowship of the Ring,...",4.09253,5944
5952,0.892103,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449
7153,0.892073,"Lord of the Rings: The Return of the King, The...",4.08396,5449
6539,0.377599,Pirates of the Caribbean: The Curse of the Bla...,3.779241,3950
8368,0.340934,Harry Potter and the Prisoner of Azkaban (2004),3.809971,2397
3578,0.337667,Gladiator (2000),3.95105,4811
3793,0.329686,X-Men (2000),3.556436,3535
4896,0.31918,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.678509,2843
3624,0.307471,Shanghai Noon (2000),3.297443,1017
31658,0.303898,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.064417,1141


<font size="+1" color="red">Replace this cell with a brief commentary about the movies you see on this list. What happens if you set the condition on *ratings_count* to a much larger value? What happens if you set it to a much smaller value?</font>

TODO: comment

## 2.4. Implement the item-based recommendations

Now that we believe that this type of correlation sort of makes sense, let us implement the item-based recommender. We need all correlations between columns in `user_movie`.

To compute all correlations between columns in a dataframe, use [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). This function receives a matrix with *r* rows and *c* columns, and returns a square matrix of *c x c* containing all pair-wise correlations between columns.

**This process may take a few minutes.** Print the first 5 rows of the resulting matrix when done.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie. Store this in "item_similarity", and print the first 10 rows.</font>

In [32]:
item_similarity = user_movie.corr()
display(item_similarity.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,0.115068,0.033721,-0.232268,,-0.5,0.197011,0.199514,0.250873,,...,0.37998,0.87831,,,,0.248126,0.180609,-0.08557,-0.408248,0.105671
3177,0.115068,1.0,0.30382,0.559533,,,0.331191,0.167918,1.0,,...,0.546119,0.735767,-1.0,,,-0.221382,0.317475,0.014735,0.661989,0.185654
3190,0.033721,0.30382,1.0,0.636361,,-0.014315,0.146042,0.394293,-0.290397,,...,0.246183,0.632026,,,,0.378181,0.170926,0.022444,-0.07336,-0.054114
3225,-0.232268,0.559533,0.636361,1.0,,0.578414,0.347716,0.263671,-0.250313,,...,-0.300376,0.318377,,,,0.480173,0.750306,0.536828,0.753141,0.098748
3228,,,,,1.0,,,,,,...,,,,,,,,,,


Similarities between movies that do not have many ratings in common are unreliable. Fortunately, the `corr` method includes a parameter `min_periods` that establishes a minimum number of elements in common that two columns must have to compute the correlation.

Re-generate item_similarity setting min_periods to 100.

This process will also take a few minutes. Print the first 5 rows of the resulting matrix when done.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie, but considering only movies having at least 100 ratings in common. Store this in "item_similarity_min_ratings"</font>

In [33]:
item_similarity_min_ratings = user_movie.corr(min_periods=100)
display(item_similarity_min_ratings.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,,,,,,,,,,...,,,,,,,,,,
3177,,1.0,,,,,,,,,...,,,,,,,,,,
3190,,,1.0,,,,,,,,...,,,,,,,,,,
3225,,,,1.0,,,,,,,...,,,,,,,,,,
3228,,,,,,,,,,,...,,,,,,,,,,


We will need to test our function so let us select a couple of interesting users.

Our first user, `user_id_super` will be someone who has given the following 3 films a rating higher than 4.5:

* movie_id=5349: *Spider-Man (2002)*
* movie_id=3793: *X-Men (2000)*
* movie_id=6534: *Hulk (2003)* 	

Our second user, `user_id_drama` will be someone who has given the following 3 films a rating higher than 4.5:

* movie_id=6870: *Mystic River (2003)*
* movie_id=5995: *Pianist, The (2002)*
* movie_id=3555: *U-571 (2000)*

To filter a dataframe by multiple conditions you can use, e.g., `df[(a > 1) & (b > 2)]`. 

**Important**: these particular users have watched lots of movies, so we cannot tell for sure they have only these interests.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to find the userids of two example users: user_id_super (the who liked the three superhero movies), and user_id_drama (the one who liked the three dramas)</font>

In [34]:
user_id_super = user_movie[
    (user_movie[5349] > 4.5) & (user_movie[3793] > 4.5) & (user_movie[6534] > 4.5)
].first_valid_index()
user_id_drama = user_movie[
    (user_movie[6870] > 4.5) & (user_movie[5995] > 4.5) & (user_movie[3555] > 4.5)
].first_valid_index()

We will need some auxiliary functions that are provided below. You can leave as-is.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [35]:
# Leave this code as-is


# Gets a list of watched movies for a user_id
def get_watched_movies(user_id, user_movie):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)


# Gets the rating a user_id has given to a movie_id
def get_rating(user_id, movie_id, user_movie):
    return user_movie[movie_id][user_id]


# Print watched movies
def print_watched_movies(user_id, user_movie, movies):
    for movie_id in get_watched_movies(user_id, user_movie):
        print(
            "%d %.1f %s "
            % (
                movie_id,
                get_rating(user_id, movie_id, user_movie),
                get_title(movie_id, movies),
            )
        )

In [36]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_super, user_movie, movies)

3578 5.0 Gladiator (2000) 
3598 5.0 Hamlet (2000) 
3751 5.0 Chicken Run (2000) 
3753 5.0 Patriot, The (2000) 
3785 5.0 Scary Movie (2000) 
3755 5.0 Perfect Storm, The (2000) 
3624 5.0 Shanghai Noon (2000) 
3623 5.0 Mission: Impossible II (2000) 
5349 5.0 Spider-Man (2002) 
4701 5.0 Rush Hour 2 (2001) 
4270 5.0 Mummy Returns, The (2001) 
4306 5.0 Shrek (2001) 
4388 5.0 Scary Movie 2 (2001) 
4370 5.0 A.I. Artificial Intelligence (2001) 
3988 5.0 How the Grinch Stole Christmas (a.k.a. The Grinch) (2000) 
3863 5.0 Cell, The (2000) 
3827 5.0 Space Cowboys (2000) 
3793 5.0 X-Men (2000) 
7153 5.0 Lord of the Rings: The Return of the King, The (2003) 
7324 5.0 Hidalgo (2004) 
7454 5.0 Van Helsing (2004) 
8636 5.0 Spider-Man 2 (2004) 
8533 5.0 Notebook, The (2004) 
6761 5.0 Tibet: Cry of the Snow Lion (2002) 
6946 5.0 Looney Tunes: Back in Action (2003) 
8368 5.0 Harry Potter and the Prisoner of Azkaban (2004) 
8622 5.0 Fahrenheit 9/11 (2004) 
30816 5.0 Phantom of the Opera, The (2004) 
5389 5.

In [37]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_drama, user_movie, movies)

3555 5.0 U-571 (2000) 
4034 5.0 Traffic (2000) 
4014 5.0 Chocolat (2000) 
3967 5.0 Billy Elliot (2000) 
5669 5.0 Bowling for Columbine (2002) 
5991 5.0 Chicago (2002) 
5995 5.0 Pianist, The (2002) 
4995 5.0 Beautiful Mind, A (2001) 
6870 5.0 Mystic River (2003) 
7147 5.0 Big Fish (2003) 
8622 5.0 Fahrenheit 9/11 (2004) 
8464 5.0 Super Size Me (2004) 
30707 5.0 Million Dollar Baby (2004) 
5015 4.5 Monster's Ball (2001) 
5989 4.5 Catch Me If You Can (2002) 
6953 4.5 21 Grams (2003) 
3510 4.5 Frequency (2000) 
5464 4.5 Road to Perdition (2002) 
5010 4.0 Black Hawk Down (2001) 
5299 4.0 My Big Fat Greek Wedding (2002) 
4308 4.0 Moulin Rouge (2001) 
4022 4.0 Cast Away (2000) 
3897 4.0 Almost Famous (2000) 
3755 4.0 Perfect Storm, The (2000) 
3948 3.5 Meet the Parents (2000) 
4246 3.5 Bridget Jones's Diary (2001) 
4447 3.5 Legally Blonde (2001) 
4975 3.5 Vanilla Sky (2001) 
4019 3.5 Finding Forrester (2000) 
5377 3.5 About a Boy (2002) 
5349 3.0 Spider-Man (2002) 
6281 3.0 Phone Booth (2002)

For every user, we will consider that the importance of a new movie (a movie s/he has not rated) will be equal to the sum of the similarities between that new movie and all the movies the user has already rated.

Indeed, to further improve this, we will compute a weighted sum, in which the weight will be the rating given to the movie.

For instance, suppose a user has rated movies as follows:

```
movie_id rating
1        2.0
2        3.0
3        NaN
4        NaN
```

And that movie similarities are as follows (values with a "." do not matter in this example):

```
movie_id   1   2   3   4
1         ...............
2         ...............
3         0.1 0.2 NaN ...
4         0.9 0.8 ... NaN
```

The importance of movie 3 to this user will be:

```
2.0 * 0.1 + 3.0 * 0.2 = 0.8
```

While the importance of movie 4 to this user will be:

```
2.0 * 0.9 + 3.0 + 0.8 = 5.6
```

As we can see, we are favoring movies that are highly similar to many movies that the user has rated high.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Create a function `get_movies_relevance` that returns a dataframe with columns `movie_id` and `relevance`. You can use the following template:

```python
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    
    # Create an empty series
    movies_relevance = ...
    
    # Iterate through the movies the user has watched
    for watched_movie in ...
        
        # Obtain the rating given
        rating_given = ...
        
        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = ...
        
        # Multiply this vector by the given rating
        weighted_similarities = ...
        
        # Append these terms to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df

```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code for "get_movies_relevance"</font>

In [38]:
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):

    # Create an empty series
    movies_relevance = pd.Series(np.zeros(user_movie.shape[0]))

    # Iterate through the movies the user has watched
    for watched_movie in get_watched_movies(user_id, user_movie):

        # Obtain the rating given
        rating_given = get_rating(user_id, watched_movie, user_movie)

        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = item_similarity_matrix[watched_movie]

        # Multiply this vector by the given rating
        weighted_similarities = rating_given * similarities

        # Append these terms to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])

    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()

    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index

    return movies_relevance_df

Apply `get_movies_relevance` to the two users we have selected, `user_id_super` and `user_id_drama`.

The result will contain only `movie_id` and `relevance`, you will have to merge with the `movies` dataframe on the `movie_id` attribute.

Sort the results by descending relevance and print the top 10 for each case.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to obtain the 5 most relevant movies for the users user_id_super (who likes superhero movies) and user_id_drama (who likes dramas)</font>

In [43]:
display(
    pd.merge(
        get_movies_relevance(user_id_super, user_movie, item_similarity_min_ratings),
        movies,
        how='inner',
        on='movie_id',
    )
    .sort_values('relevance', ascending=False)
    .head(10)
)

Unnamed: 0,relevance,movie_id,title,genres
1531,189.170085,8644,"I, Robot (2004)",Action|Adventure|Sci-Fi|Thriller
674,181.63812,5459,Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (...,Action|Comedy|Sci-Fi
86,176.650945,3753,"Patriot, The (2000)",Action|Drama|War
1472,172.899804,8361,"Day After Tomorrow, The (2004)",Action|Adventure|Drama|Sci-Fi|Thriller
313,172.700877,4310,Pearl Harbor (2001),Action|Drama|Romance|War
300,172.301301,4270,"Mummy Returns, The (2001)",Action|Adventure|Comedy|Thriller
328,169.123776,4367,Lara Croft: Tomb Raider (2001),Action|Adventure
1033,168.960164,6373,Bruce Almighty (2003),Comedy|Drama|Fantasy|Romance
1652,168.783883,8972,National Treasure (2004),Action|Adventure|Drama|Mystery|Thriller
1026,166.866641,6365,"Matrix Reloaded, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX


In [44]:
display(
    pd.merge(
        get_movies_relevance(user_id_drama, user_movie, item_similarity_min_ratings),
        movies,
        how='inner',
        on='movie_id',
    )
    .sort_values('relevance', ascending=False)
    .head(10)
)

Unnamed: 0,relevance,movie_id,title,genres
1638,65.46137,8958,Ray (2004),Drama
197,63.007635,4019,Finding Forrester (2000),Drama
1090,61.354376,6565,Seabiscuit (2003),Drama
510,61.21305,4995,"Beautiful Mind, A (2001)",Drama|Romance
517,61.209632,5014,I Am Sam (2001),Drama
1287,60.751048,7143,"Last Samurai, The (2003)",Action|Adventure|Drama|War
1531,60.700299,8644,"I, Robot (2004)",Action|Adventure|Sci-Fi|Thriller
1211,60.611768,6870,Mystic River (2003),Crime|Drama|Mystery
1363,59.820898,7325,Starsky & Hutch (2004),Action|Comedy|Crime|Thriller
1490,59.438079,8464,Super Size Me (2004),Comedy|Documentary|Drama


<font size="+1" color="red">Replace this cell with a brief commentary on the movies you see on these lists. How many of them look relevant for the intended users? Feel free to use IMDB or Wikipedia to get info on these movies.</font>

<font size="-1" color="gray">All those trivial facts you learned about 1980s and 1990s pop culture were supposed to be useful one day; that day has arrived :-)</font>

TODO: COMMENT

Finally, you only need to remove the movies the user has watched. To do so:

* Obtain the dataframe of relevant movies with `get_movies_relevance`
* Set this dataframe index to 'movie_id'
* Obtain the list of movie_ids of watched movies with `get_watched_movies`
* Drop from the relevant movies dataframe the watched movies

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code implementing "get_recommended_movies"</font>

In [45]:
def get_recommended_movies(user_id, user_movie, item_similarity_matrix):
    relevant_movies = get_movies_relevance(user_id, user_movie, item_similarity_matrix)
    relevant_movies.set_index('movie_id', inplace=True)
    movie_ids_list = get_watched_movies(user_id, user_movie)
    relevant_movies.drop(movie_ids_list, axis=0, inplace=True)
    return relevant_movies

<font size="+1" color="red">Replace this cell with your code to obtain the 10 most recommended movies for the users user_id_super and user_id_drama</font>

In [46]:
display(
    pd.merge(
        get_recommended_movies(user_id_super, user_movie, item_similarity_min_ratings),
        movies,
        how='inner',
        on='movie_id',
    )
    .sort_values('relevance', ascending=False)
    .head(10)
)

Unnamed: 0,movie_id,relevance,title,genres
928,6365,166.866641,"Matrix Reloaded, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX
165,4018,165.338077,What Women Want (2000),Comedy|Romance
171,4025,163.032765,Miss Congeniality (2000),Comedy|Crime
625,5507,161.080324,xXx (2002),Action|Crime|Thriller
938,6378,155.293219,"Italian Job, The (2003)",Action|Crime
1857,31685,154.993274,Hitch (2005),Comedy|Romance
132,3948,150.570934,Meet the Parents (2000),Comedy
288,4369,148.949754,"Fast and the Furious, The (2001)",Action|Crime|Thriller
1135,6934,148.394158,"Matrix Revolutions, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX
438,4963,148.251901,Ocean's Eleven (2001),Crime|Thriller


In [47]:
display(
    pd.merge(
        get_recommended_movies(user_id_drama, user_movie, item_similarity_min_ratings),
        movies,
        how='inner',
        on='movie_id',
    )
    .sort_values('relevance', ascending=False)
    .head(10)
)

Unnamed: 0,movie_id,relevance,title,genres
1585,8958,65.46137,Ray (2004),Drama
1050,6565,61.354376,Seabiscuit (2003),Drama
491,5014,61.209632,I Am Sam (2001),Drama
1317,7325,59.820898,Starsky & Hutch (2004),Action|Comedy|Crime|Thriller
1248,7149,59.294621,Something's Gotta Give (2003),Comedy|Drama|Romance
333,4448,58.968024,"Score, The (2001)",Action|Drama
1365,7445,58.192646,Man on Fire (2004),Action|Crime|Drama|Mystery|Thriller
543,5152,58.004447,We Were Soldiers (2002),Action|Drama|War
82,3753,57.920754,"Patriot, The (2000)",Action|Drama|War
242,4223,57.482846,Enemy at the Gates (2001),Drama|War


<font size="+1" color="red">Replace this cell with a brief commentary on these recommendations. Do you think they are relevant? Why or why not? After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?</font>

TODO: COMMENT

# DELIVER (individually)

Remember to read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/data-mining-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook

## Extra points available

For more learning and extra points, use the [surprise](http://surpriselib.com/) library to generate recommendations for the same two users. Display the generated recommendations and comment on them.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: surprise library</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>