# Recommender Systems

<small><i>Updated January 2022</i></small>

<div class="alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
<h2>Outline</h2>
<ol>
    <li>What is a recommender system?</li>
    <li>How to build a recommender system? </li>
    <li>How to evaluate its success?</li>
</ol>
</div>

## Steps to build a recommender system:
<ol>
    <li>Data collection and understanding</li>
    <li>Data filtering/cleaning</li>
    <li>Learning<br>
        <span style="font-size:smaller">E.g., using item/user similarity function</span></li>
    <li>Evaluation</li>
</ol>

## Types of Recommenders
<div class="alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
    <h3>Non-Personalized filtering</h3><br/>
    Based on general information about the items without using any data from the user who receives the recoomendation. 
</div>
<div class="alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
    <h3>Content-based filtering</h3><br/>
    Based on items' descriptions and a profile of user’s preferences. They recommend items which are similar to the ones that the user likes.
    <br>We usually need to compute the <b>similarity between items</b> based on their description.
        <img src=https://miro.medium.com/max/1334/1*jVG54DFcmaWeJPuJxbGH3w.png width=400>
        <center><small>By Nafeea Afshin at <a href=https://nafeea3000.medium.com/recommender-systems-c8db209dd0d3>medium.com</a></small></center>
</div>
<div class="alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
        <h3>Collaborative filtering</h3><br/>
        Based on users’ behavior, activities and preferences. Requires a <b>community of users</b>.
        <br><b>Hypothesis: Similar users tend to like similar items.</b>
        <br>
        <br>There are two main types: 
    <ul>
        <li><b>User-based CF</b>: given a user U, find a set of users S which are similar to user U. Then, use the ratings from users in S to make predictions for the user U.</li><br>
        <li><b>Item-based CF</b>: build an item-item similarity matrix (measure similarity between all pairs of items). Then, use this matrix to find items similar to the ones already liked by the user.</li>
        </ul>
    Similarity in both cases can be defined in terms of similar ratings.
        <img src=https://miro.medium.com/max/4800/1*Mlt-6jMs0JOHSm9zOTqTDw.jpeg width=700>
        <small><center>By Sammit at <a href=https://blog.clairvoyantsoft.com/mlmuse-introduction-to-recommendation-systems-part-i-99dc523b05dc>medium.com</a></center></small>
</div>

### Summary: 
<table style="width:95%;border:1px solid black;">
  <tr >
  <td style="width:20%"></td>
  <td>Pros</td>
  <td>Cons</td>
  </tr>
  <tr>
  <td>Content-Based</td>
  <td>No community required, comparison between items possible from the beginning</td>
  <td>Content description needed; Good explainability, no surprises</td>
  </tr>
  <tr>
  <td>Collaborative filtering</td>
  <td>Well-understood, works well in several domains; easy to implement</td>
  <td>Requires a community of users; sparsity problems; difficult to explain suggestions; cold-start problem (for new users and items)</td>
  </tr>

 </table>


<div class="alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
        <h3>Hybrid solutions</h3><br/>
        Hybrid approaches build models that combine somehow content-based and collaborative-based recsys.
</div>




<hr/>

# Hands on
## A User-Based Collaborative Filtering RecSys for Movielens

Given a user (Marta) and an item that she has not seen, the goal is to estimate her rating for the item. The data that we are going to use in the most basic situation looks as:
<table style="width:60%">
  <tr>
    <td></td>
    <td>Superman</td> 
    <td>Star Wars 1</td>
    <td>Matrix</td>
    <td>Spiderman</td>
    
  </tr>
  <tr>
    <td>User1</td>
    <td>3.5</td> 
    <td>4</td>
    <td>5</td>
    <td>5</td>
  </tr>
  <tr>
    <td>User2</td>
    <td>3</td> 
    <td><font color="red"><b>¿?</b></font></td>
    <td>4.5</td>
    <td>3</td>
  </tr>
  <tr>
    <td>User3</td>
    <td>3.5</td> 
    <td>5</td>
    <td>3.5</td>
    <td>2</td>
  </tr>
  <tr>
    <td>User 4 (Marta)</td>
    <td>3</td> 
    <td>3.5</td>
    <td>4.5</td>
    <td><font color="red"><b>¿?</b></font></td>
  </tr>
</table>


We will use again the MovieLens dataset, which you should have downloaded to complete the first notebook.

Let us first load the libraries that we are going to need:

In [23]:
%autosave 150
%matplotlib inline
import pandas as pd
import numpy as np
import math
import copy
import matplotlib.pylab as plt

Autosaving every 150 seconds


And, next, the dataset:

In [25]:
# The dataset is composed of 3 main files

# The users file 
u_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols)

# The movies (items) file
m_cols = ['movie_id', 'title', 'release_date']
# It contains aditional columns indicating, among other the movies' genre.
# Let's only load the first three columns:
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(3), encoding='latin-1')

# The ratings file 
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols)


# We merge all three dataframes into a single dataset
data = pd.merge(pd.merge(ratings, users), movies)
# and keep only the columns that we are going to use
data = data[['user_id', 'rating', 'movie_id', 'title']]

We will use a subset of just 100 users, the ones with a largest number of ratings. We keep a 20% of them for evaluation purposes, and learn with the remaining 80%:

In [10]:
np.random.seed(7) # for replicability

# We keep only data regarding the 100 users with the largest number of ratings
user_id_most_raters = data.groupby('user_id').size().sort_values(ascending=False).head(100).keys()
data = data[data['user_id'].isin(user_id_most_raters)].copy()
print('Dataset size:', data.shape)
print('Usuaris:', data.user_id.nunique())
print('Films:',data.movie_id.nunique())

Dataset size: (33389, 4)
Usuaris: 100
Films: 1615


As you might infer from the previous CF description, computating similarity between items is critical in CF methods. <br>

## Defining a similarity between users

In order to define a similarity metric between two users, it makes sense to have a look first to the movies that they both have seen/rated:

In [11]:
# dataframe with the data from a first user
data_user_1 = data[data.user_id == data.user_id.unique()[0]]

# dataframe with the data from a second user
data_user_2 = data[data.user_id == data.user_id.unique()[1]]

# merge works as an inner join
common_movies = pd.merge(data_user_1, data_user_2, on='movie_id')
print("\nNumber of movies seen by both users:", common_movies.shape[0])

print("Ratings from user 1")
print(common_movies[['title_x','rating_x']].head(5))
print("Ratings from user 2")
print(common_movies[['title_y','rating_y']].head(5))


Number of movies seen by both users: 98
Ratings from user 1
                               title_x  rating_x
0                         Kolya (1996)         4
1  Truth About Cats & Dogs, The (1996)         3
2                 Birdcage, The (1996)         3
3          English Patient, The (1996)         3
4                 Marvin's Room (1996)         3
Ratings from user 2
                               title_y  rating_y
0                         Kolya (1996)         1
1  Truth About Cats & Dogs, The (1996)         3
2                 Birdcage, The (1996)         5
3          English Patient, The (1996)         1
4                 Marvin's Room (1996)         2


The basic idea when measuring similarity between two users, <i>a</i> and <i>b</i>, is to first identify the items that both users have commonly ever rated (<i>P</i>), and then to apply certain (dis)similarity function between their ratings of these commonly rated movies.

We can use different (di)similarity functions. For example,
<ul>
    <li>Euclidean distance    <br>
    $$dist(a,b) = \sqrt{\sum_{p \in P}{(r_{a,p} - r_{b,p})^2}}$$
which needs to be transformed to work as a similarity measure: $sim(a,b) = (1+dist(a,b))^{-1}$.
    </li>
    <li>Pearson Correlation</li>
    $$sim(a,b) = \frac{\sum_{p\in P} (r_{a,p}-\bar{r_a})(r_{b,p}-\bar{r_b})}{\sqrt{\sum_{p \in P}(r_{a,p}-\bar{r_a})^2}\sqrt{\sum_{p \in P}(r_{b,p}-\bar{r_b})^2}}$$
    <br>
    <li>Cosine similarity</li>
    $$ sim(a,b) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| |\vec{b}|}
    =\frac{\sum_{p\in P} r_{a,p} r_{b,p}}{\sqrt{\sum_{p\in P} r^2_{a,p}}\sqrt{\sum_{p\in P} r^2_{b,p}}}$$
    <br>
</ul>

  
<br>
where: 

* $a$ and $b$ are users
* $P$ is the set of items that both users $a$ and $b$ have ever rated.
* $r_{a,p}$ is the rating of item $p$ by user $a$
* $\bar{r_a}$ is the mean rating given by user $a$


So, let us implement this:

In [26]:
from scipy.stats import pearsonr
from scipy.spatial.distance import euclidean

# euclidean distance based similarity using scipy's euclidean definition
def euclideanSimilarity(v1, v2):
    return 1.0 / (1.0 + euclidean(v1,v2))

# wrapper for pearson correlation similarity which uses scipy's definition
def pearsonSimilarity(v1, v2):
    res = pearsonr(v1, v2)[0]
    if math.isnan(res) or res < 0:
        res = 0
    return res

# Returns a similarity score for two users
def similarityFunction(myData, user1, user2, similarity=euclideanSimilarity):
    # Get movies rated by user1
    movies_user1 = myData[myData['user_id'] == user1]
    # Get movies rated by user2
    movies_user2 = myData[myData['user_id'] == user2]
    
    # Find commonly rated films
    rep=pd.merge(movies_user1, movies_user2, on='movie_id')    

    return similarity(rep['rating_x'], rep['rating_y'])


print("Euclidean distance-based similarity:", similarityFunction(data, 
                                                                 data.user_id.unique()[10], 
                                                                 data.user_id.unique()[11]))

print("Pearson correlation-based similarity:", similarityFunction(data, 
                                                                  data.user_id.unique()[10], 
                                                                  data.user_id.unique()[11], 
                                                                  similarity=pearsonSimilarity))

Euclidean distance-based similarity: 0.07263669959755374
Pearson correlation-based similarity: 0.06317948789567646


<div class="alert alert-success">
Question #1.-<br>
<span style="color:black">Implement the cosine similarity
</span></div>

In [27]:
# Question 1
from scipy import spatial
from numpy import dot
from numpy.linalg import norm
# cosine similarity
def cosineSimilarity(v1, v2):
    # Following the equation
    cos1 = dot(v1,v2)/(norm(v1)*norm(v2))
    # Using library spatial from scipy
    cos2 = 1.0 - spatial.distance.cosine(v1,v2)
    print(f'Following Cosine equation we obtain: {cos1} \n and using library we obtain:{cos2} ')
    return cos1, cos2
    
print("Cosine similarity:", similarityFunction(data,
                                               data.user_id.unique()[10], 
                                               data.user_id.unique()[11], 
                                               similarity=cosineSimilarity))

Following Cosine equation we obtain: 0.9406954990098857 
 and using library we obtain:0.9406954990098856 
Cosine similarity: (0.9406954990098857, 0.9406954990098856)


### Issues to take into accout
<ul>
<li>Pearson Correlation is usually preferred over euclidean distance, since it uses the ranking and disregards the specific ratings.</li>
<li>Cosine distance is usually preferred when our data is binary/unary; i.e., [like vs. unlike] or [buy vs. not-buy].</li>
<li>In general, these definitions of similarity suffer when two users have very few items in common.</li>
</ul>

<div class="alert alert-success">
Question #2.-<br>
<span style="color:black">Modify the previous similarity function so that it only measures the similarity between two users if the number of movies that both users have seen is equal or larger than a given threshold (parameter `minCommonItems`).
    <br />
    Otherwise, return 0.
</span></div>

In [43]:
def cosineSimilarity1(v1, v2):
    # Following the equation
    cos = dot(v1,v2)/(norm(v1)*norm(v2))
    return cos

# Question 2

# Returns a similarity score for two users
def similarityFunction2(myData, user1, user2, similarity=euclideanSimilarity, minCommonItems=300):
    #We can see that for this example both users have 300 items in commoon
    
    #Get movies rated by user 1
    movies_user1 = myData[myData['user_id'] == user1]
    #Get movies rated by user 2
    movies_user2 = myData[myData['user_id'] == user2]

    #Measure the similarity if the number of movies that both users have seen is equal
    if movies_user1.movie_id.count() & movies_user2.movie_id.count()>=minCommonItems:
        rep = pd.merge(movies_user1,movies_user2, on = 'movie_id')
        return similarity(rep['rating_x'], rep['rating_y'])
    else:
        return 0
    
    
print("Euclidean similarity:", similarityFunction2(data,
                                               data.user_id.unique()[10], 
                                               data.user_id.unique()[11], 
                                               similarity=euclideanSimilarity))    
print("Pearson similarity:", similarityFunction2(data,
                                               data.user_id.unique()[10], 
                                               data.user_id.unique()[11], 
                                               similarity=pearsonSimilarity))  
print("Cosine similarity:", similarityFunction2(data,
                                               data.user_id.unique()[10], 
                                               data.user_id.unique()[11], 
                                               similarity=cosineSimilarity1))  

Euclidean similarity: 0.07263669959755374
Pearson similarity: 0.06317948789567646
Cosine similarity: 0.9406954990098857


Now, we already have a similarity function between users. But many others could be used instead.

<div class="alert alert-success">
Question #3.-<br>
<span style="color:black">Create a new similarity function that deals with small $P$ sets differently: we are going to weigh the similarity (previous definition) by the relative size of $P$ over $50$ items...
$$sim\_alt(a,b) = sim(a,b) * \frac{min(100,|P_{ab}|)}{100} $$
where $|P_{ab}|$ is the number of common items for users $a$ and $b$.
</span></div>

In [53]:
# Question 3

# Returns a weighted similarity score for two users
def weightedSimilarityFunction(myData, user1, user2, similarity=euclideanSimilarity):
    #Get movies rated by user 1
    movies_user1 = myData[myData['user_id'] == user1]
    #Get movies rated by user 2
    movies_user2 = myData[myData['user_id'] == user2]
    
    # We are going to weigh the Similarity by the relative size of P over 50 items
    rep = pd.merge(movies_user1,movies_user2, on = 'movie_id')
    P = len(rep)  # Numbner of common movies
    WeightSimilarity = float(min(100,P)/100.0)
    
    sim = similarity(rep['rating_x'], rep['rating_y'])* WeightSimilarity

    return sim
    
    
print("Euclidean distance-based similarity:", weightedSimilarityFunction(data, 
                                                                         data.user_id.unique()[10], 
                                                                         data.user_id.unique()[11]))

print("Pearson correlation-based similarity:", weightedSimilarityFunction(data, 
                                                                          data.user_id.unique()[10], 
                                                                          data.user_id.unique()[11], 
                                                                          similarity=pearsonSimilarity))

print("Cosine similarity:", weightedSimilarityFunction(data,
                                                       data.user_id.unique()[10], 
                                                       data.user_id.unique()[11], 
                                                       similarity=cosineSimilarity1))

Euclidean distance-based similarity: 0.05738299268206746
Pearson correlation-based similarity: 0.0499117954375844
Cosine similarity: 0.7431494442178097


Up to this point, we have defined how to measure similarity between users. Now, how can we use this information to make recommendations?

## How do we generate recommendations from others' ratings?

We might reformulate this problem as: can we *estimate* the rating that a user $a$ would give to an item $p$?
Then, we would need to recommend to the user those items with the highest estimated rating.

We can estimate the rating $\hat{r}_{a,p}$ given by user $a$ to item $p$ as a weighted average of other users' ratings for item $p$ weighted by their similarity with user $a$:

$$\hat{r}_{a,p}= \frac{\sum_{b \in N}{sim(a,b) \cdot r_{b,p}}}{\sum_{b \in N}{sim(a,b)}}$$

where $N$ is the set of users that have rated item $p$.
<br><br>

### Example:
<br>
<table style="width:100%">
  <tr>
    <td>User</td>
    <td>$sim(a,b)$</td> 
    <td>$r_{b,p_1}$ for item $p_1$</td>
    <td>$r_{b,p_2}$ for item $p_2$</td>
    <td>$sim(a,b)\cdot r_{b,p_1}$</td>
    <td>$sim(a,b)\cdot r_{b,p_2}$</td>
    
  </tr>
  <tr>
    <td>b1</td>
    <td>0.99</td> 
    <td>3</td>
    <td>2.5</td>
    <td>2.97</td>
    <td>2.48</td>
    
  </tr>
  <tr>
    <td>b2</td>
    <td>0.38</td> 
    <td>3</td>
    <td>3</td>
    <td>1.14</td>
    <td>1.14</td>
  </tr>
  <tr>
    <td>b3</td>
    <td>0.89</td>
    <td>4.5</td>
    <td> - </td>
    <td>4.0</td>
    <td> - </td>
  </tr>
  <tr>
    <td>b4</td>
    <td>0.92</td>
    <td>3</td>
    <td>3</td>
    <td>2.76</td>
    <td>2.76</td>
  </tr>
  <tr style="border-top:1px solid black">
    <td>$\sum_{b \in N}{sim(a,b) \cdot r_{b,p}}$</td>
    <td></td> 
    <td></td>
    <td></td>
    <td>10.87</td>
    <td>6.38</td>
  </tr>
  <tr>
    <td>$\sum_{b \in N}{sim(a,b)}$</td>
    <td></td> 
    <td></td>
    <td></td>
    <td>3.18</td>
    <td>2.29</td>
  </tr>
  <tr>
  <td>$pred(a,p)$</td>
    <td></td> 
    <td></td>
    <td></td>
    <td>3.42</td>
    <td>2.78</td>
  </tr>
</table>

Let's put these ideas into practice. We want to estimate the rating that user `u` would give to movie `m`:

In [84]:
u = 7
m = 300

# Let's find other users' ratings for movie m
ratings_for_m = data[data['movie_id'] == m]

sim_with_u = {}
users_rated_m = list(ratings_for_m['user_id'])

# Let's reduce the size of the dataset: we only are interested in all the ratings for movies also rated by u
mrbu = data[data['user_id'] == u][['movie_id']] # movies rated by u
data_u = pd.merge(data, mrbu, on='movie_id')      # all the ratings for movies rated by u

num = 0
den = 0
for ous in users_rated_m:
    if ous == u: 
        print("Warning: user", u, "has already rated movie", m," with value:", 
             float(ratings_for_m.rating[ratings_for_m['user_id'] == u]))
        continue 

    sim = similarityFunction(data_u, u, ous) # calculate similarity
    num += sim * float(ratings_for_m.rating[ratings_for_m['user_id']==ous])
    den += sim

print("The estimated rating for user",u,"and movie",m,"is:", num/den)

The estimated rating for user 7 and movie 300 is: 3.693789893715973


Change the value of `u` and set it to '7' and see what happens.

But, we need to do all this in a general way considering any possible pair (user,movie) and thus measuring similarity between all pairs of users.

Let's build a class that learns from a dataset (creates a similarity matrix for users) and then provides rating estimations for any given pair (user,movie).

<div class="alert alert-success">
Question #4.-<br>
<span style="color:black">Implement the previous calculations in a general recommender system class. To do so, complete the 5 following TODO's.
</span></div>

In [171]:
# Question 4
class CollaborativeFiltering:
    """ Collaborative filtering using a custom sim(u,u'). """
    
    def __init__(self, similarity=similarityFunction):
        """ Constructor """
        self.sim_metric = similarity 
        self.df = None
        self.sim =  euclideanSimilarity # similary matrix for users (diagonal symmetric)

    def getSimilarityMatrix(self):
        return copy.deepcopy(self.sim)

    def setSimilarityMatrix(self, sim):
        self.sim = sim
        
    def fit(self, myData):
        """ Prepare data structures for estimation. Compute a similarity matrix among users """
        self.df = myData
        if self.sim is None:
            allUsers = list(self.df['user_id'].unique())
            self.sim = {key: {} for key in allUsers}

            for p1id in np.arange(len(allUsers)-1):
                user1 = allUsers[p1id]
                mrbp1 = self.df[self.df['user_id']==user1][['movie_id']]#### TODO 4.1: store in this variable all the 'movie_id' of all movies rated by p1
                data_p1 = pd.merge(self.df, mrbp1, on='movie_id')          # all the ratings for movies rated by p1
                for p2id in np.arange(p1id+1, len(allUsers)):
                    user2 = allUsers[p2id]
                    sim = self.sim_metric(data_p1,user1,user2)#### TODO 4.2: call the appropriate function to calculate the similarity
                    self.sim[user1][user2] = sim
                    self.sim[user2][user1] = sim
                
    def predict(self, user_id, movie_id):
        """ Estimate the rating that 'user_id' would give to 'movie_id' """
        rating_num = 0.0
        rating_den = 0.0
        #### TODO 4.3: is user_id known? it should be in the similarity matrix, if so
        if len(self.df[self.df['user_id'] == user_id]) > 0:
            user_exists_in_mat=True 
        else:
            user_exists_in_mat=False
        
        #user_exists_in_mat = user_id in self.sim
        df_ratings_for_movie = self.df[self.df['movie_id'] == movie_id] # all the ratings for movie_id
        if user_exists_in_mat: 
            allUsers = set(df_ratings_for_movie['user_id']) # all the users that have ever rated movie_id
            for other_user in allUsers:
                if user_id == other_user: 
                    print("Warning: user", user_id, "has already rated movie", movie_id," with value:", 
                          float(df_ratings_for_movie.rating[df_ratings_for_movie['user_id'] == user_id]))
                    continue 
                
                #rating_num += #### TODO 4.4: calculate and add to this variable the addition to the numerator
                              #              relative to the current other user 
                rating_num += self.sim(user_id, other_user) * float(df_ratings_for_movie.rating[df_ratings_for_movie['user_id'] == other_user])
                #rating_den += #### TODO 4.5: calculate and add to this variable the addition to the denominator
                              #              relative to the current other user
                rating_den += self.sim(user_id, other_user)

        if rating_den == 0: # if we couldn't make a regular estimation:
            if df_ratings_for_movie.rating.mean() > 0:
                # return the unweighted mean movie rating if there are ratings available for movie_id
                return df_ratings_for_movie.rating.mean()
            elif user_exists_in_mat:
                # or return the mean user rating if there is no previous rating for that movie
                return self.df.rating[self.df['user_id']==user_id].mean()
            else:
                # or return a constant value (mid-scale rating) if no information at all is available
                return 3;

        return rating_num/rating_den

Now, let's test this CF class!

First of all, we learn the similarity matrix (this might take a while!):

In [172]:
my_recsys = CollaborativeFiltering()
my_recsys.fit(data)

And, now, we can estimate the rating that user 'user_id' would give to movie 'movie_id':

In [173]:
u=1
m=300
est_rating = my_recsys.predict(user_id=u, movie_id=m) # Estimate the rating that user 'u' would give to movie 'm'
print("The estimated rating for user",u,"and movie",m,"is:", est_rating)

The estimated rating for user 1 and movie 300 is: 3.5975182755080346



## Can the previous predictive function be improved?

### 1) Normalization: Predictions scaled to the user domain

Users tend to rate differently: some users' average is high, others' is low. We can try to adapt to our prediction to the user's mean:<br>

$$\hat{r}^N_{a,p} = \bar{r_a} + \frac{\sum_{b \in N}{sim(a,b)\cdot (r_{b,p}-\bar{r_b})}}{\sum_{b \in N}{sim(a,b)}}$$


where $\bar{r_b}$ is the mean rating of user $b$.<br>


This prediction function was used in the original Netflix system.

<br><br>
### Example:
Prediction for user $a$ with mean rating $\bar{r_a} = 3.5$
<table style="width:100%">
  <tr>
    <td>User</td>
    <td>$sim(a,b)$</td> 
    <td>Mean rating: $\bar{r_b}$</td>
    <td>$r_{b,p_1}$ for item $p_1$</td>
    <td>$sim(a,b)*(r_{b,p_1}-\bar{r_b})$</td>

    
  </tr>
  <tr>
    <td>b1</td>
    <td>0.99</td> 
    <td>4.3</td> 
    <td>3</td>
    <td>-1.29</td>

    
  </tr>
  <tr>
    <td>b2</td>
    <td>0.38</td> 
    <td>2.73</td> 
    <td>3</td>
    <td>0.10</td>

  </tr>
  <tr>
    <td>b3</td>
    <td>0.89</td>
    <td>3.12</td>  
    <td>4.5</td>
    <td>1.23</td>

  </tr>
  <tr>
    <td>b4</td>
    <td>0.92</td>
    <td>3.98</td>  
    <td>3</td>
    <td>-0.90</td>

  </tr>
  <tr style="border-top:1px solid black">
    <td>$\sum_{b \in N}{sim(a,b)\cdot (r_{b,p}-\bar{r_b})}$</td>
    <td></td> 
    <td></td>
    <td></td>
    <td>-0.86</td>

  </tr>
  <tr>
    <td>$\sum_{b \in N}{sim(a,b)}$</td>
    <td></td> 
    <td></td>
    <td></td>
    <td>3.18</td>

  </tr>
  <tr>
  <td>$pred(a,p)$</td>
    <td></td> 
    <td></td>
    <td></td>
    <td>3.23</td>

  </tr>
</table>


<div class="alert alert-success">
Question #5.-<br>
<span style="color:black">Create a new CF RecSys that uses normalization in the predictions. Name it `NormalizedCollaborativeFiltering`<br/>
    
    Tip: Copy the structure of the previous class, `CollaborativeFiltering`. Use an attribute `mean_ratings` to store the mean value given by each user, and estimate it during fitting. Consider the mean ratings as explained above during the prediction.
</span></div>

In [181]:
# Question 5
class NormalizedCollaborativeFiltering:
    """ Collaborative filtering using a custom sim(u,u') normalized by user's mean rating. """
    
    def __init__(self, similarity=similarityFunction):
        """ Constructor """
        self.sim_metric = similarity
        self.df = None
        self.sim = None # similary matrix for users (diagonal symmetric)
        self.mean_ratings = None # users' mean ratings

    #### TODO: work here. Copy from previous implementation, first; add new functionalities latter on.

    def predict(self, user_id, movie_id):
        """ Estimate the rating that 'user_id' would give to 'movie_id' """
        rating_num = 0.0
        rating_den = 0.0
        #### TODO 4.3: is user_id known? it should be in the similarity matrix, if so
        if len(self.df[self.df['user_id'] == user_id]) > 0:
            user_exists_in_mat=True 
        else:
            user_exists_in_mat=False

        #user_exists_in_mat = user_id in self.sim
        df_ratings_for_movie = self.df[self.df['movie_id'] == movie_id] # all the ratings for movie_id
        if user_exists_in_mat: 
            allUsers = set(df_ratings_for_movie['user_id']) # all the users that have ever rated movie_id
            for other_user in allUsers:
                if user_id == other_user: 
                    print("Warning: user", user_id, "has already rated movie", movie_id," with value:", 
                          float(df_ratings_for_movie.rating[df_ratings_for_movie['user_id'] == user_id]))
                    continue 

                #rating_num += #### TODO 4.4: calculate and add to this variable the addition to the numerator
                              #              relative to the current other user 
                rating_num += self.sim(user_id, other_user) * float(df_ratings_for_movie.rating[df_ratings_for_movie['user_id'] == other_user])
                #rating_den += #### TODO 4.5: calculate and add to this variable the addition to the denominator
                              #              relative to the current other user
                rating_den += self.sim(user_id, other_user)

        if rating_den == 0: # if we couldn't make a regular estimation:
            if df_ratings_for_movie.rating.mean() > 0:
                # return the unweighted mean movie rating if there are ratings available for movie_id
                return df_ratings_for_movie.rating.mean()
            elif user_exists_in_mat:
                # or return the mean user rating if there is no previous rating for that movie
                return self.df.rating[self.df['user_id']==user_id].mean()
            else:
                # or return a constant value (mid-scale rating) if no information at all is available
                return 3;

        return rating_num/rating_den
 
    
    def fit(self, myData):
        """ Prepare data structures for estimation. """
        self.df = myData
        if self.sim is None:
            self.createSimMatrix()
        if self.mean_ratings is None:
            self.calculateMeanRatings()
        
    def createSimMatrix(self):
        """ Compute a similarity matrix among users """
        self.df = myData
        if self.sim is None:
            allUsers = list(self.df['user_id'].unique())
            self.sim = {key: {} for key in allUsers}

            for p1id in np.arange(len(allUsers)-1):
                user1 = allUsers[p1id]
                mrbp1 = self.df[self.df['user_id']==user1][['movie_id']]#### TODO 4.1: store in this variable all the 'movie_id' of all movies rated by p1
                data_p1 = pd.merge(self.df, mrbp1, on='movie_id')          # all the ratings for movies rated by p1
                for p2id in np.arange(p1id+1, len(allUsers)):
                    user2 = allUsers[p2id]
                    sim = self.sim_metric(data_p1,user1,user2)#### TODO 4.2: call the appropriate function to calculate the similarity
                    self.sim[user1][user2] = sim
                    self.sim[user2][user1] = sim

    def calculateMeanRatings(self):
        print("calculate_mean_ratings: not implemented yet")


Let's test this new CF class:

In [183]:
my_norm_recsys = NormalizedCollaborativeFiltering()
my_norm_recsys.setSimilarityMatrix(my_recsys.getSimilarityMatrix()) # to save time, let's reuse the sim matrix
my_norm_recsys.fit(data) # thus, this only will calculate the mean rating of the users

And, now, we can estimate the rating that user `u` would give to movie `m` with this new function:

In [None]:
u=1
m=300
est_rating = my_norm_recsys.predict(user_id=u, movie_id=m) # Estimate the rating that user 'u' would give to movie 'm'
print("The estimated rating for user",u,"and movie",m,"is:", est_rating)


### 2) Not all the neighbor ratings might be equal
Agreement on commonly liked items is not as important as agreement on controversial items. We could weigh user similarity according to the rating variance.

### 3) Value of number of co-rated items
Reduce the similarity between users when the number of co-rated items is low or discard those users with a small number of co-rated items.

### 4) Case amplification
Increase the weigth to those users which are really really similar to each other (~= 1).

### 5) Neighborhood selection
Only a subset of similar users are used to make recommendations. Dissimilar users are discarded.

<div class="alert alert-success">
Question #6.-<br>
<span style="color:black">Create a new RecSys that estimates the score of movie `movie_id` as given by user `user_id` only using the subset of the `N` most similar users to `user_id`. 

    Tip: Copy the structure of the first class, `CollaborativeFiltering`. Use an attribute `N` to store the no. of most similar users to consider. Create a function that returns the N most similar users for a given `user_id`. Consider only these most similar users to make the prediction.
</span></div>

In [None]:
import operator
    
class NNCollaborativeFiltering:
    """ Collaborative filtering using a custom sim(u,u') and considering only the most similar N users """
    
    def __init__(self, N=10, similarity=similarityFunction):
        """ Constructor """
        self.sim_metric = similarity
        self.N = N
        self.df = None
        self.sim = None # similary matrix for users (diagonal symmetric)

    def getSimilarityMatrix(self):
        return copy.deepcopy(self.sim)

    def setSimilarityMatrix(self, sim):
        self.sim = sim

    #### TODO: work here. Copy from previous implementation, first; add new functionalities latter on.

    def get_N_most_similar_users(self, user_id):
        print("get_N_most_similar_users: not implemented yet")


Let's test this new CF class:

In [None]:
my_nn_recsys = NNCollaborativeFiltering()
my_nn_recsys.setSimilarityMatrix(my_recsys.getSimilarityMatrix()) # to save time, let's reuse the sim matrix
my_nn_recsys.fit(data) # thus, this only will set the dataframe

And, now, we can estimate the rating that user `u` would give to movie `m` with this new function:

In [None]:
u=1
m=300
est_rating = my_nn_recsys.predict(user_id=u, movie_id=m) # Estimate the rating that user 'u' would give to movie 'm'
print("The estimated rating for user",u,"and movie",m,"is:", est_rating)