

# $\color{green}{\text{ 10.4. Neighborhood-Based Collaborative Filtering }}$

-------



### $\color{green}{\text{Agenda : }}$


 - What is collaborative filtering?
 - Cosine Similarity
 - User based filtering
 - Creating recommendations
 - Adding new users

------

* ##   What is Collaborative Filtering?

In the context of e.g. recommendations engines, it is the process of **inferring the taste of one user, based on the knowledge of the tastes of other users**.

**Types of collaborative filtering:**

* _neigbourhood-based_ (also, _memory-based_):
    * **user-based**: looks for similarities in ratings between the target user and other users; _user-item matrix_
    * **item-based**: looks for similarities in items the target user has rated and other users have rated; _item-item matrix_
* _model-based_:
    * e.g. NMF

-----

* ## What do we mean by *neighbourhood*?

How do we measure **similarity** or **proximity** of two objects? - see references below for other measures. Today we focus on:

### Cosine Similarity

- **Angle** measurement between two vectors. **Orientation**, not magnitude.
- Values range between:
  
    `-1`: vectors point in opposite directions.
  
    `+1`: vectors are on top of each other.
    
     `0`: vectors are perpendicular. 

$$cos(x, y) = \dfrac{\textbf{x} \cdot \textbf{y}}{\lVert \textbf{x} \rVert \lVert \textbf{y} \rVert}= \dfrac{\sum_i^n{x_i*y_i}}{\sqrt{\sum_i^n{x_i^2}}\sqrt{\sum_i^n{y_i^2}}}$$

Numerator: dot product of the vectors

Denominator: Euclidean norms (=lengths) of the vectors multiplied

An additional option is to use the **adjusted** or **centred** cosine similarity, which subtracts each user's average rating from all their ratings (mean=0). 

### User-vectors

We can think of the rows of our ratings as being vectors, one for each user. Then we can use the cosine similarity as a measure of how "close" or "far apart" their tastes are by running the above calculation between two users

-----

## Example with your movie ratings

In [None]:
import numpy as np
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

We have to deal with the `NaN`s. Remember we have some options available:

- global average
- 0
- average of each user
- average of the movie
- ...

For cosine similarity we need to fill the `NaN`s with zero because we don't want imputed means making users seem similar when they are not

In [None]:
# impute the nan with 0:


### Calculate Cosine-similarities between users

- Numerator: dot products of the vectors
- Denominator: Euclidean norm of the vectors multiplied

In [None]:
def cos_sim(vec1, vec2):
    """function to calcualte the cosine similarity between two vectors""" 
    num = np.dot(vec1, vec2)
    denom = np.sqrt(np.dot(vec1, vec1)) * np.sqrt(np.dot(vec2, vec2))
    return num / denom

We can take two user vectors and compare them:

In [None]:
# define two users-vectors



In [None]:
# calculate cosine similarity



-----

### User-based filtering 
"*Users similar to you also enjoyed X and Y*"

#### Create a table with pairwise user/user cosine-similarities
* We can write a function that takes ratings matrix as an argument and in the nested loop finds the cosine similarities between user pairs.

In [None]:
#Can display the table as a heatmap: 


### One-liner from  `sklearn`

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Returns numpy array:


In [None]:
# We can turn this into a dataframe:


In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(cosine_similarity(R), annot=True, linewidths = 3, 
            cmap=sns.cubehelix_palette(start=2, rot=0, dark=0, light=.95, reverse = True, as_cmap=True) ,
           xticklabels=cosine_similarity_table.index,
           yticklabels=cosine_similarity_table.index);  

------

## Coffee break 15min! ☕

-----

### How do we create recommendations based on this result?

**Algorithm**

Define active user:
- create a list of unseen movies for the active user
- get the nearest neighbours users (most similar users)
- for each unseen movie: 
    - get the neighbours who have rated/seen the movie 
    - predict the ratings based on the (weighted) average ratings of the neighbours
- collect all unseen movies sorted by predicted ratings. Give Top X as prediction.

(This can all be done by writing for-loops and comparing values.)

In [None]:
# use the transposed version of R


In [None]:
# choose an active user


In [None]:
# create a list of unseen movies for this user


In [None]:
# Create a list of top 3 similar user (nearest neighbours)


In [None]:
# create the recommendation (predicted/rated movie)
predicted_ratings_movies = []

for movie in unseen_movies:
    
    # we check the users who watched the movie
    people_who_have_seen_the_movie = list(R_t.columns[R_t.loc[movie] > 0])
    
    num = 0
    den = 0
    for user in neighbours:
        # if this person has seen the movie
        if user in people_who_have_seen_the_movie:
        #  we want extract the ratings and similarities
            rating = R_t.loc[movie,user]
            similarity = cosine_sim_table.loc[active_user,user]
            
        # predict the rating based on the (weighted) average ratings of the neighbours
        # sum(ratings)/no.users OR 
        # sum(ratings*similarity)/sum(similarities)
            num = num + rating*similarity
            den = den + similarity
    predicted_ratings = num/den
    predicted_ratings_movies.append([predicted_ratings,movie])

In [None]:
# create df pred


### What happens if a new user joins?

In [None]:
# initialize new user 


In [None]:
ratings = {'Forrest Gump': ,
 'Shawshank Redemption': ,         
 'Matrix': ,
 'Star Wars: Episode IV':,               
 'Pulp Fiction': ,
  'Lord of The Rings Trilogy': }

In [None]:
# new user dataframe



#### Advantages of Neighbourhood Bases approaches

      * Fast
      * Works for huge datasets
      * No domain knowledge necessary

#### Disadvantages

      * Hard to include other data than ratings
      * Sparsity
      * Cold start problem

### Other good measure of similarity

Check also the Adjusted Cosine Similarity that calcaulates normalized distances taking care of grumpy or happy users

------

### Bonus: Item based filtering 
"*Because you watched X, you may also enjoy Y*"

Item-based filtering can be done in a similar way - see [course material](https://spiced.space/costmary-function/ds-course/chapters/project_movie_recommender/neighborhood_based_cf.html)

------

## Next steps:

#### Neighborhood based recommender function

- Collect different example queries for "typical" users (e.g. a horror movie buff, a Disney Person) and try out the algorithm
- Set the number of neighbors to a very high or low number. What happens to the recommendations?
- Implement a recommender function that recommends movies to a new user based on the NearestNeighbor model!


#### ⭐ Further Experiments:
- Try out *other distance metrics*: Convert the rating matrix into a *boolean matrix* (rated:1 vs not-rated:0) and use the *Jaccard similarity*.
- Use the adjusted cosine similarity, does it make a difference?
- Try out another method for calculating the score: use a *weighted* (weights = distances) sum or average. 
- Find similar *movies*! Use the method to find and recommend similar movies! Hint: Run the model on the *transposed* user item rating matrix.

## References/ Further reading:

- [A guide to distance measures from `SciPy`](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html)
- [NMF and cosine similarity](https://colab.research.google.com/github/ML-Challenge/week4-unsupervised-learning/blob/master/L4.Discovering%20interpretable%20features.ipynb#scrollTo=hxYA_Dasi1tp) notebook on Google Colab