# Neighborhood-Based Collaborative Filtering

Imagine you have to build a recommender. What approach could you take for recommending movies to your users?

* Majority vote (potentially weighing critics and regular users differently)
* Based on genres (e.g. ask users for their favorite genre and primarily recommend movies from that genre)
* Based on user categories / demographics
* Don't recommend the highest scoring result, but allow for serendipity/discovery

### What is collaborative filtering?

In the context of e.g. recommendations engines, using (explicit/implicit) taste of other users (collaborative) to infer the taste of the target user. As opposed to _content-based filtering_, e.g. recommending movies based on someone's favorite genre, gender, age, location.

### Types of collaborative filtering

* _Neighborhood-based_ (also, memory-based):
    * **user-based**: looks for similarities in ratings between target user and other users; _user-item matrix_
    * **item-based**: looks for similarities in items target user has rated compared to items other users have rated; _item-item matrix_
* _Model-based_:
    * e.g. NMF, LDA, SVD

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
df = pd.read_csv('movie_ratings_vv.csv', index_col=1, header=1)

In [None]:
df.drop(['Unnamed: 0', 'Unnamed: 12', 'knows ...'], axis=1, inplace=True)

In [None]:
df.drop(['avg', 'votes'], inplace=True)

In [None]:
df = df[df.index.notnull()]

In [None]:
df = df.T

In [None]:
sns.heatmap(df.fillna(0))

### How do we measure similarity?

* Euclidean distance: $$ \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2} $$
* Manhattan distance: $$ |x_1-x_2|+|y_1-y_2| $$
* Minkowsky distance: $$ \sqrt[\lambda]{|x_1-x_2|^\lambda + |y_1-y_2|^\lambda} $$
* Jaccard similarity (on sets): $$ \frac{|A \cap B|}{|A \cup B|} $$
* Cosine similarity

#### Cosine Similarity

A normalized dot product of two vectors. Geometrical interpretation is that it is an angle (or a cosine of an angle) between two vectors.

$$ cos(X, Y) = \frac{X \cdot Y}{\lVert X \rVert \lVert Y \rVert} = \frac{\sum x_i y_i}{\sqrt{\sum x_i^2}\sqrt{\sum y_i^2}} $$

In [None]:
# Implement cosine similarity
def cosim(X, Y):
    num = np.nansum(X*Y) # np.dot(X, Y)
    denom = np.sqrt(np.nansum(X*X)*np.nansum(Y*Y)) # np.sqrt(np.dot(X, X)*np.dot(Y, Y))
    return num/denom

In [None]:
cosim(df['Marija'], df['Stefan'])

In [None]:
cosim_table = []
for user1 in df.columns:
    row = []
    for user2 in df.columns:
        row.append(cosim(df[user1], df[user2]))
    cosim_table.append(row)

In [None]:
cosim_df = pd.DataFrame(cosim_table, columns=df.columns, index=df.columns).round(2)

In [None]:
sns.heatmap(cosim_df)

### Making predictions

What approach would you take now to recommending movies to people?

* Look for a data-twin (or 10, or N) and get recommendations from them 
    * Use average of those users: sum(ratings)/N
    * Use weighted average of their ratings: sum(similarity*rating)/sum(similarity)
* Extend this to all users and use weighted average

How do we go about making predictions:
* Select only movies your target user hasn't seen
* For those movies, calculate weighted average of the ratings of other users
* Rank this and pick top 1/3/N

In [None]:
target_user = 'Marija'

In [None]:
unseen_movies = df[df[target_user].isna()].index

In [None]:
predicted_ratings = []
for movie in unseen_movies:
    other_users = df.columns[df.loc[movie].isna() == False]
    nominator = 0
    denominator = 0
    for user in other_users:
        nominator += cosim(df[target_user], df[user])*df.loc[movie][user]
        denominator += cosim(df[target_user], df[user])
    predicted_ratings.append((movie, nominator/denominator))

In [None]:
predicted_ratings

In [None]:
sorted(predicted_ratings, key=lambda x: x[1])