# Colaborative Filtering

---

## Two types
* Memory/neighborhood-based 
* Model-based

---

### Model based - e.g. NMF, SVD

In [None]:
from sklearn.decomposition import NMF
...

---

### Memory/neighborhood-based 
* Cosine Similarity
* Euclidean Distance
* Jaccard Similarity
* More to try: Clustering, e.g. KNN, Kmeans

---

## Cosine Similarity

* Basically just the normalized dot product!
* Numerator = dot product
* Denominator = Euclidean norm of the vectors multiplied

$cos(X,Y)=\frac{X∗Y}{||X||∗||Y||}$

---

#### A bit of intuition

* Dot product can be written as :

$DP = X_1*Y_1+X_2*Y_2+X_3*Y_3....X_n*Y_n$

* OR:

$DP = ||X||∗||Y||*\cos(\theta)$

* Therefore

$\cos(\theta) = \frac{DP}{||X||∗||Y||}$

---

### Imports and Data

In [None]:
import pandas as pd
import numpy as np
import tqdm
import seaborn as sns
from sqlalchemy import create_engine
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances
from sklearn.metrics import jaccard_score

HOST = 'localhost'
USERNAME = 'postgres'
PASSWORD = 'postgres'
PORT = '5432'
DATABASE = 'movies'

conn_string = f'postgres://{USERNAME}:{PASSWORD}@{HOST}:{PORT}/{DATABASE}'
engine = create_engine(conn_string)

df = pd.read_sql('ratings',conn_string)
umr = df.pivot_table(values='rating', index='userId', columns='movieId')

### Manually 
* We need to fillnas to manually calculate cosim

In [None]:
umr.fillna(0,inplace=True)

In [None]:
def cosim(vec1, vec2):
    num = np.dot(vec1, vec2)
    denom = np.sqrt(np.dot(vec1,vec1) * np.dot(vec2,vec2))
    return num / denom

In [None]:
data = []
for i, row1 in tqdm.tqdm(umr.iterrows()):
    row = []
    for j, row2 in umr.iterrows():
        c = cosim(row1, row2)
        row.append(c)
    data.append(row)

### Now we can create our own cosim matrix

In [None]:
cs = pd.DataFrame(data, index=umr.index, columns=umr.index).round(2)

#### And use it to find k closest users

In [None]:
def pick_closest_existing_users(user_id, k):
    closest_users = [cs.iloc[x] for x in np.argsort(cs.iloc[user_id])[-k+1:].values if x != user_id]
    return closest_users

### Automatically
* All in one line of code!

In [None]:
cosim = pd.DataFrame(cosine_similarity(umr))
sns.heatmap(cosim)

---

### Advantages of Cosim

* fast
* works for huge datasets
* works with other types of features (genres, demography)
* item based or user based

### Disadvantages of Cosim
* Treats missing data as negative
* Most data is typically missing!
* SOLUTION: normalize the data by subtracting the mean

---

### Challenge:
* Handle a new user

---

### Euclidean Distance

In [None]:
euclid = pd.DataFrame(euclidean_distances(umr))
sns.heatmap(euclid)

---

### Jaccard Similarity / Tanimoto Coefficient
* Size of the intersection divided by the size of the union of the sample sets

$ \frac{| A \cap B |}{|A \cup B|}$

* Pros: Quick and easy calcuation
* Cons: Ignores the ratings of the set values

In [None]:
umr.fillna(0, inplace=True)
jaccard_score(umr.iloc[0].round(), umr.iloc[1].round(), average='weighted')

---

### Other to try:
* Clustering - KNN, Kmeans, etc