<a href="https://colab.research.google.com/github/Ang3lino/recomenderSys/blob/master/collaborativeFiltering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:

import numpy as np
import pandas as pd

import os
import pickle

from sortedcontainers import SortedList
from collections import Counter, defaultdict
from tqdm import tqdm  # modulo cuya finalidad es dar un feedback del progreso de algun procedimiento

In [0]:

!pip install tqdm



# User-Based Neighborhood Models
En este enfoque, user-based neighborhoods son definidos con el fin de identificar usuario similares a el usuario objetivo con el fin de predecir valoraciones que el daria a un articulo.

Para la matrix de valoraciones $m \times n$, $R = [r_{ui}]$ con $m$ usuarios y $n$ articulos. Definamos $I_u$ el conjunto de indices de articulos que el usuario $u$ ha valorado (una fila). El conjunto $I_u \cap I_v$ define las observaciones mutuamente observadas.

In [0]:
from google.colab import drive  
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
def load_object(fname: str, user_count: int, item_count: int) -> defaultdict:
    fdir = 'drive/My Drive/petroleo/movielens-20m-dataset'
    fname = f'{fname}_{user_count}_{item_count}.json'
    fpath = os.path.join(fdir, 'shrinked', fname)
    with open(fpath, 'rb') as fp:
        object_ = pickle.load(fp)
    return object_

def defaultdict_set(defdict):
    return {k: set(v) for k, v in defdict.items()}


user_count = 4096
item_count = 512
user2item = load_object('user2item', user_count, item_count)
item2user = load_object('item2user', user_count, item_count)
user_item2rating = load_object('user_item2rating', user_count, item_count)

user2item = defaultdict_set(user2item)

# Pearson
Una medida que captura la similitud $Sim(u, v)$ entre los vectores de valoracion entre usuarios $u,v$ es el coeficiente de correlacion de Pearson. El primer paso es calcular las medias de valoracion $\mu_u$ para cada usuario $u$:

$$ \mu_u = \frac{\sum_{k\in I_u}r_{uk}}{|I_u|} $$

Asi, la correlacion de Parson se define como 

$$ Sim(u, v) = Pearson(u, v) = \frac{\sum_{k \in I_u \cap I_v}(r_{uk} - \mu_u)(r_{vk} - \mu_v)}
{\sqrt{\sum_{k \in I_u \cap I_v}(r_{uk} - \mu_u)^2 }  \sqrt{ \sum_{k \in I_u \cap I_v}(r_{vk} - \mu_v)^2 } 
} $$



In [0]:
def user_means(user_item2rating: dict, user2item: defaultdict) -> dict:
    means = dict() 
    for u, items in user2item.items():
        means[u] = np.mean([user_item2rating[(u, i)] for i in items])
    return means

def sim_cos(u: np.array, v: np.array):
    dot = lambda u, v: u.dot(v)
    sigma = lambda u, v: np.sqrt(dot(u, v))
    num = dot(u, v)
    den = sigma(u, u) * sigma(v, v)
    return num / den
    

def sim_pearson(u: int, v: int, means: dict, user2item: defaultdict, user_item2rating: dict):
    intersection = user2item[u] & user2item[v]
    a = np.array([user_item2rating[u, k] for k in intersection]) - means[u]
    b = np.array([user_item2rating[v, k] for k in intersection]) - means[v]
    return sim_cos(a, b)



In [0]:
means = user_means(user_item2rating, user2item)
ans = sim_pearson(1, 5, means, user2item, user_item2rating)
print(ans)

0.11996713555118083


# Prediccion
Sea $P_u(j)$ el conjunto de los k usuarios mas cercanos del usuario $u$, quienes han valorado el articulo $j$. Usuarios con un muy bajo o negativas correlaciones con el usuario $u$ son filtrados por $P_u(j)$ como una mejora heuristica. Asi, la funcion de prediccion se describe como: 

$$\hat{r_{uj}} = \mu_u + \frac{\sum_{v\in P_u(j)}Sim(u,v)(r_{vj}-\mu_v)}{\sum_{v\in P_u(j)}|Sim(u, v)|} $$

In [0]:
def euclidean_distance(u, v):
    return np.sqrt(np.sum((u - v) ** 2))

def neighbors(x_vec, y_cls, p_i, k=3):
    dist_class = [(euclidean_distance(x_i, p_i), y_cls[i]) for i, x_i in enumerate(x_vec)]
    sorted_distance = sorted(dist_class, key=lambda x: x[0])[:k]
    return [y_i for (_, y_i) in sorted_distance]

def closest_users(u: int, j: int, item2user, user_item2rating):
    users = item2user[j]
    ratings = [user_item2rating[user, j] for user in users]
    closests = neighbors(ratings, users, user_item2rating[(u, j)])
    return closests


In [45]:
u = 1
j = 2
closest_users(u, j, item2user, user_item2rating)

[405.0, 691.0, 1005.0]

In [0]:
def predict_rating(u: int, j: int, means: dict, user2item, item2user, user_item2rating):
    k_closest = closest_users(u, j, item2user, user_item2rating)
    similars = np.array([sim_pearson(u, v, means, user2item, user_item2rating) for v in k_closest])
    s_vj = np.array([user_item2rating[(v, j)] - means[v] for v in k_closest])
    num = np.sum(similars * s_vj ) 
    den = np.sum(list(map(abs, similars)))
    return means[u] + num/den

In [47]:
r_pred = predict_rating(u, j, means, user2item, item2user, user_item2rating)
print(r_pred)
print(user_item2rating[(u, j)])

4.629527360969084
4.0


In [0]:
def predict(user2item, item2user, user_item2rating, batch_size: int) -> dict:
    predictions = dict()
    # n = len(user2item)
    count = 0
    for u, items in user2item.items():
        for i in items:
            predictions[(u, i)] = predict_rating(u, i, means, user2item, item2user, user_item2rating)
            count += 1
        if count > batch_size: break
    return predictions

def mse(y_true, y_pred):
    return np.sqrt(np.sum((y_true - y_pred) ** 2))

In [63]:
%time preds = predict(user2item, item2user, user_item2rating, 100)

CPU times: user 6.51 s, sys: 7 ms, total: 6.52 s
Wall time: 6.54 s


In [64]:
debug = True
acc = 0
for u, i in preds.keys():
    x = preds[u, i] 
    y = user_item2rating[u, i]
    if debug: print(x, y)
    acc += (x - y) ** 2 
print(np.sqrt(acc))

4.085048612653084 4.0
3.1179143597964023 3.0
2.991131300344831 3.0
3.1492853891721664 3.0
3.9911313003448314 4.0
4.946395968567048 5.0
2.150096136419229 2.0
2.1863032393366044 2.0
2.1388294198290376 2.0
3.094144789288019 3.0
2.1335539832734645 2.0
3.933888167657611 4.0
4.933888167657611 5.0
2.2522257995102173 2.0
5.031052034256797 5.0
2.259580725633232 2.0
4.0439845791225215 4.0
3.932085949435386 4.0
4.071729890093563 4.0
4.121903358745572 4.0
3.098664487084926 3.0
4.046114366429273 4.0
3.9911313003448314 4.0
2.288201286983692 2.0
4.991131300344831 5.0
3.101056786269148 3.0
3.2647869200427664 3.0
3.2647869200427664 3.0
4.001273636137011 4.0
3.060306265253589 3.0
3.9911313003448314 4.0
3.2647869200427664 3.0
4.133353339857471 4.0
5.226686990482238 5.0
3.067921850630901 3.0
2.991131300344831 3.0
4.067921850630901 4.0
3.176484215668627 3.0
3.932085949435386 4.0
5.041109067765025 5.0
3.05967528183247 3.0
3.9421942792253546 4.0
3.921804649644094 4.0
2.9562120238202088 3.0
3.117085030304906 