<a href="https://colab.research.google.com/github/Ang3lino/recomenderSys/blob/master/matrixFactorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:

import numpy as np
import pandas as pd

import os
import random
import pickle

from sortedcontainers import SortedList
from collections import Counter, defaultdict
from tqdm import tqdm  # modulo cuya finalidad es dar un feedback del progreso de algun procedimiento

In [3]:
!pip install tqdm --upgrade
tqdm.pandas()

Requirement already up-to-date: tqdm in /usr/local/lib/python3.6/dist-packages (4.43.0)


In [4]:
from google.colab import drive  
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
def load_object(fname: str, user_count: int, item_count: int) -> defaultdict:
    fdir = 'drive/My Drive/petroleo/movielens-20m-dataset'
    fname = f'{fname}_{user_count}_{item_count}.json'
    fpath = os.path.join(fdir, 'shrinked', fname)
    with open(fpath, 'rb') as fp:
        object_ = pickle.load(fp)
    return object_

def defaultdict_set(defdict):
    return {k: set(v) for k, v in defdict.items()}


user_count = 4096
item_count = 512
user2item = load_object('user2item', user_count, item_count)
item2user = load_object('item2user', user_count, item_count)
user_item2rating = load_object('user_item2rating', user_count, item_count)



# Factorizacion de matrices
Con el fin de reducir espacio de almacenamiento y aumentar la velocidad del algoritmo aplicaremos factorizacion de matrices. Aqui, se busca obtener dos matrices cuyo producto aproxime de mejor manera a $R$. Es decir

$$R \approx \hat R = WU^T$$

Asumamos que $R$ con $m$ usuarios y $n$ articulos, donde $W$ es de dimension $m\times k$ y $U$ es de dimension $n \times k$. Definamos tambien la funcion de perdida

$$ J = \sum_{i, j} (r_{ij} - \hat r_{ij})^2 = \sum_{i,j} (r_{ij} - w_i^T u_j)^2 $$ 

Como de costumbre, se busca minimizar la funcion $J$, derivando parcialmente e igualando a cero tenemos.

$$ w_{i} = (\sum_{j\in\psi_i}u_ju_j^T)^{-1} \sum_{j\in\psi_i}r_{ij}u_j$$
$$ u_{j} = (\sum_{i\in\Omega_j}w_iw_i^T)^{-1} \sum_{i\in\Omega_j}r_{ij}w_i$$

In [0]:
def loss_function(ratings: dict, u, w):
    ''' r[(i, j)] -> int '''
    return sum((r - w[i].dot(u[j])) ** 2 for (i, j), r in ratings.items())

def solve_system(X: np.array, R: dict):
    matrix = np.zeros(k, k)
    vector = np.zeros(k)
    for i in I:
        for j in J:
            matrix += np.outer(X[j], X[j])
            vector += np.dot(R[(i,j)], X[j])
        X[i] = np.linalg.solve(matrix, vector)
    

def fit(R: dict, I: iter, J: iter, epochs: int, k: int):
    m = max(I) + 1
    n = max(J) + 1
    W = random.randn(m, k)
    U = random.randn(n, k)
    for epoch in tqdm(range(epochs)):
        solve_system(W, R)
        solve_system(U, R)


In [8]:
a = np.array([[1,2],[3,4]])
np.outer(a,a)

array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16]])

In [9]:
b = np.array([1,2,3,4])
np.outer(b,b)

array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16]])