## Implementação

In [2]:
import numpy as np
import pandas as pd

Parte-se do pressuposto que os dados estão organizados da seguinte maneira, numa matriz $R$

User_ID | Item_1 | $\dots$ | Item_n , e cada célula contém a avaliação do usuário User_m para o item da coluna específica, caso o usuário não tenha avaliado o item, colocamos um valor default de 0.

O primeiro passo seria pegar esses dados e dividí-los nas matrizes $M_1$ e $M_2$, uma que diz respeito aos usuários e outra que diz respeito aos usuários e suas caracteristicas e a outra diz respeito aos itens e suas características.

As duas matrizes são iniciadas com valores aleatórios.

Após inicializar essas duas matrizes, é preciso criar a matriz $\hat{R}$, que terá uma configuração do tipo, 

User_ID | Item_1 | Item_2 | $\dots$ | Item_n

E cada célula será preenchida com um valor de "Rating", que por sua vez será resultado da seguinte multiplicação, 

$$ \hat{R} = M_1 \times M_2^T $$

Ok, temos uma matriz $\hat{R}$, formada pela multiplicação de valores aleatórios, e portanto, tem vários valores aleatórios. A questão é:

Como isso será útil para sistemas de recomendação? 

Realmente inicialmente não teremos resultados satisfatórios tendo em vista que só tivemos a utilização de valores aleatórios. Porém, para melhorarmos isso, utilizaremos algoritmos de otimização como o SGD, para que os valores das matrizes $M_1$ e $M_2$ sejam ajustados e fiquem mais condizentes com os valores da matriz de avaliações $R$.

Ao final desse processo teremos uma matriz de recomendação $\hat{R}$, com valores condizentes com a realidade dos dados. Abaixo temos a classe MatrixFactorization completa.

In [3]:
class MatrixFactorization():
    def __init__(self, number_caracteristics, max_rating=5, min_rating=1, learning_rate=0.01, max_iter=500, regularization=0.01, momentum=0.9):
        
        self.number_caracteristics = number_caracteristics # tamanho de cada embedding 
        self.max_rating = max_rating 
        self.min_rating = min_rating
        self.learning_rate = learning_rate # learning rate do sgd
        self.max_iter = max_iter 
        self.regularization = regularization #fator de regularização 
        self.momentum = momentum  # inicialmente eu não tinha colocado o termo de momentum 

    def criar_matrizes(self, Matriz_R):
        # cria as matrizes de usuário e item que vao ser usadas para a criacao da matriz r_hat
        self.matriz_m1 = np.random.normal(loc=0, scale=1/self.number_caracteristics, size=(Matriz_R.shape[0], self.number_caracteristics))
        self.matriz_m2 = np.random.normal(loc=0, scale=1/self.number_caracteristics, size=(Matriz_R.shape[1], self.number_caracteristics))
        self.v_m1 = np.zeros_like(self.matriz_m1)
        self.v_m2 = np.zeros_like(self.matriz_m2)
    
    def criar_matriz_R_hat(self):
        self.r_hat = np.dot(self.matriz_m1, self.matriz_m2.T)
    
    def calcular_erro(self, R):
        erro = (R - self.r_hat)
        mse = np.mean(erro ** 2)
        return np.sqrt(mse)

    def treinar(self, Matriz_R, mostrar_erro=False):
        array_R = np.array(Matriz_R)
        
        # aqui é a ideia de criar uma máscara para os valores que não foram avaliados pelo usuário
        # retirado do github https://github.com/wangyuhsin/matrix-factorization/blob/main/README.md 
        mask = array_R > 0

        for iter in range(self.max_iter):
        
            erro = (array_R - self.r_hat) * mask

            # atualização dos gradientes e das velocidades dos termos de momentum
            #grad_m1 = -2 * (erro @ self.matriz_m2) + self.regularization * self.matriz_m1
            #grad_m2 = -2 * (erro.T @ self.matriz_m1) + self.regularization * self.matriz_m2
            grad_m1 = np.clip(-2 * (erro @ self.matriz_m2) + self.regularization * self.matriz_m1, -1, 1) # o uso do clip aqui é porque tava dando overflow
            grad_m2 = np.clip(-2 * (erro.T @ self.matriz_m1) + self.regularization * self.matriz_m2, -1, 1)

            self.v_m1 = self.momentum * self.v_m1 - self.learning_rate * grad_m1
            self.v_m2 = self.momentum * self.v_m2 - self.learning_rate * grad_m2
            # atualiza m1 m2
            self.matriz_m1 += self.v_m1
            self.matriz_m2 += self.v_m2

            # atualiza r_hat após a iteração completa
            self.r_hat = np.clip(np.dot(self.matriz_m1, self.matriz_m2.T), self.min_rating, self.max_rating)

            if mostrar_erro:
                if iter %10 == 0:
                    erro_atual = self.calcular_erro(array_R)
                    print(f"Iteração: {iter + 1}, Erro: {erro_atual}") 

        return self.r_hat

    def fit(self, Matriz_R, mostrar_erro=False):
        self.criar_matrizes(Matriz_R)
        self.criar_matriz_R_hat()  
        return self.treinar(Matriz_R, mostrar_erro)


## Testes

Como dito no início, aqui se pressupõe que os valores estão organizados de determinada maneira, para isso peguei o dataset do Movielens100k pivoteado.

Caso seja de interesse do leitor, a limpeza do dataset está disponível em 

https://github.com/Lucasaraga0/k_means_recommendation-system/blob/main/notebooks_and_functions/datasets_cleaning.ipynb

Desse mesmo repositório utilizarei o dataset Movie_pivot_1.csv. 

In [4]:
data = pd.read_csv("movies_pivot_1.csv")
data.fillna(0,inplace= True)
data.drop(columns= ['User Average Rating'], inplace= True)
data

Unnamed: 0,Toy Story (1995),GoldenEye (1995),Four Rooms (1995),Get Shorty (1995),Copycat (1995),Shanghai Triad (Yao a yao yao dao waipo qiao) (1995),Twelve Monkeys (1995),Babe (1995),Dead Man Walking (1995),Richard III (1995),...,Mirage (1995),Mamma Roma (1962),"Sunchaser, The (1996)","War at Home, The (1996)",Sweet Nothing (1995),Mat' i syn (1997),B. Monkey (1998),Sliding Doors (1998).1,You So Crazy (1994),Scream of Stone (Schrei aus Stein) (1991)
0,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
939,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
MatrizF = MatrixFactorization(number_caracteristics= 40, max_iter= 300, regularization= 0.01, learning_rate= 0.001, momentum= 0.9 )
MatrizF.fit(Matriz_R= data,mostrar_erro= True)


Iteração: 1, Erro: 1.1917854175374085
Iteração: 11, Erro: 1.1917854175374085
Iteração: 21, Erro: 1.1917854175374085
Iteração: 31, Erro: 1.2355065146349535
Iteração: 41, Erro: 2.331964123832541
Iteração: 51, Erro: 3.2966067414748315
Iteração: 61, Erro: 2.733407038050792
Iteração: 71, Erro: 3.0507791368214745
Iteração: 81, Erro: 2.9335712630851374
Iteração: 91, Erro: 2.9444567970398166
Iteração: 101, Erro: 2.938257742279391
Iteração: 111, Erro: 2.91058221737619
Iteração: 121, Erro: 2.905047830239172
Iteração: 131, Erro: 2.8966075903950603
Iteração: 141, Erro: 2.902776542952841
Iteração: 151, Erro: 2.9145091381778516
Iteração: 161, Erro: 2.9279360591287764
Iteração: 171, Erro: 2.941173352570431
Iteração: 181, Erro: 2.9527854894409145
Iteração: 191, Erro: 2.9631929991188133
Iteração: 201, Erro: 2.9732304414936577
Iteração: 211, Erro: 2.982474991803103
Iteração: 221, Erro: 2.9909403151316822
Iteração: 231, Erro: 2.9981542371130536
Iteração: 241, Erro: 3.004070550314351
Iteração: 251, Erro: 

array([[5.        , 3.05921968, 4.01250652, ..., 1.5962633 , 3.12601474,
        3.55632809],
       [3.89292424, 5.        , 1.        , ..., 1.35290434, 3.34982528,
        3.03704838],
       [1.82391531, 1.        , 1.83735522, ..., 1.46273147, 1.68304772,
        2.26463535],
       ...,
       [4.99282275, 4.36483174, 3.95856378, ..., 2.00793597, 2.94593448,
        3.81809121],
       [4.39586561, 1.96130279, 5.        , ..., 1.72777761, 2.96214609,
        3.42216198],
       [3.6071972 , 4.5346334 , 5.        , ..., 1.40845932, 2.65683179,
        3.32610548]])

In [23]:
matriz_final = MatrizF.r_hat 
np.save("matriz_r_hat",matriz_final)
np.save("matriz_m1", MatrizF.matriz_m1)
np.save("matriz_m2", MatrizF.matriz_m2)

In [29]:
matriz_final[0]
data_final = pd.DataFrame(matriz_final, columns= data.columns)
data_final.to_csv("matriz_final_com_recomendacoes.csv")