# Matrix Factorization

(Basado en el notebook https://github.com/MoMkhani/MovieLens-Matrix-Factorization/tree/main)

Usaremos Singular Value Decomposition, un método para factorización de matrices, para recomendar las películas para los usuarios. 

Usen este código para futuras referencias en caso de necesitar

# Import Libraries

In [47]:
import numpy as np
import polars as pl
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.sparse.linalg import svds

# Carga de datos

In [50]:
# Vamos a cargar el conjunto de datos de ratings de las peliculas que vimos antes. 
# Recuerden ajustar la ruta si no les funciona
rating = pd.read_csv('../data/raw/movies/ratings.csv')
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [51]:
print(f"Recordemos que hay {rating.shape[0]} filas y {rating.shape[1]} columnas: {list(rating.columns)}")

Recordemos que hay 100836 filas y 4 columnas: ['userId', 'movieId', 'rating', 'timestamp']


In [53]:
# Cargamos las películas
movie = pd.read_csv('../data/raw/movies/movies.csv')
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [54]:
print(f"Recordemos que hay {movie.shape[0]} películas descriptas con {movie.shape[1]} atributos: {list(movie.columns)}")

Recordemos que hay 9742 películas descriptas con 3 atributos: ['movieId', 'title', 'genres']


In [56]:
df = pd.merge(rating, movie, on='movieId')
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


In [57]:
print(f"Tenemos entonces {df.shape[0]} opiniones. Las columnas son: {list(df.columns)}")

Tenemos entonces 100836 opiniones. Las columnas son: ['userId', 'movieId', 'rating', 'timestamp', 'title', 'genres']


In [59]:
# Veamos si hay valores nulos
df.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
dtype: int64

# Matrix Factorization

Ya que hicimos en el notebook anterior un análisis más profundo de la base de datos, pasaremos directamente a ver como hacer las recomendaciones con el método Singular Value Decomposition para la factorización de matrices.

In [63]:
# Hagamos primero la matriz de ratings dada por cada usuario a cada película.
# Nos va a quedar una matrix dispersa.

matrix = rating.pivot(columns='movieId', index='userId', values='rating').fillna(0)
matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Vamos a obtener la matriz pura en formato numpy, calcular el promedio de los ratings por cada usuario, y normalizar

In [64]:
mtrx = matrix.to_numpy()
ratings_mean = np.mean(mtrx, axis = 1)
normalized_mtrx = mtrx - ratings_mean.reshape(-1, 1)
normalized_mtrx

array([[ 3.89582476, -0.10417524,  3.89582476, ..., -0.10417524,
        -0.10417524, -0.10417524],
       [-0.01177499, -0.01177499, -0.01177499, ..., -0.01177499,
        -0.01177499, -0.01177499],
       [-0.00976964, -0.00976964, -0.00976964, ..., -0.00976964,
        -0.00976964, -0.00976964],
       ...,
       [ 2.23215755,  1.73215755,  1.73215755, ..., -0.26784245,
        -0.26784245, -0.26784245],
       [ 2.98755656, -0.01244344, -0.01244344, ..., -0.01244344,
        -0.01244344, -0.01244344],
       [ 4.50611888, -0.49388112, -0.49388112, ..., -0.49388112,
        -0.49388112, -0.49388112]], shape=(610, 9724))

In [72]:
# Usaremos la función svds de sklearn 
# El número K es el número de dimensiones que usaremos. 

U, sigma, Vt = svds(normalized_mtrx, k = 50)
# Vamos a diagonalizar la matriz ahora
sigma = np.diag(sigma)
# Ahora haremos las predicciones para cada usuario
all_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_predicted_ratings, columns = matrix.columns)
preds_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,2.167328,0.402751,0.840184,-0.076281,-0.551337,2.504091,-0.890114,-0.026443,0.196974,1.593259,...,-0.023453,-0.019967,-0.026939,-0.026939,-0.023453,-0.026939,-0.023453,-0.023453,-0.023453,-0.058732
1,0.211459,0.006658,0.033455,0.017419,0.18343,-0.062473,0.083037,0.024158,0.04933,-0.15253,...,0.019498,0.016777,0.022219,0.022219,0.019498,0.022219,0.019498,0.019498,0.019498,0.032281
2,0.003588,0.030518,0.046393,0.008176,-0.006247,0.107328,-0.012416,0.003779,0.007297,-0.059362,...,0.005909,0.006209,0.00561,0.00561,0.005909,0.00561,0.005909,0.005909,0.005909,0.008004
3,2.051549,-0.387104,-0.252199,0.087562,0.130465,0.27021,0.477835,0.040313,0.025858,-0.017365,...,0.004836,0.004172,0.0055,0.0055,0.004836,0.0055,0.004836,0.004836,0.004836,-0.023311
4,1.344738,0.778511,0.065749,0.111744,0.273144,0.584426,0.25493,0.128788,-0.085541,1.023455,...,-0.008042,-0.007419,-0.008664,-0.008664,-0.008042,-0.008664,-0.008042,-0.008042,-0.008042,-0.010127


# Recomendaciones

In [76]:
def recommend_movies(preds_df, userId, movie, ratings_df, num_recommendations=5):
    '''Recomendar top K películas para cada usuario

    Args:
    preds_df: Matriz predicha.
    userId: usuario
    movie: películas
    ratings_df: ratings
    num_recommendations: cantidad de recomendaciones

    Return:
    user_rated: películas rankeadas por el usuario 
    recommendations: recomendaciones finales

    '''
    # El número de usuario comienza por 1, el índice en 0
    user_row_number = userId-1 
    # Ordenamos las películas del usuario.
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) 
    # Películas ya vistas
    user_data = ratings_df[ratings_df.userId == (userId)]

    user_rated = (user_data.merge(movie, how = 'left', left_on = 'movieId', right_on = 'movieId').
                  sort_values(['rating'], ascending=False)
                 )

    recommendations = (movie[~movie['movieId'].isin(user_rated['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
               rename(columns = {user_row_number: 'Predictions'}).
               sort_values('Predictions', ascending = False).
               iloc[:num_recommendations, :-1]
                      )

    return user_rated, recommendations

In [80]:
# Obtener las películas ranqueadas ya y las predicciones
already_rated, predictions = recommend_movies(preds_df, 2, movie, rating, 10)

# Películas ya ranqueadas
already_rated.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
9,2,60756,5.0,1445714980,Step Brothers (2008),Comedy
22,2,106782,5.0,1445714966,"Wolf of Wall Street, The (2013)",Comedy|Crime|Drama
28,2,131724,5.0,1445714851,The Jinx: The Life and Deaths of Robert Durst ...,Documentary
18,2,89774,5.0,1445715189,Warrior (2011),Drama
16,2,80906,5.0,1445715172,Inside Job (2010),Documentary
27,2,122882,5.0,1445715272,Mad Max: Fury Road (2015),Action|Adventure|Sci-Fi|Thriller
8,2,58559,4.5,1445715141,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
15,2,80489,4.5,1445715340,"Town, The (2010)",Crime|Drama|Thriller
10,2,68157,4.5,1445715154,Inglourious Basterds (2009),Action|Drama|War
2,2,1704,4.5,1445715228,Good Will Hunting (1997),Drama|Romance


In [81]:
# Recomendaciones
predictions

Unnamed: 0,movieId,title,genres
2223,2959,Fight Club (1999),Action|Crime|Drama|Thriller
1936,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
7398,80463,"Social Network, The (2010)",Drama
312,356,Forrest Gump (1994),Comedy|Drama|Romance|War
8850,134130,The Martian (2015),Adventure|Drama|Sci-Fi
508,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
6323,48780,"Prestige, The (2006)",Drama|Mystery|Sci-Fi|Thriller
3634,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy
7032,69122,"Hangover, The (2009)",Comedy|Crime
4795,7153,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy
