# Filtrado Colaborativo

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

Cargamos los datos de los usuarios.

In [3]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv('data/ml-100k/u.user', sep='|', names=u_cols,
 encoding='latin-1')

display(users.shape, users.head())

(943, 5)

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Cargamos los datos de los items (productos, es decir, películas).

In [4]:
i_cols = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')

display(movies.shape, movies.head())

(1682, 24)

Unnamed: 0,movie_id,title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


Eliminamos toda la información de las películas, salvo el valor id e la película y el título.

In [6]:
movies = movies[['movie_id', 'title']]

Cargamos los datos sobre la relación entre usuarios y películas (100K registros).

In [7]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_csv('data/ml-100k/u.data', sep='\t', names=r_cols,
 encoding='latin-1')

display(ratings.shape, ratings.head())

(100000, 4)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Eliminamos la columna `timestamp`.

In [8]:
ratings = ratings.drop('timestamp', axis=1)

Dividimos de datos entre ***train*** y ***test***.

In [9]:
from sklearn.model_selection import train_test_split

X = ratings.copy()
y = ratings['user_id']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify=y, random_state=42)

Importamos un criterio de evaluación. En este caso, el error cuadrático medio.

In [10]:
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

Definimos el modelo base para que siempre devuelva 3.

In [11]:
def baseline(user_id, movie_id):
    return 3.0

Desarrollamos la función para calcular la puntuación *RMSE* obtenida en el conjunto *test*.

1. Construir una lista de tuplas usuario-película a partir del conjunto de datos de prueba.
1. Predecir la puntuación de cada tupla usuario-película.
1. Extraer las valoraciones reales dadas por los usuarios en los datos de *test*.
1. Obtener el valor de *RMSE*.

In [12]:
def score(cf_model):
    
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    
    y_pred = np.array([cf_model(user, movie) for (user, movie) in id_pairs])
    
    y_true = np.array(X_test['rating'])
    
    return rmse(y_true, y_pred)

In [13]:
score(baseline)

1.2488234462885457

## Filtrado Colaborativo Basado en el Usuario

### Matriz de *Ratings*

Construimos la matriz de *ratings* utilizando la función `pivot_table`.

In [14]:
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie_id')

r_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1671,1672,1673,1674,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### Media

Filtrado colaborativo basado en el usuario utilizando los *ratings* medios

1. Comprueba si `movie_id` existe en `r_matrix`.
1. Calcular la media de todas las valoraciones dadas a la película.
1. Por defecto, si no hay información, la calificación es 3.

In [15]:
def cf_user_mean(user_id, movie_id):
    
    if movie_id in r_matrix:
        mean_rating = r_matrix[movie_id].mean()
    
    else:
        mean_rating = 3.0
    
    return mean_rating

Calculamos el *RMSE* para el modelo *Mean*

In [16]:
score(cf_user_mean)

1.0300824802393536

### Media Ponderada

Creamos una matriz de *ratings* ficticios con todos los valores nulos imputados a 0.

In [17]:
r_matrix_dummy = r_matrix.copy().fillna(0)

Importamos el método `cosine similarity` del submódulo ***pairwise*** y calculamos la matriz de la similitud del coseno sobre la matriz de *ratings* ficticios.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

Lo convertimos a un *Data Frame* de Pandas.

In [19]:
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.index, columns=r_matrix.index)

cosine_sim.head(10)

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.108361,0.046638,0.029577,0.245753,0.335853,0.344724,0.191582,0.057149,0.251979,...,0.257073,0.069412,0.231643,0.108093,0.176842,0.104799,0.232472,0.051528,0.129555,0.256333
2,0.108361,1.0,0.057613,0.130237,0.054918,0.190552,0.079399,0.076146,0.167992,0.147376,...,0.136993,0.252887,0.255454,0.285193,0.232751,0.149088,0.102807,0.062386,0.109143,0.107686
3,0.046638,0.057613,1.0,0.139805,0.0,0.032485,0.043869,0.080968,0.022263,0.059925,...,0.027402,0.0,0.17506,0.010343,0.105635,0.019052,0.127099,0.023917,0.060392,0.0
4,0.029577,0.130237,0.139805,1.0,0.0,0.04519,0.088586,0.199526,0.135013,0.026919,...,0.055392,0.049773,0.076549,0.139382,0.113886,0.0,0.130343,0.077357,0.15789,0.063911
5,0.245753,0.054918,0.0,0.0,1.0,0.176443,0.28186,0.132205,0.03879,0.1342,...,0.183969,0.019305,0.073714,0.041807,0.081088,0.029743,0.188392,0.068342,0.055557,0.207259
6,0.335853,0.190552,0.032485,0.04519,0.176443,1.0,0.394725,0.143385,0.125126,0.372679,...,0.328643,0.070809,0.135806,0.17167,0.125446,0.086464,0.230566,0.095478,0.197307,0.185268
7,0.344724,0.079399,0.043869,0.088586,0.28186,0.394725,1.0,0.215861,0.121224,0.378723,...,0.339853,0.110866,0.096055,0.10469,0.126108,0.075012,0.270071,0.020036,0.236086,0.266571
8,0.191582,0.076146,0.080968,0.199526,0.132205,0.143385,0.215861,1.0,0.116173,0.169088,...,0.150048,0.064242,0.118297,0.053969,0.168057,0.095736,0.164157,0.076269,0.089871,0.210995
9,0.057149,0.167992,0.022263,0.135013,0.03879,0.125126,0.121224,0.116173,1.0,0.152694,...,0.082819,0.0644,0.127051,0.069251,0.095673,0.0,0.131458,0.106763,0.089297,0.089583
10,0.251979,0.147376,0.059925,0.026919,0.1342,0.372679,0.378723,0.169088,0.152694,1.0,...,0.279849,0.087828,0.131888,0.111841,0.094423,0.080883,0.255758,0.063461,0.169309,0.181031


Filtrado Colaborativo basado en el usuario utilizando *ratings* medios ponderados.

1. Comprueba si `movie_id` existe en `r_matrix`.
1. Obtiene las puntuaciones de similitud del usuario en cuestión con todos los demás usuarios.
1. Obtiene los *ratings* de los usuarios para la película en cuestión.
1. Extrae los índices que contienen ***NaN*** en la serie `m_ratings`.
1. Elimina los valores ***NaN*** de la serie `m_ratings`.
1. Elimina las puntuaciones de cosenos correspondientes de la serie `sim_scores`.
1. Calcula la media ponderada final.  
1. Valor predeterminado a una calificación de 3 en ausencia de información.

In [20]:
def cf_user_wmean(user_id, movie_id):
    
    if movie_id in r_matrix:
        
        sim_scores = cosine_sim[user_id]
        
        m_ratings = r_matrix[movie_id]
        
        idx = m_ratings[m_ratings.isnull()].index
        
        m_ratings = m_ratings.dropna()
        
        sim_scores = sim_scores.drop(idx)
        
        wmean_rating = np.dot(sim_scores, m_ratings)/ sim_scores.sum()
    
    else:
        
        wmean_rating = 3.0
    
    return wmean_rating

In [21]:
def score(cf_model):
    
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    
    y_pred = np.array([cf_model(user, movie) for (user, movie) in id_pairs])
    
    y_true = np.array(X_test['rating'])
    print(y_pred, y_true)

    list_pred = list(np.argwhere(np.isnan(y_pred)))
    list_true = list(np.argwhere(np.isnan(y_true)))
    
    if len(list_pred)>0:
        list_pred = list(list_pred[0])
    if len(list_true)>0:
        list_true = list(list_true[0])
    
    list_res = list_pred + list_true
    print(list_res)
    return rmse(y_true, y_pred)

In [22]:
score(cf_user_wmean)

[3.13546052 3.95395952 2.40621525 ... 4.19306903 3.18297794 4.15145525] [2 4 2 ... 5 4 4]
[20488]


ValueError: Input contains NaN.

In [23]:
def score(cf_model):
    id_pairs = zip(X_test['user_id'], X_test['movie_id'])
    y_pred = np.array([cf_model(user, movie) for (user, movie) in id_pairs])
    y_pred[np.isnan(y_pred)] = False
    y_true = np.array(X_test['rating'])
    print(y_pred, y_true)
    list_pred = list(np.argwhere(np.isnan(y_pred)))
    list_true = list(np.argwhere(np.isnan(y_true)))
    if len(list_pred)>0:
        list_pred = list(list_pred[0])
    if len(list_true)>0:
        list_true = list(list_true[0])
    list_res = list_pred + list_true
    print(list_res)
    return rmse(y_true, y_pred)
import numpy as np

In [24]:
score(cf_user_wmean)

[3.13546052 3.95395952 2.40621525 ... 4.19306903 3.18297794 4.15145525] [2 4 2 ... 5 4 4]
[]


1.023662431714556

Unimos el *Data Frame* de los usuarios originales con el conjunto de *train*.

In [25]:
merged_df = pd.merge(X_train, users)

merged_df.head()

Unnamed: 0,user_id,movie_id,rating,age,sex,occupation,zip_code
0,862,177,4,25,M,executive,13820
1,862,416,3,25,M,executive,13820
2,862,1093,5,25,M,executive,13820
3,862,168,4,25,M,executive,13820
4,862,568,3,25,M,executive,13820


Calculamos el *rating* medio de cada película por género.

In [26]:
gender_mean = merged_df[['movie_id', 'sex', 'rating']].groupby(['movie_id', 'sex'])['rating'].mean()

Establecemos el índice del *Data Frame* de los usuarios en `user_id`.

In [27]:
users = users.set_index('user_id')

Filtrado colaborativo basado en género utilizando *ratings* medios.

1. Comprueba si `movie_id` existe en `r_matrix`.
1. Identifica el género del usuario.
1. Comprueba si el género ha calificado la película.
1. Calcula el *rating* medio otorgado por ese género a la película.
1. Valor predeterminado a una calificación de 3 en ausencia de información.

In [29]:
def cf_gender(user_id, movie_id):
    
    if movie_id in r_matrix:
        
        gender = users.loc[user_id]['sex']

        if gender in gender_mean[movie_id]:
            
            gender_rating = gender_mean[movie_id][gender]
        
        else:
            gender_rating = 3.0
    
    else:
        
        gender_rating = 3.0
    
    return gender_rating

In [30]:
score(cf_gender)

[3.22727273 3.85714286 2.46428571 ... 4.31683168 3.11320755 4.19642857] [2 4 2 ... 5 4 4]
[]


1.0392906999935203

Calculamos el *rating* medio por género y ocupación.

In [31]:
gen_occ_mean = merged_df[['sex', 'rating', 'movie_id', 'occupation']].pivot_table(
    values='rating', index='movie_id', columns=['occupation', 'sex'], aggfunc='mean')

gen_occ_mean.head()

occupation,administrator,administrator,artist,artist,doctor,educator,educator,engineer,engineer,entertainment,...,salesman,salesman,scientist,scientist,student,student,technician,technician,writer,writer
sex,F,M,F,M,M,F,M,F,M,F,...,F,M,F,M,F,M,F,M,F,M
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,3.9375,3.75,5.0,3.4,3.666667,3.25,3.884615,4.0,4.083333,4.0,...,,4.0,3.5,4.0,4.043478,3.796296,4.0,3.75,4.0,3.0
2,3.0,3.666667,,,,4.0,3.5,,3.066667,,...,,,,3.0,2.666667,3.277778,,2.714286,,2.333333
3,3.5,4.0,,,,,2.0,,3.777778,,...,,,,,3.0,3.391304,,4.25,,1.0
4,3.666667,3.6,,4.666667,3.0,2.5,3.8,4.0,3.65,,...,4.0,4.0,,3.4,3.25,3.777778,,3.333333,4.25,3.25
5,4.0,2.333333,,,,4.0,2.333333,,3.5,,...,,,,4.0,4.333333,3.111111,,3.333333,4.0,2.0


Filtrado colaborativo basado en género y ocupación utilizando *ratings* medios.
    
1. Comprueba si `movie_id` existe en `gen_occ_mean`.
1. Identifica al usuario.
1. Identifica el género y la ocupación.
1. Comprueba si la ocupación ha calificado la película.
1. Comprueba si el género ha calificado la película.
1. Extrae el *rating* requerido.
1. Predeterminado a 3 si el *rating* es un *NaN*.
1. Devuelve la calificación predeterminada.   

In [32]:
def cf_gen_occ(user_id, movie_id):
    
    if movie_id in gen_occ_mean.index:
        
        user = users.loc[user_id]
        
        gender = user['sex']
        occ = user['occupation']
        
        if occ in gen_occ_mean.loc[movie_id]:
            
            if gender in gen_occ_mean.loc[movie_id][occ]:
                
                rating = gen_occ_mean.loc[movie_id][occ][gender]
                
                if np.isnan(rating):
                    rating = 3.0
                
                return rating
              
    return 3.0

In [33]:
score(cf_gen_occ)

[3.         3.5        1.         ... 4.55555556 3.5        4.32142857] [2 4 2 ... 5 4 4]
[]


1.1419651376788005

## Enfoques Basados en Modelos

Importamos las clases y métodos requeridos de la librería Surprise.

* El objeto `Reader` nos ayuda a analizar el archivo o *Data Frame* que contiene los *ratings*.
* Crea el conjunto de datos que se utilizará para crear el filtro.
* Definimos el objeto del algoritmo `kNN`.
* Evalua el modelo.

In [36]:
from surprise import Reader, Dataset, KNNBasic
from surprise.model_selection import cross_validate

reader = Reader()

data = Dataset.load_from_df(ratings, reader)

knn = KNNBasic()

cross_validate(knn, data, measures=['RMSE'], cv=5, verbose=True)

ModuleNotFoundError: No module named 'surprise'

Importamos el algoritmo que realiza la descomposición de valores singulares (SVD, Singular Value Descomposition), lo aplicamos sobre nuestros datos y lo evaluamos.

In [None]:
from surprise import SVD

svd = SVD()

cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)