<h1 align=center><font size = 5>FILTRO BASADO EN CONTENIDO</font></h1>

Los sistemas recomendadores son colecciones de algoritmos utilizados para sugerir elementos a los usuarios basados en información del usuario. Estos sistemas suelen verse en tiendas online, base datos de películas y buscadores de trabajos. En este labo, se explorarán Content-based sistemas recomendadores basados en contenidos e implementará una simple versión de uno utilizando Python las librerías Pandas.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
movies_df = pd.read_csv('../Datasets/moviedataset/ml-latest/movies.csv', sep = ',')
ratings_df = pd.read_csv('../Datasets/moviedataset/ml-latest/ratings.csv', sep = ',')

In [3]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


#### Preprocesamiento de los Datos <br>

In [5]:
#Especificamos los paréntesis para no tener conflicto con las películas que tienen años como parte de su título
movies_df['Year'] = movies_df.title.str.extract('(\(\d\d\d\d\))', expand = False)
#Eliminando los paréntesis
movies_df['Year'] = movies_df.Year.str.extract('(\d\d\d\d)', expand = False)
#Eliminando los años de la columna 'title'
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
movies_df['title'] = movies_df.title.str.strip()

  movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')


In [6]:
movies_df.head()

Unnamed: 0,movieId,title,genres,Year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


Separamos los valores de la columna **Genres** y pongámoslo todos en **list of Genres** para simplificar una utilización que haremos después. Esto también se puede lograr la función split string de Python dentro de la columna que corresponde.

In [7]:
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,Year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


#### Aplicamos One-hot Encoding en la columna genres

In [8]:
#Copiando el marco de datos de la pelicula en uno nuevo ya que no necesitamos la información del género por ahora.
moviesWithGenres_df = movies_df.copy()

#Para cada fila del marco de datos, iterar la lista de géneros y colocar un 1 en la columna que corresponda
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
#Completar los valores NaN con 0 para mostrar que una película no tiene el género de la columna
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head(10)

Unnamed: 0,movieId,title,genres,Year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,6,Heat,"[Action, Crime, Thriller]",1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,7,Sabrina,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,8,Tom and Huck,"[Adventure, Children]",1995,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,9,Sudden Death,[Action],1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,10,GoldenEye,"[Action, Adventure, Thriller]",1995,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
#Eliminamos la columna genres ya que no la utilizaremos
movies_df.drop('genres', axis = 1, inplace = True)

### Luego trabajamos con ratings_df

In [10]:
ratings_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496
5,2,112552,5.0,1436165496
6,2,112556,4.0,1436165499
7,3,356,4.0,920587155
8,3,2394,4.0,920586920
9,3,2431,5.0,920586945


In [11]:
#Eliminamos la columna timestamp ya que no la vamos a utilizar 
ratings_df.drop('timestamp', axis = 1, inplace = True)
ratings_df.columns

Index(['userId', 'movieId', 'rating'], dtype='object')

<a id="ref3"></a>

# Sistema de recomendación Basado en Contenido

In [12]:
UserInput = [
            {'title': 'Toy Story', 'rating':2.5},
            {'title': 'Jumanji', 'rating':3.0},
            {'title': 'Grumpier Old Men', 'rating':5.0},
            {'title': 'Waiting to Exhale', 'rating':3.5},
            {'title': 'Father of the Bride Part II', 'rating':4.0},
            {'title': 'Heat', 'rating':5.0},
            {'title': 'Sabrina', 'rating':4.0},
            {'title': 'Tom and Huck', 'rating':4.0}
            ]
input_movies = pd.DataFrame(UserInput)
input_movies

Unnamed: 0,title,rating
0,Toy Story,2.5
1,Jumanji,3.0
2,Grumpier Old Men,5.0
3,Waiting to Exhale,3.5
4,Father of the Bride Part II,4.0
5,Heat,5.0
6,Sabrina,4.0
7,Tom and Huck,4.0


Con las datos ingresados completos, extraigamos los ID de las películas del dataframe de películas y agreguémosla.

In [13]:
#Filtrar las peliculas por titulo
inputId = movies_df[movies_df['title'].isin(input_movies['title'].tolist())]
#Luego juntarlas para obtener el movieId. Implícitamente, lo está uniendo por título.
input_movies = pd.merge(inputId, input_movies)
#Eliminando información que no utilizaremos del dataframe de entrada
input_movies.drop('Year', axis = 1, inplace = True)
input_movies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,2.5
1,2,Jumanji,3.0
2,3,Grumpier Old Men,5.0
3,4,Waiting to Exhale,3.5
4,5,Father of the Bride Part II,4.0
5,6,Heat,5.0
6,73608,Heat,5.0
7,131274,Heat,5.0
8,7,Sabrina,4.0
9,915,Sabrina,4.0


In [14]:
#Descartando las películas de la entrada de datos
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(input_movies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,Year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,6,Heat,"[Action, Crime, Thriller]",1995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,7,Sabrina,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,8,Tom and Huck,"[Adventure, Children]",1995,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
898,915,Sabrina,"[Comedy, Romance]",1954,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14752,73608,Heat,"[Comedy, Drama, Romance]",1972,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Necesitaremos solamente la tabla actual de géneros, por lo que ordenaremos un poco inicializando el índice y eliminando las columnas movieId, title, genres e year.

In [15]:
#Inicializando el índice para evitar problemas a futuro
userMovies = userMovies.reset_index(drop=True)
#Eliminando problemas innecesarios para ahorrar memoria y evitar conflictos
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1)
userGenreTable

  userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1)
  userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1)
  userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1)


Unnamed: 0,Year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1995,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1954,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1972,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Ahora si estamos listos para comenzar a aprender las preferencias recibidas!

Para lograrlo, ponderaremos cada género. Esto se puede lograr utilizando las revisiones y multiplicádolas dentro de la tabla de ingreso de género para luego juntar la tabla resultante por columna. Esta operación en realidad es un producto escalar entre una matriz y un vector. Esto se logra invocando la función de Panda llamada "dot".

In [16]:
input_movies['rating']

0     2.5
1     3.0
2     5.0
3     3.5
4     4.0
5     5.0
6     5.0
7     5.0
8     4.0
9     4.0
10    4.0
Name: rating, dtype: float64

In [17]:
input_movies['rating'] = input_movies['rating'].astype('int')

In [18]:
#Producto escalar para obtener los pesos
userProfile = userGenreTable.transpose().dot(input_movies['rating'])
#Perfil del usuario
userProfile

Year                  1995199519951995199519951995199519951995199519...
Adventure                                                          10.0
Animation                                                           2.0
Children                                                           10.0
Comedy                                                             27.0
Fantasy                                                             5.0
Romance                                                            21.0
Drama                                                              11.0
Action                                                              9.0
Crime                                                               5.0
Thriller                                                            9.0
Horror                                                              0.0
Mystery                                                             0.0
Sci-Fi                                                          

Ahora, tenemos los pesos para cada preferencia del usuario. Esto se conoce como Perfil del Usuario. Utilizando esto, podemos sugerir películas que satisfagan las preferencias del usuario.

In [19]:
#Ahora llevemos los géneros de cada película al marco de datos original
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
#Y eliminemos información innecesaria
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1)
genreTable.head()

  genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1)
  genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1)
  genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1)


Unnamed: 0_level_0,Year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
genreTable.shape

(34208, 21)

In [21]:
genreTable

Unnamed: 0_level_0,Year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151697,1967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
151701,2010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
151703,2009,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
151709,2015,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
genreTable.drop('Year',axis = 1, inplace= True)

In [23]:
userProfile.drop('Year', axis = 0, inplace= True)

In [24]:
#Multiplicando los géneros por los pesos para luego calcular el peso promedio
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

movieId
1    0.495413
2    0.229358
3    0.440367
4    0.541284
5    0.247706
dtype: float64

In [25]:
#Ordena nuestra recomendación en orden descendente
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Miremos los valores
recommendationTable_df.head()

movieId
4956     0.798165
83266    0.798165
26093    0.788991
27344    0.779817
76153    0.770642
dtype: float64

Ahora vemos la tabla de recomendación!


In [26]:
#Tabla de recomendaciones final
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,Year
376,380,True Lies,1994
1824,1907,Mulan,1998
4625,4719,Osmosis Jones,2001
4861,4956,"Stunt Man, The",1980
8605,26093,"Wonderful World of the Brothers Grimm, The",1962
8710,26236,"White Sun of the Desert, The (Beloe solntse pu...",1970
9296,27344,Revolutionary Girl Utena: Adolescence of Utena...,1999
9697,31367,"Chase, The",1994
10298,34435,Sholay,1975
10704,42015,Casanova,2005
