# SISTEMA DE RECOMENDACIÓN BASADO EN CONTENIDO

Comenzamos importando las librerias necesarias y descargando ciertos archivos con ayuda de nltk

In [10]:
import pandas as pd 
import numpy as np
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

from itertools import combinations

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ronaldo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ronaldo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Los datos utilizados provienen del dataset de MovieLens, ligeramente modificado. Originalmente no contenía los argumentos, los cuales han sido añadidos. 

In [11]:
path = r".\archivos\MovieLens_con_argumento.csv"
movies = pd.read_csv(path, sep=',')
movies.head()

Unnamed: 0,movieId,title,genres,argumento
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,A cowboy doll is profoundly threatened and jea...
1,2,Jumanji (1995),Adventure|Children|Fantasy,When two kids find and play a magical board ga...
2,3,Grumpier Old Men (1995),Comedy|Romance,John and Max resolve to save their beloved bai...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"Based on Terry McMillan's novel, this film fol..."
4,5,Father of the Bride Part II (1995),Comedy,George Banks must deal not only with the pregn...


Cada película contiene un ```movieId``` que no corresponde con el índice del DataFrame, por lo que se crean dos diccionarios con la relación que guardan entre el ```index``` y el ```movieId```

In [12]:
# Tablas de referencia de los movield-index, cuyo objetivo es ahorrar líneas de código
dicc_index_movieid = movies['movieId'].to_dict()
dicc_movieid_index = {value: key for key, value in dicc_index_movieid.items()}

Para la realización de un sistema de recomendación basado en contenido es necesario estraer información de cada elemento (película), en este caso se usarán los argumentos y los géneros a los que pertenece para la búsqueda de otras películas similares.

La técnica empleada para esta tarea será **Bag-Of-Words**, convirtiendo el texto a un vector algebraico que lo represente utilizando **TF-IDF** para el cálculo de la relevancia de cada palabra dentro del cada argumento del DataFrame. 

Los principales problemas que representa la técnica empleada son:
    
- Existencia de palabras muy repetitivas y que no aportan significado al texto (conectores, artículos, etc)

- Dificultades a la hora de comparar textos con una diferencia en extensión alta

Con respecto al primer problema, usaremos la función ```stopwords``` que contiene los principales palabras de uso común y que no aportan significado al texto analizado, además se eliminarán los signos de puntuación. Como medida adicional también se procederá a la reducción de las palabras a sus raíces usando la función ```SnowballStemmer``` para una mejor comparación de las palabras.

La extensión de los textos a analizar es parecida por lo que podemos ignorar el segundo problema.

Para la realización del vector se usará la función de Scikit-Learn ```CountVectorizer``` que automatiza el proceso de contar las palabras y a la cual podemos darle como argumento de entrada una función de preprocesamiento, es decir, una funciónn que se le aplicará a cada texto del DataFrame para preparar los datos. Además para dotar de mayor significado modificaremos el parámetros ```min_df``` que expresa el número mínimo de argumentos distintos en los que debe aparecer una palabra para ser tenida en cuenta. 


In [15]:
def preprocess (argument):

    # Transformar todo el texto en letra minúscula y eliminar los valores numéricos
    s = argument.lower()
    s = re.sub(r"\d+", "", s)

    # Eliminar signos de puntuación y palabras sin significado para el análisis
    stop_word = stopwords.words('english')
    words = word_tokenize(s)
    words = [x for x in words if x not in stop_word + list(string.punctuation)]

    # Reducir las palabras a su raiz
    stemmer = SnowballStemmer('english')
    roots = []
    for word in words:
        roots.append(stemmer.stem(word))
    
    # Concatenar la lista de palabras para el CountVectorizer
    s = ' '.join(roots)

    return s

In [16]:
count_arguments = CountVectorizer(preprocessor=preprocess, min_df=5)

arguments_bow = (count_arguments
                 .fit_transform(movies['argumento'])
                 .toarray())

Para una mejor visualización y manipulación de los datos el *array* generado se tranformará en un *DataFrame* con las palabras ordenadas alfabéticamente

In [18]:
columnas_argumentos = [tup[0] for tup in
                       sorted(contador_argumentos.vocabulary_.items(),
                              key=lambda x: x[1])]

argumentos_bag_of_words_df = pd.DataFrame(
    argumentos_bag_of_words,
    columns = columnas_argumentos,
    index = pelis_movielens['title']
)

argumentos_bag_of_words_df.head()

Unnamed: 0_level_0,abandon,abduct,abil,abl,aboard,aborigin,abort,abroad,abrupt,absenc,...,youngster,youth,yuppi,zealand,zero,zeus,zombi,zone,zoo,zoologist
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Jumanji (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Grumpier Old Men (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Waiting to Exhale (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Father of the Bride Part II (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Como ya mencionado anteriormente, también se incluirán los géneros dentro del Bag-Of-Words. Para ello debemos realizar diferentes combinaciones 2 a 2 (o de uno en uno) de los géneros a los que pertenece cada película. 

Los géneros aparecen se presentan como un string del tipo: ```Animation|Children|Comedy|Fantasy``` por lo que debemos prepararlo para realizar las combinaciones.

In [19]:
from itertools import combinations

def tokenizador_generos(string_generos):
    generos_separados = string_generos.split('|')
    resultado = []
    for tamaño in [1,2]:
        combs = ['Género - ' + '|'.join(sorted(tupla))
                 for tupla in combinations(generos_separados, r=tamaño)]
        resultado = resultado + combs
    return sorted(resultado)

In [20]:
tokenizador_generos('Animacion|genero')

['Género - Animation',
 'Género - Animation|Children',
 'Género - Animation|Comedy',
 'Género - Animation|Fantasy',
 'Género - Children',
 'Género - Children|Comedy',
 'Género - Children|Fantasy',
 'Género - Comedy',
 'Género - Comedy|Fantasy',
 'Género - Fantasy']

Ahora se realizará el conteo, pero en este caso no necesitamos procesar el texto, solo que cuente las combinaciones de géneros comos si de palabras se tratase.

In [23]:
contador_generos = CountVectorizer(tokenizer=tokenizador_generos,
                                   token_pattern=None,
                                   lowercase=False)

contador_generos.fit(pelis_movielens['genres'])
generos_bag_of_words = contador_generos.fit_transform(pelis_movielens['genres']).toarray()

columnas_generos = [tup[0] for tup in
                    sorted(contador_generos.vocabulary_.items(),
                           key=lambda x: x[1])]

generos_bag_of_words_df = pd.DataFrame(
    generos_bag_of_words,
    columns = columnas_generos,
    index = pelis_movielens['title']
)

Unnamed: 0_level_0,Género - (no genres listed),Género - Action,Género - Action|Adventure,Género - Action|Animation,Género - Action|Children,Género - Action|Comedy,Género - Action|Crime,Género - Action|Documentary,Género - Action|Drama,Género - Action|Fantasy,...,Género - Sci-Fi,Género - Sci-Fi|Thriller,Género - Sci-Fi|War,Género - Sci-Fi|Western,Género - Thriller,Género - Thriller|War,Género - Thriller|Western,Género - War,Género - War|Western,Género - Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Jumanji (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Grumpier Old Men (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Waiting to Exhale (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Father of the Bride Part II (1995),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Black Butler: Book of the Atlantic (2017),0,1,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
No Game No Life: Zero (2017),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Flint (2017),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bungo Stray Dogs: Dead Apple (2018),0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
bag_of_words_ambos = np.hstack((argumentos_bag_of_words, generos_bag_of_words))

bag_of_words_ambos_df = pd.DataFrame(bag_of_words_ambos,
                                     columns=columnas_argumentos+columnas_generos,
                                     index=pelis_movielens["title"])

In [11]:
tf_idf = TfidfTransformer()
tf_idf_pelis = tf_idf.fit_transform(bag_of_words_ambos_df).toarray()

tf_idf_pelis_df = pd.DataFrame(tf_idf_pelis,
                               columns=columnas_argumentos+columnas_generos,
                               index=pelis_movielens["title"])

tf_idf_pelis_df[3349:3355]

Unnamed: 0_level_0,abandon,abduct,abil,abl,aboard,aborigin,abort,abroad,abrupt,absenc,...,Género - Sci-Fi,Género - Sci-Fi|Thriller,Género - Sci-Fi|War,Género - Sci-Fi|Western,Género - Thriller,Género - Thriller|War,Género - Thriller|Western,Género - War,Género - War|Western,Género - Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Short Circuit 2 (1988),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.172138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Short Circuit (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.16808,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Vanishing, The (Spoorloos) (1988)",0.0,0.311558,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.120505,0.0,0.0,0.0,0.0,0.0
"Tetsuo, the Ironman (Tetsuo) (1988)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.12123,0.164459,0.0,0.0,0.097004,0.0,0.0,0.0,0.0,0.0
They Live (1988),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.098617,0.133783,0.0,0.0,0.07891,0.0,0.0,0.0,0.0,0.0
Tucker: The Man and His Dream (1988),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
cosine_sims = cosine_similarity(tf_idf_pelis_df)
matrix_similaridades_df = pd.DataFrame(cosine_sims,
                                       index=pelis_movielens["title"],
                                       columns=pelis_movielens["title"])

In [13]:
np.fill_diagonal(matrix_similaridades_df.values, np.nan)

Ya podemos por tanto buscar las películas más similares a cualquiera de las de nuestro listado de pelis. Simplemente tenemos que buscar en la fila correspondiente a la película de la que queremos encontrar las más similares; buscando las K películas con mayor cosine similarity con ella.

Para acelerar los cálculos, podemos utilizar Numpy para encontrar qué orden de las columnas ordenarían de mayor a menor los valores de cada fila:

In [14]:
orden_pelis_cosine_sim_por_fila = np.argsort(-cosine_sims, axis=1)
orden_pelis_cosine_sim_por_fila

array([[2998, 2353, 7351, ..., 3826, 7020,    0],
       [6062, 9627, 8026, ..., 5353, 9731,    1],
       [8904,  755, 1513, ..., 1169, 1024,    2],
       ...,
       [ 100, 6685, 2753, ..., 4354, 9731, 9729],
       [7037, 3406, 8446, ..., 3861, 9731, 9730],
       [8664, 8625, 9685, ..., 5506, 5515, 9731]], dtype=int64)

Así que de la primera fila (primera película, Toy Story) la película más similar es la de la columna número 2353, la segunda más similar es la de la columna 6445, etcétera (ya miraremos cuáles son).

Vamos también a generar otra matriz de Numpy con las cosine similarities de cada fila ordenadas de mayor a menor:

In [15]:
cosine_sims_ordenadas = -np.sort(-cosine_sims, axis=1)
cosine_sims_ordenadas

array([[0.40397622, 0.39966961, 0.3888024 , ..., 0.        , 0.        ,
               nan],
       [0.43105048, 0.41952491, 0.32348353, ..., 0.        , 0.        ,
               nan],
       [0.34351053, 0.27375868, 0.26749533, ..., 0.        , 0.        ,
               nan],
       ...,
       [0.23809936, 0.23686512, 0.23300361, ..., 0.        , 0.        ,
               nan],
       [0.23335249, 0.21625889, 0.20622123, ..., 0.        , 0.        ,
               nan],
       [0.252336  , 0.252336  , 0.23716675, ..., 0.        , 0.        ,
               nan]])

Con estas dos matrices ```(orden_pelis_cosine_sim_por_fila y cosine_sims_ordenadas)``` podemos ya calcular muy fácilmente qué películas son las más similares a una cualquiera, y cuáles son los valores de dichas similaridades.

Para hacerlo más fácil, vamos a crear una función que tome como argumentos un movieId y un número k, y devuelva el listado de las top k películas más similares a ella:

In [16]:
def top_k_similares(movieId, k):
    """
    Devuelve un DataFrame con el top k
    de películas más similares a la
    introducida en el argumento movieId.
    """
    
    # Convertimos el movieId a el número de fila
    # que le corresponde en nuestras matrices
    # orden_pelis_cosine_sim_por_fila y 
    # cosine_sims_ordenadas:
    fila_cosine_sims = dicc_movieid_indice[movieId]
    
    # Seleccionamos en las dos matrices:
    lista_ordenada_peliculas_sim = orden_pelis_cosine_sim_por_fila[fila_cosine_sims]
    lista_ordenada_similaridades = cosine_sims_ordenadas[fila_cosine_sims]
    
    # Nos quedamos con el top k:
    top_k = lista_ordenada_peliculas_sim[:k]
    cosine_sims_de_las_top_k = lista_ordenada_similaridades[:k]
    
    # Preparamos la salida de la función, cogiendo
    # el dataset original, quedándonos solo con las
    # películas del top k, y añadimos una columna
    # con las similaridades:
    top_k_df = pelis_movielens.iloc[top_k].copy()
    top_k_df["similaridad"] = cosine_sims_de_las_top_k
    
    return top_k_df

In [17]:
top_k_similares(1, k=10)

Unnamed: 0,movieId,title,genres,argumento,similaridad
2998,4016,"Emperor's New Groove, The (2000)",Adventure|Animation|Children|Comedy|Fantasy,Emperor Kuzco is turned into a llama by his ex...,0.403976
2353,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,"When Woody is stolen by a toy collector, Buzz ...",0.39967
7351,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,The toys are mistakenly delivered to a day-car...,0.388802
5084,8015,"Phantom Tollbooth, The (1970)",Adventure|Animation|Children|Fantasy,Milo is a boy who is bored with life. One day ...,0.386209
8921,136016,The Good Dinosaur (2015),Adventure|Animation|Children|Comedy|Fantasy,In a world where dinosaurs and humans live sid...,0.38369
6445,51939,TMNT (Teenage Mutant Ninja Turtles) (2007),Action|Adventure|Animation|Children|Comedy|Fan...,When the world is threatened by an ancient evi...,0.383128
8800,130520,Home (2015),Adventure|Animation|Children|Comedy|Fantasy|Sc...,An alien on the run from his own people makes ...,0.371658
8214,103755,Turbo (2013),Adventure|Animation|Children|Comedy|Fantasy,A freak accident might just help an everyday g...,0.366717
1704,2294,Antz (1998),Adventure|Animation|Children|Comedy|Fantasy,A rather neurotic ant tries to break from his ...,0.365829
9422,166461,Moana (2016),Adventure|Animation|Children|Comedy|Fantasy,"In Ancient Polynesia, when a terrible curse in...",0.364677


In [18]:
top_k_similares(2, k=10)

Unnamed: 0,movieId,title,genres,argumento,similaridad
6062,40851,Zathura (2005),Action|Adventure|Children|Fantasy,Two young brothers are drawn into an intergala...,0.43105
9627,179401,Jumanji: Welcome to the Jungle (2017),Action|Adventure|Children,Four teenagers are sucked into a magical video...,0.419525
8026,98122,Indie Game: The Movie (2012),Documentary,A documentary that follows the journeys of ind...,0.323484
8710,125974,Halloweentown High (2004),Adventure|Children|Comedy|Fantasy,A girl in a magical world bets her family's ma...,0.279246
7929,95654,Geri's Game (1997),Animation|Children,Geri sets up a chess game to play his greatest...,0.258976
4568,6790,Avalon (2001),Drama|Fantasy|Sci-Fi,"In a dystopian world, a woman spends her time ...",0.242856
765,1009,Escape to Witch Mountain (1975),Adventure|Children|Fantasy,Two mysterious orphan children have extraordin...,0.24093
3572,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy,An orphaned boy enrolls in a school of wizardr...,0.240308
1616,2162,"NeverEnding Story II: The Next Chapter, The (1...",Adventure|Children|Fantasy,A young boy with a distant father enters a wor...,0.235532
4444,6566,Spy Kids 3-D: Game Over (2003),Action|Adventure|Children,Carmen's caught in a virtual reality game desi...,0.231111


In [26]:
pelis_movielens.head(10)

Unnamed: 0,movieId,title,genres,argumento
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,A cowboy doll is profoundly threatened and jea...
1,2,Jumanji (1995),Adventure|Children|Fantasy,When two kids find and play a magical board ga...
2,3,Grumpier Old Men (1995),Comedy|Romance,John and Max resolve to save their beloved bai...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"Based on Terry McMillan's novel, this film fol..."
4,5,Father of the Bride Part II (1995),Comedy,George Banks must deal not only with the pregn...
5,6,Heat (1995),Action|Crime|Thriller,A group of professional bank robbers start to ...
6,7,Sabrina (1995),Comedy|Romance,An ugly duckling having undergone a remarkable...
7,8,Tom and Huck (1995),Adventure|Children,Two best friends witness a murder and embark o...
8,9,Sudden Death (1995),Action,A former fireman takes on a group of terrorist...
9,10,GoldenEye (1995),Action|Adventure|Thriller,Years after a friend and fellow 00 agent is ki...


In [20]:
top_k_similares(6, k=10)

Unnamed: 0,movieId,title,genres,argumento,similaridad
7630,87785,Takers (2010),Action|Crime|Thriller,A group of bank robbers find their multi-milli...,0.335613
4959,7562,Dobermann (1997),Action|Crime,Dobermann is the world's most ruthless bank ro...,0.31724
5756,31101,Stander (2003),Action|Crime|Drama,"The life and career of Andre Stander, a South ...",0.313011
3220,4351,Point Break (1991),Action|Crime|Thriller,An F.B.I. Agent goes undercover to catch a gan...,0.308473
6260,47254,Chaos (2005),Action|Crime|Drama|Thriller,"Two cops, a rookie and a grizzled vet, pursue ...",0.288462
4710,7031,"Real McCoy, The (1993)",Action|Crime|Drama|Thriller,"A woman is released from prison, an expert ban...",0.283013
9555,173619,Fugitives (1986),Comedy|Crime,Jean is taken hostage at a bank by a foolish b...,0.261155
1438,1963,Take the Money and Run (1969),Comedy|Crime,"The life and times of Virgil Starkwell, inept ...",0.256322
6152,44199,Inside Man (2006),Crime|Drama|Thriller,"A police detective, a bank robber, and a high-...",0.252717
3561,4879,High Heels and Low Lifes (2001),Action|Comedy|Crime|Drama,A nurse eavesdrops with a friend on a cell pho...,0.243051
