# Planteamiento del proyecto

Tenemos un conjunto de datos que contiene sinopsis de casi 35 mil películas escrapeadas de la Wikipedia. Queremos que,dado el título de una película, te recomiende otras basadas en su sinopsis.

El dataset a utilizar es el que aparece en la carpeta: [Wiki movie](./wiki_movie_plots_deduped.csv)

Los objetivos a cumplir son los siguientes:

1. **Preprocesamiento de datos**
2. **Representación vectorial del texto**
3. **Medición de la similitud**
4. **Desarrollo del sistema de recomendación**
5. **Evaluación del sistema de recomendación**

## 1. **Preprocesamiento de datos:**
   - Limpiar y preparar el conjunto de datos que contiene las sinopsis de las películas.
   - Realizar la tokenización y eliminación de palabras irrelevantes (stopwords).
   - Normalizar el texto (por ejemplo, convertirlo a minúsculas, eliminar caracteres especiales, etc.).
   - Posible lematización o stemming para reducir las palabras a su raíz.

In [3]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

#Cargar datos
data_films = pd.read_csv('wiki_movie_plots_deduped.csv')
data_films.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Light_of_the_Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Presidents,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination.\r\nIn the second shot, which runs just over eight seconds long, an assassin kneels feet of Lady Justice."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_the_Grizzly_King","Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading ""His Photographer"" and ""His Press Agent"" respectively, follow him into the shot; the photographer sets up his camera. ""Teddy"" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. ""Teddy"" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. ""Teddy"" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs."
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Beanstalk_(1902_film),"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."


Como en este ejercicios queremos hacer recomendaciones de películas en base a la sinopsis, vamos a quitar todas las variablees excepto '**Title**' y '**Plot**'.

In [4]:
#Eliminamos las columnas que no nos interesan
data_films = data_films.drop(columns=['Release Year', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page'])
data_films.head()

Unnamed: 0,Title,Plot
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better."
2,The Martyred Presidents,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination.\r\nIn the second shot, which runs just over eight seconds long, an assassin kneels feet of Lady Justice."
3,"Terrible Teddy, the Grizzly King","Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading ""His Photographer"" and ""His Press Agent"" respectively, follow him into the shot; the photographer sets up his camera. ""Teddy"" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. ""Teddy"" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. ""Teddy"" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs."
4,Jack and the Beanstalk,"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."


In [13]:
import spacy 
# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")
import os

In [14]:
# Tokenización, stopwords y normalización
def preprocess(text: str) -> list:
    '''
    En esta función preprocess(text), se utiliza Spacy para analizar el texto y generar un objeto doc, 
    que representa el texto analizado por Spacy. Luego, se itera sobre cada token en doc, y 
    se seleccionan solo aquellos tokens que no son stopwords (token.is_stop) y que son alfabéticos (token.is_alpha). 
    Finalmente, se extrae el lema de cada token (token.lemma_) y se almacenan en una lista tokens, 
    que se devuelve como el resultado de la función. Estos tokens son las palabras individuales del texto tokenizado.
    Args:
    text (str): El texto a tokenizar.
    Returns:
    list: Una lista de tokens del texto tokenizado.
    '''
    
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return tokens

if not os.path.exists('movies_clean.csv'):
    data_films['clean_plot'] = data_films['Plot'].apply(preprocess)
    data_films.to_csv('movies_clean.csv', index=False)
else:
    print("El archivo ya existe.")
    
data_films_clean = pd.read_csv("movies_clean.csv")
data_films_clean.head()

Unnamed: 0,Title,Plot,clean_plot
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]","['bartender', 'work', 'saloon', 'serve', 'drink', 'customer', 'fill', 'stereotypically', 'irish', 'man', 'bucket', 'beer', 'Carrie', 'Nation', 'follower', 'burst', 'inside', 'assault', 'irish', 'man', 'pull', 'hat', 'eye', 'dump', 'beer', 'head', 'group', 'begin', 'wreck', 'bar', 'smash', 'fixture', 'mirror', 'break', 'cash', 'register', 'bartender', 'spray', 'seltzer', 'water', 'Nation', 'face', 'group', 'policeman', 'appear', 'order', 'everybody']"
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better.","['moon', 'paint', 'smile', 'face', 'hang', 'park', 'night', 'young', 'couple', 'walk', 'past', 'fence', 'learn', 'railing', 'look', 'moon', 'smile', 'embrace', 'moon', 'smile', 'get', 'big', 'sit', 'bench', 'tree', 'moon', 'view', 'block', 'cause', 'frown', 'scene', 'man', 'fan', 'woman', 'hat', 'moon', 'leave', 'sky', 'perch', 'shoulder', 'well']"
2,The Martyred Presidents,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination.\r\nIn the second shot, which runs just over eight seconds long, an assassin kneels feet of Lady Justice.","['film', 'minute', 'long', 'compose', 'shot', 'girl', 'sit', 'base', 'altar', 'tomb', 'face', 'hide', 'camera', 'center', 'altar', 'view', 'portal', 'display', 'portrait', 'president', 'Abraham', 'Lincoln', 'James', 'Garfield', 'William', 'McKinley', 'victim', 'assassination', 'second', 'shot', 'run', 'second', 'long', 'assassin', 'kneel', 'foot', 'Lady', 'Justice']"
3,"Terrible Teddy, the Grizzly King","Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading ""His Photographer"" and ""His Press Agent"" respectively, follow him into the shot; the photographer sets up his camera. ""Teddy"" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. ""Teddy"" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. ""Teddy"" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs.","['last', 'second', 'consist', 'shot', 'shot', 'set', 'wood', 'winter', 'actor', 'represent', 'vice', 'president', 'Theodore', 'Roosevelt', 'enthusiastically', 'hurry', 'hillside', 'tree', 'foreground', 'fall', 'right', 'cock', 'rifle', 'man', 'bear', 'sign', 'read', 'photographer', 'Press', 'Agent', 'respectively', 'follow', 'shot', 'photographer', 'set', 'camera', 'Teddy', 'aim', 'rifle', 'upward', 'tree', 'fell', 'appear', 'common', 'house', 'cat', 'proceed', 'stab', 'Teddy', 'hold', 'prize', 'aloft', 'press', 'agent', 'take', 'note', 'second', 'shot', 'take', 'slightly', 'different', 'wood', 'path', 'Teddy', 'ride', 'path', 'horse', 'camera', 'left', 'shot', 'follow', 'closely', 'press', 'agent', 'photographer', 'dutifully', 'hold', 'sign']"
4,Jack and the Beanstalk,"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince.","['early', 'know', 'adaptation', 'classic', 'fairytale', 'film', 'show', 'Jack', 'trade', 'cow', 'bean', 'mother', 'force', 'drop', 'yard', 'beig', 'force', 'upstairs', 'sleep', 'Jack', 'visit', 'fairy', 'show', 'glimpse', 'await', 'ascend', 'bean', 'stalk', 'version', 'Jack', 'son', 'deposed', 'king', 'Jack', 'wake', 'find', 'beanstalk', 'grow', 'climb', 'enter', 'giant', 'home', 'giant', 'find', 'Jack', 'narrowly', 'escape', 'giant', 'chase', 'Jack', 'bean', 'stalk', 'Jack', 'able', 'cut', 'giant', 'safety', 'fall', 'kill', 'Jack', 'celebrate', 'fairy', 'reveal', 'Jack', 'return', 'home', 'prince']"


## 2. **Representación vectorial del texto:**
   - Explorar y seleccionar técnicas de representación de texto para convertir las sinopsis en vectores numéricos. Esto puede incluir enfoques como TF-IDF, Word2Vec, FastText o técnicas más avanzadas como BERT o modelos de lenguaje preentrenados (transformers).
   - Realizar pruebas y comparaciones de diferentes técnicas de representación para identificar la que mejor captura la similitud semántica entre las sinopsis.

In [15]:
import gensim.downloader
from gensim.models import Word2Vec

#Entrenamiento del modelo
model_w2v = Word2Vec(sentences=data_films['clean_plot'], vector_size=100, window=5, min_count=1, workers=4)

In [16]:
import numpy as np

# Representación de películas
def get_movie_vector(tokens: list) -> list:
    '''
    Esta función toma una lista de tokens (palabras) como entrada y devuelve 
    una representación vectorial de la película basada en las palabras de la sinopsis.

    Args:
    tokens (list): lista de tokens (palabras) de la sinopsis de la película.
    Return:
    vector (list): representación vectorial de la película.
    '''
    vectors = [model_w2v.wv[token] for token in tokens if token in model_w2v.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(100)  # or any default vector if no tokens are found

data_films_clean['vector'] = data_films_clean['clean_plot'].apply(get_movie_vector)

## 3. **Medición de la similitud:**
   - Implementar una métrica de similitud entre las sinopsis. Esto puede incluir el uso de distancia coseno, distancia Euclidiana, o métricas específicas para representaciones embebidas como la distancia entre embeddings de palabras o sentencias.
   - Evaluar y ajustar las métricas de similitud para maximizar la precisión de las recomendaciones.

In [28]:
from sklearn.metrics.pairwise import cosine_similarity

# Cálculo de similitud
def get_cosine_similarity(movie_vector: list, query_vector: list) -> float:
    '''
    Esta función calcula la similitud entre la película y el resto de películas.
    
    Args:
    movie_vector (list): Vector de características de la película.
    qery_vector (list): Vector de características de la consulta.
    Return:
    float: Similitud entre la película y la consulta.
    '''
    return cosine_similarity([movie_vector], [query_vector])[0][0]

## 4. **Desarrollo del sistema de recomendación:**
   - Implementar el algoritmo de recomendación basado en la similitud de las sinopsis de las películas.
   - Configurar el sistema para que, dado el título de una película, busque su sinopsis y encuentre las sinopsis más similares para recomendar películas afines.
   - Determinar el número óptimo de recomendaciones a mostrar y establecer los criterios para ordenarlas.

In [29]:
# Recomendación
def recommend_similar_movies(movie_title, top_n=10):
    query_vector = data_films_clean[data_films_clean['Title'] == movie_title]['vector'].values[0]
    data_films_clean['cosine_similarity'] = data_films_clean['vector'].apply(lambda x: get_cosine_similarity(x, query_vector))
    similar_movies = data_films_clean.sort_values(by='similarity', ascending=False)['Title'].head(top_n+1)[1:]
    return similar_movies


## 5. **Evaluación del sistema de recomendación:**
   - Hacer algunas comprobaciones manuales para ver que tal funciona el modelo.

In [30]:
# Ejemplo de uso
movie_title = "The Matrix"
recommendations = recommend_similar_movies(movie_title)
print("Recomendaciones para", movie_title, ":")
print(recommendations)

Recomendaciones para The Matrix :
12995                       George of the Jungle
21506    Night at the Museum: Secret of the Tomb
16905    Night at the Museum: Secret of the Tomb
16146                         X-Men: First Class
21133                         X-Men: First Class
14711                                 Madagascar
12020                                 Leprechaun
15475                Madagascar: Escape 2 Africa
1324                        Island of Lost Souls
22090                  Resident Evil: Apocalypse
Name: Title, dtype: object


## **Conclusión**

El sistema de recomendación de películas basado en sinopsis implementado en este proyecto ofrece una herramienta eficaz para sugerir películas similares, facilitando la exploración de contenidos relacionados. La combinación de técnicas de procesamiento de lenguaje natural y modelos de representación vectorial permite proporcionar recomendaciones que se alinean con la sinopsis de entrada.

#### **Mejoras Posibles:**

1. **Optimizar el Preprocesamiento**:
   - **Eliminación de Ruido**: Refinar la lista de stopwords y técnicas de lematización para mejorar la precisión de la representación textual.

2. **Explorar Otras Métricas de Similitud**:
   - **Métricas Alternativas**: Investigar y probar otras métricas de similitud como la **distancia de Manhattan** o la **correlación de Pearson** para evaluar cuál ofrece mejores resultados en el contexto de recomendaciones de películas.

