# Planteamiento del proyecto

Tenemos un conjunto de datos que contiene sinopsis de casi 35 mil películas escrapeadas de la Wikipedia. Queremos que,dado el título de una película, te recomiende otras basadas en su sinopsis.

El dataset a utilizar es el que aparece en la carpeta: [Wiki movie](./wiki_movie_plots_deduped.csv)

Los objetivos a cumplir son los siguientes:

1. **Preprocesamiento de datos**
2. **Representación vectorial del texto**
3. **Medición de la similitud**
4. **Desarrollo del sistema de recomendación**
5. **Evaluación del sistema de recomendación**

## 1. **Preprocesamiento de datos:**
   - Limpiar y preparar el conjunto de datos que contiene las sinopsis de las películas.
   - Realizar la tokenización y eliminación de palabras irrelevantes (stopwords).
   - Normalizar el texto (por ejemplo, convertirlo a minúsculas, eliminar caracteres especiales, etc.).
   - Posible lematización o stemming para reducir las palabras a su raíz.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

#Cargar datos
data_films = pd.read_csv('wiki_movie_plots_deduped.csv')
data_films.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Light_of_the_Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Presidents,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination.\r\nIn the second shot, which runs just over eight seconds long, an assassin kneels feet of Lady Justice."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_the_Grizzly_King","Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading ""His Photographer"" and ""His Press Agent"" respectively, follow him into the shot; the photographer sets up his camera. ""Teddy"" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. ""Teddy"" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. ""Teddy"" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs."
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Beanstalk_(1902_film),"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."


Como en este ejercicios queremos hacer recomendaciones de películas en base a la sinopsis, vamos a quitar todas las variablees excepto '**Title**' y '**Plot**'.

In [2]:
#Eliminamos las columnas que no nos interesan
data_films = data_films.drop(columns=['Release Year', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page'])
data_films_clean = data_films.copy()
data_films_clean.head()

Unnamed: 0,Title,Plot
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better."
2,The Martyred Presidents,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination.\r\nIn the second shot, which runs just over eight seconds long, an assassin kneels feet of Lady Justice."
3,"Terrible Teddy, the Grizzly King","Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading ""His Photographer"" and ""His Press Agent"" respectively, follow him into the shot; the photographer sets up his camera. ""Teddy"" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. ""Teddy"" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. ""Teddy"" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs."
4,Jack and the Beanstalk,"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."


In [3]:
import spacy 
from bs4 import BeautifulSoup
import re

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

In [4]:
# Preprocesar texto con mejoras
def preprocess_text(nlp, text: str, custom_stopwords=None) -> list:
    """Tokeniza, elimina HTML, stopwords, y normaliza el texto."""
    
    # Eliminar etiquetas HTML
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Convertir a minúsculas
    text = text.lower()
    
    # Manejar contracciones (ej., "don't" -> "do not")
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"'s", " is", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"'t", " not", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"'m", " am", text)
    
    # Tokenizar y lematizar
    doc = nlp(text)
    
    # Filtrar tokens alfabéticos, no-stopwords, y de longitud mayor a 2
    tokens = [
        token.lemma_ for token in doc 
        if token.is_alpha and not token.is_stop and len(token) > 2
    ]
    
    # Eliminar palabras de baja frecuencia y stopwords personalizadas
    if custom_stopwords:
        tokens = [token for token in tokens if token not in custom_stopwords]
    
    return tokens

data_films_clean['clean_plot'] = data_films_clean['Plot'].apply(lambda plot: preprocess_text(nlp, plot))
data_films_clean.head()

  text = BeautifulSoup(text, "html.parser").get_text()


Unnamed: 0,Title,Plot,clean_plot
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]","[bartender, work, saloon, serve, drink, customer, fill, stereotypically, irish, man, bucket, beer, carrie, nation, follower, burst, inside, assault, irish, man, pull, hat, eye, dump, beer, head, group, begin, wreck, bar, smash, fixture, mirror, break, cash, register, bartender, spray, seltzer, water, nation, face, group, policeman, appear, order, everybody]"
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better.","[moon, paint, smile, face, hang, park, night, young, couple, walk, past, fence, learn, railing, look, moon, smile, embrace, moon, smile, get, big, sit, bench, tree, moon, view, block, cause, frown, scene, man, fan, woman, hat, moon, leave, sky, perch, shoulder, well]"
2,The Martyred Presidents,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination.\r\nIn the second shot, which runs just over eight seconds long, an assassin kneels feet of Lady Justice.","[film, minute, long, compose, shot, girl, sit, base, altar, tomb, face, hide, camera, center, altar, view, portal, display, portrait, president, abraham, lincoln, james, garfield, william, mckinley, victim, assassination, second, shot, run, second, long, assassin, kneel, foot, lady, justice]"
3,"Terrible Teddy, the Grizzly King","Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading ""His Photographer"" and ""His Press Agent"" respectively, follow him into the shot; the photographer sets up his camera. ""Teddy"" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. ""Teddy"" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. ""Teddy"" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs.","[last, second, consist, shot, shot, set, wood, winter, actor, represent, vice, president, theodore, roosevelt, enthusiastically, hurry, hillside, tree, foreground, fall, right, cock, rifle, man, bear, sign, read, photographer, press, agent, respectively, follow, shot, photographer, set, camera, teddy, aim, rifle, upward, tree, fell, appear, common, house, cat, proceed, stab, teddy, hold, prize, aloft, press, agent, take, note, second, shot, take, slightly, different, wood, path, teddy, ride, path, horse, camera, left, shot, follow, closely, press, agent, photographer, dutifully, hold, sign]"
4,Jack and the Beanstalk,"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince.","[early, know, adaptation, classic, fairytale, film, show, jack, trade, cow, bean, mother, force, drop, yard, beig, force, upstairs, sleep, jack, visit, fairy, show, glimpse, await, ascend, bean, stalk, version, jack, son, deposed, king, jack, wake, find, beanstalk, grow, climb, enter, giant, home, giant, find, jack, narrowly, escape, giant, chase, jack, bean, stalk, jack, able, cut, giant, safety, fall, kill, jack, celebrates, fairy, reveal, jack, return, home, prince]"


## 2. **Representación vectorial del texto:**
   - Explorar y seleccionar técnicas de representación de texto para convertir las sinopsis en vectores numéricos. Esto puede incluir enfoques como TF-IDF, Word2Vec, FastText o técnicas más avanzadas como BERT o modelos de lenguaje preentrenados (transformers).
   - Realizar pruebas y comparaciones de diferentes técnicas de representación para identificar la que mejor captura la similitud semántica entre las sinopsis.

In [6]:
import gensim.downloader
from gensim.models import Word2Vec

#Entrenamiento del modelo
model_w2v = Word2Vec(sentences=data_films_clean['clean_plot'], vector_size=100, window=5, min_count=1, workers=4)

In [8]:
import numpy as np

# Vectorizar la sinopsis
def vectorize_plot(model, tokens: list) -> np.ndarray:
    """Convierte una lista de tokens en un vector utilizando el modelo Word2Vec."""
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

data_films_clean['vector'] = data_films_clean['clean_plot'].apply(lambda tokens: vectorize_plot(model_w2v, tokens))

## 3. **Medición de la similitud:**
   - Implementar una métrica de similitud entre las sinopsis. Esto puede incluir el uso de distancia coseno, distancia Euclidiana, o métricas específicas para representaciones embebidas como la distancia entre embeddings de palabras o sentencias.
   - Evaluar y ajustar las métricas de similitud para maximizar la precisión de las recomendaciones.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

# Calcular la similitud de coseno
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    """Calcula la similitud de coseno entre dos vectores."""
    return cosine_similarity([a], [b])[0][0]

## 4. **Desarrollo del sistema de recomendación:**
   - Implementar el algoritmo de recomendación basado en la similitud de las sinopsis de las películas.
   - Configurar el sistema para que, dado el título de una película, busque su sinopsis y encuentre las sinopsis más similares para recomendar películas afines.
   - Determinar el número óptimo de recomendaciones a mostrar y establecer los criterios para ordenarlas.

In [11]:
# Recomendar películas similares
def recommend_movies(df: pd.DataFrame, movie_title: str, top_n=10) -> pd.Series:
    """Recomienda películas similares basado en la sinopsis."""
    assert movie_title in df['Title'].values, f"La película '{movie_title}' no se encuentra en la base de datos."
    
    # Calcular similitud
    query_vector = df.loc[df['Title'] == movie_title, 'vector'].values[0]
    df['cosine_similarity'] = df['vector'].apply(lambda vec: cosine_sim(vec, query_vector))
    
    return df.sort_values(by='cosine_similarity', ascending=False)['Title'].head(top_n + 1)[1:]


## 5. **Evaluación del sistema de recomendación:**
   - Hacer algunas comprobaciones manuales para ver que tal funciona el modelo.

In [12]:
# Ejemplo de uso
movie_title = "The Matrix"
recommendations = recommend_movies(data_films_clean, movie_title)
print("Recomendaciones para", movie_title, ":")
print(recommendations)

Recomendaciones para The Matrix :
14313          The Matrix Reloaded
14314       The Matrix Revolutions
21133           X-Men: First Class
16146           X-Men: First Class
33397                    Appleseed
16983      Avengers: Age of Ultron
12305                   The Shadow
12550                    Screamers
16178    Avengers, TheThe Avengers
33162                       Gunhed
Name: Title, dtype: object


## **Conclusión**

El sistema de recomendación de películas basado en sinopsis implementado en este proyecto ofrece una herramienta eficaz para sugerir películas similares, facilitando la exploración de contenidos relacionados. La combinación de técnicas de procesamiento de lenguaje natural y modelos de representación vectorial permite proporcionar recomendaciones que se alinean con la sinopsis de entrada.

#### **Mejoras Posibles:**

1. **Optimizar el Preprocesamiento**:
   - **Eliminación de Ruido**: Refinar la lista de stopwords y técnicas de lematización para mejorar la precisión de la representación textual.

2. **Explorar Otras Métricas de Similitud**:
   - **Métricas Alternativas**: Investigar y probar otras métricas de similitud como la **distancia de Manhattan** o la **correlación de Pearson** para evaluar cuál ofrece mejores resultados en el contexto de recomendaciones de películas.

