**EXAMEN BIMESTRAL**

**Nombre:** Aarón Yumancela



**Instrucciones:**

En este examen, los estudiantes deberán diseñar e implementar un sistema básico de recuperación de
información utilizando la base de datos Rotten Tomatoes movies and critic reviews disponible en
Kaggle. El objetivo es responder consultas relacionadas con la temática de las películas y sus
características.

In [45]:
# Descomprimir el archivo .zip

import zipfile

with zipfile.ZipFile("examen.zip", "r") as z:
    z.extractall("descomprimido")


In [46]:
# Leemos los archivos

import pandas as pd

movies = pd.read_csv("descomprimido/rotten_tomatoes_movies.csv")
reviews = pd.read_csv("descomprimido/rotten_tomatoes_critic_reviews.csv")

movies.head()


Unnamed: 0,rotten_tomatoes_link,movie_title,movie_info,critics_consensus,content_rating,genres,directors,authors,actors,original_release_date,...,production_company,tomatometer_status,tomatometer_rating,tomatometer_count,audience_status,audience_rating,audience_count,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,"Always trouble-prone, the life of teenager Per...",Though it may seem like just another Harry Pot...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,"Craig Titley, Chris Columbus, Rick Riordan","Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,...,20th Century Fox,Rotten,49.0,149.0,Spilled,53.0,254421.0,43,73,76
1,m/0878835,Please Give,Kate (Catherine Keener) and her husband Alex (...,Nicole Holofcener's newest might seem slight i...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,...,Sony Pictures Classics,Certified-Fresh,87.0,142.0,Upright,64.0,11574.0,44,123,19
2,m/10,10,"A successful, middle-aged Hollywood songwriter...",Blake Edwards' bawdy comedy may not score a pe...,R,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",1979-10-05,...,Waner Bros.,Fresh,67.0,24.0,Spilled,53.0,14684.0,2,16,8
3,m/1000013-12_angry_men,12 Angry Men (Twelve Angry Men),Following the closing arguments in a murder tr...,Sidney Lumet's feature debut is a superbly wri...,NR,"Classics, Drama",Sidney Lumet,Reginald Rose,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",1957-04-13,...,Criterion Collection,Certified-Fresh,100.0,54.0,Upright,97.0,105386.0,6,54,0
4,m/1000079-20000_leagues_under_the_sea,"20,000 Leagues Under The Sea","In 1866, Professor Pierre M. Aronnax (Paul Luk...","One of Disney's finest live-action adventures,...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,Earl Felton,"James Mason, Kirk Douglas, Paul Lukas, Peter L...",1954-01-01,...,Disney,Fresh,89.0,27.0,Upright,74.0,68918.0,5,24,3


**Preprocesamiento de datos**

In [47]:
# STOPWORDS
stopwords = {
    "the","a","an","of","in","on","at","and","or","to","from","that","this",
    "is","was","were","be","been","as","for","with","by","it","they","them",
    "but","so","their","its","are","do","did","does","because","about",
    "el","la","los","las","un","una","unos","unas","de","del","y","o","que",
    "en","con","por","para","como","su","sus","es","son","ser","fue","fueron",
    "se","lo","mi","mis","tu","tus"
}

# QUITAR ACENTOS
def quitar_acentos(p):
    tabla = {"á":"a","é":"e","í":"i","ó":"o","ú":"u","ü":"u","ñ":"n"}
    nueva = ""
    for c in p:
        nueva += tabla.get(c, c)
    return nueva

# LIMPIAR PUNTUACIÓN EXTERNA
def limpiar_palabra(p):
    while p and p[0] in ".,!?;:\"()[]{}":
        p = p[1:]
    while p and p[-1] in ".,!?;:\"()[]{}":
        p = p[:-1]
    return p

# STEM BÁSICO
def stem_basico(p):
    terminaciones = ["ing","ed","ly","ment","s","es","amos","emos","imos","ando","iendo"]
    for t in terminaciones:
        if p.endswith(t) and len(p) > len(t) + 2:
            return p[:-len(t)]
    return p

# TOKENIZADOR
def tokenizar(texto):
    tokens = []
    palabra = ""
    for c in texto:
        if c.isalnum() or c in ".-":
            palabra += c
        else:
            if palabra:
                tokens.append(palabra)
                palabra = ""
    if palabra:
        tokens.append(palabra)
    return tokens

# PREPROCESAMIENTO COMPLETO
def preprocesar(texto):
    texto = texto.lower()
    texto = quitar_acentos(texto)
    tokens = tokenizar(texto)

    resultado = []
    for t in tokens:
        t = limpiar_palabra(t)
        t = t.replace(".", "")
        if not t:
            continue
        if t in stopwords:
            continue
        t = stem_basico(t)
        resultado.append(t)

    return resultado


In [48]:
# Construimos un único campo textual combinando título, géneros y sinopsis

def unir_texto_movie(row):
    partes = []
    if isinstance(row.get("movie_title"), str):
        partes.append(row["movie_title"])
    if isinstance(row.get("genres"), str):
        partes.append(row["genres"])
    if isinstance(row.get("movie_info"), str):
        partes.append(row["movie_info"])
    if isinstance(row.get("critics_consensus"), str):
        partes.append(row["critics_consensus"])
    return " ".join(partes)


movies["texto"] = movies.apply(unir_texto_movie, axis=1)
movies["texto"].head()


Unnamed: 0,texto
0,Percy Jackson & the Olympians: The Lightning T...
1,Please Give Comedy Kate (Catherine Keener) and...
2,"10 Comedy, Romance A successful, middle-aged H..."
3,"12 Angry Men (Twelve Angry Men) Classics, Dram..."
4,"20,000 Leagues Under The Sea Action & Adventur..."


In [49]:
movies["tokens"] = movies["texto"].apply(preprocesar)


In [50]:
# Imprimimos unicamente 5 peliculas las cuales hayan sido preprocesadas

for i in range(5):
    print("PELÍCULA", i, movies["movie_title"].iloc[i])
    print(movies["tokens"].iloc[i][:25])
    print()


PELÍCULA 0 Percy Jackson & the Olympians: The Lightning Thief
['percy', 'jackson', 'olympian', 'lightn', 'thief', 'action', 'adventure', 'comedy', 'drama', 'science', 'fiction', 'fantasy', 'alway', 'trouble-prone', 'life', 'teenager', 'percy', 'jackson', 'logan', 'lerman', 'get', 'lot', 'more', 'complicat', 'when']

PELÍCULA 1 Please Give
['please', 'give', 'comedy', 'kate', 'catherine', 'keener', 'her', 'husband', 'alex', 'oliver', 'platt', 'wealthy', 'new', 'yorker', 'who', 'prowl', 'estate', 'sale', 'make', 'tidy', 'profit', 'resell', 'item', 'bought', 'cheap']

PELÍCULA 2 10
['10', 'comedy', 'romance', 'successful', 'middle-ag', 'hollywood', 'songwriter', 'fall', 'hopeless', 'love', 'woman', 'his', 'dream', 'even', 'follow', 'girl', 'her', 'new', 'husband', 'mexican', 'honeymoon', 'resort', 'while', 'his', 'behavior']

PELÍCULA 3 12 Angry Men (Twelve Angry Men)
['12', 'angry', 'men', 'twelve', 'angry', 'men', 'classic', 'drama', 'follow', 'clos', 'argument', 'murder', 'trial', '12'

In [51]:
# Se implementa el modelo BM25 desde cero (cálculo de TF, DF, IDF y normalización por longitud)

import math

class BM25:
    def __init__(self, documentos, preprocesador, k1=1.2, b=0.75):
        self.k1 = k1
        self.b = b
        self.preprocesar = preprocesador

        self.docs = [self.preprocesar(doc) for doc in documentos]
        self.N = len(self.docs)

        self.doc_len = []
        self.avgdl = 0
        self.tf = []
        self.df = {}
        self.idf = {}

        self._indexar()

    def _indexar(self):
        total_len = 0

        for tokens in self.docs:
            dl = len(tokens)
            self.doc_len.append(dl)
            total_len += dl

            tf_doc = {}
            for t in tokens:
                tf_doc[t] = tf_doc.get(t, 0) + 1
            self.tf.append(tf_doc)

            for t in tf_doc:
                self.df[t] = self.df.get(t, 0) + 1

        self.avgdl = total_len / self.N

        for t, df_t in self.df.items():
            self.idf[t] = math.log((self.N - df_t + 0.5) / (df_t + 0.5) + 1)

    def score(self, consulta):
        q_tokens = self.preprocesar(consulta)
        scores = []

        for i, tf_doc in enumerate(self.tf):
            score = 0
            dl = self.doc_len[i]

            for t in q_tokens:
                if t not in tf_doc:
                    continue

                tf_td = tf_doc[t]
                idf_t = self.idf.get(t, 0)

                numer = tf_td * (self.k1 + 1)
                denom = tf_td + self.k1 * (1 - self.b + self.b * dl / self.avgdl)

                score += idf_t * (numer / denom)

            scores.append((i, score))

        return sorted(scores, key=lambda x: x[1], reverse=True)


In [52]:
# Construimos el modelo BM25 utilizando el texto ya preprocesado

bm25 = BM25(movies["texto"], preprocesar)


In [53]:
# Se crea una función para buscar las películas

def buscar_peliculas(query, top=5):
    resultados = bm25.score(query)[:top]

    print(f"\nConsulta: '{query}'\n")

    for doc_id, score in resultados:
        titulo = movies["movie_title"].iloc[doc_id]
        texto = movies["texto"].iloc[doc_id]

        print(f" {titulo} — Score={score:.4f}")
        print(texto[:400], "...\n")


**Impresión de los resultados**

In [54]:
# Usamos la funcón para consultar las películas

buscar_peliculas("space travel", top=10)



Consulta: 'space travel'

 Infini — Score=11.2908
Infini Drama, Science Fiction & Fantasy Members of a search-and-rescue team uncover a global threat after traveling to a mining colony in outer space. ...

 The Jetsons — Score=9.0421
The Jetsons Action & Adventure, Animation, Comedy, Kids & Family, Science Fiction & Fantasy Based on the popular cartoon series, this animated movie, set in a time filled with family-sized spacecrafts and intergalactic travel, begins with hardworking family man George Jetson (George O'Hanlon) ecstatic when his cranky boss, Mr. Spacely (Mel Blanc), gives him a promotion that relocates him, his wife ...

 Serenity — Score=8.8929
Serenity Action & Adventure, Science Fiction & Fantasy In this continuation of the television series "Firefly," a group of rebels travels the outskirts of space aboard their ship, Serenity, outside the reach of the Alliance, a sinister regime that controls most of the universe. After the crew takes in Simon (Sean Maher) and his psyc

In [55]:
buscar_peliculas("family movie", top=10)



Consulta: 'family movie'

 Teen Beach Movie — Score=7.0799
Teen Beach Movie Kids & Family, Musical & Performing Arts, Television Two young surfers (Ross Lynch, Maia Mitchell) find romance when they magically become part of a movie musical. ...

 The Even Stevens Movie — Score=6.7675
The Even Stevens Movie Comedy, Kids & Family, Television Members (Shia LaBeouf, Nick Spano, Tom Virtue) of a family unwittingly appear on a reality-television show after the producer sends them to an island for a vacation. ...

 Grown Up Movie Star — Score=6.5826
Grown Up Movie Star Comedy, Drama A 13-year-old (Tatiana Maslany) gives in to sexual temptation after her mother abandons the family. ...

 The Kid — Score=6.5826
The Kid Comedy, Kids & Family, Science Fiction & Fantasy Critics find The Kid to be too sweet and the movie's message to be annoyingly simplistic. ...

 Ways To Live Forever — Score=6.5826
Ways To Live Forever Drama, Kids & Family Sam needs to learn about UFOs, horror movies and airships

**Análisis de Resultados**

Los resultados obtenidos para la consulta "space travel" son coherentes, ya que las películas en las primeras posiciones incluyen viajes espaciales o historias situadas en el espacio dentro de su descripción. Modelos como Infini, Serenity o Gattaca reflejan que BM25 está identificando términos relacionados como “space” y “travel” dentro del texto. Para la consulta "family movie", el sistema recupera principalmente títulos clasificados en el género Kids & Family o con tramas orientadas a públicos familiares, lo cual muestra que la información de géneros y sinopsis está apoyando el proceso de búsqueda. En general, los resultados mantienen una correspondencia razonable con cada consulta, aunque pueden variar debido a que el modelo se basa únicamente en coincidencias textuales. Aun así, el comportamiento es adecuado para un sistema básico de recuperación de información.

**Mejoras**

*   Ponderar el género: Dar mayor peso al campo genres podría mejorar la precisión en consultas temáticas como “family movie”.
*   Eliminar términos poco útiles: Quitar palabras como movie o film evitaría coincidencias que no aportan relevancia real.




