# Examen Bimestral – Diseño de un Sistema Básico de Recuperación de Información

**Nombre:** Felipe Quirola

* **Instrucciones:**
  
En este examen, los estudiantes deberán diseñar e implementar un sistema básico de recuperación de
información utilizando la base de datos Rotten Tomatoes movies and critic reviews disponible en
Kaggle. El objetivo es responder consultas relacionadas con la temática de las películas y sus
características.


# Importe de Libreías

In [51]:
import pandas as pd
import numpy as np
import string
from collections import defaultdict
import math
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Carga de Datos
* Usamos una muestra de 5000 datos

In [52]:
nltk.download('punkt')
nltk.download('stopwords')

movies_path = "/kaggle/input/clapper-massive-rotten-tomatoes-movies-and-reviews/rotten_tomatoes_movies.csv"
reviews_path = "/kaggle/input/clapper-massive-rotten-tomatoes-movies-and-reviews/rotten_tomatoes_movie_reviews.csv"

movies_df = pd.read_csv(movies_path).sample(5000, random_state=42).reset_index(drop=True)
reviews_df = pd.read_csv(reviews_path)

movies_df.head()

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix
0,tim,Tim,46.0,,,,,2006-04-04,108.0,Drama,English (Australia),Michael Pate,"Colleen McCullough,Michael Pate",,,
1,mas_que_a_nada_en_el_mundo,More Than Anything in the World,50.0,,,,,2009-05-12,90.0,"Drama, Fantasy",Spanish,Andrés León Becker,,,,
2,roadkill,Roadkill,79.0,,,,,2004-07-13,80.0,Musical,English,Bruce McDonald,,,,
3,two_men_in_town_2015,Two Men in Town,26.0,47.0,R,['Language'],2015-03-06,2015-05-12,116.0,Drama,English,Rachid Bouchareb,"Rachid Bouchareb,Olivier Lorelle,Yasmina Khadra",,Cohen Media Group,
4,sprinter,Sprinter,92.0,73.0,,,2019-04-24,2019-10-22,114.0,Drama,English,Storm Saulter,Storm Saulter,,FilmRise,


# Construcción del corpus por movie

In [53]:
MOVIE_ID_COL = "id"
MOVIE_TITLE_COL = "title"
MOVIE_INFO_COL = "ratingContents"
REVIEW_TEXT_COL = "reviewText"

reviews_agg = (
    reviews_df
    .groupby(MOVIE_ID_COL)[REVIEW_TEXT_COL]
    .apply(lambda x: " ".join(x.dropna().astype(str)))
    .reset_index(name="all_reviews")
)

movies_merged = movies_df.merge(reviews_agg, on=MOVIE_ID_COL, how="left")
movies_merged["all_reviews"] = movies_merged["all_reviews"].fillna("")
movies_merged[MOVIE_INFO_COL] = movies_merged[MOVIE_INFO_COL].fillna("")

movies_merged["corpus_text"] = (
    movies_merged[MOVIE_TITLE_COL].astype(str) + " " +
    movies_merged[MOVIE_INFO_COL].astype(str) + " " +
    movies_merged["all_reviews"].astype(str)
)

movies_merged[["title", "genre", "corpus_text"]].head()

Unnamed: 0,title,genre,corpus_text
0,Tim,Drama,Tim Tim comes close to reproducing the plot o...
1,More Than Anything in the World,"Drama, Fantasy",More Than Anything in the World
2,Roadkill,Musical,Roadkill Roadkill was the first film to flush...
3,Two Men in Town,Drama,Two Men in Town ['Language'] Two Men in Town i...
4,Sprinter,Drama,"Sprinter Likable enough, but after breaking o..."


# Preprocesamiento

In [54]:
STOPWORDS = set(stopwords.words('english'))
STEMMER = SnowballStemmer('english')
PUNCT_TABLE = str.maketrans("", "", string.punctuation)

def preprocess_text(text):
    if not isinstance(text, str):
        return []
    text = text.lower()
    text = text.translate(PUNCT_TABLE)
    tokens = word_tokenize(text)
    clean_tokens = []
    for tok in tokens:
        if tok.isalpha() and tok not in STOPWORDS:
            clean_tokens.append(STEMMER.stem(tok))
    return clean_tokens

# Preprocesar corpus

In [55]:
def preprocess_corpus(df, text_col="corpus_text"):
    df = df.dropna(subset=[text_col]).reset_index(drop=True)
    df["tokens"] = df[text_col].astype(str).apply(preprocess_text)
    df["processed_text"] = df["tokens"].apply(lambda x: " ".join(x))
    return df

movies_corpus = preprocess_corpus(movies_merged, "corpus_text")
movies_corpus[["title", "genre", "processed_text"]].head()

Unnamed: 0,title,genre,processed_text
0,Tim,Drama,tim tim come close reproduc plot doubt negativ...
1,More Than Anything in the World,"Drama, Fantasy",anyth world
2,Roadkill,Musical,roadkil roadkil first film flush institut ment...
3,Two Men in Town,Drama,two men town languag two men town dusti sun bl...
4,Sprinter,Drama,sprinter likabl enough break block pictur get ...


# BM25

In [56]:
def build_bm25(tokens_series):
    N = len(tokens_series)
    df_term = defaultdict(int)
    inverted_index = defaultdict(list)
    doc_len = []
    for doc_id, tokens in enumerate(tokens_series):
        doc_len.append(len(tokens))
        term_freq = defaultdict(int)
        for t in tokens:
            term_freq[t] += 1
        for term, tf in term_freq.items():
            df_term[term] += 1
            inverted_index[term].append((doc_id, tf))
    avgdl = sum(doc_len) / N
    idf = {}
    for term, df_t in df_term.items():
        idf[term] = math.log((N - df_t + 0.5) / (df_t + 0.5) + 1)
    return inverted_index, idf, doc_len, avgdl, N

inverted_index, idf_bm25, doc_len, avgdl, N_docs = build_bm25(movies_corpus["tokens"])
len(inverted_index), N_docs, avgdl

(32599, 5000, 125.2022)

In [57]:
def preprocess_query(query):
    return preprocess_text(query)

def search_bm25(query, inverted_index, idf, doc_len, avgdl, df_docs, top_k=10, k1=1.5, b=0.75):
    query_terms = preprocess_query(query)
    if not query_terms:
        return pd.DataFrame()
    scores = defaultdict(float)
    for term in query_terms:
        if term not in inverted_index:
            continue
        for doc_id, tf in inverted_index[term]:
            dl = doc_len[doc_id]
            score = idf.get(term, 0.0) * (tf*(k1+1)) / (tf + k1*(1 - b + b*dl/avgdl))
            scores[doc_id] += score
    if not scores:
        return pd.DataFrame()
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    results = df_docs.loc[[d for d,_ in ranked], ["title","genre","corpus_text"]].copy()
    results["score"] = [s for _,s in ranked]
    results = results.reset_index(drop=True)
    return results

# Consultas

In [58]:
q1 = "space travel movie"
res1 = search_bm25(q1, inverted_index, idf_bm25, doc_len, avgdl, movies_corpus, 10)
res1[["title", "genre", "score"]]

Unnamed: 0,title,genre,score
0,Zero,"Romance, Comedy",10.652336
1,"I Travel Because I Have to, I Come Back Becaus...",Drama,10.535118
2,Teenagers From Outer Space,Sci-fi,9.859191
3,Spark: A Space Tail,"Kids & family, Comedy, Adventure, Action, Sci-...",9.683079
4,Explorers,"Kids & family, Sci-fi, Drama",9.187602
5,Last Exit: Space,Documentary,9.002396
6,They Crawl Beneath,"Horror, Mystery & thriller, Sci-fi",8.366032
7,A Space Program,Documentary,8.127221
8,Traveller,Drama,8.098265
9,Gulliver's Travels,"Kids & family, Fantasy",8.046027


In [59]:
q2 = "family movie"
res2 = search_bm25(q2, inverted_index, idf_bm25, doc_len, avgdl, movies_corpus, 10)
res2[["title", "genre", "score"]]

Unnamed: 0,title,genre,score
0,It Runs in the Family,"Comedy, Drama",8.602301
1,"Inheritance, Italian Style","Comedy, Drama",7.923303
2,Family Plot,Mystery & thriller,7.794017
3,Lionheart,Comedy,7.712151
4,The Boss Baby: Family Business,"Kids & family, Comedy, Adventure, Animation",7.679375
5,Bliss!,"Drama, Adventure",7.54748
6,Two Summers,Drama,7.374991
7,The Talent Given Us,"Comedy, Drama",7.37145
8,Marley & Me,"Comedy, Drama",7.141081
9,Eleven Days in May,"Documentary, War, History",7.126599


# Análisis de Resultados

Básicamente el sistema si logra encontrar películas relaconadas a los temas de búsqueda, al usar BM25 los resultados que se muestran primero y con mayor score son los documentos en donde la querry aparece con mayor frecuencia.

* “space travel movie”
La mayoría de las películas mostradastenían relación con el espacio. Sin embargo, también aparecieron algunos títulos que solo mencionaban la palabra “travel” sin ser realmente viajes al espacio.

* “family movie”
El sistema recuperó varias películas sobre familias, aunque algunas trataban más sobre conflictos familiares que no era lo que verdaderamente buscabamos.

En los dos ejemplos el sistema se acerca bastante a lo que el usuario quiere encontrar. Esto demuestra que el índice invertido y BM25 están funcionando bien para detectar los términos más relevantes dentro del corpus.

Como mejora futura se podría incorporar información extra, por ejemplo el género de las películas para así poder descartar películas que no son de reelevancia como en el ejemplo de la búsqueda de family movie.