# Ejercicio 1: Introducción a Recuperación de Información

## Objetivo de la práctica
- Entender el problema de **buscar información** en colecciones de texto.
- Comprender por qué se necesita un **índice invertido** en recuperación de información.
- Programar una primera solución manual y luego optimizarla con un índice.
- Evaluar la mejora en tiempos de búsqueda cuando usamos estructuras adecuadas.

## Parte 1: Búsqueda lineal en documentos

### Actividad
1. Se te proporcionará un dataset con reviews de películas.
2. Escribe una función que:
   - Lea todos los documentos.
   - Busque una palabra ingresada por el usuario.
   - Muestre en qué documentos aparece la palabra.

In [1]:
# Listas, diccionarios y conjuntos en Estructuras de datos
import pandas as pd

In [3]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [46]:
def lineal_search(column, query):
    mask = column.str.contains(query, na=False)
    return column[mask]

In [47]:
lineal_search(df["review"], "hello")

17       This movie made it into one of my top 10 most ...
33       One of the most significant quotes from the en...
169      The Lives of the Saints starts off with an atm...
214      holy Sh*t this was god awful. i sat in the the...
1485     hello, looking for a movie for the family to w...
                               ...                        
47161    Three Stooges - Have Rocket, Will Travel - 195...
48641    Released at a time when Duvivier was going aga...
49310    I claim no matter how hard I seek I'll never f...
49856    Man, this gets a lot of good reviews in the re...
49960    John Carpenter's career is over if this sad ex...
Name: review, Length: 87, dtype: object

## Parte 2: Construcción de un índice invertido

### Actividad
1. Escribe un programa que:
   - Recorra todos los documentos.
   - Construya un **índice invertido**, es decir, un diccionario donde:
     - Cada palabra clave apunta a una lista de documentos donde aparece.

2. Escribe una nueva función de búsqueda que:
   - Consulte directamente el índice para encontrar los documentos relevantes.
   - Sea mucho más rápida que la búsqueda lineal.

In [15]:
# Primera versión con conjuntos (Costoso)

def dictionary(df):
    final_set = set()
    for row in df.itertuples():
        words = row.review.split()
        set1 = set(words)
        final_set = set1 | final_set

    dictionary = dict(final_set)
    return dictionary

In [23]:
# Segunda versión con diccionarios (Menos costoso O n^2)
def dictionary(df, column):
    word_dict = {}
    for index, row in df.iterrows():
        words = row[column].lower().split()
        for word in words:
            if word not in word_dict:
                word_dict[word] = None
    return word_dict

In [48]:
# Tercera versión con diccionarios y con sus índices
def dictionary(df, column):
    word_dict = {}
    for idx, row in df.iterrows():
        words = row[column].split()
        for word in words:
            if word in word_dict:
                word_dict[word].append(idx)
            else:
                word_dict[word] = [idx]
    return word_dict

In [52]:
def inverted_index_search(dataframe, dictionary, query):
    return dataframe.loc[dictionary[query]]

In [26]:
all_words = dictionary(df, "review")


In [53]:
# Quiero buscar la palabra wait

inverted_index_search(df, all_words, "wait")

Unnamed: 0,review,sentiment
83,"""Fate"" leads Walter Sparrow to come in possess...",negative
98,"This IS the worst movie I have ever seen, as w...",negative
98,"This IS the worst movie I have ever seen, as w...",negative
98,"This IS the worst movie I have ever seen, as w...",negative
98,"This IS the worst movie I have ever seen, as w...",negative
...,...,...
49964,I saw this last week during Bruce Campbell's b...,positive
49983,"I loved it, having been a fan of the original ...",positive
49993,Robert Colomb has two full-time jobs. He's kno...,negative
49993,Robert Colomb has two full-time jobs. He's kno...,negative


## Parte 3: Evaluación de tiempos de búsqueda
### Actividad

1. Realiza la búsqueda de varias palabras usando:
      -  Corpus pequeño.
      -  Corpus grande.
2. Mide el tiempo de ejecución:
      -  Para búsqueda lineal.
      -  Para búsqueda usando índice invertido.
      -  Grafica o presenta los resultados en una tabla comparativa.

In [34]:
# df_large = pd.read_csv('rotten_tomatoes_movie_reviews.csv.zip', compression='zip')
df_large = pd.read_csv("/kaggle/input/rotten-tomatoes-movies-and-critic-reviews-dataset/rotten_tomatoes_critic_reviews.csv")
df_large

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,,False,FILMINK (Australia),Fresh,,2010-02-09,With a top-notch cast and dazzling special eff...
3,m/0814255,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5,2010-02-09,Whether audiences will get behind The Lightnin...
4,m/0814255,Ethan Alter,True,Hollywood Reporter,Rotten,,2010-02-10,What's really lacking in The Lightning Thief i...
...,...,...,...,...,...,...,...,...
1130012,m/zulu_dawn,Chuck O'Leary,False,Fantastica Daily,Rotten,2/5,2005-11-02,
1130013,m/zulu_dawn,Ken Hanke,False,"Mountain Xpress (Asheville, NC)",Fresh,3.5/5,2007-03-07,"Seen today, it's not only a startling indictme..."
1130014,m/zulu_dawn,Dennis Schwartz,False,Dennis Schwartz Movie Reviews,Fresh,B+,2010-09-16,A rousing visual spectacle that's a prequel of...
1130015,m/zulu_dawn,Christopher Lloyd,False,Sarasota Herald-Tribune,Rotten,3.5/5,2011-02-28,"A simple two-act story: Prelude to war, and th..."


In [41]:
df_large["review_content"] = df_large["review_content"].astype(str)
all_tomato_words = dictionary(df_large, "review_content")

In [55]:
# Ahora buscare la palabra tomatoes
inverted_index_search(df_large, all_tomato_words, "tomatoes")

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
122035,m/all_the_money_in_the_world_2017,Harvey S. Karten,False,Shockya.com,Fresh,A-,2017-12-19,The highlight is Christopher Plummer's perform...
219969,m/callas_forever,Steve Rhodes,False,Internet Reviews,Rotten,0.5/4,2004-11-17,The worst picture of the year ... If there had...
232519,m/chances_are,Desson Thomson,True,Washington Post,Rotten,,2000-01-01,"O'Neal's performance, on the other hand, could..."
351915,m/food_inc,Cynthia Fuchs,False,PopMatters,Fresh,,2009-06-24,"As Food, Inc. shows, these pretty, red, geneti..."
367864,m/gattaca,Roger Ebert,True,Chicago Sun-Times,Fresh,3.5/4,2000-01-01,At a time when we read about cloned sheep and ...
459182,m/invasion,Brian Holcomb,False,Beyond Hollywood,Rotten,,2008-04-19,This has to be the lamest alien invasion movie...
665022,m/power_of_one,Janet Maslin,True,New York Times,Rotten,1.5/5,2004-08-30,The film's facile treatment of racial issues m...
665023,m/power_of_one,Janet Maslin,True,New York Times,Rotten,1.5/5,2004-08-30,The film's facile treatment of racial issues m...
843965,m/sound_and_fury,Harvey S. Karten,False,Compuserve,Fresh,9/10,2001-02-08,When a documentary makes a spectator want to j...
843978,m/sound_and_fury,Harvey S. Karten,False,Compuserve,Fresh,9/10,2001-02-08,When a documentary makes a spectator want to j...


# Medir tiempos

Búsqueda lineal:

In [56]:
import time

In [59]:
inicio = time.time()

lineal_search(df_large["review_content"], "tomatoes")

final = time.time()

tiempo_ejecucion_ms = (final - inicio) * 1000
print(f"Tiempo de ejecución: {tiempo_ejecucion_ms:.2f} milisegundos")

Tiempo de ejecución: 579.45 milisegundos


Búsqueda por índices invertidos

In [60]:
inicio = time.time()

inverted_index_search(df_large, all_tomato_words, "tomatoes")

final = time.time()

tiempo_ejecucion_ms = (final - inicio) * 1000
print(f"Tiempo de ejecución: {tiempo_ejecucion_ms:.2f} milisegundos")

Tiempo de ejecución: 1.22 milisegundos
