# Ejercicio 1: Introducción a Recuperación de Información

## Objetivo de la práctica
- Entender el problema de **buscar información** en colecciones de texto.
- Comprender por qué se necesita un **índice invertido** en recuperación de información.
- Programar una primera solución manual y luego optimizarla con un índice.
- Evaluar la mejora en tiempos de búsqueda cuando usamos estructuras adecuadas.

## Parte 1: Búsqueda lineal en documentos

### Actividad
1. Se te proporcionará un dataset con reviews de películas.
2. Escribe una función que:
   - Lea todos los documentos.
   - Busque una palabra ingresada por el usuario.
   - Muestre en qué documentos aparece la palabra.

In [1]:
import pandas as pd

In [4]:
path = 'data/IMDB Dataset.csv'
df = pd.read_csv(path)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [48]:
def buscar(docs, query):
    # Funcion retorna los documentos donde se encuentra el query
    return docs[docs.apply(lambda x: query in str(x))]

In [29]:
buscar(df['review'],'wonderful')

1        A wonderful little production. <br /><br />The...
2        I thought this was a wonderful way to spend ti...
29       'War movie' is a Hollywood genre that has been...
41       This movie is based on the book, "A Many Splen...
59       I just watched The Dresser this evening, havin...
                               ...                        
49852    Russ and Valerie are having discussions about ...
49921    Antonio Margheriti's "Danza Macabra" aka. "Cas...
49935    "Nurse Betty" is the kind of movie you can't d...
49938    I made a big mistake going to see this film. T...
49941    Why did the histories of Mary and Rhoda have t...
Name: review, Length: 3083, dtype: object

## Parte 2: Construcción de un índice invertido

### Actividad
1. Escribe un programa que:
   - Recorra todos los documentos.
   - Construya un **índice invertido**, es decir, un diccionario donde:
     - Cada palabra clave apunta a una lista de documentos donde aparece.

2. Escribe una nueva función de búsqueda que:
   - Consulte directamente el índice para encontrar los documentos relevantes.
   - Sea mucho más rápida que la búsqueda lineal.

In [None]:
from collections import defaultdict
import re
def crear_indice_invertido(docs):
    # Funcion retorna un diccionario con el indice invertido, realiza normalizacion basica
    indice_invertido = defaultdict(list)
    for idx, fila in docs.items():
        texto = str(fila).lower()
        texto_normalizado = re.sub(r'[^a-z\s]', '', texto)
        palabras = set(texto_normalizado.split())
        for palabra in palabras:
            indice_invertido[palabra].append(idx)
    return dict(sorted(indice_invertido.items()))

In [25]:
indice_invertido =crear_indice_invertido(df['review'])

In [42]:
import time
tiempo_inicial = time.time()
print(df['review'].loc[indice_invertido['useful']])
tiempo_final = time.time()
print(f"Tiempo de busqueda indice invertido: {tiempo_final - tiempo_inicial} segundos")

tiempo_inicial_linear = time.time()

print(buscar(df['review'],'usefu'))
tiempo_final_linear = time.time()
print(f"Tiempo de busqueda lineal: {tiempo_final_linear - tiempo_inicial_linear} segundos")

96       My guess would be this was originally going to...
290      I saw the movie "Hoot" and then I immediately ...
590      I was -Unlike most of the reviewers- not born ...
1171     A SHIRLEY TEMPLE Short Subject.<br /><br />It ...
1291     Since there have been so many reviews of this ...
                               ...                        
48240    Solo starts as a team of US soldiers go into S...
48500    The End Of Suburbia (TEOS) is a very useful fi...
48508    I have read the whole 'A wrinkle in time' book...
48938    i've heard a lot about the inventive camera-wo...
49993    Robert Colomb has two full-time jobs. He's kno...
Name: review, Length: 174, dtype: object
Tiempo de busqueda indice invertido: 0.0019996166229248047 segundos
96       My guess would be this was originally going to...
290      I saw the movie "Hoot" and then I immediately ...
590      I was -Unlike most of the reviewers- not born ...
1171     A SHIRLEY TEMPLE Short Subject.<br /><br />It ...
1291  

## Parte 3: Evaluación de tiempos de búsqueda
### Actividad

1. Realiza la búsqueda de varias palabras usando:
      -  Corpus pequeño.
      -  Corpus grande.
2. Mide el tiempo de ejecución:
      -  Para búsqueda lineal.
      -  Para búsqueda usando índice invertido.
      -  Grafica o presenta los resultados en una tabla comparativa.

In [44]:
df_large = pd.read_csv('data/' + 'rotten_tomatoes_movie_reviews.csv')
df_large

Unnamed: 0,id,reviewId,creationDate,criticName,isTopCritic,originalScore,reviewState,publicatioName,reviewText,scoreSentiment,reviewUrl
0,beavers,1145982,2003-05-23,Ivan M. Lincoln,False,3.5/4,fresh,Deseret News (Salt Lake City),Timed to be just long enough for most youngste...,POSITIVE,http://www.deseretnews.com/article/700003233/B...
1,blood_mask,1636744,2007-06-02,The Foywonder,False,1/5,rotten,Dread Central,It doesn't matter if a movie costs 300 million...,NEGATIVE,http://www.dreadcentral.com/index.php?name=Rev...
2,city_hunter_shinjuku_private_eyes,2590987,2019-05-28,Reuben Baron,False,,fresh,CBR,The choreography is so precise and lifelike at...,POSITIVE,https://www.cbr.com/city-hunter-shinjuku-priva...
3,city_hunter_shinjuku_private_eyes,2558908,2019-02-14,Matt Schley,False,2.5/5,rotten,Japan Times,The film's out-of-touch attempts at humor may ...,NEGATIVE,https://www.japantimes.co.jp/culture/2019/02/0...
4,dangerous_men_2015,2504681,2018-08-29,Pat Padua,False,,fresh,DCist,Its clumsy determination is endearing and some...,POSITIVE,http://dcist.com/2015/11/out_of_frame_dangerou...
...,...,...,...,...,...,...,...,...,...,...,...
1444958,thor_love_and_thunder,102706151,2022-07-05,Christie Cronan,False,7/10,fresh,Raising Whasians,Solid but not totally sold&#44; Thor&#58; Ragn...,POSITIVE,https://raisingwhasians.com/thor-love-and-thun...
1444959,thor_love_and_thunder,102706150,2022-07-05,Ian Sandwell,False,4/5,fresh,Digital Spy,Thor&#58; Love and Thunder is the most enterta...,POSITIVE,https://www.digitalspy.com/movies/a40496050/th...
1444960,thor_love_and_thunder,102706149,2022-07-05,Lauren LaMagna,False,8/10,fresh,Next Best Picture,&quot;Thor&#58; Love and Thunder&quot; is a st...,POSITIVE,https://www.nextbestpicture.com/thor-love-and-...
1444961,thor_love_and_thunder,102706148,2022-07-05,Jake Cole,True,1/4,rotten,Slant Magazine,Across Taika Waititi&#8217;s film&#44; a war a...,NEGATIVE,https://www.slantmagazine.com/film/thor-love-a...


In [45]:
indice_invertido =crear_indice_invertido(df_large['reviewText'])

In [49]:
tiempo_inicial = time.time()
print(df_large['reviewText'].loc[indice_invertido['useful']])
tiempo_final = time.time()
print(f"Tiempo de busqueda indice invertido: {tiempo_final - tiempo_inicial} segundos")

tiempo_inicial_linear = time.time()

print(buscar(df_large['reviewText'],'useful'))
tiempo_final_linear = time.time()
print(f"Tiempo de busqueda lineal: {tiempo_final_linear - tiempo_inicial_linear} segundos")

7743       Even though it isn't the most well-written mov...
9925           Gimmicky but making a valid and useful point.
15689      Provides plenty of moving case studies...[but]...
15730      Finding North is a useful, engaging and enragi...
17795      No doubt, the outtakes on the DVD release will...
                                 ...                        
1437453    Parents will smirk at the cameos from King Tut...
1437998    As a viewer, I feel privileged to be able to p...
1442860    I guess I can be most useful to this weekend's...
1442897    King's attempt at "racial balance" patheticall...
1443164    A really clever spin on the coming-of-age come...
Name: reviewText, Length: 470, dtype: object
Tiempo de busqueda indice invertido: 0.001997232437133789 segundos
7743       Even though it isn't the most well-written mov...
9925           Gimmicky but making a valid and useful point.
15689      Provides plenty of moving case studies...[but]...
15730      Finding North is a usef

In [51]:
common_review_phrases = [
    "a must see",
    "highly recommended",
    "worth watching",
    "great performance",
    "well written",
    "amazing story",
    "would watch again",
    "not what I expected",
    "could have been better",
    "loved the movie"
]
tiempo_inicial = time.time()
for phrase in common_review_phrases:
    print(f"Searching for phrase: '{phrase}'")
    df_large['reviewText'].loc[indice_invertido.get(phrase.split()[0], [])]
tiempo_final = time.time()
print(f"Tiempo de busqueda indice invertido: {tiempo_final - tiempo_inicial} segundos")

tiempo_inicial_linear = time.time()
for phrase in common_review_phrases:
    print(f"Searching for phrase: '{phrase}'")
    buscar(df_large['reviewText'], phrase)
tiempo_final_linear = time.time()
print(f"Tiempo de busqueda lineal: {tiempo_final_linear - tiempo_inicial_linear} segundos")

Searching for phrase: 'a must see'
Searching for phrase: 'highly recommended'
Searching for phrase: 'worth watching'
Searching for phrase: 'great performance'
Searching for phrase: 'well written'
Searching for phrase: 'amazing story'
Searching for phrase: 'would watch again'
Searching for phrase: 'not what I expected'
Searching for phrase: 'could have been better'
Searching for phrase: 'loved the movie'
Tiempo de busqueda indice invertido: 0.12078356742858887 segundos
Searching for phrase: 'a must see'
Searching for phrase: 'highly recommended'
Searching for phrase: 'worth watching'
Searching for phrase: 'great performance'
Searching for phrase: 'well written'
Searching for phrase: 'amazing story'
Searching for phrase: 'would watch again'
Searching for phrase: 'not what I expected'
Searching for phrase: 'could have been better'
Searching for phrase: 'loved the movie'
Tiempo de busqueda lineal: 2.882619619369507 segundos
