# Ejercicio 1: Introducción a Recuperación de Información

## Objetivo de la práctica
- Entender el problema de **buscar información** en colecciones de texto.
- Comprender por qué se necesita un **índice invertido** en recuperación de información.
- Programar una primera solución manual y luego optimizarla con un índice.
- Evaluar la mejora en tiempos de búsqueda cuando usamos estructuras adecuadas.

## Parte 1: Búsqueda lineal en documentos

### Actividad
1. Se te proporcionará un dataset con reviews de películas.
2. Escribe una función que:
   - Lea todos los documentos.
   - Busque una palabra ingresada por el usuario.
   - Muestre en qué documentos aparece la palabra.

In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/clapper-massive-rotten-tomatoes-movies-and-reviews/rotten_tomatoes_movies.csv
/kaggle/input/clapper-massive-rotten-tomatoes-movies-and-reviews/rotten_tomatoes_movie_reviews.csv
/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


In [5]:
#path = '../data/'
#df = pd.read_csv(path + 'IMDB Dataset.csv.zip', compression='zip')

df= pd.read_csv ("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
df 

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [6]:
def hasgood (str_):
    #Función devuelve True si el string contiene la palabra good, caso contrario devuelve False
    return 'good' in str_ 

In [7]:
reviewHasGood = df['review'].apply(hasgood) #Para simplificar se usa la función reviewHasGood, pero puede ser oipcional
reviewHasGood

0        False
1        False
2        False
3        False
4         True
         ...  
49995     True
49996    False
49997     True
49998    False
49999     True
Name: review, Length: 50000, dtype: bool

In [8]:
reviewHasGood.value_counts()

review
False    31229
True     18771
Name: count, dtype: int64

In [9]:
def hasquery (str_, query):
    #Función devuelve True si en el string se encuentra la palabra a buscar como query, caso contrario devuelve False
    return str_ in query 

In [30]:
query = 'not bad'
reviewHasAnyQuery = df['review'].apply(lambda x: hasquery(query, x))
reviewHasAnyQuery

0        False
1        False
2        False
3        False
4        False
         ...  
49995    False
49996    False
49997    False
49998    False
49999    False
Name: review, Length: 50000, dtype: bool

In [11]:

reviewHasAnyQuery.value_counts()

review
False    49752
True       248
Name: count, dtype: int64

In [12]:
df[reviewHasAnyQuery]

Unnamed: 0,review,sentiment
14,This a fantastic movie of three prisoners who ...,positive
67,I really like Salman Kahn so I was really disa...,negative
240,This movie was absolutely pathetic. A pitiful ...,negative
420,This movie starts with the main character lyin...,negative
691,"First, I realize that a ""1"" rating is supposed...",negative
...,...,...
48844,Not a terrible film -- my 10 year old boy love...,negative
48851,What can you say about this movie? It was not ...,negative
48912,"I absolutely love this show, but I saw the sec...",positive
49843,"I'm shocked that all the ""hated it"" ratings ar...",negative


In [13]:

df['review'].loc[14]

"This a fantastic movie of three prisoners who become famous. One of the actors is george clooney and I'm not a fan but this roll is not bad. Another good thing about the movie is the soundtrack (The man of constant sorrow). I recommand this movie to everybody. Greetings Bart"

## Parte 2: Construcción de un índice invertido

### Actividad
1. Escribe un programa que:
   - Recorra todos los documentos.
   - Construya un **índice invertido**, es decir, un diccionario donde:
     - Cada palabra clave apunta a una lista de documentos donde aparece.

2. Escribe una nueva función de búsqueda que:
   - Consulte directamente el índice para encontrar los documentos relevantes.
   - Sea mucho más rápida que la búsqueda lineal.

In [17]:
inverted_index = {}

# Procesar solo una parte del dataset 
subset = df.head(5000)

for i, review in enumerate(subset['review']):
    words = review.split()
    for word in set(words):  # evita duplicar índices por palabra
        inverted_index.setdefault(word, []).append(i)
print("Tamanio del índice invertido:", len(inverted_index))


Tamaño del índice invertido: 101552


In [18]:
def search_in_index(query, index, df):
    if query in index:
        doc_ids = index[query]
        print(f"La palabra '{query}' aparece en {len(doc_ids)} documentos.")
        return df.iloc[doc_ids]
    else:
        print(f"La palabra '{query}' no aparece en ningún documento.")
        return pd.DataFrame()

In [19]:
resultados = search_in_index('good', inverted_index, subset)
resultados.head()

La palabra 'good' aparece en 1562 documentos.


Unnamed: 0,review,sentiment
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
12,So im not a big fan of Boll's work but then ag...,negative
14,This a fantastic movie of three prisoners who ...,positive
20,After the success of Die Hard and it's sequels...,positive
21,I had the terrible misfortune of having to vie...,negative


## Parte 3: Evaluación de tiempos de búsqueda
### Actividad

1. Realiza la búsqueda de varias palabras usando:
      -  Corpus pequeño.
      -  Corpus grande.
2. Mide el tiempo de ejecución:
      -  Para búsqueda lineal.
      -  Para búsqueda usando índice invertido.
      -  Grafica o presenta los resultados en una tabla comparativa.

In [20]:
import time

In [31]:
df_small = df.head(5000)
df_large= pd.read_csv ("/kaggle/input/clapper-massive-rotten-tomatoes-movies-and-reviews/rotten_tomatoes_movie_reviews.csv")
df_large

Unnamed: 0,id,reviewId,creationDate,criticName,isTopCritic,originalScore,reviewState,publicatioName,reviewText,scoreSentiment,reviewUrl
0,beavers,1145982,2003-05-23,Ivan M. Lincoln,False,3.5/4,fresh,Deseret News (Salt Lake City),Timed to be just long enough for most youngste...,POSITIVE,http://www.deseretnews.com/article/700003233/B...
1,blood_mask,1636744,2007-06-02,The Foywonder,False,1/5,rotten,Dread Central,It doesn't matter if a movie costs 300 million...,NEGATIVE,http://www.dreadcentral.com/index.php?name=Rev...
2,city_hunter_shinjuku_private_eyes,2590987,2019-05-28,Reuben Baron,False,,fresh,CBR,The choreography is so precise and lifelike at...,POSITIVE,https://www.cbr.com/city-hunter-shinjuku-priva...
3,city_hunter_shinjuku_private_eyes,2558908,2019-02-14,Matt Schley,False,2.5/5,rotten,Japan Times,The film's out-of-touch attempts at humor may ...,NEGATIVE,https://www.japantimes.co.jp/culture/2019/02/0...
4,dangerous_men_2015,2504681,2018-08-29,Pat Padua,False,,fresh,DCist,Its clumsy determination is endearing and some...,POSITIVE,http://dcist.com/2015/11/out_of_frame_dangerou...
...,...,...,...,...,...,...,...,...,...,...,...
1444958,thor_love_and_thunder,102706151,2022-07-05,Christie Cronan,False,7/10,fresh,Raising Whasians,Solid but not totally sold&#44; Thor&#58; Ragn...,POSITIVE,https://raisingwhasians.com/thor-love-and-thun...
1444959,thor_love_and_thunder,102706150,2022-07-05,Ian Sandwell,False,4/5,fresh,Digital Spy,Thor&#58; Love and Thunder is the most enterta...,POSITIVE,https://www.digitalspy.com/movies/a40496050/th...
1444960,thor_love_and_thunder,102706149,2022-07-05,Lauren LaMagna,False,8/10,fresh,Next Best Picture,&quot;Thor&#58; Love and Thunder&quot; is a st...,POSITIVE,https://www.nextbestpicture.com/thor-love-and-...
1444961,thor_love_and_thunder,102706148,2022-07-05,Jake Cole,True,1/4,rotten,Slant Magazine,Across Taika Waititi&#8217;s film&#44; a war a...,NEGATIVE,https://www.slantmagazine.com/film/thor-love-a...


In [32]:
def linear_search(df, query, col='review'):
    return df[df[col].str.contains(query, case=False, na=False)]
# Corpus pequenio
start_small = time.time()
linear_small = linear_search(df_small, 'good')
time_small = time.time() - start_small

# Corpus grande
start_large = time.time()
linear_large = linear_search(df_large, 'good', col='reviewText') 
time_large = time.time() - start_large


In [33]:
# Indice invertido para el corpus pequeño
inverted_index_small = {}
for i, review in enumerate(df_small['review']):
    for word in set(review.split()):
        inverted_index_small.setdefault(word, []).append(i)

# Funcion de búsqueda con indice
def search_in_index(query, index, df, col='review'):
    if query in index:
        return df.iloc[index[query]]
    else:
        return pd.DataFrame()

# Busqueda y tiempo
start_inv_small = time.time()
inv_result_small = search_in_index('good', inverted_index_small, df_small)
time_inv_small = time.time() - start_inv_small

In [34]:
resultados = pd.DataFrame({
    'Metodo': ['Busqueda lineal', 'Índice invertido'],
    'Corpus pequenio (segundos)': [time_small, time_inv_small],
    'Corpus grande (segundos)': [time_large, 'No medido (muy grande)']
})
print(resultados)

             Método  Corpus pequeño (segundos) Corpus grande (segundos)
0   Búsqueda lineal                   0.052666                 2.307402
1  Índice invertido                   0.000800   No medido (muy grande)
