# Taller 05: Herramientas para Indexación

**Objetivo:**

Este ejercicio te permitirá recordar los conceptos fundamentales de indexación clásica en sistemas de recuperación de información. Implementarás un índice invertido manualmente y luego explorarás herramientas como *Whoosh* y *Elasticsearch* para construir y consular índices.

### Parte 1: Construcción Manual de un Índice Invertido

**1. Cargar los datos en Python:**

Usa pandas para cargar y explorar el dataset.


In [8]:
#Bibliotecas
import pandas as pd
import re
from collections import defaultdict

In [2]:
# Cargar dataset
data = pd.read_csv("../data/wiki_movie_plots_deduped.csv")
data = data[['Title', 'Plot']].dropna()  # Asegúrate de trabajar con campos no nulos
print(data.head())

                              Title  \
0            Kansas Saloon Smashers   
1     Love by the Light of the Moon   
2           The Martyred Presidents   
3  Terrible Teddy, the Grizzly King   
4            Jack and the Beanstalk   

                                                Plot  
0  A bartender is working at a saloon, serving dr...  
1  The moon, painted with a smiling face hangs ov...  
2  The film, just over a minute long, is composed...  
3  Lasting just 61 seconds and consisting of two ...  
4  The earliest known adaptation of the classic f...  


**2. Construir un índice invertido:**

Realiza una normalización básica del texto y posteriormente se genera el índice invertido.

In [5]:
def normalize_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Eliminar puntuación
    return text

data['Normalized_Plot'] = data['Plot'].apply(normalize_text)

In [None]:
inverted_index = defaultdict(list)

for idx, row in data.iterrows():
    tokens = row['Normalized_Plot'].split()
    for token in set(tokens):  # Evitar duplicados en un mismo documento
        inverted_index[token].append(row['Title'])

print(dict(list(inverted_index.items())[:10]))  # Muestra las primeras 10 entradas



**3. Realizar consultas en el índice:**

Implementa una función para buscar palabras clave.

In [7]:
def search_inverted_index(query, index):
    query = normalize_text(query)
    return index.get(query, [])

print(search_inverted_index("cyborg", inverted_index))

['The Colossus of New York', 'Cyborg 2087', 'Spacehunter: Adventures in the Forbidden Zone', 'Superman III', 'Warrior of the Lost World', 'The Terminator', 'RoboCop', 'Cyborg', 'Moontrap', 'Nemesis', 'Cyborg 3: The Recycler', 'Cyborg Cop II', 'Space Truckers', 'Future War', 'Leprechaun 4: In Space', 'Virus', 'Jason X', 'Treasure Planet', 'Godzilla: Final Wars', 'Star Wars: Episode III – Revenge of the Sith', 'Tekken', 'Terminator Salvation', 'Justice League: The Flashpoint Paradox', 'Superman: Unbound', 'RoboCop', 'Hardcore Henry', 'Logan', 'Fortress', 'The Machine', 'Kill Command', 'April and the Extraordinary World', 'Sixty Million Dollar Man', 'Kung Fu Cyborg', 'Future X-Cops', 'Godzilla vs. Megalon', 'Kamen Rider V3', 'Kamen Rider V3 vs. the Destron Monsters', 'Terror of Mechagodzilla', 'JAKQ Dengeki Tai', 'JAKQ Dengeki Tai vs. Goranger', 'Sun Vulcan Movie', '964 Pinocchio', 'Zeiram', 'Tetsuo II: Body Hammer', 'Ghost in the Shell', 'Godzilla Against Mechagodzilla', 'Godzilla: Tokyo

### Parte 2: Usar Whoosh para Indexación y Recuperación

In [9]:
!pip install whoosh

Collecting whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl.metadata (3.1 kB)
Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
   ---------------------------------------- 0.0/468.8 kB ? eta -:--:--
   -- ------------------------------------ 30.7/468.8 kB 660.6 kB/s eta 0:00:01
   ----- --------------------------------- 61.4/468.8 kB 656.4 kB/s eta 0:00:01
   --------- ---------------------------- 122.9/468.8 kB 901.1 kB/s eta 0:00:01
   --------------------- ------------------ 256.0/468.8 kB 1.3 MB/s eta 0:00:01
   ------------------------------ --------- 358.4/468.8 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------- 468.8/468.8 kB 1.8 MB/s eta 0:00:00
Installing collected packages: whoosh
Successfully installed whoosh-2.7.4



[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**1. Configuración del índice con Whoosh:**

Define un esquema y configura el índice.

In [10]:
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
import os

schema = Schema(Title=TEXT(stored=True), Plot=TEXT(stored=True))

if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

**2. Agregar documentos al índice:**

Agrega los títulos y tramas.

In [11]:
from whoosh.writing import AsyncWriter

writer = AsyncWriter(ix)
for _, row in data.iterrows():
    writer.add_document(Title=row['Title'], Plot=row['Plot'])
writer.commit()

**3. Realizar consultas:**

Consulta palabras clave.

In [12]:
from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
    query = QueryParser("Plot", ix.schema).parse("cyborg")
    results = searcher.search(query)
    for result in results:
        print(result['Title'])

Future War
Space Truckers
JAKQ Dengeki Tai
JAKQ Dengeki Tai vs. Goranger
Kung Fu Cyborg
Future X-Cops
Kamen Rider V3
Kamen Rider V3 vs. the Destron Monsters
Cyborg She
Cyborg She


### Parte 3: Usar Elasticsearch para Indexación y Recuperación

**1. Inicia Elasticsearch con Docker:**

Configura e inicia un contenedor.

In [2]:
!pip install elasticsearch

Collecting elasticsearch
  Downloading elasticsearch-8.17.0-py3-none-any.whl.metadata (8.8 kB)
Collecting elastic-transport<9,>=8.15.1 (from elasticsearch)
  Downloading elastic_transport-8.17.0-py3-none-any.whl.metadata (3.6 kB)
Collecting urllib3<3,>=1.26.2 (from elastic-transport<9,>=8.15.1->elasticsearch)
  Downloading urllib3-2.2.3-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi (from elastic-transport<9,>=8.15.1->elasticsearch)
  Downloading certifi-2024.12.14-py3-none-any.whl.metadata (2.3 kB)
Downloading elasticsearch-8.17.0-py3-none-any.whl (571 kB)
   ---------------------------------------- 0.0/571.2 kB ? eta -:--:--
   -- ------------------------------------- 30.7/571.2 kB 1.4 MB/s eta 0:00:01
   ------- -------------------------------- 112.6/571.2 kB 1.1 MB/s eta 0:00:01
   --------------- ------------------------ 225.3/571.2 kB 1.4 MB/s eta 0:00:01
   ----------------------- ---------------- 337.9/571.2 kB 1.6 MB/s eta 0:00:01
   ------------------------------------


[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**2. Configurar el cliente en Python:**

Usa la librería elasticsearch.


In [16]:
from elasticsearch import Elasticsearch

# Conexión al servicio Elasticsearch sin autenticación básica
es = Elasticsearch(
    ['https://your-cluster-endpoint:443']  # Incluye el puerto
)

# Verificar que la conexión es exitosa
if es.ping():
    print("Conexión exitosa")
else:
    print("No se pudo conectar con Elasticsearch")

No se pudo conectar con Elasticsearch


In [15]:
import elasticsearch
print(elasticsearch.__version__)

(8, 17, 0)


**3. Indexar documentos:**


Inserta películas en el índice.

In [None]:
for _, row in data.iterrows():
    doc = {'Title': row['Title'], 'Plot': row['Plot']}
    es.index(index='movies', body=doc)

**4.Realizar consultas:**

Consulta con palabras clave.

In [None]:
query = {
    "query": {
        "match": {
            "Plot": "cyborg"
        }
    }
}
response = es.search(index='movies', body=query)
for hit in response['hits']['hits']:
    print(hit['_source']['Title'])