# Workshop: Using Cloud tools for Information Retrieval

## Objective:
Learn how to use two powerful vector databases, ChromaDB and Pinecone, for performing similarity searches with text embeddings. Vector databases are essential tools in the field of Information Retrieval (IR) and are widely used in various applications such as search engines, recommendation systems, and natural language processing (NLP).


In [None]:
!pip install chromad

### Importacion de librerias

In [None]:
import chromadb
from chromadb.utils import embedding_functions
import pandas as pd
from chromadb.config import Settings
from transformers import BertTokenizer, BertModel
import torch

### 2 Inicializar la base de datos

Inicializa y configura la base de datos de ChromaDB:

In [None]:
# Inicializacion del cliente en ChromaDB
client = chromadb.Client(Settings())

# Crear la colección
collection = client.create_collection(name="mi_coleccion_vinos")

### 3. Cargar los datos

In [None]:
wine_df = pd.read_csv('winemag-data-130k-v2.csv')
wine_df

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43899,43899,US,This is an outlier in the Brian Carter lineup ...,Stone Tree Vineyard 1,92,48.0,Washington,Wahluke Slope,Columbia Valley,Paul Gregutt,@paulgwine,Brian Carter Cellars 2008 Stone Tree Vineyard ...,Cabernet Sauvignon,Brian Carter Cellars
43900,43900,Italy,From the Serralunga d'Alba area of Barolo prod...,,92,,Piedmont,Barolo,,,,Cantine Gemma 2007 Barolo,Nebbiolo,Cantine Gemma
43901,43901,US,Multiple vineyards contribute to this standout...,Récolte Grand Cru,93,125.0,Oregon,Dundee Hills,Willamette Valley,Paul Gregutt,@paulgwine,Domaine Serene 2014 Récolte Grand Cru Chardonn...,Chardonnay,Domaine Serene
43902,43902,New Zealand,This wine shows a fine balance between sweetne...,Bannockburn,93,40.0,Central Otago,,,Joe Czerwinski,@JoeCz,Felton Road 2015 Bannockburn Riesling (Central...,Riesling,Felton Road


### 4 Generacion de embeddings

Cargar tanto el tokenizador como el modelo BERT preentrenado bert-base-uncased desde Hugging Face. El tokenizador convierte el texto en tokens, y el modelo genera los embeddings basados en estos tokens.

In [None]:
# Inicializar el tokenizer y el modelo BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### 5. Función para generar embedding

Tomar un texto como entrada, lo tokeniza, obtiene las representaciones de los embeddings utilizando el modelo BERT, extrae el embedding correspondiente al primer token [CLS], y devuelve este embedding como una lista. Este proceso permite convertir un texto en una representación numérica que captura su significado contextual, lo que es útil para diversas tareas de procesamiento de lenguaje natural (NLP).

In [None]:
# Función para generar embeddings usando BERT
def generar_embedding(texto):
    # Tokenizar el texto y convertir a IDs de tokens
    inputs = tokenizer(texto, return_tensors='pt', truncation=True, padding=True, max_length=512)

    # Obtener las representaciones de los embeddings
    with torch.no_grad():
        outputs = model(**inputs)

    # Usar las embeddings de la primera capa oculta (hidden states) del primer token [CLS]
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()

    return embeddings[0].tolist()  # Devolver como una lista

### 6. Indexacion de documentos

Indexar los documentos con ChromaDB:

In [None]:
# Indexar los documentos en ChromaDB
for idx, row in wine_df.iterrows():
    # Generar el embedding para la descripción del vino en la fila actual
    embedding = generar_embedding(row['description'])
      # Agregar el documento a la colección en ChromaDB
    collection.add(ids=[str(row['Unnamed: 0'])], embeddings=[embedding], metadatas=[{"description": row['description']}])

### 7. Consulta a la base de datos

In [21]:
# Convertir la consulta en un embedding
consulta = "This is a festive wine, with soft, ripe fruit and acidity"
# Generar el embedding para la consulta
query_embedding = generar_embedding(consulta)

### Realizar la consulta
Vamos a mostrar los 15 resultados mas cercanos

In [22]:
# Realizar la consulta
resultados = collection.query(query_embeddings=query_embedding, n_results=15)

# Verificar y mostrar los resultados
if 'ids' in resultados and 'distances' in resultados and 'metadatas' in resultados:
    ids = resultados['ids'][0]
    distances = resultados['distances'][0]
    metadatas = resultados['metadatas'][0]

    for doc_id, score, metadata in zip(ids, distances, metadatas):
        descripcion = metadata['description']
        print(f"ID: {doc_id}, Score: {score}\nDescripción: {descripcion}\n")
else:
    print("No se encontraron resultados.")


ID: 42, Score: 6.150557518005371
Descripción: This is a festive wine, with soft, ripe fruit and acidity, plus a red berry flavor.

ID: 12499, Score: 6.150557518005371
Descripción: This is a festive wine, with soft, ripe fruit and acidity, plus a red berry flavor.

ID: 10563, Score: 14.070354461669922
Descripción: This is an off-dry wine, soft with strawberry flavors that are balanced by crisp acidity. It is light and fruity—a fine wine to drink as an apéritif.

ID: 14023, Score: 14.391206741333008
Descripción: This is a fresh and fruity wine, full of red fruits that are laced with raisins and prunes. The acidity cuts through the richness balancing the natural sweetness. It has an attractive crisp aftertaste.

ID: 22064, Score: 14.480831146240234
Descripción: This is a soft wine with attractive ripe strawberry fruits and a soft texture. There is balancing acidity giving an edge of freshness to the otherwise gentle fruity texture.

ID: 28281, Score: 14.896913528442383
Descripción: This i