# Workshop: Using Cloud tools for Information Retrieval

## Objective:
Learn how to use two powerful vector databases, ChromaDB and Pinecone, for performing similarity searches with text embeddings. Vector databases are essential tools in the field of Information Retrieval (IR) and are widely used in various applications such as search engines, recommendation systems, and natural language processing (NLP).

## Step 1: Importación de librerías

In [1]:
import chromadb
from chromadb.config import Settings
import torch
from transformers import AutoTokenizer, AutoModel, BertTokenizer, BertModel
from pinecone import Pinecone, ServerlessSpec
import pandas as pd
import gensim.downloader as api
import numpy as np

## Step 2: Inicializar la base de datos


In [2]:
client = chromadb.Client(Settings()) #inicializar el cliente en ChromaDB
collection = client.create_collection(name="coleccion_vinos") #crear la coleccion

## Step 3: Cargar los datos

In [3]:
wine_df = pd.read_csv('winemag-data-130k-v2.csv')
wine_df

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


#Step 4: Generar de embeddings
Se toma un texto, se lo tokeniza, se pasa los tokens a través del modelo BERT para obtener las representaciones de embeddings y finalemente se extrae el embedding del token, devolviéndolo como una lista de números

In [4]:
#inicializar el tokenizer y el modelo de Bert
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

#funcion para generar embeddings usando Bert
def generar_embedding(texto):
    inputs = tokenizer(texto, return_tensors='pt', truncation=True, padding=True, max_length=512) #tokenizar el texto y convertir a id's de tokens
    #obtener las representaciones de los embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :].numpy() #usar las embeddings de la primera capa oculta del primer token [CLS]

    return embeddings[0].tolist()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Step 5: Indexar de documentos

In [5]:
for idx, row in wine_df.head(10000).iterrows(): #solo los 10000 primeros debido a recursos del computador
    embedding = generar_embedding(row['description']) #generar el embedding para la descripcion del vino en la fila actual
    #agregar el documento a la coleccion en ChromaDB
    collection.add(ids=[str(row['Unnamed: 0'])],
                   embeddings=[embedding],
                   metadatas=[{"description": row['description']}])

## Step 6: Realizar la consulta

In [6]:
query = "The wine is completely dry and mature"
query_embedding = generar_embedding(query) #generar el embedding para la consulta

### Mostrar los resultados mas cercanos

In [7]:
resultados = collection.query(query_embeddings=query_embedding, n_results=10)

#verificar y mostrar los resultados
if 'ids' in resultados and 'distances' in resultados and 'metadatas' in resultados:
    ids = resultados['ids'][0]
    distances = resultados['distances'][0]
    metadatas = resultados['metadatas'][0]

    for doc_id, score, metadata in zip(ids, distances, metadatas):
        descripcion = metadata['description']
        print(f"ID: {doc_id}, Score: {score}\nDescripción: {descripcion}\n")
else:
    print("No se encontraron resultados.")

ID: 8274, Score: 30.083202362060547
Descripción: Finely perfumed, with berry fruits that freshen the more severe tannins. The wine has concentrated structure, smokiness as well as fresh acidity. The finish is dry and firm.

ID: 3094, Score: 32.154998779296875
Descripción: Classic, soft, easy and broad Pinot Blanc. There is an attractive buttery character to go with the peach and lime flavors. The wine is ripe, with pure fruit and a final tang of acidity.

ID: 110, Score: 32.335235595703125
Descripción: Produced from cru vines at the base of Mount Brouilly, the wine has structure as well as ripe black-plum fruits. It is generous and its fruit is well balanced by acidity and solid tannins. The wines is ready to drink.

ID: 1918, Score: 33.486602783203125
Descripción: This solid, structured wine has black currant fruit, balanced acidity and firm tannins. The wine has a solid texture, a core of dryness and the potential to age 3–4 years.

ID: 7310, Score: 33.64719009399414
Descripción: Age