# Workshop: Using Cloud tools for Information Retrieval

## Objective:
Learn how to use two powerful vector databases, ChromaDB and Pinecone, for performing similarity searches with text embeddings. Vector databases are essential tools in the field of Information Retrieval (IR) and are widely used in various applications such as search engines, recommendation systems, and natural language processing (NLP).

In [1]:
import chromadb
import torch
from transformers import AutoTokenizer, AutoModel

In [2]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="d3d788b6-7f0a-4422-ac69-0163ff358bf9")

In [3]:
pc.create_index(
    name="jueves",
    dimension=300, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

In [4]:
index = pc.Index("jueves300")

In [5]:
import pandas as pd

wine_df = pd.read_csv("../week10/data/winemag-data_first150k.csv")
wine_df

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...,...
150925,150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


In [6]:
from gensim.models import KeyedVectors

model_path = '../week10/data/GoogleNews-vectors-negative300.bin.gz'

# Cargar el modelo Word2Vec preentrenado
word2vec_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

In [7]:
corpus = wine_df[['Unnamed: 0','description']][:30]
corpus

Unnamed: 0.1,Unnamed: 0,description
0,0,This tremendous 100% varietal wine hails from ...
1,1,"Ripe aromas of fig, blackberry and cassis are ..."
2,2,Mac Watson honors the memory of a wine once ma...
3,3,"This spent 20 months in 30% new French oak, an..."
4,4,"This is the top wine from La Bégude, named aft..."
5,5,"Deep, dense and pure from the opening bell, th..."
6,6,Slightly gritty black-fruit aromas include a s...
7,7,Lush cedary black-fruit aromas are luxe and of...
8,8,This re-named vineyard was formerly bottled as...
9,9,The producer sources from two blocks of the vi...


In [8]:
import numpy as np

def generate_word2vec_embeddings(texts):
    embeddings = []
    for text in texts:
        tokens = text.lower().split()
        word_vectors = [word2vec_model[word] for word in tokens if word in word2vec_model]
        if word_vectors:
            embeddings.append(np.mean(word_vectors, axis=0))
        else:
            embeddings.append(np.zeros(word2vec_model.vector_size))
    return np.array(embeddings)

word2vec_embeddings = generate_word2vec_embeddings(corpus['description'])
print("Word2Vec Embeddings:", word2vec_embeddings)
print("Word2Vec Shape:", word2vec_embeddings.shape)

Word2Vec Embeddings: [[ 1.97866447e-02  3.41472141e-02 -8.84628296e-03 ... -1.57333612e-02
   6.62658662e-02 -2.78472900e-02]
 [ 1.68609619e-03 -1.24740601e-03 -6.54935837e-04 ... -4.45375443e-02
   6.40835762e-02  3.22151184e-02]
 [-1.75819397e-02  6.40892386e-02  2.40856409e-02 ... -4.09250259e-02
   9.11022425e-02  1.76935196e-02]
 ...
 [ 2.24064998e-02  5.79735003e-02 -1.06399122e-04 ... -2.36428753e-02
   6.22100830e-02 -2.15536579e-02]
 [ 7.06324074e-03  3.76312993e-02  2.09319014e-02 ... -3.86342034e-02
   8.60798284e-02  3.15088741e-02]
 [ 2.73168087e-02  3.64656448e-02  5.79528809e-02 ... -7.14282990e-02
   1.07534885e-01  2.65576839e-02]]
Word2Vec Shape: (30, 300)


In [9]:
word2vec_embeddings

array([[ 1.97866447e-02,  3.41472141e-02, -8.84628296e-03, ...,
        -1.57333612e-02,  6.62658662e-02, -2.78472900e-02],
       [ 1.68609619e-03, -1.24740601e-03, -6.54935837e-04, ...,
        -4.45375443e-02,  6.40835762e-02,  3.22151184e-02],
       [-1.75819397e-02,  6.40892386e-02,  2.40856409e-02, ...,
        -4.09250259e-02,  9.11022425e-02,  1.76935196e-02],
       ...,
       [ 2.24064998e-02,  5.79735003e-02, -1.06399122e-04, ...,
        -2.36428753e-02,  6.22100830e-02, -2.15536579e-02],
       [ 7.06324074e-03,  3.76312993e-02,  2.09319014e-02, ...,
        -3.86342034e-02,  8.60798284e-02,  3.15088741e-02],
       [ 2.73168087e-02,  3.64656448e-02,  5.79528809e-02, ...,
        -7.14282990e-02,  1.07534885e-01,  2.65576839e-02]], dtype=float32)

In [10]:
x = {'id': '0', 'values': word2vec_embeddings[0]}
x

{'id': '0',
 'values': array([ 1.97866447e-02,  3.41472141e-02, -8.84628296e-03,  5.19332886e-02,
        -1.63661949e-02, -3.75488289e-02,  5.24800308e-02, -1.33699030e-01,
         4.18946855e-02,  1.35601804e-01, -4.52171341e-02, -1.09356686e-01,
        -1.28356935e-02,  1.36230467e-02, -6.44167438e-02,  8.89822021e-02,
         2.00988762e-02,  1.46444708e-01, -1.91184990e-02, -6.88751191e-02,
        -2.08301544e-02,  1.02580644e-01,  2.07199100e-02, -3.14007774e-02,
         3.41799743e-02, -5.87153435e-02, -4.22256477e-02,  8.33202377e-02,
        -5.45539847e-03,  2.48138420e-02, -1.26869204e-02, -1.06964111e-02,
        -8.11920129e-03,  5.03300652e-02, -1.49918552e-02, -7.29885092e-03,
        -1.43737793e-02, -1.14369199e-01,  3.42922211e-02,  8.48118812e-02,
         9.50866714e-02, -7.82394409e-02,  2.16217041e-02, -2.50357632e-02,
        -1.56270023e-02, -6.56021088e-02, -3.80557999e-02,  1.42280580e-02,
         1.22375488e-02,  1.51565550e-02, -4.68969345e-04,  1.9854

In [11]:
vectors = []
for i in range(30):
    x = {'id': str(i), 'values': word2vec_embeddings[i]}
    vectors.append(x)

In [12]:
vectors = [{'id': str(i), 'values': word2vec_embeddings[i]} for i in range(30)]

In [13]:
y = [1, 'a']
y

[1, 'a']

In [14]:
vectors

[{'id': '0',
  'values': array([ 1.97866447e-02,  3.41472141e-02, -8.84628296e-03,  5.19332886e-02,
         -1.63661949e-02, -3.75488289e-02,  5.24800308e-02, -1.33699030e-01,
          4.18946855e-02,  1.35601804e-01, -4.52171341e-02, -1.09356686e-01,
         -1.28356935e-02,  1.36230467e-02, -6.44167438e-02,  8.89822021e-02,
          2.00988762e-02,  1.46444708e-01, -1.91184990e-02, -6.88751191e-02,
         -2.08301544e-02,  1.02580644e-01,  2.07199100e-02, -3.14007774e-02,
          3.41799743e-02, -5.87153435e-02, -4.22256477e-02,  8.33202377e-02,
         -5.45539847e-03,  2.48138420e-02, -1.26869204e-02, -1.06964111e-02,
         -8.11920129e-03,  5.03300652e-02, -1.49918552e-02, -7.29885092e-03,
         -1.43737793e-02, -1.14369199e-01,  3.42922211e-02,  8.48118812e-02,
          9.50866714e-02, -7.82394409e-02,  2.16217041e-02, -2.50357632e-02,
         -1.56270023e-02, -6.56021088e-02, -3.80557999e-02,  1.42280580e-02,
          1.22375488e-02,  1.51565550e-02, -4.6896934

In [15]:
index.upsert(vectors=vectors, namespace='vectors')

{'upserted_count': 30}

In [16]:
print(index.describe_index_stats())

{'dimension': 300,
 'index_fullness': 0.0,
 'namespaces': {'vectors': {'vector_count': 30}},
 'total_vector_count': 30}


In [17]:
query_str = 'coffee smell'

query_vector = generate_word2vec_embeddings([query_str])

query_vector

array([[ 0.01367188, -0.10717773, -0.20788574,  0.30711365, -0.10205078,
        -0.00811768, -0.1159668 , -0.02148438, -0.00925446,  0.3671875 ,
        -0.10205078, -0.2368164 ,  0.05566406,  0.07617188, -0.01531982,
         0.19726562, -0.08215332,  0.14453125,  0.22705078, -0.18115234,
        -0.20507812,  0.07391357,  0.17333984, -0.24658203, -0.14978027,
         0.00488281, -0.02490234,  0.18237305, -0.05865479,  0.0579834 ,
        -0.03411865, -0.17160034, -0.10400391, -0.14428711, -0.17773438,
        -0.02542114,  0.01763916, -0.17547607, -0.08508301,  0.06848145,
        -0.16625977, -0.27077103,  0.0925293 ,  0.06634521, -0.16479492,
        -0.07995605, -0.32470703,  0.09127808, -0.27775574,  0.2524414 ,
        -0.13500977, -0.01123047,  0.00634766, -0.07592773, -0.00756836,
         0.04724121,  0.06738281,  0.2019043 ,  0.08813477, -0.24609375,
        -0.07141113,  0.17640495, -0.00439453,  0.17382812,  0.07470703,
        -0.18554688, -0.07110596,  0.03588867,  0.0

In [18]:
index.query(
    namespace="vectors",
    vector=query_vector.tolist(),
    top_k=3,
    include_values=False
)

{'matches': [{'id': '21', 'score': 0.534405589, 'values': []},
             {'id': '3', 'score': 0.533128679, 'values': []},
             {'id': '1', 'score': 0.523467481, 'values': []}],
 'namespace': 'vectors',
 'usage': {'read_units': 5}}

In [19]:
wine_df[wine_df['Unnamed: 0'] == 21]['description']

21    Alluring, complex and powerful aromas of grill...
Name: description, dtype: object

# CHROMADB

In [20]:
import chromadb

In [21]:
chroma_client = chromadb.Client()

In [22]:
collection = chroma_client.create_collection(name="Wine_colecction")


In [23]:
documents = corpus['description'].tolist()
ids = corpus['Unnamed: 0'].astype(str).tolist()

In [24]:
collection.add(
    documents=documents,
    ids=ids
)

In [25]:
results = collection.query(
    query_texts=["Tcoffee smell"], # Chroma will embed this for you
    n_results=5 # how many results to return
)
print(results)


{'ids': [['7', '28', '22', '24', '21']], 'distances': [[1.3850152492523193, 1.3960453271865845, 1.4567155838012695, 1.4612610340118408, 1.4670910835266113]], 'metadatas': [[None, None, None, None, None]], 'embeddings': None, 'documents': [['Lush cedary black-fruit aromas are luxe and offer notes of marzipan and vanilla. This bruiser is massive and tannic on the palate, but still lush and friendly. Chocolate is a key flavor, while baked berry and cassis flavors are hardly wallflowers. On the finish, this is tannic and deep as a sea trench. Drink this saturated black-colored Toro through 2023.', 'Cranberry, baked rhubarb, anise and crushed slate aromas show on the nose of this always-excellent bottling from the Franscioni family. The palate abounds with energy, driving black raspberry fruit accented with earthy loam, fennel, oregano and juniper through a compellingly round midpalate.', 'Tarry blackberry and cheesy oak aromas are appropriate for a wine of this size and magnitude. In the m

In [26]:
wine_df[wine_df['Unnamed: 0'] == 7]['description']

7    Lush cedary black-fruit aromas are luxe and of...
Name: description, dtype: object