# Preparación y representación vectorial de documentos/textos en embeddings, indexación y almacenamiento en una base de datos vectorial

referencias:

[1] https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/pinecone/README.md

[2]https://github.com/pinecone-io/examples/blob/master/learn/generation/openai/gpt-4-langchain-docs.ipynb


[3] https://docs.pinecone.io/docs/quickstart


[4]https://github.com/Azure-Samples/Azure-OpenAI-Docs-Samples/blob/main/Samples/Tutorials/Embeddings/embedding_billsum.ipynb

Este notebook toma los ejemplos de la referencia 2 y 4 y los aplica al dataset de tweets de cambio climatico:

* Leer conjunto de datos y preprocesarlo.
* Limpiar el texto y/o documento y crear embeddings (representacion vectorial) con modelo seleccionado: openIA ada-002 
* Realizar una busqueda de texto a traves de una seleccion de vectores similares en el dataframe.
* Crear el indice y agregar los embeddings en el indice creado de la base de datos vectorial seleccionada: pinecone.


### Base de datos vectorial seleccionada pinecone

- **configuracion**: Instalar, Importar librerias y cargar variables de ambiente para conexion a base de datos vectorial pinecone y a modelo de embeddings openIA. [2][3][4]
- **leer dataset y crear embdedings**: leer el dataset, aumentar el texto y crear embeddings con modelo de OpenIA.
- **Base de datos vectorial: Pinecone**
    - Configurar y crear el cliente para conexion con VDB Pinecone. 
    - Crear indice
    - Cargar vectores y metadata en indice


In [None]:
#pip install openai python-dotenv pinecone-client numpy pandas tiktoken

## Importar librerias

In [3]:

import os
import pandas as pd
import re
import numpy as np
import tiktoken
from openai import AzureOpenAI
from dotenv import load_dotenv
from tqdm.auto import tqdm
from time import sleep
from pinecone import Pinecone
from huggingface_hub import login
from datasets import Dataset

## configurar variables de ambiente y rutas

In [4]:
load_dotenv()
OPENAI_ENDPOINT = os.environ["AZURE_OPENAI_ENDPOINT"]
OPENAI_API_KEY = os.environ["AZURE_OPENAI_API_KEY"] 
EMBEDDING_DEPLOYMENT = os.environ["AZURE_EMBEDDING_DEPLOYMENT"] 
OPENAI_API_VERSION =os.environ["OPENAI_API_VERSION"] 

api_key = os.environ.get('PINECONE_API_KEY')
environment = os.environ.get('PINECONE_ENVIRONMENT')
use_serverless = os.environ.get("USE_SERVERLESS", "False").lower() == "true"
access_token_hf = os.environ.get('HF_TOKEN')

In [3]:
src_path = os.getcwd()
data_path = '../data'
filename = 'climateTwitterData.csv'
out_data_path = data_path+'data/out/batch/'

## Leer dataset de entrada

In [4]:
df=pd.read_csv(os.path.join(src_path,data_path,filename))
df.head()

  df=pd.read_csv(os.path.join(src_path,data_path,filename))


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,author_id,text,retweets,permalink,date,formatted_date,favorites,mentions,hashtags,geo,urls,search_hashtags,location,sentiment1,sentiment2
0,0,0,1.21181e+18,7.59e+17,"2020 is the year we #votethemout, the year we ...",15,https://twitter.com/Sphiamia/status/1211807074...,2019-12-31 00:31:35+00:00,Tue Dec 31 00:31:35 +0000 2019,46,,#votethemout #climatestrike #rebelforlife,,,#climatestrike,"California, USA",negative,negative
1,1,1,1.21067e+18,22195470.0,Winter has not stopped this group of dedicated...,9,https://twitter.com/StephDujarric/status/12106...,2019-12-27 20:56:21+00:00,Fri Dec 27 20:56:21 +0000 2019,35,,#climatefriday #climatestrike #ClimateAction,,,#climatestrike,"California, USA",positive,positive
2,2,2,1.21059e+18,1.07e+18,WEEK 55 of #ClimateStrike at the @UN. Next wee...,545,https://twitter.com/AlexandriaV2005/status/121...,2019-12-27 15:50:22+00:00,Fri Dec 27 15:50:22 +0000 2019,3283,@UN @Fridays4future,#ClimateStrike,,,#climatestrike,"California, USA",positive,positive
3,3,3,1.21026e+18,1339821000.0,"A year of resistance, as youth protests shape...",1,https://twitter.com/EnergyHouseVA/status/12102...,2019-12-26 17:53:26+00:00,Thu Dec 26 17:53:26 +0000 2019,2,,#greta #gretathunberg #climatechange #fridaysf...,,https://www.channelnewsasia.com/news/commentar...,#climatestrike,"California, USA",positive,positive
4,4,4,1.20964e+18,1339821000.0,HAPPY HOLIDAYS #greta #gretathunberg #climate...,1,https://twitter.com/EnergyHouseVA/status/12096...,2019-12-25 00:56:37+00:00,Wed Dec 25 00:56:37 +0000 2019,4,,#greta #gretathunberg #climatechange #fridaysf...,,"http://www.energyhouse.us,http://www.pacenowfo...",#climatestrike,"California, USA",positive,positive


In [5]:
df['search_hashtags'].value_counts()

search_hashtags
#climatestrike       18355
#climatechange       16190
#climateaction        6378
#sustainability       5790
#climatecrisis        4982
#environment          4703
#greennewdeal         4589
#globalwarming        4152
#fridaysforfuture     3038
#actonclimate         1895
#savetheplanet        1434
#bushfires             899
Name: count, dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72405 entries, 0 to 72404
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0.1     72405 non-null  int64  
 1   Unnamed: 0       72405 non-null  int64  
 2   id               72405 non-null  float64
 3   author_id        72405 non-null  float64
 4   text             72405 non-null  object 
 5   retweets         72405 non-null  int64  
 6   permalink        72405 non-null  object 
 7   date             72405 non-null  object 
 8   formatted_date   72405 non-null  object 
 9   favorites        72405 non-null  int64  
 10  mentions         27554 non-null  object 
 11  hashtags         72402 non-null  object 
 12  geo              0 non-null      float64
 13  urls             33349 non-null  object 
 14  search_hashtags  72405 non-null  object 
 15  location         72405 non-null  object 
 16  sentiment1       30000 non-null  object 
 17  sentiment2  

## Preprocesar dataset para crear embeddings

### pre-pocesar dataframe

In [45]:
#seleccionar columnas de interes para la mineria de texto
df_procesado = df[['text','date','hashtags','search_hashtags',	'location',	'sentiment1']]

#eliminar registros con mas de 4 columnas en nulo
df_procesado = df_procesado.dropna(thresh= 4 , axis=0 )

# Convertir la columna "date" a formato str aaaa-mm-dd sin la hora
df_procesado.loc[:,'date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')

#eliminar registros repetidos en los campos texto, fecha y ubicacion
df_procesado = df_procesado.drop_duplicates(subset=['text','date','location'])

#reiniciar indice de filas
df_procesado.reset_index(drop= True, inplace=True)

df_procesado.head()

Unnamed: 0,text,date,hashtags,search_hashtags,location,sentiment1
0,"2020 is the year we #votethemout, the year we ...",2019-12-31,#votethemout #climatestrike #rebelforlife,#climatestrike,"California, USA",negative
1,Winter has not stopped this group of dedicated...,2019-12-27,#climatefriday #climatestrike #ClimateAction,#climatestrike,"California, USA",positive
2,WEEK 55 of #ClimateStrike at the @UN. Next wee...,2019-12-27,#ClimateStrike,#climatestrike,"California, USA",positive
3,"A year of resistance, as youth protests shape...",2019-12-26,#greta #gretathunberg #climatechange #fridaysf...,#climatestrike,"California, USA",positive
4,HAPPY HOLIDAYS #greta #gretathunberg #climate...,2019-12-25,#greta #gretathunberg #climatechange #fridaysf...,#climatestrike,"California, USA",positive


In [46]:
df_procesado.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53399 entries, 0 to 53398
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   text             53399 non-null  object
 1   date             53399 non-null  object
 2   hashtags         53397 non-null  object
 3   search_hashtags  53399 non-null  object
 4   location         53399 non-null  object
 5   sentiment1       21795 non-null  object
dtypes: object(6)
memory usage: 2.4+ MB


In [47]:
# Manejar NaN en la columna 'hashtags' y 'sentiment1'
df_procesado.loc[:,'hashtags'] = df_procesado['hashtags'].fillna('')

df_procesado.loc[:,'sentiment1'] = df_procesado['sentiment1'].fillna('')

# unir columnas
df_procesado.loc[:,'aumented_text'] = df_procesado['text'] + '. date: ' + df_procesado['date'] + '. location: ' + df_procesado['location'] + '. sentiment: ' + df_procesado['sentiment1']

#crear indice para los embeddigs del texto de cada fila
df_procesado = df_procesado.reset_index(names="id")

df_procesado.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53399 entries, 0 to 53398
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               53399 non-null  int64 
 1   text             53399 non-null  object
 2   date             53399 non-null  object
 3   hashtags         53399 non-null  object
 4   search_hashtags  53399 non-null  object
 5   location         53399 non-null  object
 6   sentiment1       53399 non-null  object
 7   aumented_text    53399 non-null  object
dtypes: int64(1), object(7)
memory usage: 3.3+ MB


### limpieza de texto columna aumentet_text

In [48]:
def limpiar_texto(texto:str):
    #eliminar espacios en blanco, saltos de linea y pasar todo a minusculas
    texto = texto.lower() 
    texto = re.sub(r'\s+',  ' ', texto).strip()
    texto = texto.replace("\n", "")
    texto = texto.strip()
    return texto

df_procesado['clean_text']= df_procesado["aumented_text"].apply(lambda x : limpiar_texto(x))

Segun la documentacion del modelo de embed text-embedding-ada-002 el numero de tokens no debe exceder 8192 tokens. Sino se debe dividir el texto. Para este dataset no es necesario, ningun texto supera ese numero.

In [49]:
tokenizer = tiktoken.get_encoding("cl100k_base")
# calcular el numero de tokest en el texto 
df_procesado['n_tokens_text'] = df_procesado["aumented_text"].apply(lambda x: len(tokenizer.encode(x)))
df_procesado['n_tokens_clean'] = df_procesado["clean_text"].apply(lambda x: len(tokenizer.encode(x)))


In [50]:
df_procesado[['n_tokens_text','n_tokens_clean']].describe()

Unnamed: 0,n_tokens_text,n_tokens_clean
count,53399.0,53399.0
mean,67.865915,67.610405
std,25.057471,24.884816
min,21.0,21.0
25%,49.0,49.0
50%,67.0,67.0
75%,85.0,84.0
max,455.0,408.0


In [51]:
def crear_metadata(row):
    metadata = {
        "search_hashtags": row["search_hashtags"],
        "date": row["date"],
        "location": row["location"],
        "hashtags": row["hashtags"],
        "sentiment1": row["sentiment1"],
        "text": row["clean_text"]
    }
    return metadata
df_procesado['metadata'] = df_procesado.apply(lambda row: crear_metadata(row), axis=1)

In [52]:
df_final_procesado = df_procesado[['id','clean_text','metadata']].copy()

In [53]:
df_final_procesado.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53399 entries, 0 to 53398
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          53399 non-null  int64 
 1   clean_text  53399 non-null  object
 2   metadata    53399 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB


##   Crear embeddings con modelo de embeddings pre-entrenado de OpenIA

### crear cliente para conexion con modelo de embedding de azure openIA

In [54]:
client = AzureOpenAI(
  api_key = OPENAI_API_KEY,  
  api_version = OPENAI_API_VERSION,
  azure_endpoint = OPENAI_ENDPOINT
)
def generar_embedding(texto:str, modelo:str=EMBEDDING_DEPLOYMENT)->list[float]:
    """Funcion para crear embedings a partir de un texto usando modelo model """
    return client.embeddings.create(input = [texto], model=modelo).data[0].embedding

def generar_embeddings_por_lote(textos:list, modelo:str=EMBEDDING_DEPLOYMENT)->list:
    """Funcion para crear embedings a partir de una lista de textos usando modelo model """
    return client.embeddings.create(input = textos, model=modelo).data
    

### Crear funcion para crear embeddings de textos en dataframe por lotes

In [56]:
def procesar_df_por_lotes(df:pd.DataFrame,tamaño_lote:int= 200, guardar:bool = True)->pd.DataFrame:
    """Funcion para procesar dataframe por lotes de n_lote """

    # Dividir el DataFrame en lotes y procesar cada lote
    lotes = []
    lote_inicial =0
    lote_actual = 0
    for i in tqdm(range(0, len(df), tamaño_lote)):
        # indice final del lote
        i_end = min(len(df), i+tamaño_lote)
        meta_batch = df[i:i_end]
        # lista ids
        ids_batch = [x['id'] for _,x in meta_batch.iterrows()]
        # lista de textos para embed
        textos = [x['clean_text'] for _,x in meta_batch.iterrows()]
        # lista metadatos
        # crear embeddings
        try:
            res = generar_embeddings_por_lote(textos, modelo=EMBEDDING_DEPLOYMENT)
        except Exception as e:
            done = False
            print(f"Reitentar solicitud, error {e}")
            while not done:
                sleep(5)
                try:
                    res = generar_embeddings_por_lote(textos, modelo=EMBEDDING_DEPLOYMENT)
                    done = True
                except:
                    print(f"Reitentar solicitud, error {e}")
                    pass
        embeds = [record.embedding for record in res]
        metadata_batch = meta_batch['metadata'].to_list()

        embeded_data = list(zip(ids_batch, textos ,embeds, metadata_batch))

        # Crear un DataFrame a partir de los datos combinados
        df_embeded = pd.DataFrame(embeded_data, columns=['id','text','embeddings','metadata'])

        lotes.append(df_embeded)
        lote_actual = lote_actual + 1
        if (lote_actual % 20 == 0) and guardar:
            
            # Guardar el archivo temporal cada 2000 ejecuciones
            file_name = f'climateTwitterEmbedData_{i_end}.csv'
            file_path = os.path.join(src_path, data_path, 'out', 'batch', file_name)
            lista_temp= lotes[lote_inicial:lote_actual]
            print(f"lote_actual->{lote_actual}")
            print(f"lote_inicial->{lote_inicial}")
            print(f"lista_temporal_guardada->{len(lista_temp)}")


            df_embeded_accum = pd.concat(lista_temp, ignore_index=True)
            df_embeded_accum.to_csv(file_path, sep=";",index=False)
            lote_inicial = lote_actual
            
  
    # Concatenar los resultados en un solo DataFrame
    df_out = pd.concat(lotes, ignore_index=True)
    return df_out

### Crear df con ids, vectores y metadata para cargar en base de datos vectorial

In [57]:
df_embeded = procesar_df_por_lotes(df_final_procesado,tamaño_lote= 200)

  0%|          | 0/267 [00:00<?, ?it/s]

  7%|▋         | 19/267 [18:21<4:12:15, 61.03s/it]

lote_actual->20
lote_inicial->0
lista_temporal_guardada->20


 15%|█▍        | 39/267 [38:42<3:51:47, 61.00s/it]

lote_actual->40
lote_inicial->20
lista_temporal_guardada->20


 22%|██▏       | 59/267 [59:02<3:31:37, 61.05s/it]

lote_actual->60
lote_inicial->40
lista_temporal_guardada->20


 30%|██▉       | 79/267 [1:19:23<3:11:02, 60.97s/it]

lote_actual->80
lote_inicial->60
lista_temporal_guardada->20


 37%|███▋      | 99/267 [1:39:45<2:49:59, 60.71s/it]

lote_actual->100
lote_inicial->80
lista_temporal_guardada->20


 45%|████▍     | 119/267 [2:00:05<2:30:36, 61.06s/it]

lote_actual->120
lote_inicial->100
lista_temporal_guardada->20


 52%|█████▏    | 139/267 [2:20:29<2:10:18, 61.08s/it]

lote_actual->140
lote_inicial->120
lista_temporal_guardada->20


 60%|█████▉    | 159/267 [2:40:51<1:49:51, 61.04s/it]

lote_actual->160
lote_inicial->140
lista_temporal_guardada->20


 67%|██████▋   | 179/267 [3:01:15<1:29:35, 61.09s/it]

lote_actual->180
lote_inicial->160
lista_temporal_guardada->20


 75%|███████▍  | 199/267 [3:21:37<1:09:07, 60.99s/it]

lote_actual->200
lote_inicial->180
lista_temporal_guardada->20


 82%|████████▏ | 219/267 [3:42:00<48:52, 61.09s/it]  

lote_actual->220
lote_inicial->200
lista_temporal_guardada->20


 90%|████████▉ | 239/267 [4:02:24<28:38, 61.38s/it]

lote_actual->240
lote_inicial->220
lista_temporal_guardada->20


 97%|█████████▋| 259/267 [4:22:45<08:05, 60.69s/it]

lote_actual->260
lote_inicial->240
lista_temporal_guardada->20


100%|██████████| 267/267 [4:30:56<00:00, 60.89s/it]


In [58]:
df_embeded

Unnamed: 0,id,text,embeddings,metadata
0,0,"2020 is the year we #votethemout, the year we ...","[-0.0329987034201622, -0.04772832244634628, -0...","{'search_hashtags': '#climatestrike', 'date': ..."
1,1,winter has not stopped this group of dedicated...,"[-0.022893592715263367, -0.0438067689538002, -...","{'search_hashtags': '#climatestrike', 'date': ..."
2,2,week 55 of #climatestrike at the @un. next wee...,"[-0.02628657966852188, -0.038114212453365326, ...","{'search_hashtags': '#climatestrike', 'date': ..."
3,3,"a year of resistance, as youth protests shaped...","[-0.013956073671579361, -0.04684501141309738, ...","{'search_hashtags': '#climatestrike', 'date': ..."
4,4,happy holidays #greta #gretathunberg #climatec...,"[-0.023563934490084648, -0.03334088623523712, ...","{'search_hashtags': '#climatestrike', 'date': ..."
...,...,...,...,...
53394,53394,#endplasticwaste #savetheplanet can we just st...,"[-0.02566179819405079, -0.012251610867679119, ...","{'search_hashtags': '#savetheplanet', 'date': ..."
53395,53395,always feared this. #recycling #savetheplanet ...,"[-0.010699223726987839, -0.015921304002404213,...","{'search_hashtags': '#savetheplanet', 'date': ..."
53396,53396,no more straws at lbm... only if you ask for i...,"[-0.03115925006568432, -0.022100692614912987, ...","{'search_hashtags': '#savetheplanet', 'date': ..."
53397,53397,my #trumps may not believe in #climatechange b...,"[-0.045478325337171555, -0.015456773340702057,...","{'search_hashtags': '#savetheplanet', 'date': ..."


In [59]:
# Guardar en formato pkl
file_name = "climateTwitterEmbedData.pkl"
file_path = os.path.join(src_path, data_path, 'out', file_name)
df_embeded.to_pickle(file_path)

In [60]:
# cargar df into  Hugging Face dataset
login(token = access_token_hf, add_to_git_credential=False,write_permission= True)

dataset = Dataset.from_pandas(df_embeded)
dataset.push_to_hub("AndresR2909/climate_twitter_text_embeddings")

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\felip\.cache\huggingface\token
Login successful


Creating parquet from Arrow format: 100%|██████████| 27/27 [00:04<00:00,  6.40ba/s]
Creating parquet from Arrow format: 100%|██████████| 27/27 [00:02<00:00,  9.22ba/s]
Uploading the dataset shards: 100%|██████████| 2/2 [02:15<00:00, 67.95s/it]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


CommitInfo(commit_url='https://huggingface.co/datasets/AndresR2909/climate_twitter_text_embeddings/commit/f72264763ef619c587ab6dff286ad0ffbb5441a0', commit_message='Upload dataset', commit_description='', oid='f72264763ef619c587ab6dff286ad0ffbb5441a0', pr_url=None, pr_revision=None, pr_num=None)

![image.png](hf_dataset.png)

## Prueba de busqueda de documentos localmente, sin base de datos vectorial, sobre el dataframe

In [61]:
def similaridad_coseno(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def busqueda_documentos_dataframe(df: pd.DataFrame, user_query:str, top_n:int=4)->pd.DataFrame:
    """Funcion para buscar textos en dataframe"""
    embedding = generar_embedding(
        user_query,
        modelo="text-embedding-ada-002" 
    )
    df["similarities"] = df.embeddings.apply(lambda x: similaridad_coseno(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    display(res)
    return res


res = busqueda_documentos_dataframe(df_embeded, "why winter has not stopped in California in 2019-12", top_n=5)

Unnamed: 0,id,text,embeddings,metadata,similarities
6365,6365,it appears like winter has been cancelled for ...,"[-0.02680104970932007, -0.031904712319374084, ...","{'search_hashtags': '#globalwarming', 'date': ...",0.848808
51997,51997,it appears like winter has been cancelled for ...,"[-0.027197187766432762, -0.029225224629044533,...","{'search_hashtags': '#globalwarming', 'date': ...",0.831487
1,1,winter has not stopped this group of dedicated...,"[-0.022893592715263367, -0.0438067689538002, -...","{'search_hashtags': '#climatestrike', 'date': ...",0.830322
32921,32921,winter is not coming #globalwarming . date: 20...,"[-0.017101731151342392, -0.023731127381324768,...","{'search_hashtags': '#globalwarming', 'date': ...",0.824444
1121,1121,@usatodayweather it not normally this warm out...,"[-0.0009998481255024672, -0.020653579384088516...","{'search_hashtags': '#climatestrike', 'date': ...",0.823834


In [65]:
res.iloc[0]["metadata"]

{'search_hashtags': '#globalwarming',
 'date': '2019-12-30',
 'location': 'California, USA',
 'hashtags': '#GlobalWarming',
 'sentiment1': 'positive',
 'text': 'it appears like winter has been cancelled for maryland #globalwarming. date: 2019-12-30. location: california, usa. sentiment: positive'}

In [67]:
res.iloc[2]["metadata"]

{'search_hashtags': '#climatestrike',
 'date': '2019-12-27',
 'location': 'California, USA',
 'hashtags': '#climatefriday #climatestrike #ClimateAction',
 'sentiment1': 'positive',
 'text': 'winter has not stopped this group of dedicated climate activists. they are an example to follow. #climatefriday #climatestrike #climateaction. date: 2019-12-27. location: california, usa. sentiment: positive'}

## Cargar embeddings a base de datos vectorial pinecone

### Iniciar conexion con base de datos vectorial


#### seleccionar tipo de pod: Serverless or Pod-based
Decidir que pod usar, ver documentacion: https://docs.pinecone.io/guides/indexes/configure-pod-based-indexes#changing-pod-sizes

In [5]:
# initialize connection to pinecone (get API key at app.pc.io)
# configure client
pc = Pinecone(api_key=api_key)

In [6]:

from pinecone import ServerlessSpec, PodSpec
import time

if use_serverless:
    spec = ServerlessSpec(cloud='aws', region='us-west-2')
else:
    spec = PodSpec(environment=environment,pod_type="s1.x1")

### Crear un indice

crear un indice en  la vdb pinecone

In [7]:
index_name = 'climate-twitter-data'

if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

# we create a new index
pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of text-embedding-ada-002
        metric='cosine', #'dotproduct'
        spec=spec
    )

# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [8]:
# confirmar que fue creado el indice
pc.list_indexes()

{'indexes': [{'dimension': 1536,
              'host': 'climate-twitter-data-s4apt1d.svc.gcp-starter.pinecone.io',
              'metric': 'cosine',
              'name': 'climate-twitter-data',
              'spec': {'pod': {'environment': 'gcp-starter',
                               'pod_type': 'starter',
                               'pods': 1,
                               'replicas': 1,
                               'shards': 1}},
              'status': {'ready': True, 'state': 'Ready'}}]}

### conexion a indice de VDB creado previamente

In [9]:
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### Cargar vectores al indice creado en base de datos vectorial

In [15]:
index.upsert_from_dataframe(df_embeded[['id','values','metadata']], batch_size=200)

sending upsert requests:   0%|          | 0/53399 [00:00<?, ?it/s]

sending upsert requests: 100%|██████████| 53399/53399 [13:12<00:00, 67.34it/s]


{'upserted_count': 53399}

In [16]:
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.53399,
 'namespaces': {'': {'vector_count': 53399}},
 'total_vector_count': 53399}

![image.png](pinecone_vdb.png)