# Pinecone RAG: upserting data

copyright 2025, Denis Rothman

# Pinecone vector store

copyright 2024, Denis Rothman

The goal of this notebook is to upsert *classical RAG data* to a Pinecone index for retrieval to provide instructions to a generative AI model.

The notebook contains the following sections:
* **Setting up the environment** with a file downloading script, OpenAI, and Pinecone.
* **Processing the Data: loading and chunking** to load the instruction scenarios and chunk them.
* **Embedding the chunked data**
* **The Pinecone index** to create or connect to a Pinecone index and upsert the chunked and embedded data (text data)






# Setting up the environment

This notebook was developed in Google Colab. Colab includes many pre-installed libraries and sets `/content/` as the default directory, meaning you can access files directly by their filename if you wish (e.g., `filename` instead of needing to specify `/content/filename`). This differs from local environments, where you'll often need to install libraries or specify full file paths.

## File downloading script

grequests contains a script to download files from the repository

In [None]:
!curl -L https://raw.githubusercontent.com/Denis2054/Building-Business-Ready-Generative-AI-Systems/master/commons/grequests.py --output grequests.py

## OpenAI

In [None]:
from grequests import download
download("commons","requirements01.py")
download("commons","openai_setup.py")
download("commons","openai_api.py")

Downloaded 'requirements01.py' successfully.
Downloaded 'openai_setup.py' successfully.
Downloaded 'openai_api.py' successfully.


### Installing OpenAI

In [None]:
# Run the setup script to install and import dependencies
%run requirements01

Uninstalling 'openai'...
Installing 'openai' version 1.57.1...
'openai' version 1.57.1 is installed.


#### Initializing the OpenAI API key



In [None]:
google_secrets=True #activates Google secrets in Google Colab
if google_secrets==True:
  import openai_setup
  openai_setup.initialize_openai_api()

OpenAI API key initialized successfully.


In [None]:
if google_secrets==False: # Uncomment the code and choose any method you wish to initialize the API_KEY
  import os
  #API_KEY=[YOUR API_KEY]
  #os.environ['OPENAI_API_KEY'] = API_KEY
  #openai.api_key = os.getenv("OPENAI_API_KEY")
  #print("OpenAI API key initialized successfully.")

#### Importing the API call function

In [None]:
# Import the function from the custom OpenAI API file
import openai_api
from openai_api import make_openai_api_call

## Installing Pinecone

In [None]:
download("commons","requirements02.py")

Downloaded 'requirements02.py' successfully.


In [None]:
# Run the setup script to install and import dependencies
%run requirements02

Uninstalling 'pinecone-client'...
Installing 'pinecone-client' version 5.0.1...
'pinecone-client' version 5.0.1 is installed.


### Initializing the Pinecone API key

In [None]:
download("commons","pinecone_setup.py")

Downloaded 'pinecone_setup.py' successfully.


In [None]:
if google_secrets==True:
  import pinecone_setup
  pinecone_setup.initialize_pinecone_api()

PINECONE_API_KEY initialized successfully.


In [None]:
if google_secrets==False: # Uncomment the code and choose any method you wish to initialize the Pinecone API key
  import os
  #PINECONE_API_KEY=[YOUR PINECONE_API_KEY]
  #os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY
  #openai.api_key = os.getenv("PINECONE_API_KEY")
  #print("OpenAI API key initialized successfully.")

# Processing data

In [None]:
download("Chapter03","data01.txt")

Downloaded 'data01.txt' successfully.


In [None]:
# Load the CSV file
file_path = '/content/data01.txt'

## DATA

In [None]:
try:
    with open(file_path, 'r') as file:
        text = file.read()
    text
except FileNotFoundError:
    text = "Error: File not found. Please check the file path."
print(text)

The CTO was explaing that a business-ready generative AI system (GenAISys) offers functionality similar to ChatGPT-like platforms. It combines generative AI models, RAG, memory retention, and a wide range of ML and non-AI functions managed by an AI controller. The controller orchestrates tasks dynamically rather than following the same set of instructions for each task.
GenAISys relies on a generative AI model such as GPT-4o or any advanced LLM. The CTO said that we saw that getting access to the API is insufficient. Contextual awareness and memory retention are critical components of a GenAISys. Although they are seamlessly available in ChatGPT-like platforms, we have to build them into our systems.
We defined memoryless, short-term, long-term memory, and cross-topic memory. For the hybrid travel marketing campaign, we will distinguish semantic memory(facts) from episodic memory(personal events in time, for example). The CTO said that the we will need to use episodic memories of past 

## Chunking the dataset

In [None]:
# Import libraries
from openai import OpenAI

# Initialize OpenAI Client
client = OpenAI()

# Function to chunk text using GPT-4o
def chunk_text_with_gpt4o(text):
    # Prepare the messages for GPT-4o
    messages = [
        {"role": "system", "content": "You are an assistant skilled at splitting long texts into meaningful, semantically coherent chunks of 50-100 words each."},
        {"role": "user", "content": f"Split the following text into meaningful chunks:\n\n{text}"}
    ]

    # Make the GPT-4o API call
    response = client.chat.completions.create(
        model="gpt-4o",  # GPT-4o model
        messages=messages,
        temperature=0.2,  # Low randomness for consistent chunks
        max_tokens=1024  # Sufficient tokens for the chunked response
    )

    # Extract and clean the response
    chunked_text = response.choices[0].message.content
    chunks = chunked_text.split("\n\n")  # Assume GPT-4o separates chunks with double newlines

    return chunks

# Chunk the text
chunks = chunk_text_with_gpt4o(text)

# Display the chunks
print("Chunks:")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1}:")
    print(chunk)


Chunks:

Chunk 1:
The CTO was explaining that a business-ready generative AI system (GenAISys) offers functionality similar to ChatGPT-like platforms. It combines generative AI models, RAG, memory retention, and a wide range of ML and non-AI functions managed by an AI controller. The controller orchestrates tasks dynamically rather than following the same set of instructions for each task.

Chunk 2:
GenAISys relies on a generative AI model such as GPT-4o or any advanced LLM. The CTO said that getting access to the API is insufficient. Contextual awareness and memory retention are critical components of a GenAISys. Although they are seamlessly available in ChatGPT-like platforms, we have to build them into our systems.

Chunk 3:
We defined memoryless, short-term, long-term memory, and cross-topic memory. For the hybrid travel marketing campaign, we will distinguish semantic memory (facts) from episodic memory (personal events in time, for example). The CTO said that we will need to use 

In [None]:
# Print the length and content of the first 10 chunks
for i in range(3):
    print(len(chunks[i]))
    print(chunks[i])

374
The CTO was explaining that a business-ready generative AI system (GenAISys) offers functionality similar to ChatGPT-like platforms. It combines generative AI models, RAG, memory retention, and a wide range of ML and non-AI functions managed by an AI controller. The controller orchestrates tasks dynamically rather than following the same set of instructions for each task.
324
GenAISys relies on a generative AI model such as GPT-4o or any advanced LLM. The CTO said that getting access to the API is insufficient. Contextual awareness and memory retention are critical components of a GenAISys. Although they are seamlessly available in ChatGPT-like platforms, we have to build them into our systems.
359
We defined memoryless, short-term, long-term memory, and cross-topic memory. For the hybrid travel marketing campaign, we will distinguish semantic memory (facts) from episodic memory (personal events in time, for example). The CTO said that we will need to use episodic memories of past 

In [None]:
# Now, each line is treated as a separate chunk
print(f"Total number of chunks: {len(chunks)}")

Total number of chunks: 9


## Embedding

**IMPORTANT NOTE**: OpenAI continually upgrades its models including the embedding models. As such, this section is updated when necessary for performance optimization.

## Initializing the embedding model


In [None]:
import openai
import time

embedding_model="text-embedding-3-small"
#embedding_model="text-embedding-3-large"
#embedding_model="text-embedding-ada-002"

# Initialize the OpenAI client
client = openai.OpenAI()

def get_embedding(texts, model="text-embedding-3-small"):
    texts = [text.replace("\n", " ") for text in texts]  # Clean input texts
    response = client.embeddings.create(input=texts, model=model)  # API call for batch
    embeddings = [res.embedding for res in response.data]  # Extract embeddings
    return embeddings


## Embedding the chunks

    Parameters:
        chunks (list): List of text chunks to be embedded.
        embedding_model (str): Model to be used for embedding.
        batch_size (int): Number of chunks to process per batch.
        pause_time (int): Time to wait between batches (in seconds).
    

In [None]:
def embed_chunks(chunks, embedding_model="text-embedding-3-small", batch_size=1000, pause_time=3):
    start_time = time.time()  # Start timing the operation
    embeddings = []  # Initialize an empty list to store the embeddings
    counter = 1  # Batch counter

    # Process chunks in batches
    for i in range(0, len(chunks), batch_size):
        chunk_batch = chunks[i:i + batch_size]  # Select a batch of chunks

        # Get the embeddings for the current batch
        current_embeddings = get_embedding(chunk_batch, model=embedding_model)

        # Append the embeddings to the final list
        embeddings.extend(current_embeddings)

        # Print batch progress and pause
        print(f"Batch {counter} embedded.")
        counter += 1
        time.sleep(pause_time)  # Optional: adjust or remove this depending on rate limits

    # Print total response time
    response_time = time.time() - start_time
    print(f"Total Response Time: {response_time:.2f} seconds")

    return embeddings

embeddings = embed_chunks(chunks)

Batch 1 embedded.
Total Response Time: 3.83 seconds


In [None]:
print("First embedding:", embeddings[0])

First embedding: [-0.011681988835334778, 0.007010514847934246, 0.04123054817318916, -0.007631615735590458, 0.00541811715811491, -0.0523575097322464, -0.006607459858059883, 0.02761918120086193, 0.020602058619260788, -0.02772490121424198, 0.0014181260485202074, -0.05756418779492378, -0.014192823320627213, -0.028649944812059402, -0.040279075503349304, -0.043080639094114304, -0.004602095577865839, -0.047970157116651535, 0.014708205126225948, -0.010492646135389805, 0.03055289387702942, 0.03639388829469681, -0.015818258747458458, 0.033513035625219345, 0.004816838074475527, -0.019016269594430923, 0.027698470279574394, 0.049740955233573914, -0.026297690346837044, 0.0001448479015380144, 0.04501001536846161, -0.017589056864380836, -0.012316305190324783, -0.01725868508219719, 0.018196944147348404, 0.015844687819480896, 0.03412092104554176, 0.018276233226060867, 0.017681561410427094, 0.039221879094839096, -0.008510407991707325, -0.01719260960817337, 0.014681775122880936, 0.018857689574360847, 0.00

Control output

In [None]:
# Check the lengths of the chunks and embeddings
num_chunks = len(chunks)
print(f"Number of chunks: {num_chunks}")
print(f"Number of embeddings: {len(embeddings)}")

Number of chunks: 9
Number of embeddings: 9


#  The Pinecone index

In [None]:
import os
from pinecone import Pinecone, ServerlessSpec

# Retrieve the API key from environment variables
api_key = os.environ.get('PINECONE_API_KEY')
if not api_key:
    raise ValueError("PINECONE_API_KEY is not set in the environment!")

# Initialize the Pinecone client
pc = Pinecone(api_key=api_key)

In [None]:
from pinecone import ServerlessSpec

index_name = 'genai-v1'
namespace="data01"
cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

In [None]:
import time
import pinecone
# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimension of the embedding model
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    time.sleep(1)

# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'genaisys': {'vector_count': 3}},
 'total_vector_count': 3}

# Upserting

In Pinecone, each record within a vector index comprises several key components:

- **ID**: A unique identifier for the record, which can be any string value. This ID is essential for operations like fetching or deleting records by their identifier.

- **Values**: An array of numbers representing the dense vector embedding associated with the record. These values capture the essential features of the data point in a high-dimensional space.

- **Sparse Values (Optional)**: When utilizing hybrid search capabilities, records can include sparse vector values in addition to dense ones. This allows for more nuanced similarity searches by combining dense and sparse representations. Hybrid search combines keyword-based and semantic search techniques to enhance result relevance and accuracy.    

- **Metadata (Optional)**: Additional information associated with the record, stored as key-value pairs. Metadata is useful for filtering search results or adding context to the data.

These components collectively define a record in Pinecone's vector index, enabling efficient similarity searches and data retrieval based on vector embeddings.

In [None]:
import pinecone
import time
import sys

start_time = time.time()  # Start timing before the request

# Function to calculate the size of a batch
def get_batch_size(data, limit=4000000):  # limit set to 4MB to be safe
    total_size = 0
    batch_size = 0
    for item in data:
        item_size = sum([sys.getsizeof(v) for v in item.values()])
        if total_size + item_size > limit:
            break
        total_size += item_size
        batch_size += 1
    return batch_size

# Upsert function with namespace
def upsert_to_pinecone(batch, batch_size, namespace="data01"):
    """
    Upserts a batch of data to Pinecone under a specified namespace.
    """
    try:
        index.upsert(vectors=batch, namespace=namespace)
        print(f"Upserted {batch_size} vectors to namespace '{namespace}'.")
    except Exception as e:
        print(f"Error during upsert: {e}")

# Function to upsert data in batches
def batch_upsert(data):
    total = len(data)
    i = 0
    while i < total:
        batch_size = get_batch_size(data[i:])
        batch = data[i:i + batch_size]
        if batch:
            upsert_to_pinecone(batch, batch_size, namespace="data01")
            i += batch_size
            print(f"Upserted {i}/{total} items...")  # Display current progress
        else:
            break
    print("Upsert complete.")

# Generate IDs for each data item
ids = [str(i) for i in range(1, len(chunks) + 1)]

# Prepare data for upsert
data_for_upsert = [
    {"id": str(id), "values": emb, "metadata": {"text": chunk}}
    for id, (chunk, emb) in zip(ids, zip(chunks, embeddings))
]

# Upsert data in batches
batch_upsert(data_for_upsert)

response_time = time.time() - start_time  # Measure response time
print(f"Upsertion response time: {response_time:.2f} seconds")  # Print response time


Upserted 9 vectors to namespace 'data01'.
Upserted 9/9 items...
Upsert complete.
Upsertion response time: 1.11 seconds


In [None]:
print("Index stats")
print(index.describe_index_stats(include_metadata=True))

Index stats
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'genaisys': {'vector_count': 3}},
 'total_vector_count': 3}
