# Creating a Pinecone Vector DB and Uploading Data

In order to use this guide, you will need to obtain a [Pinecone API key](https://www.pinecone.io/). The instructions to create a Pinecone database, and uploading a few select PDF files to the database are based on the [official examples](https://github.com/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb) provided by Pinecone. All the API key values are set in the environment and read from there. To test out the code, the Wikipedia page of Nvidia has been used and you might see some outputs in the cells.


In [44]:
import os
from uuid import uuid4

import fitz
import pinecone
import tiktoken
from datasets import Dataset
from tqdm.auto import tqdm

Define a helper function to parse PDF files. You can choose to read any format of text files.

In [45]:
def load_data_from_pdfs(path):
    local_urls = []
    local_articles = []
    for x in tqdm(os.listdir(path)):
        if x.endswith(".pdf"):
            print(x)
            local_urls.append(path + x)
            doc = fitz.open(path+x)
            text = ""
            for page in doc:
                text += page.get_text()
            local_articles.append(text)
    data_local = {"id": [i for i in range(len(local_urls))], "text": [local_articles[i] for i in range(
        0, len(local_urls))], "url": [local_urls[i] for i in range(0, len(local_urls))]}
    return data_local

Create a Hugging Face format dataset

In [46]:
data = load_data_from_pdfs("kb/")
our_dataset = Dataset.from_dict(data)
print(our_dataset)

  0%|          | 0/4 [00:00<?, ?it/s]

nvidia.pdf
Dataset({
    features: ['id', 'text', 'url'],
    num_rows: 1
})


One can save the dataset in Hugging Face dataset format to disk to avoid processing again.

In [47]:
our_dataset.save_to_disk("kb")

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Every record contains *a lot* of text. Our first task is therefore to identify a good preprocessing methodology for chunking these articles into more "concise" chunks to later be embedding and stored in our Pinecone vector database.

In [48]:
tiktoken.encoding_for_model('gpt-4')
tokenizer = tiktoken.get_encoding('cl100k_base')

def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

Now, we use LangChain's `RecursiveCharacterTextSplitter` to split our text into chunks of a specified max length using the function we defined above. Keep in mind that the processing strategy that one uses to populate the database must be same as when querying the database.

In [49]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

Lets test it out!

In [50]:
chunks = text_splitter.split_text(our_dataset[0]['text'])[:3]
chunks

['10/3/23, 11:26 AM\nNvidia - Wikipedia\nhttps://en.wikipedia.org/wiki/Nvidia\n1/25\nNvidia Corporation\nHeadquarters at Santa Clara in 2023\nTrade name\nNVIDIA\nType\nPublic\nTraded as\nNasdaq: NVDA (https://w\nww.nasdaq.com/market-a\nctivity/stocks/nvda)\nNasdaq-100 component\nS&P 100 component\nS&P 500 component\nIndustry\nComputer hardware\nComputer software\nCloud computing\nSemiconductors\nArtificial intelligence\nGPUs\nGraphics cards\nConsumer electronics\nVideo games\nFounded\nApril 5, 1993 in\nSunnyvale, California,\nU.S.\nFounders\nJensen Huang\nCurtis Priem\nChris Malachowsky\nNvidia\nNvidia Corporation[note 1][note 2] (/ɛnˈvɪdiə/ en-VID-ee-ə)\nis \nan \nAmerican \nmultinational \ntechnology \ncompany\nincorporated in Delaware and based in Santa Clara,\nCalifornia.[2] It is a software and fabless company which\ndesigns graphics processing units (GPUs), application\nprogramming interface (APIs) for data science and high-\nperformance computing as well as system on a chip unit

Lets see the lengths

In [51]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1]), tiktoken_len(chunks[2])

(383, 387, 377)

Using the `text_splitter` we get much better sized chunks of text. We'll use this functionality during the indexing process later. Now let's take a look at embedding.

## Creating Embeddings

Building embeddings using LangChain's OpenAI embedding support is fairly straightforward. We first need to add our [OpenAI api key]() by running the next cell:

In [52]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

*(Note that OpenAI is a paid service and so running the remainder of this notebook may incur some small cost)*

After initializing the API key we can initialize our `text-embedding-ada-002` embedding model like so:

In [53]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now we embed some example text from the data we just parsed.

In [54]:
res = embed.embed_documents(our_dataset[0]['text'][:500])
len(res), len(res[0])

(500, 1536)

From this we get 1536-dimensional embeddings. Now we move on to initializing our Pinecone vector database.

## Vector Database

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [55]:
index_name = 'nemoguardrailsindex'
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
# find ENV (cloud region) next to API key in console
PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') or 'gcp-starter'

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)


If this is a new index, then it takes a few minutes to create the new index. So the following code might return `NULL` at first.

In [56]:
if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )

Verify that it was created, or ensure that the old index exists. If you are using a free version of Pinecone then the indexes are purged on a regular basis if not being used.

In [57]:
for index_name in pinecone.list_indexes():
  print(index_name)

nemoguardrailsindex


Then we connect to the selected index:

In [58]:
index = pinecone.GRPCIndex(index_name)

In [59]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}

If this is a new Pinecone index, then we expect to see a `total_vector_count` of `0`, as we haven't added any vectors yet. If its a previously existing index then it should have a non-zero value.

## Indexing

We can perform the indexing task using the LangChain vector store object. But for now it is much faster to do it via the Pinecone python client directly. We will do this in batches of `100` or more.

In [63]:
batch_limit = 10

texts = []
metadatas = []

for i, record in enumerate(tqdm(our_dataset)):
    # first get metadata fields for this record
    metadata = {
        'id': str(record['id']),
        'source': record['url']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/1 [00:00<?, ?it/s]

We've now indexed everything. It might take a minute for the indexing to actually happen. We can check the number of vectors in our index like so:

In [64]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.00061,
 'namespaces': {'': {'vector_count': 61}},
 'total_vector_count': 61}

That is it for now. You have created a Pinecone Vector database, initialized it and uploaded data of your choice to it. Now, you can head over to NeMo Guardrails and interact with the database.