[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/generative-qa/openai-ml-qa/00-build-index.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/generation/generative-qa/openai-ml-qa/00-build-index.ipynb)

# Index Init

We use this notebook to create embeddings with OpenAI and push the embeddings and metadata to Pinecone. Required installs for this notebook are:

In [1]:
!pip install -qU openai pinecone-client datasets

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.9 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.9/44.9 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.6/170.6 KB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.5/147.5 KB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m19.6 MB/s[0m eta [

## Data Preparation

We start by downloading the dataset from Hugging Face *Datasets*:

In [2]:
from datasets import load_dataset

data = load_dataset('jamescalam/ml-qa', split='train')
data



Downloading and preparing dataset json/jamescalam--ml-qa to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--ml-qa-2cecc52fb1e2761a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.9M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--ml-qa-2cecc52fb1e2761a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


Dataset({
    features: ['docs', 'category', 'thread', 'href', 'question', 'context', 'marked'],
    num_rows: 6165
})

In [5]:
data[100]

{'docs': 'huggingface',
 'category': 'Beginners',
 'thread': 'Training stops when I try Fine-Tune XLSR-Wav2Vec2 for low-resource ASR',
 'href': 'https://discuss.huggingface.co/t/training-stops-when-i-try-fine-tune-xlsr-wav2vec2-for-low-resource-asr/8981',
 'question': 'Hi,\nI’m learning Wav2Vec2 according the blog link:\n  \n\n      huggingface.co\n  \n\n  \n    \n\nFine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers 1\n\n\n\n  \n\n  \n    \n    \n  \n\n  \n\n\nAnd I download the ipynb file and try run it locally.\nFine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_🤗_Transformers.ipynb\nAll looks file but when I run trainer.train(), it seems stop after a while, and it generate some log files under the folder wav2vec2-large-xlsr-turkish-demo, I send the screen shot to you as following:\n\n2021-08-05 17-05-36 的屏幕截图1063×410 35 KB\n\nI don’t know how to open the file events.out.tfevents.1628152300.tq-sy.129248.2, what’s the problem and how can I debug of it? please help.\nThanks a lo

When storing the original plaintext (and other metadata) of our data, we can either store them in Pinecone as indexed or non-indexed metadata — or elsewhere.

Storing in Pinecone can make the system simpler as we are then querying a single location. However, there are limitations on metadata size. For *indexed* metadata this is 5KB of metadata, and for *non-indexed* metadata this is 10.24KB, [see here for more info](https://docs.pinecone.io/docs/limits#:~:text=Max%20metadata%20size%20per%20vector,key%20from%20the%20metadata%20payload.).

First, let's check if we can fit out data within either of these two limits.

In [31]:
from sys import getsizeof
import json

limit = 0

for record in data:
    size = getsizeof(json.dumps(record))
    if size > 10_240:
        limit += 1

print(f"Over 10.24KB: {round((limit/len(data)*100),2)}%")

Over 10.24KB: 1.33%


A small number of our records do exceed this limit, so we must either store some of the fields elsewhere, truncate the data, or drop these items.

As we have just 1.33% of our records exceeding the limit, we will go with the simpler approach of dropping these excessively long samples.

In [33]:
data = data.filter(lambda x: 0 if getsizeof(
        json.dumps(x)
    ) > 10_240 else 1
)
data

  0%|          | 0/7 [00:00<?, ?ba/s]

Dataset({
    features: ['docs', 'category', 'thread', 'href', 'question', 'context', 'marked'],
    num_rows: 6083
})

For now, let's move on to preparing the text data and building our embeddings.

## Text Prep and Embeddings

To store as much information as possible in each record, it may make sense to format each record as something like:

```
Thread title: <thread>

Question asked: <question>

Given answer: <context>
```

We will create this format for each record and store in a new `text` variable.

In [34]:
text = [
    f"Thread title: {x['thread']}\n\n"+
    f"Question asked: {x['question']}\n\n"+
    f"Given answer: {x['context']}" for x in data
]
text[100]

'Thread title: Wav2Vec2ForCTC and Wav2Vec2Tokenizer\n\nQuestion asked: Having installed transformers and trying:\nimport transformers\nimport librosa\nimport soundfile as sf\nimport torch\nfrom transformers import Wav2Vec2ForCTC\nfrom transformers import Wav2Vec2Tokenizer\n#load model and tokenizer\ntokenizer = Wav2Vec2Tokenizer.from_pretrained(“facebook/wav2vec2-base-960h”)\nmodel = Wav2Vec2ForCTC.from_pretrained(“facebook/wav2vec2-base-960h”)\nI get:\nImportError                               Traceback (most recent call last)\n in \n3 import soundfile as sf\n4 import torch\n----> 5 from transformers import Wav2Vec2ForCTC\n6 from transformers import Wav2Vec2Tokenizer\n7\nImportError: cannot import name ‘Wav2Vec2ForCTC’ from ‘transformers’ (c:\\python\\python37\\lib\\site-packages\\transformers_init_.py)\nHow I install/get Wav2Vec2ForCTC and Wav2Vec2Tokenizer ???\n\nGiven answer: This probably means you don’t have the latest version. You should check your version of Transformers with\n

The text isn't always going to be perfect, but we'll see that the embedding model doesn't have any issues with this. Now let's initialize the embedding model and begin building the embeddings.

### Embedding with OpenAI

We begin by initializing the embedding model. For this we need [OpenAI API keys](https://beta.openai.com/signup).

In [9]:
import openai

# TODO do we need org key?
#openai.organization = "OPENAI_ORG_KEY"
# get this from top-right dropdown on OpenAI under organization > settings
openai.api_key = "OPENAI_API_KEY"
# get API key from top-right dropdown on OpenAI website

openai.Engine.list()  # check we have authenticated

<OpenAIObject list at 0x7f55d5538b30> JSON: {
  "data": [
    {
      "created": null,
      "id": "babbage",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "ada",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-davinci-002",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "davinci",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "babbage-code-search-code",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-similarity-babbage-001",
      "object": "engine",
      "owner": "openai-dev",
    

The `openai.Engine.list()` function should return a list of models that we can use. One of those is `text-embedding-ada-002` that we will use for creating embeddings like so:

In [10]:
model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=model
)

In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field.

In [14]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-ada-002` model.

In [15]:
len(res['data'])

2

In [18]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

We will apply this same embedding logic when indexing all of our data in the Pinecone vector database soon.

## Building a Pinecone Index

We need a vector index to store the vector embeddings and enable a fast and scalable search through them. For this we use the Pinecone vector database.

To use this we need a [free Pinecone API key](https://app.pinecone.io).

Once ready, we initialize our index like so:

In [24]:
import pinecone

index_name = 'openai-ml-qa'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="PINECONE_API_KEY",
    environment="us-east1-gcp"  # may be different, check at app.pinecone.io
)
# check if 'openai-ml-qa' already exists (only create index if not)
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine',
        # we may want to filter by docs (e.g. pytorch vs tensorflow)
        metadata_config={'indexed': ['docs']}
    )
# connect to index
index = pinecone.Index(index_name)

We added `metadata_config={'indexed': ['docs']}` to fit anything that we don't plan on using as filterable fields in the *non-indexed* category. By doing this we optimize our index space, keeping the index more efficient.

Now we begin populating the index.

When adding records to Pinecone we need three items in a tuple format:

```
(id, vector, metadata)
```

All IDs must be unique, our vectors will be built by OpenAI, and the metadata is a dictionary of the information for each record (`'href'`, `'question'`, etc).

We will create our vector embeddings and add the records to Pinecone in batches of `128`. This is to avoid trying to push too much data into single API requests.

In [35]:
from tqdm.auto import tqdm  # this is our progress bar

batch_size = 128  # process everything in batches of 32
for i in tqdm(range(0, len(text), batch_size)):
    # set end position of batch
    i_end = min(i+batch_size, len(text))
    # get batch of metadata, text, and IDs
    meta_batch = [data[x] for x in range(i,i_end)]
    text_batch = text[i:i_end]
    ids_batch = [str(n) for n in range(i, i_end)]
    # create embeddings
    res = openai.Embedding.create(input=text_batch, engine=model)
    embeds = [record['embedding'] for record in res['data']]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/48 [00:00<?, ?it/s]

We can check that everything has been upserted with `index.describe_index_stats()`

In [36]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 6083}},
 'total_vector_count': 6083}

We have `6083` vectors (and their respective metadata) added to the index as expected.

With that our index has been built and we can move on to the next stage of querying.