[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/openai/semantic_search_openai.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/openai/semantic_search_openai.ipynb)

# Semantic Search with Pinecone and OpenAI

In this guide you will learn how to use the OpenAI Embedding API to generate language embeddings, and then index those embeddings in the Pinecone vector database for fast and scalable vector search.

This is a powerful and common combination for building semantic search, question-answering, threat-detection, and other applications that rely on NLP and search over a large corpus of text data.

The basic workflow looks like this:

**Embed and index**

* Use the OpenAI Embedding API to generate vector embeddings of your documents (or any text data).
* Upload those vector embeddings into Pinecone, which can store and index millions/billions of these vector embeddings, and search through them at ultra-low latencies.

**Search**

* Pass your query text or document through the OpenAI Embedding API again.
* Take the resulting vector embedding and send it as a query to Pinecone.
* Get back semantically similar documents, even if they don't share any keywords with the query.

![Architecture overview](https://files.readme.io/6a3ea5a-pinecone-openai-overview.png)

Let's get started...

## Setup

We first need to setup our environment and retrieve API keys for OpenAI and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the OpenAI and Pinecone clients:

In [1]:
!pip install -qU pinecone-client openai datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 KB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 KB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 KB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

### Creating Embeddings

Then we initialize our connection to OpenAI Embeddings *and* Pinecone vector DB. Sign up for an API key over at [OpenAI](https://beta.openai.com/signup) and [Pinecone](https://app.pinecone.io).

In [6]:
import openai
import os

openai.api_type = "azure"
openai.api_base = "https://maap-001-openai.openai.azure.com/"
openai.api_version = "2022-12-01"
openai.api_key = "OpenAI_API_KEY"
# get API key from top-right dropdown on OpenAI website

openai.Engine.list  # check we have authenticated

<bound method ListableAPIResource.list of <class 'openai.api_resources.engine.Engine'>>

We can now create embeddings with the OpenAI Ada similarity model like so:

In [8]:
def get_embedding(text):
  # Note how this function assumes you already set your Open AI key!
    result = openai.Embedding.create(
      engine='text-embedding-ada-002',
      input=text
    )
    return result["data"][0]["embedding"]

In [9]:
def get_embeddings(input_texts):  
    results = []  
    for input_text in input_texts:  
        res = get_embedding(input_text)  
        results.append(res)  
    return results

In [16]:
inputs = [  
    "Sample document text goes here",  
    "there will be several phrases in each batch",  
]

res = get_embeddings(inputs)
len(res)

#print(f"vector 0: {len(res['data'][0]['embedding'])}\nvector 1: {len(res['data'][1]['embedding'])}")

2

In [None]:
# we can extract embeddings to a list
embeds = [record['embedding'] for record in res['data']]
len(embeds)

Next, we initialize our index to store vector embeddings with Pinecone.

In [17]:
embeds = res

len(embeds[0])

1536

In [18]:
import pinecone

index_name = 'semantic-search-openai'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="d4edd32a-cc2d-4351-b36e-cba1dee871c2",
    environment="asia-southeast1-gcp"  # find next to api key in console
)
# check if 'openai' index already exists (only create index if not)
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=len(embeds[0]))
# connect to index
index = pinecone.Index(index_name)

## Populating the Index

Now we will take 1K questions from the TREC dataset

In [19]:
from datasets import load_dataset

# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:32]')
trec

Downloading builder script:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading and preparing dataset trec/default to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset trec downloaded and prepared to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2. Subsequent calls will reuse this data.


Dataset({
    features: ['text', 'coarse_label', 'fine_label'],
    num_rows: 100
})

In [20]:
trec[0]

{'text': 'How did serfdom develop in and then leave Russia ?',
 'coarse_label': 2,
 'fine_label': 26}

Then we create a vector embedding for each phrase using OpenAI, and `upsert` the ID, vector embedding, and original text for each phrase to Pinecone.

In [None]:
from tqdm.auto import tqdm

count = 0  # we'll use the count to create unique IDs
batch_size = 32  # process everything in batches of 32
for i in tqdm(range(0, len(trec['text']), batch_size)):
    # set end position of batch
    i_end = min(i+batch_size, len(trec['text']))

    # get batch of lines and IDs
    lines_batch = trec['text'][i: i+batch_size]
    ids_batch = [str(n) for n in range(i, i_end)]

   # create embeddings
    embeds = get_embeddings(lines_batch)

    # prep metadata and upsert batch
    meta = [{'text': line} for line in lines_batch]
    to_upsert = zip(ids_batch, embeds, meta)

    # upsert to Pinecone
    index.upsert(vectors=list(to_upsert))

---

# Querying

With our data indexed, we're now ready to move onto performing searches. This follows a similar process to indexing. We start with a text `query`, that we would like to use to find similar sentences. As before we encode this with OpenAI's text similarity Babbage model to create a *query vector* `xq`. We then use `xq` to query the Pinecone index.

In [25]:
from requests.api import get
query = "Why do heavier objects travel downhill faster?"

xq = get_embedding(query)

Now query...

In [26]:
res = index.query([xq], top_k=5, include_metadata=True)
res

{'matches': [{'id': '11',
              'metadata': {'text': 'Why do heavier objects travel downhill '
                                   'faster ?'},
              'score': 0.989782274,
              'values': []},
             {'id': '27',
              'metadata': {'text': 'What is the highest waterfall in the '
                                   'United States ?'},
              'score': 0.735625267,
              'values': []},
             {'id': '23',
              'metadata': {'text': "What 's the Olympic motto ?"},
              'score': 0.730186522,
              'values': []},
             {'id': '0',
              'metadata': {'text': 'How did serfdom develop in and then leave '
                                   'Russia ?'},
              'score': 0.726234138,
              'values': []},
             {'id': '14',
              'metadata': {'text': 'What is considered the costliest disaster '
                                   'the insurance industry has ever faced ?'},
  

The response from Pinecone includes our original text in the `metadata` field, let's print out the `top_k` most similar questions and their respective similarity scores.

In [27]:
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.99: Why do heavier objects travel downhill faster ?
0.74: What is the highest waterfall in the United States ?
0.73: What 's the Olympic motto ?
0.73: How did serfdom develop in and then leave Russia ?
0.73: What is considered the costliest disaster the insurance industry has ever faced ?


Looks good, let's make it harder and replace *"downhill"* with the incorrect term *"uphill"*.

In [28]:
query = "Why do heavier objects travel uphill faster?"

# create the query embedding
xq = get_embedding(query)

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.96: Why do heavier objects travel downhill faster ?
0.74: What 's the Olympic motto ?
0.73: What is the highest waterfall in the United States ?
0.72: What is considered the costliest disaster the insurance industry has ever faced ?
0.72: What fowl grabs the spotlight after the Chinese Year of the Monkey ?


And again...

In [29]:
query = "What will happen when heavier objects travel downwards?"

# create the query embedding
xq = get_embedding(query)

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.92: Why do heavier objects travel downhill faster ?
0.74: What is the highest waterfall in the United States ?
0.73: What is considered the costliest disaster the insurance industry has ever faced ?
0.73: How did serfdom develop in and then leave Russia ?
0.73: What 's the Olympic motto ?


Looks great, our semantic search pipeline is clearly able to identify the meaning between each of our queries and return the most semantically similar questions from the already indexed questions.

Once we're finished with the index we delete it to save resources.

In [None]:
pinecone.delete_index(index_name)

---