[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/openai/semantic_search_openai.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/openai/semantic_search_openai.ipynb)

# Semantic Search with Pinecone and OpenAI

In this guide you will learn how to use the OpenAI Embedding API to generate language embeddings, and then index those embeddings in the Pinecone vector database for fast and scalable vector search.

This is a powerful and common combination for building semantic search, question-answering, threat-detection, and other applications that rely on NLP and search over a large corpus of text data.

The basic workflow looks like this:

**Embed and index**

* Use the OpenAI Embedding API to generate vector embeddings of your documents (or any text data).
* Upload those vector embeddings into Pinecone, which can store and index millions/billions of these vector embeddings, and search through them at ultra-low latencies.

**Search**

* Pass your query text or document through the OpenAI Embedding API again.
* Take the resulting vector embedding and send it as a query to Pinecone.
* Get back semantically similar documents, even if they don't share any keywords with the query.

![Architecture overview](https://files.readme.io/6a3ea5a-pinecone-openai-overview.png)

Let's get started...

## Setup

We first need to setup our environment and retrieve API keys for OpenAI and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the OpenAI and Pinecone clients:

In [2]:
!pip install -qU pinecone-client openai datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h

### Creating Embeddings

Then we initialize our connection to OpenAI Embeddings *and* Pinecone vector DB. Sign up for an API key over at [OpenAI](https://beta.openai.com/signup) and [Pinecone](https://app.pinecone.io).

In [3]:
import openai
import os
# Sunitha's Open AI Key
openai.api_key = "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"
# get API key from top-right dropdown on OpenAI website

openai.Engine.list()  # check we have authenticated

<OpenAIObject list at 0x7f3c284215e0> JSON: {
  "data": [
    {
      "created": null,
      "id": "babbage",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "davinci",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-davinci-edit-001",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "babbage-code-search-code",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-similarity-babbage-001",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "code-davinci-edit-001",
      "object": "engine",
      

We can now create embeddings with the OpenAI Ada similarity model like so:

In [None]:
#!pip install pandas 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# import pandas as pd

# df = pd.read_csv('/content/A320_A340_A350_corpus.csv')

In [None]:
#df['filename'] = df['filename'].str.replace('\W ', " ", regex=True)

In [None]:
#df['filename'] = df['filename'].str.replace('.pdf', ' ', regex=True)

In [None]:
#df['filename']

0       A320 Landing Gear  Main Landing Gear Sliding T...
1       A320 Landing Gear  Main Landing Gear Sliding T...
2       A320 Landing Gear  Main Landing Gear Sliding T...
3       A320 Landing Gear  Main Landing Gear Sliding T...
4       A320 Landing Gear  Main Landing Gear Sliding T...
                              ...                        
3002    A350-Wings  Wing Upper and Lower Covers  Inspe...
3003    A350-Wings  Wing Upper and Lower Covers  Inspe...
3004    A350-Wings  Wing Upper and Lower Covers  Inspe...
3005    A350-Wings  Wing Upper and Lower Covers  Inspe...
3006    A350-Wings  Wing Upper and Lower Covers  Inspe...
Name: filename, Length: 3007, dtype: object

In [None]:
#df['file_content']= df['filename']+ '.' + df['file_content']

In [None]:
#df['file_content']

0       A320 Landing Gear  Main Landing Gear Sliding T...
1       A320 Landing Gear  Main Landing Gear Sliding T...
2       A320 Landing Gear  Main Landing Gear Sliding T...
3       A320 Landing Gear  Main Landing Gear Sliding T...
4       A320 Landing Gear  Main Landing Gear Sliding T...
                              ...                        
3002    A350-Wings  Wing Upper and Lower Covers  Inspe...
3003    A350-Wings  Wing Upper and Lower Covers  Inspe...
3004    A350-Wings  Wing Upper and Lower Covers  Inspe...
3005    A350-Wings  Wing Upper and Lower Covers  Inspe...
3006    A350-Wings  Wing Upper and Lower Covers  Inspe...
Name: file_content, Length: 3007, dtype: object

In [None]:
#from datasets import Dataset

In [None]:
#trec = Dataset.from_pandas(df)


In [None]:
#trec[:10]

{'unique_id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'module': [None, None, None, None, None, None, None, None, None, None],
 'filename': ['A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EA

In [4]:
MODEL = "text-embedding-ada-002"

Next, we initialize our index to store vector embeddings with Pinecone.

In [5]:
import pinecone

# Sunitha's R&D mail ID- pinecone API Key & Environment

index_name = 'semantic-search-openai'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="fc6ea2f8-0ceb-4853-9919-e53444f84366",
    environment="northamerica-northeast1-gcp"  # find next to api key in console
)
# check if 'openai' index already exists (only create index if not)
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=1536)
# connect to index
index = pinecone.Index(index_name)

  from tqdm.autonotebook import tqdm


## Populating the Index



In [None]:
#trec[:10]

{'unique_id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'module': [None, None, None, None, None, None, None, None, None, None],
 'filename': ['A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EASA_AD_2021-0175_1 ',
  'A320 Landing Gear  Main Landing Gear Sliding Tubes  Inspection-EA

Then we create a vector embedding for each phrase using OpenAI, and `upsert` the ID, vector embedding, and original text for each phrase to Pinecone.

In [None]:
# from tqdm.auto import tqdm

# count = 0  # we'll use the count to create unique IDs
# batch_size = 32  # process everything in batches of 32
# for i in tqdm(range(0, len(trec['file_content']), batch_size)):
#     # set end position of batch
#     i_end = min(i+batch_size, len(trec['file_content']))
#     # get batch of lines and IDs
#     lines_batch = trec['file_content'][i: i+batch_size]
#     ids_batch = [str(n) for n in range(i, i_end)]
#     # create embeddings
#     res = openai.Embedding.create(input=lines_batch, engine=MODEL)
#     embeds = [record['embedding'] for record in res['data']]
#     # prep metadata and upsert batch
#     meta = [{'file_content': line} for line in lines_batch]
#     to_upsert = zip(ids_batch, embeds, meta)
#     # upsert to Pinecone
#     index.upsert(vectors=list(to_upsert))

  0%|          | 0/94 [00:00<?, ?it/s]

---

# Querying

With our data indexed, we're now ready to move onto performing searches. This follows a similar process to indexing. We start with a text `query`, that we would like to use to find similar sentences. As before we encode this with OpenAI's text similarity Babbage model to create a *query vector* `xq`. We then use `xq` to query the Pinecone index.

In [6]:
query = "A320 damage?"

xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

Now query...

In [7]:
res = index.query([xq], top_k=5, include_metadata=True)
res

{'matches': [{'id': '553',
              'metadata': {'file_content': 'A320-Fuselage  Fuselage Skin '
                                           'Repairs  '
                                           'Inspection-EASA_AD_2015-0036R4_1 '
                                           '.: 2015 -0036R 4 TE.CAP.0011 0-010 '
                                           'European Union Aviation Safety '
                                           'Agency. All rights reserved. '
                                           'ISO9001 Certified.'},
              'score': 0.85820812,
              'values': []},
             {'id': '561',
              'metadata': {'file_content': 'A320-Fuselage  Fuselage Skin '
                                           'Repairs  '
                                           'Inspection-EASA_AD_2015-0036R4_1 '
                                           '.All rights reserved. ISO9001 '
                                           'Certified. Proprietary document.'},
         

The response from Pinecone includes our original text in the `metadata` field, let's print out the `top_k` most similar questions and their respective similarity scores.

In [8]:
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['file_content']}")

0.86: A320-Fuselage  Fuselage Skin Repairs  Inspection-EASA_AD_2015-0036R4_1 .: 2015 -0036R 4 TE.CAP.0011 0-010 European Union Aviation Safety Agency. All rights reserved. ISO9001 Certified.
0.86: A320-Fuselage  Fuselage Skin Repairs  Inspection-EASA_AD_2015-0036R4_1 .All rights reserved. ISO9001 Certified. Proprietary document.
0.86: A320-Fuselage  Fuselage Skin Repairs  Inspection-EASA_AD_2015-0036R4_1 .ISO9001 Certified. Proprietary document. Copies are not controlled.
0.86: A320-Fuselage  Fuselage Skin Repairs  Inspection-EASA_AD_2015-0036R4_1 .ISO9001 Certified. Proprietary document. Copies are not controlled.
0.86: A320-Fuselage  Fuselage Skin Repairs  Inspection-EASA_AD_2015-0036R4_1 .EASA AD No. : 2015 -0036R 4 TE.CAP.0011 0-010 European Union Aviation Safety Agency. All rights reserved.


Looks good, let's make it harder and replace *"depression"* with the incorrect term *"recession"*.

In [None]:
query = "What was the cause of the major recession in the early 20th century?"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.88: Why did the world enter a global depression in 1929 ?
0.83: When was `` the Great Depression '' ?
0.81: What crop failure caused the Irish Famine ?
0.80: When did World War I start ?
0.80: What were popular songs and types of songs in the 1920s ?


And again...

In [None]:
query = "Why was there a long-term economic downturn in the early 20th century?"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.90: Why did the world enter a global depression in 1929 ?
0.84: When was `` the Great Depression '' ?
0.80: When did World War I start ?
0.80: What crop failure caused the Irish Famine ?
0.80: When did the Dow first reach ?


Looks great, our semantic search pipeline is clearly able to identify the meaning between each of our queries and return the most semantically similar questions from the already indexed questions.

Once we're finished with the index we delete it to save resources.

In [None]:
#pinecone.delete_index(index_name)

---