# Semantic search using OpenSearch and OpenAI

This notebook guides you through an example of using [Aiven for OpenSearch](https://go.aiven.io/openai-opensearch-os) as a backend vector database for OpenAI embeddings and how to perform semantic search.

## Prerequisites

Before you begin, ensure you have created all necessary accounts and services as highlighted in the [README](./README.md) to follow the prerequisites:
- You have an [Aiven Account](./README.md#setup-your-aiven-account)
- You have created your [opensearch service](./README.md#create-an-opensearch-service)
- You have and OpenAI Account
- You have created AND SAVED an OpenAI API key
- You have setup your python environment for this notebook

## Adding our Environment Variables
To avoid leaking api_keys we will store them in an .env file that is ignored from version control.

**make a copy of `.env_sample`**

In [None]:
! cp .env_sample .env

## Add our OpenAI API key

Open `.env` and replace `<YOUR_OPENAI_API_KEY>` with the key that you saved from OpenAI.

## Add our OpenSearch Service URI

Verify the Aiven for OpenSearch service is in the `RUNNING` state.

![OpenSearch service in the running state](./assets/opensearch-running-state.png)

Select the running service and copy the **Service URI**.

![Copy the OpenSearch Service URI](assets/copy-opensearch-service-uri.png)

Add the OpenSearch Service URI to your `.env` file created above, replacing `<OPENSEARCH_SERVICE_URI>`

## Load our environment variables

In [None]:
import os # to access our variables
from dotenv import load_dotenv


load_dotenv()

## Connect to our Opensearch Service

In [None]:
from opensearchpy import OpenSearch

connection_string = os.getenv("OPENSEARCH_SERVICE_URI")

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(connection_string, use_ssl=True, timeout=100)

## Download the dataset
To save us from having to recalculate embeddings on a huge dataset, we are using a pre-calculated OpenAI embeddings dataset covering wikipedia articles. We can get the file and unzip it with:

In [None]:
import wget
import zipfile

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'
wget.download(embeddings_url)

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip",
"r") as zip_ref:
    zip_ref.extractall("data")

Let's load the file in a dataframe and check the content with:

In [None]:
import pandas as pd

wikipedia_dataframe = pd.read_csv("data/vector_database_wikipedia_articles_embedded.csv")

wikipedia_dataframe.head()

The file contains:
* `id` a unique Wikipedia article identifier
* `url` the Wikipedia article URL
* `title` the title of the Wikipedia page
* `text` the text of the article
* `title_vector` and `content_vector` the embedding calculated on the title and content of the wikipedia article respectively
* `vector_id` the id of the vector

We can create an OpenSearch mapping optimized for this information with:

In [None]:
index_settings ={
    "index": {
      "knn": True,
      "knn.algo_param.ef_search": 100
    }
  }

index_mapping= {
    "properties": {
      "title_vector": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss"
        }
      },
      "content_vector": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss"
        },
      },
      "text": {"type": "text"},
      "title": {"type": "text"},
      "url": { "type": "keyword"},
      "vector_id": {"type": "long"}
      
    }
}

## Create an index in Aiven for OpenSearch

This is where we will store our data

In [None]:
index_name = "openai_wikipedia_index"
client.indices.create(index=index_name, body={"settings": index_settings, "mappings":index_mapping})

## Index data into OpenSearch

Now it's time to parse the the pandas dataframe and index the data into OpenSearch using Bulk APIs. The following function indexes a set of rows in the dataframe:

In [None]:
def dataframe_to_bulk_actions(df):
    for index, row in df.iterrows():
        yield {
            "_index": index_name,
            "_id": row['id'],
            "_source": {
                'url' : row["url"],
                'title' : row["title"],
                'text' : row["text"],
                'title_vector' : json.loads(row["title_vector"]),
                'content_vector' : json.loads(row["content_vector"]),
                'vector_id' : row["vector_id"]
            }
        }

We don't want to index all the dataset at once, since it's way too large, so we'll load it in batches of `200` rows.

In [None]:
from opensearchpy import helpers
import json

start = 0
end = len(wikipedia_dataframe)
batch_size = 200
for batch_start in range(start, end, batch_size):
    batch_end = min(batch_start + batch_size, end)
    batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]
    actions = dataframe_to_bulk_actions(batch_dataframe)
    helpers.bulk(client, actions)

## Verify that our index has populated in our Aiven Console

In the Aiven Console, select **Indexes** in the sidebar and verify that you have documents populated. There should be OVER 20,000 documents.

![OpenSearch Indexes in the Aiven Console](assets/opensearch-indexes.png)

Once all the documents are indexed, let's try a query to retrieve the documents containing `Pizza`:

In [None]:
res = client.search(index=index_name, body={
    "_source": {
        "excludes": ["title_vector", "content_vector"]
    },
    "query": {
        "match": {
            "text": {
                "query": "Pizza"
            }
        }
    }
})

print(res["hits"]["hits"][0]["_source"]["text"])

## Encode questions with OpenAI

To perform a semantic search, we need to calculate questions encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the `text-embedding-3-small` model.

In [None]:
from openai import OpenAI

# Define model
EMBEDDING_MODEL = "text-embedding-ada-002"

# Define the Client
openaiclient = OpenAI(
    # This is the default and can be omitted
    api_key=os.getenv("OPENAI_API_KEY"),
)
# Define question
question = 'is Pineapple a good ingredient for Pizza?'

# Create embedding
question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)

## Run semantic search queries with OpenSearch

With the above embedding calculated, we can now run semantic searches against the OpenSearch index. We're using `knn` as query type and scan the content of the `content_vector` field.

After running the block below, we should see content semantically similar to the question. Expect documents based on Pineapples, Pizza, Hawaii, Italy, etc.

In [None]:
opensearch_response = client.search(
  index = index_name,
  body = {
      "size": 15,
      "query" : {
        "knn" : {
          "content_vector":{
          "vector":  question_embedding.data[0].embedding,
          "k": 3
        }
      }
    }
  }
)

for result in opensearch_response["hits"]["hits"]:
  print("Id:" + str(result['_id']))
  print("Score: " + str(result["_score"]))
  print("Title: " + str(result["_source"]["title"]))
  print("Text: " + result["_source"]["text"][0:100])


## Use OpenAI Chat Completions API to generate a reply

now let's use OpenAI chat `completions` to generate a reply based on the information retrieved.

In [None]:
# Retrieve the text of the first result in the above dataset
top_hit_summary = opensearch_response['hits']['hits'][0]['_source']['text']

# Craft a reply
openai_response = openaiclient.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Answer the following question:" 
            + question 
            + "by using the following text:" 
            + top_hit_summary
        }
    ]
)

choices = openai_response.choices
print(f"Our top hit is \n {top_hit_summary}")
for choice in choices:
    print("------------------------------------------------------------")
    print(choice.message.content)
    print("------------------------------------------------------------")


## Conclusion

OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search.

You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by [signing up](https://go.aiven.io/openai-opensearch-signup).