# Using AwaDB as a Vector database for Question Answering tasks

This notebook demonstrates how to utilize AwaDB as a vector database for storing embeddings obtained from OpenAI. It then illustrates how to employ GPT and embedding-based search for question answering tasks.

We outline an end-to-end workflow example to illustrate the entire process.

1. Text Preprocessing
2. Embedding
3. Vector Store
4. Similarity Search
5. Question Answering

We need to slice the original document into appropriate text sentences for further embedding, and then utilize different embedding models, such as OpenAI Embedding, to process the text and store it in the database. After the above operations, the subsequent queries will be combined with the similarity finding results in the database to give a better answer.

## Install libraries
The requirements for this sample are `openai` and `awadb` packages 

You can use `pip install awadb` and `pip install openai` to install them.

In [1]:
# Import necessary libraries

try:
    import openai
    import awadb
except ImportError as exc:
    raise ImportError(
        "Could not import libraries. "
        "Please install it with `pip install awadb` or `pip install openai`"
    ) from exc

You also need to set your openai api key as an environment variable before. You can find more information about this by referring [Best Practices for API Key Safety
](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) 

In [2]:
import os
import wget

assert os.environ["OPENAI_API_KEY"] != None

## Load Dataset

We then need to load the dataset we are using in this example.

In [3]:
embeddings_path = "https://raw.githubusercontent.com/awa-ai/awadb/main/tests/state_of_the_union.txt"
file_path = "state_of_the_union.txt"

if not os.path.exists(file_path):
    wget.download(embeddings_path, file_path)
    print("\nFile downloaded successfully.")
else:
    print("File already exists in the local file system.")
    
# Load the data file
from langchain.document_loaders import TextLoader
loader = TextLoader(file_path)

File already exists in the local file system.


### Split the text
Then we are going to preprocessing the text. Briefly, we split the text data into chunks of max size 200, with an overlap of size 10 between neighboring chunks.

The choice of the two hyperparameters here is related to the average sentence length of your document. A basic logic is the need to ensure that each segmented phrase contains a complete semantic meaning and does not contain more than one semantic meaning.

In [4]:
# Transform to document
data = loader.load()
print(f'documents:{len(data)}')

# Initialize tex spilitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# Split the document
split_docs = text_splitter.split_documents(data)
print("split_docs size:",len(split_docs))

Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 232, which is longer than the specified 200
Created a chunk of size 242, which is longer than the specified 200
Created a chunk of size 219, which is longer than the specified 200
Created a chunk of size 304, which is longer than the specified 200
Created a chunk of size 205, which is longer than the specified 200
Created a chunk of size 332, which is longer than the specified 200
Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 203, which is longer than the specified 200
Created a chunk of size 281, which is longer than the specified 200
Created a chunk of size 201, which is longer than the specified 200
Created a chunk of size 250, which is longer than the specified 200
Created a chunk of size 325, which is longer than the specified 200
Created a chunk of size 242, which is longer than the specified 200


documents:1
split_docs size: 255


In [5]:
from typing import Set

# Save the embedded texts by Awadb
texts = [text.page_content for text in split_docs]

awadb_client = awadb.Client()
awadb_client.Create("testdb1")

# Add the splitted texts into database
awadb_client.AddTexts("embedding_text", "testdb1", texts=texts)

not_include_fields: Set[str] = {"text_embedding"}

### Set the question

Use `awadb_client.Search` for similarity search

In [6]:
# Set the question
query = "What measures does the speaker ask Congress to pass to reduce gun violence?"
# Similarity search results
similar_docs = awadb_client.Search(query=query, topn=3, not_include_fields=not_include_fields)

#print(similar_docs)

## Create Prompt
We then will create prompts based on our question and the results from the similarity search.

In [7]:
# Create prompt
system_prompt = "You are a person who answers questions for people based on specified information\n"

similar_prompt = ""
for i in range(3):
    similar_prompt += similar_docs[0]['ResultItems'][i]['embedding_text'] + "\n"

#similar_prompt = similar_docs[0].page_content + "\n" + similar_docs[1].page_content + "\n" + similar_docs[2].page_content + "\n"
question_prompt = f"Here is the question: {query}\nPlease provide an answer only related to the question and do not include any information more than that.\n"
prompt = system_prompt + "Here is some information given to you:\n" + similar_prompt + question_prompt

print(prompt)

You are a person who answers questions for people based on specified information
Here is some information given to you:
Ban assault weapons and high-capacity magazines. 

Repeal the liability shield that makes gun manufacturers the only industry in America that can’t be sued.
I ask Democrats and Republicans alike: Pass my budget and keep our neighborhoods safe.
And I ask Congress to pass proven measures to reduce gun violence. Pass universal background checks. Why should anyone on a terrorist list be able to purchase a weapon?
Here is the question: What measures does the speaker ask Congress to pass to reduce gun violence?
Please provide an answer only related to the question and do not include any information more than that.



In [9]:
# Create response from gpt-3.5
response = openai.ChatCompletion.create(
  model = "gpt-3.5-turbo",
  temperature =  0.7,
  messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": prompt},
    ],
  max_tokens = 40
)

print(response['choices'][0]['message']['content'].replace(' .', '.').strip())

The speaker asks Congress to pass measures to reduce gun violence, including universal background checks and preventing individuals on a terrorist list from purchasing weapons.
