# Azure OpenAI Embeddings

## Use Case
There are two main use cases this playbook will cover: document search and document “zone”. 
This playbook will walk through an example of querying against a knowledge base to find the most relevant document (document search) and show once a document is isolated how to find the most relevant section within the document (document zone).

### Document Search ###
Enterprise document search is the process of bringing a seamless search experience to finding and retrieving relevant documentation, data, and knowledge that are stored in various formats across databases within an organization. Document search can empower your team to quickly find resources across the organization through a query in natural language, presented in a holistic view.
Document search is imperative for Enterprises that deal with significant amounts of documentation such as law firms, large businesses, and public sector entities. Overall, enterprise search is important for businesses because it saves time, harnesses valuable knowledge, and provides a seamless user experience.

### Document “Zone”
For many enterprises documents can span across tens of information dense pages. Once a document search has been performed and a document has been isolated based on a query, it is essential to zone in on the right page or section of the document to gather the relevant information or pass it through to a summarization tool. 
Isolating specific pages or sections within a long document to answer a user query can ensure a succinct response, less computational time of summarizing or extracting the entire document and helping save valuable time and resources for the individual querying the document.


## Imports

In [1]:
from dotenv import load_dotenv
load_dotenv('../../src/.env') 

True

In [3]:
import openai
import re
import requests
import sys
from num2words import num2words
import os
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity
from transformers import GPT2TokenizerFast

import warnings
warnings.filterwarnings('ignore')
warnings.warn('DelftStack')
warnings.warn('Do not show this message')
print("No Warning Shown")

API_KEY = os.getenv("AZURE_OPENAI_KEY")  # SET YOUR OWN API KEY HERE
RESOURCE_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")  # SET A LINK TO YOUR RESOURCE ENDPOINT

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-06-01-preview"

TEXT_SEARCH_EMBEDDING_ENGINE = 'text-search-curie-doc-001'
COMPLETIONS_MODEL = "text-davinci-002"



## Document Search

### Use Case Overview
This section will go over how to use Azure OpenAI embeddings for the document search use case. The goal of this section is given a knowledge base with documents and an user query, to isolate to the search result which can answer the question presented in the query.


### Dataset
The first dataset we will look at is the BillSum dataset. BillSum is the first dataset for summarization of US Congressional and California state bills. For illustration purposes, we will look at the US bills solely. The corpus consists of bills from the 103rd-115th (1993-2018) sessions of Congress. The data was split into 18,949 train bills and 3,269 test bills.  The BillSum corpus focuses on mid-length legislation from 5,000 to 20,000 characters in length.
More information on the dataset and downloading instructions can be found here: 3
Schema
-	bill_id: an identifier for the bill
-	text: US bill text
-	summary: human written bill summary
-	title: bill title
-	text_len: character length of the bill
-	sum_len: character length of the bill summary 


In [38]:
# datasets can be found under the data/ directory
df = pd.read_csv(os.path.join(os.getcwd(), 'data', 'bill_sum_data.csv'))
df_bills = df[['text', 'summary', 'title']]
df_bills.count()

text       20
summary    20
title      20
dtype: int64

In [39]:
# Perform light data cleaning (removing redudant whitespace and cleaning up punctuation)
# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df_bills['text'] = df_bills["text"].apply(lambda x : normalize_text(x))

In [40]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# remove bills that are too long for the token limitation
df_bills['n_tokens'] = df_bills["text"].apply(lambda x: len(tokenizer.encode(x)))
df_bills = df_bills[df_bills.n_tokens<2000]
len(df_bills)

Token indices sequence length is longer than the specified maximum sequence length for this model (1480 > 1024). Running this sequence through the model will result in indexing errors


12

Before the search, we will embed the text documents and save the corresponding embedding. We embed each chunk using a ‘doc’ model (i.e. text-search-curie-doc-001). 
These embeddings can be stored locally or in an Azure DB.

In [41]:
df_bills.head(4)

Unnamed: 0,text,summary,title,n_tokens
0,SECTION 1. SHORT TITLE. This Act may be cited ...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...,1480
1,SECTION 1. SHORT TITLE. This Act may be cited ...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,1152
2,SECTION 1. RELEASE OF DOCUMENTS CAPTURED IN IR...,Requires the Director of National Intelligence...,A bill to require the Director of National Int...,930
4,SECTION 1. SHORT TITLE. This Act may be cited ...,Military Call-up Relief Act - Amends the Inter...,A bill to amend the Internal Revenue Code of 1...,1048


In [42]:
df_bills['curie_search'] = df_bills["text"].apply(lambda x : get_embedding(x, engine = TEXT_SEARCH_EMBEDDING_ENGINE))

In [43]:
df_bills

Unnamed: 0,text,summary,title,n_tokens,curie_search
0,SECTION 1. SHORT TITLE. This Act may be cited ...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...,1480,"[-0.019770914688706398, 0.011169900186359882, ..."
1,SECTION 1. SHORT TITLE. This Act may be cited ...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,1152,"[-0.007850012741982937, 0.01001765951514244, 0..."
2,SECTION 1. RELEASE OF DOCUMENTS CAPTURED IN IR...,Requires the Director of National Intelligence...,A bill to require the Director of National Int...,930,"[0.00012103027984267101, 0.011845593340694904,..."
4,SECTION 1. SHORT TITLE. This Act may be cited ...,Military Call-up Relief Act - Amends the Inter...,A bill to amend the Internal Revenue Code of 1...,1048,"[-0.005481021944433451, 0.00856819562613964, -..."
5,SECTION 1. RELIQUIDATION OF CERTAIN ENTRIES PR...,Requires the Customs Service to reliquidate ce...,To provide for reliquidation of entries premat...,1846,"[-0.008310390636324883, -0.004660653416067362,..."
6,SECTION 1. SHORT TITLE. This Act may be cited ...,Service Dogs for Veterans Act of 2009 - Direct...,A bill to require the Secretary of Veterans Af...,872,"[-0.017687108367681503, 0.011164870113134384, ..."
9,SECTION 1. SHORT TITLE. This Act may be cited ...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993,946,"[0.0021867561154067516, -0.004219848196953535,..."
12,SECTION 1. FINDINGS. The Congress finds the fo...,Amends the Marine Mammal Protection Act of 197...,To amend the Marine Mammal Protection Act of 1...,1223,"[-0.015813011676073074, 0.009919906966388226, ..."
14,SECTION 1. SHORT TITLE. This Act may be cited ...,Education and Training for Health Act of 2017 ...,Education and Training for Health Act of 2017,1596,"[-0.0150684155523777, 0.005073960404843092, 0...."
16,SECTION 1. SHORT TITLE. This Act may be cited ...,Andrew Prior Act or Andrew's Law - Amends the ...,Andrew's Law,608,"[-0.011593054980039597, 0.022752899676561356, ..."


At the time of search (live compute), we will embed the search query using the corresponding ‘query’ model (text-serach-query-001). Next find the closest embedding in the database, ranked by cosine similarity. 

In [44]:
# search through the reviews for a specific product
def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        engine="text-search-curie-query-001"
    )
    df["similarities"] = df.curie_search.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res


res = search_docs(df_bills, "can i get information on cable company tax revenue", top_n=4)

Unnamed: 0,text,summary,title,n_tokens,curie_search,similarities
9,SECTION 1. SHORT TITLE. This Act may be cited ...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993,946,"[0.0021867561154067516, -0.004219848196953535,...",0.36327
0,SECTION 1. SHORT TITLE. This Act may be cited ...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...,1480,"[-0.019770914688706398, 0.011169900186359882, ...",0.314105
1,SECTION 1. SHORT TITLE. This Act may be cited ...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,1152,"[-0.007850012741982937, 0.01001765951514244, 0...",0.297908
18,SECTION 1. SHORT TITLE. This Act may be cited ...,This measure has not been amended since it was...,Veterans Entrepreneurship Act of 2015,1404,"[-0.020315825939178467, 0.0011716989101842046,...",0.295586


In [45]:
#Showing top result from document search based on user query against the entire knowledge base

res["summary"][9]

"Taxpayer's Right to View Act of 1993 - Amends the Communications Act of 1934 to prohibit a cable operator from assessing separate charges for any video programming of a sporting, theatrical, or other entertainment event if that event is performed at a facility constructed, renovated, or maintained with tax revenues or by an organization that receives public financial support. Authorizes the Federal Communications Commission and local franchising authorities to make determinations concerning the applicability of such prohibition. Sets forth conditions under which a facility is considered to have been constructed, maintained, or renovated with tax revenues. Considers events performed by nonprofit or public organizations that receive tax subsidies to be subject to this Act if the event is sponsored by, or includes the participation of a team that is part of, a tax exempt organization."

Using this approach, you can use embeddings as a search mechanism across documents in a knowledge base. The user can then take the top search result and use it for their downstream task which prompted their initial query.

## Document "Zone" 

### Use Case Overview
This section will go over the document zone use case. This section assumes that document search has already been used to narrow onto one document given the user query. The goal of this section is to show how given a document and a user query, one can find the relevant zones of the document to answer the question or extract the text for future processing such as summarization.

### Dataset
The dataset used for this section is the CNN/Daily Mail dataset. It is a dataset mainly used for text summarization and question answering tasks. In all, the corpus has 286,817 training pairs, 13,368 validation pairs and 11,487 test pairs, as defined by their scripts. The source documents in the training set have 766 words spanning 29.74 sentences on an average while the summaries consist of 53 words and 3.72 sentences. More information on the dataset can be found here.4

### Schema
The schema of the dataset is outlined below.
-	id: a string containing the heximal formatted SHA1 hash of the URL where the story was retrieved from
-	article: a string containing the body of the news article
-	highlights: a string containing the highlight of the article as written by the article author


In [46]:
dataset = pd.read_csv(os.path.join(os.getcwd(), 'data', 'cnn_dailymail_data.csv'))

In [47]:
dataset.count()

id            11490
article       11490
highlights    11490
dtype: int64

### Document Segmentation

It is very common to have documents that can span tens of pages. 

Due to the token limitation, we cannot pass the entire document into an Azure OpenAI model. For long documents we must chunk the documents into logical segments that can be embedded individually. Therefore, you can segment your large document into smaller chunks based on the document structure, paragraphs, pages etc, and embed each chunk individually. These chunks can be measured against the query embedding, so determine was chunk to “zone in” on for information retrieval or summarization in the next step.

For this dataset, the sentences are human readable because they were news articles. Therefore, by splitting the text into every 10 sentences, we can manually create paragraph breaks. This is demonstrated with the splitter function below. 

We will take the longest article from the dataset and choose to "zone" in on it for this example

In [48]:
#Sorting the dataset by longest text in descending order. Then reindexing the dataset so the new indexes are based of text size in descending order

s = dataset.article.str.len().sort_values(ascending=False).index
dataset_desc = dataset.reindex(s)

In [49]:
#grabbing the longest article in the dataset - will use this document as our "zoned" in document to perfrom the search within the document
dataset_sample = dataset_desc.head(1)

In [50]:
#Splits text after sentences ending in a period. Combines n sentences per chunk.
def splitter(n, s):
    pieces = s.split(". ")
    list_out = [" ".join(pieces[i:i+n]) for i in range(0, len(pieces), n)]
    return list_out

In [51]:
# Segmenting document by chunking every 10 sentences

df_cols = ["og_row", "id", "chunk"]
df_chunked = pd.DataFrame(columns=df_cols)
for idx, row in dataset_sample.iterrows():
    df_temp = pd.DataFrame(columns=df_cols)
    for elem in splitter(10,row["article"]):
        df_temp.loc[len(df_temp.index)] = [idx, row["id"], elem]
    df_chunked = df_chunked.append(df_temp)

df_chunked

Unnamed: 0,og_row,id,chunk
0,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,Hillary Clinton's newborn presidential campaig...
1,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,In one case a foundation run by the chairman o...
2,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,Canadian businessman Ian Telfer was only shown...
3,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,Chelsea Clinton defended her family philanthro...
4,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,The foundation listed the donors on its public...
5,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,The Clinton Foundation has become lucrative fo...
6,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,Bill Clinton has made a total of $26 million i...
7,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,In 2005 he flew to Kazakhstan with a Canadian ...


In [52]:
df_chunked['chunk'] = df_chunked["chunk"].apply(lambda x : normalize_text(x))

Now, we embed each chunk of the news article use a ‘doc’ embedding model (i.e. text-search-curie-doc-001).

In [53]:
df_chunked['curie_search'] = df_chunked['chunk'].apply(lambda x : get_embedding(x, engine = TEXT_SEARCH_EMBEDDING_ENGINE))

Now, similarly to the previous section. We embed the user query using the associated “query” model (text-serach-query-curie-001). We compare the user query embedding to the embedding for each chunk of the article, to find the chunk that is most like the user query based on cosine similarity and can provide the answer.

In this example, we are looking for information specific to the zone in the article. The query is “how much money did bill Clinton make from speaking gigs”.


In [54]:
# search through the document for a text segment most similar to the query
# display top n most similar chunks based on cosine similarity
def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        engine="text-search-curie-query-001"
    )
    print(len(embedding))
    df["similarities"] = df.curie_search.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res


res = search_docs(df_chunked, "how much money did Bill Clinton make from speaking gigs", top_n=2)

4096


Unnamed: 0,og_row,id,chunk,curie_search,similarities
6,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,Bill Clinton has made a total of $26 million i...,"[-0.007038620300590992, 0.010500761680305004, ...",0.44956
5,4675,0e1d533b46c74279f036efc06f0a8d6e4b0a420f,The Clinton Foundation has become lucrative fo...,"[-0.01163650956004858, 0.000741734984330833, -...",0.408853


### Next Steps

After “zoning in” on a chunk of text using embeddings, the selected text can be extracted as used to create a dynamic prompt that can be passed into an AOAI Completion endpoint for summarization or classification type tasks. Due to the fact the original text was chunked in a manner that will fit within the token limitation, the “zoned in” text is ready to be passed directly to any downstream task. The end-to-end summarization use case is outlined in the End-to-End design above

## Enhancing Prompt Engineering using Embeddings

GPT models have acquired lots of general knowledge during training, but often our use cases focus on a more specific topic area that GPT isn't specialized in. In this section we will go over how to use embeddings to improve results with few-shot prompt engineering methods. Using embeddings we can inject domain specific information into a prompt that will enable GPT to more successfully answer the question at hands.

Lets walk through an example.

The base GPT-3 model are extremely knowledgable, but may not know the ins and out of the 2015 Austrilian Fashion Report.

In [55]:
prompt = "Which clothing brands did the 2015 Australian Fashion Report expose for ongoing exploitation of overseas workers?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    engine=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'The 2015 Australian Fashion Report exposed many clothing brands for ongoing exploitation of overseas workers, including Forever 21, H&M, Zara, and Gap.'

In fact, the clothing brands that were exposed were Lowes, Industrie, Best & Less and the Just Group. GPT-3 needs assistance here to avoid hallucinations. We rather the model tell us they do not know rather than haullicate. This is essential so we can trust the responses provided by the model.

Let's try adding in a statement in our prompt to explicitly state to avoid hallucinations.

In [56]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: Which clothing brands did the 2015 Australian Fashion Report expose for ongoing exploitation of overseas workers?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    engine=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

We can help the model answer correctly by providing contextual information into the prompt. When the required context is short, we can fit it into the prompt within the token limitation. 

Let's update the prompt with the contextual information and explicitly tell the model to refer to the provided text when answering the question.

In [57]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
As Australian Fashion Week comes to a close, a new damning report has named and shamed some of the worst clothing brands sold in Australia and their companies, for the ongoing exploitation of their overseas workers. Lowes, Industrie, Best & Less and the Just Group - which includes Just Jeans, Portmans and Dotti - were identified as some of the worst performing companies by The 2015 Australian Fashion Report. Amongst the best performers were Etiko, Audrey Blue, Cotton On, H&M and Zara. The report assessed the labour rights management systems of 59 companies and 219 brands operating in Australia. The 2015 Australian Fashion Report has named and shamed some of the worst Aussie clothing brands and companies for their ongoing exploitation of overseas workers . Amongst the best performers were Etiko, Audrey Blue, Cotton On, H&M and Zara . It found that only two of the companies could prove they were paying a full living wage to the workers in two of the three production stages of their clothing. None of the 59 companies could prove the workers at their raw material suppliers were paid a living wage. Unlike a country's legally set minimum wage, a living wage ensures that an employee has enough money to cover the necessities - like food, water, electricity and shelter - and still has a little left over for themselves and their dependants. In some countries like Bangladesh, where the minimum wage is as little as US$68 a month and a living wage is US$104, the difference can be made by paying each worker just an additional 30c per t-shirt

Q: Which clothing brands did the 2015 Australian Fashion Report expose for ongoing exploitation of overseas workers?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    engine=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'Lowes, Industrie, Best & Less, and the Just Group were exposed.'

As we can see, this approach of adding extra information into the prompt works well. However, we are limited by ensure the contextual information is small enough to fit within the token limiation.

This brings up, how do we know what information to choose to put into the prompt when we have a large body of information when you can't fit it all in?

The remainder of this section will go over how to use embeddings to selectively choose the most relevant context out of a large body of text, that will be used to augment a few shot prompt. This method answers the initial question in 2 steps:

1. Retrieving the information relevant to the query using the **Embeddings API**.
2. Appending teh relevant context to the few shot prompt using the **Completions API**.

Let's dive into an example to see how this works.

### Example 

We will use the CNN/Daily Mail dataset once again for this example.

The steps that we will execute this approach are as follows:

1. Preprocess the knowledge base by splitting into chunk and creating an embedding vector for each chunk
2. On receiving a query, embed the query in the same vector space as the context chunks from Step 1. 
3. Find the most context chunks that are most similar to the query.
4. Append the most relevant context chunk to the few shot prompt, and submit the question to GPT-3 with the Completion endpoint.

#### Step 1: Preprocess the knowledge base and create context chunks embeddings

In [58]:
df = pd.read_csv(os.path.join(os.getcwd(), 'data', 'cnn_dailymail_data.csv'))
df = df[["article", "highlights"]]
df = df.head(50) #for the sake of the example, we will only take the first 100 articles as our knowledge base to reduce compute time

df.head()

Unnamed: 0,article,highlights
0,Ever noticed how plane seats appear to be gett...,Experts question if packed out planes are put...
1,A drunk teenage boy had to be rescued by secur...,Drunk teenage boy climbed into lion enclosure ...
2,Dougie Freedman is on the verge of agreeing a ...,Nottingham Forest are close to extending Dougi...
3,Liverpool target Neto is also wanted by PSG an...,Fiorentina goalkeeper Neto has been linked wit...
4,Bruce Jenner will break his silence in a two-h...,"Tell-all interview with the reality TV star, 6..."


In [59]:
#Splits text after sentences ending in a period. Combines n sentences per chunk.
def splitter(n, s):
    pieces = s.split(". ")
    list_out = [" ".join(pieces[i:i+n]) for i in range(0, len(pieces), n)]
    return list_out

In the next few code blocks, we will take the individual articles are break them into chunks by isolating 10 sentences as a time. Then we will embed those individual chunks. 

This will result in context chunks and their corresponding vector embedding.

In [60]:
# Segmenting document by chunking for every 10 sentences
# Resulting dataframe will have the og_row or the original index from the knowledge base, so we can refer back to the full article text if needed

df_cols = ["og_row", "chunk"]
df_chunked = pd.DataFrame(columns=df_cols)
for idx, row in df.iterrows():
    df_temp = pd.DataFrame(columns=df_cols)
    for elem in splitter(10,row["article"]):
        df_temp.loc[len(df_temp.index)] = [idx, elem]
    df_chunked = df_chunked.append(df_temp)

df_chunked.reset_index(drop=True, inplace=True)
df_chunked

Unnamed: 0,og_row,chunk
0,0,Ever noticed how plane seats appear to be gett...
1,0,"British Airways has a seat pitch of 31 inches,..."
2,1,A drunk teenage boy had to be rescued by secur...
3,2,Dougie Freedman is on the verge of agreeing a ...
4,3,Liverpool target Neto is also wanted by PSG an...
...,...,...
150,47,"'It was natural, and I bowled quickly, consist..."
151,48,Tom Lineham scored two interception tries in a...
152,48,Lineham had a good chance to lay the platform ...
153,49,For years medical experts have warned about th...


In [61]:
df_chunked['curie_search'] = df_chunked['chunk'].apply(lambda x : get_embedding(x, engine = 'text-search-curie-doc-001'))

In [64]:
df_chunked

Unnamed: 0,og_row,chunk,curie_search
0,0,Ever noticed how plane seats appear to be gett...,"[-0.0074607389979064465, -0.004383711144328117..."
1,0,"British Airways has a seat pitch of 31 inches,...","[-0.0004175635112915188, 0.0086124949157238, 0..."
2,1,A drunk teenage boy had to be rescued by secur...,"[-0.013947161845862865, -0.011407176032662392,..."
3,2,Dougie Freedman is on the verge of agreeing a ...,"[-0.006194248795509338, 0.0047354078851640224,..."
4,3,Liverpool target Neto is also wanted by PSG an...,"[-0.007845471613109112, 0.0026365553494542837,..."
...,...,...,...
150,47,"'It was natural, and I bowled quickly, consist...","[-0.009540674276649952, 0.0028725622687488794,..."
151,48,Tom Lineham scored two interception tries in a...,"[-0.021203014999628067, 0.018296150490641594, ..."
152,48,Lineham had a good chance to lay the platform ...,"[-0.031285665929317474, 0.009455944411456585, ..."
153,49,For years medical experts have warned about th...,"[-0.019099948927760124, -0.006712321657687426,..."


#### Step 2: On receiving a query, embed the query in the same vector space as the context chunks 

In [65]:
input_query = "Which clothing brands did the 2015 Australian Fashion Report expose for ongoing exploitation of overseas workers?"

We will take the input query and embed it in the same vector space as the context chunks. We will use the corresponding query model ("text-search-query-curie-001")

#### Step 3: Find the most context chunks that are most similar to the query.

The code sample below combines step 2 and step 3. We will embed the input query and then find the top 3 context chunks that are most similar to the input query.

In [66]:
# search through the document for a text segment most similar to the query
# display top two most similar chunks based on cosine similarity
def search_docs(df, user_query, n=3, pprint=True):
    embedding = get_embedding(
        user_query,
        engine="text-search-curie-query-001"
    )
    print(len(embedding))
    df["similarities"] = df.curie_search.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(n)
    )
    if pprint:
        display(res)
    return res


res = search_docs(df_chunked, input_query, n=3)

4096


Unnamed: 0,og_row,chunk,curie_search,similarities
46,15,"As Australian Fashion Week comes to a close, a...","[-0.006753021385520697, -0.015173792839050293,...",0.494718
47,15,"Lowes, Industrie, Best & Less and the Just Gro...","[-0.005220436491072178, 0.0013642740668728948,...",0.431739
48,15,The report comes almost two years after over 1...,"[-0.010583952069282532, -0.02157820761203766, ...",0.429292


Let's take a look at the resulting top context chunks that were found through embeddings:

In [67]:
for row, idx in res.iterrows():
    print(idx["chunk"])
    print("\n")

As Australian Fashion Week comes to a close, a new damning report has named and shamed some of the worst clothing brands sold in Australia and their companies, for the ongoing exploitation of their overseas workers Lowes, Industrie, Best & Less and the Just Group - which includes Just Jeans, Portmans and Dotti - were identified as some of the worst performing companies by The 2015 Australian Fashion Report Amongst the best performers were Etiko, Audrey Blue, Cotton On, H&M and Zara The report assessed the labour rights management systems of 59 companies and 219 brands operating in Australia The 2015 Australian Fashion Report has named and shamed some of the worst Aussie clothing brands and companies for their ongoing exploitation of overseas workers  Amongst the best performers were Etiko, Audrey Blue, Cotton On, H&M and Zara  It found that only two of the companies could prove they were paying a full living wage to the workers in two of the three production stages of their clothing No

As we can see, the emebddings API found the most relevant chunks that we can use to enhance our prompt engineering efforts. 

#### Step 4: Append the most relevant context chunk to the few shot prompt, and submit the question to GPT-3 with the Completion endpoint.

In [68]:
context_chunk = res.iloc[0,1] #selecting top content chunk

In [69]:
#Designing the prompt to avoid hallucinations, inject the context chunk found using embeddings, and answer the input_query

prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:\n""" + context_chunk + """

Q: """ + input_query + """ 
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    engine=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'Lowes, Industrie, Best & Less, and the Just Group were identified as some of the worst performing companies by The 2015 Australian Fashion Report.'

As a result, by combining Embeddings and Completion APIs we can create powerful question answering few shot models that can answer questions on a large knowledge base without needing to finetune. It also understand to answer truthfully and not hallucinate when the answer isn't clear. 

## Takeaway

Overall, embeddings are an extremely useful model for many different use cases such as text search and text similarity. 
We find that embeddings are extremely performant for document search and ranking given a query. Additionally, embeddings can aid in pinpointing a specific region in a long document that can answer a user query specific to the document. 


## Future Work

Transformer based models tend to have token limitation for the inputs, and Azure OpenAI GPT-3 models are no exception. Currently the token limitation for Azure OpenAI is ~4000 tokens for the text-davinci-002 model. Therefore, long documents require segmentation to be fed into any Azure OpenAI model. Currently, we use the approach of segmenting the document based on common features like sentence boundary detection or if applicable HTML parsing. 
However, text documents without structure can be challenging. Future work that can be helpful to this area is being able to detect natural paragraph breaks in text through natural features, topic changes, and other features. This will ensure the chunks being based into the model are focused on a singular topic and don’t interrupt the flow of the text. 


## FAQ

1. Information on BillSUM dataset: https://github.com/FiscalNote/BillSum
2. Information on CNN/Daily Mail dataset: https://paperswithcode.com/dataset/cnn-daily-mail-1
3. Information on Azure OpenAI Embedding offerings: https://docs.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#embeddings-models