## Question answering based on the OpenAI cookbook tutorial

OpenAI has a very thorough tutorial on creating a question answering function using embeddings (https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb).
One of the possible solutions was to follow this tutorial.

The steps modified from the tutorial are: 
- Prepare the laws data
- Create embeddings for the data
- Create embedding for the query
- Find most relevant text sections (by calculating distance between the query embedding and text embeddings)
- Send query with the question and most relevant sections
- Get query answer


## Creating the embeddings file

A little section about what embeddings are.

### Overview of the data

The original dataset contains:
- the type or title of the law, for example VVS is "Vabariigi Valitsuse seadus" which translates to Government of the Republic Act
- the paragraph number
- the text of the paragraph
- the link to the paragraph

In [1]:
# imports
from fastparquet import write
from tables import *
import openai  # for generating embeddings
import pandas as pd  # for DataFrames to store article sections and embeddings
import re  
import tiktoken  # for counting tokens
import numpy as np
from dotenv import dotenv_values
from vector_database import save_index
from vector_database import load_index
from vector_database import strings_ranked_by_relatedness_vector
from answer_rater import rank_answer

In [2]:
#To use embeddings you must create an .env file where the content is OPENAI_API_KEY = "your-api-key"
config = dotenv_values(".env")["OPENAI_API_KEY"]
openai.organization = "org-3O7bHGD9SwjHVDuUCNCGACC3"
openai.api_key = config
GPT_MODEL = "gpt-3.5-turbo"  # only matters insofar as it selects which tokenizer to use
EMBEDDING_MODEL = "text-embedding-ada-002"

In [3]:
df = pd.read_csv("legal_acts_estonia.csv", names=['type', 'nr','text','link'])

In [4]:
df.head(1)

Unnamed: 0,type,nr,text,link
0,VVS,para1,§ 1.\nVabariigi Valitsuse pädevus\n(1) Vabarii...,https://www.riigiteataja.ee/akt/VVS#para1


### Cleaning and splitting the data

The tutorial suggests that long sections, which have over 1600 tokens should be split down to smaller sections. Splitting the sections allows the question query to match smaller and more  specific sections to the query and add them to the prompt, without exceeding the token limit.

In [5]:
def num_tokens_from_string(string: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(string))
    return num_tokens

When splitting the paragraphs, it is important to still have the context of the paragraph, which is why the title of the paragraph is split from the text and later added to the subsections of the paragraph.

In [6]:
df["title"] = df["text"].str.split("\n").str[1]
df["text"] = df["text"].str.split("\n").apply(lambda x: ','.join(x[2:]))
df["text"] = df["text"].str.replace('\n','')
#have to split up to have less than 1600 tokens
#this will split into sections (1) ... (2) ...
df['split_text'] = df['text'].str.split(r'\(\d+\)')
df["text"] = df["text"].str.replace('\n','')
df = df.explode("split_text")
df = df[df["split_text"]!= ""]
df["nr"] = df["nr"].str.replace('para','')
df['token_count'] = df['split_text'].apply(num_tokens_from_string)

In [7]:
df.head(1)

Unnamed: 0,type,nr,text,link,title,split_text,token_count
0,VVS,1,(1) Vabariigi Valitsus teostab täidesaatvat ri...,https://www.riigiteataja.ee/akt/VVS#para1,Vabariigi Valitsuse pädevus,Vabariigi Valitsus teostab täidesaatvat riigi...,39


Now, we must separate paragraphs, which have more tokens than 1600 and perform additional cleaning. This code will split the longer paragraphs by sections if there are subsections of sections, like (3.1) for example.

In [8]:
over_length = df[df["token_count"]>1600]
#remove the long strings for now
df = df[df["token_count"]<1600]
over_length['split_text_2'] = over_length['split_text'].str.split(r'\(\d+\.\d+\)')
over_length = over_length.explode("split_text_2")
over_length['token_count'] = over_length['split_text_2'].apply(num_tokens_from_string)
#if they still have too many words, most often the paragraphs are lists of definitions, which can be split by list enumeration
#other with a shorter length can be added back to original dataframe
over_length_merge = over_length[over_length["token_count"]<1600]
over_length_merge = over_length_merge.drop(columns=["split_text"]).rename(columns={"split_text_2":"split_text"})
df = pd.concat([df,over_length_merge])

This code is used in case there still be paragraphs with more tokens than 1600. This will split the paragraph by list enumeration elements, so 1) ..., 2) ... .

In [9]:
over_length = over_length[over_length["token_count"]>1600]
over_length["title_2"] = over_length["split_text_2"].str.split(r',\d+[\.\d+]*\)').str[0]
over_length["split_text_3"] = over_length["split_text_2"].str.split(r',\d+[\.\d+]*\)')
over_length = over_length.explode("split_text_3")
over_length = over_length[over_length["title_2"]!=over_length["split_text_3"]]
over_length['token_count'] = over_length['split_text_3'].apply(num_tokens_from_string)
over_length["title"] = over_length["title"]+ '. ' + over_length["title_2"]
over_length = over_length.drop(columns=["split_text","split_text_2","title_2"]).rename(columns={"split_text_3":"split_text"})
df = pd.concat([df,over_length])

In [10]:
df.fillna('', inplace=True) #some laws do not have the type
df["concatenated"] = "Seadus "+ df["type"]+" paragrahv "+ df["nr"]+". Pealkiri: "+ df["title"]+ " Sisu: "+ df["split_text"]
df["concatenated"] = df["concatenated"].str.replace(r'\s+', ' ',regex=True).str.rstrip(",")

In [11]:
df["concatenated"].iloc[0]

'Seadus VVS paragrahv 1. Pealkiri: Vabariigi Valitsuse pädevus Sisu: Vabariigi Valitsus teostab täidesaatvat riigivõimu Eesti Vabariigi põhiseaduse ja seaduste alusel.'

### Calculating the embeddings

In [12]:
laws = np.array(df["concatenated"])

In [13]:
laws[0]

'Seadus VVS paragrahv 1. Pealkiri: Vabariigi Valitsuse pädevus Sisu: Vabariigi Valitsus teostab täidesaatvat riigivõimu Eesti Vabariigi põhiseaduse ja seaduste alusel.'

In [14]:
# calculate embeddings

##
##DO NOT RUN UNLESS NEED TO CREATE NEW EMBEDDINGS (THIS CODE COSTS ABT 2 DOLLARS)

EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 1000  

law_strings = laws.tolist()

embeddings = []
for batch_start in range(0, len(law_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = law_strings[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")
    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response["data"]):
        assert i == be["index"]  # double check embeddings are in same order as input
    batch_embeddings = [e["embedding"] for e in response["data"]]
    embeddings.extend(batch_embeddings)

result = pd.DataFrame({"text": law_strings, "embedding": embeddings})

Batch 0 to 999
Batch 1000 to 1999
Batch 2000 to 2999
Batch 3000 to 3999
Batch 4000 to 4999
Batch 5000 to 5999
Batch 6000 to 6999
Batch 7000 to 7999
Batch 8000 to 8999
Batch 9000 to 9999
Batch 10000 to 10999
Batch 11000 to 11999
Batch 12000 to 12999
Batch 13000 to 13999
Batch 14000 to 14999
Batch 15000 to 15999
Batch 16000 to 16999
Batch 17000 to 17999
Batch 18000 to 18999
Batch 19000 to 19999
Batch 20000 to 20999
Batch 21000 to 21999
Batch 22000 to 22999
Batch 23000 to 23999
Batch 24000 to 24999
Batch 25000 to 25999
Batch 26000 to 26999
Batch 27000 to 27999
Batch 28000 to 28999
Batch 29000 to 29999
Batch 30000 to 30999
Batch 31000 to 31999
Batch 32000 to 32999
Batch 33000 to 33999
Batch 34000 to 34999
Batch 35000 to 35999
Batch 36000 to 36999
Batch 37000 to 37999
Batch 38000 to 38999
Batch 39000 to 39999
Batch 40000 to 40999
Batch 41000 to 41999
Batch 42000 to 42999
Batch 43000 to 43999
Batch 44000 to 44999
Batch 45000 to 45999
Batch 46000 to 46999
Batch 47000 to 47999
Batch 48000 to 4

In [15]:
#loading is super slow
result.to_hdf(r'embeddedfile_all.h5', key='stage', mode='w')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['text', 'embedding'], dtype='object')]

  result.to_hdf(r'embeddedfile_all.h5', key='stage', mode='w')


In [16]:
index = save_index(embeddings)

## Using the embeddings

In [17]:
reread = pd.read_hdf('./embeddedfile_all.h5')


In [18]:
index = load_index()

In [19]:
df = reread

The spatial distance cosine is calculated here. The cosine distance ranges from -1 to 1, where 1 indicates that the vectors are identical and -1 that they are completely dissimilar. When calculating relatedness, it is common to subtract the cosine distance from 1, to better reflect the relatedness from 0 to 2, where 2 would mean not related at all and 0 meaning perfect similarity.

In [23]:
strings, relatednesses = strings_ranked_by_relatedness_vector("Lapsendamine",EMBEDDING_MODEL,index,df,openai)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.872


'Seadus TsMS paragrahv 113. Pealkiri: Lapsendamine Sisu: Lapsendamist käsitlev avaldus esitatakse lapsendatava elukoha järgi. Kui lapsendataval ei ole Eestis elukohta, esitatakse avaldus Harju Maakohtusse.'

relatedness=0.871


'Seadus TsMS paragrahv 113. Pealkiri: Lapsendamine Sisu: Lapsendamisasja võib lahendada Eesti kohus, kui lapsendaja, üks lapsendavatest abikaasadest või laps on Eesti Vabariigi kodanik või kui lapsendaja, ühe lapsendava abikaasa või lapse elukoht on Eestis.'

relatedness=0.868


'Seadus TsMS paragrahv 564. Pealkiri: Lapsendamise avaldus Sisu: Avaldaja märgib avalduses oma sünniaasta, -kuu ja -päeva, samuti asjaolud, mis kinnitavad, et ta on suuteline last kasvatama, tema eest hoolitsema ja teda ülal pidama.'

relatedness=0.865


'Seadus PKS paragrahv 158. Pealkiri: Lapsendamise ettevalmistamine Sisu: Kui Sotsiaalkindlustusamet seda nõuab, läbib lapsendada sooviv isik lapsendamisele eelnevalt asjakohase koolitusprogrammi.'

relatedness=0.865


'Seadus PKS paragrahv 147. Pealkiri: Lapsendamise lubatavus Sisu: Lapsendada on lubatud, kui see on lapse huvides vajalik ning on alust arvata, et lapsendaja ja lapse vahel tekib vanema ja lapse suhe. Lapsendajat valides arvestatakse tema isikuomadusi, suhteid lapsendatavaga, varalist seisundit ja võimet täita lapsendamissuhtest tulenevaid kohustusi, samuti võimaluse korral lapse vanemate eeldatavat tahet. Otsustamisel arvestatakse võimaluse korral ka lapse üleskasvatamise järjepidevuse vajadust ning tema rahvuslikku, usulist, kultuurilist ja keelelist päritolu.'

In [24]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness_vector(query,EMBEDDING_MODEL, index,df,openai)
    introduction = 'Use the following part of the law to answer the subsequent question. Try to find the best answer.' \
    'Formulate the answer including the law name and paragraph number'\
    'The answer should be a coherent sentence. If the answer cannot be found in the laws, write '\
    '"Ei leidnud seadustest küsimusele vastust, proovige küsimus ümber sõnastada"'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nSeaduse lõik:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "Sa vastad Eesti seaduste andmebaasi küsimustele."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

In [25]:
question = "Kuidas astub minister ametisse?"
answer = ask(question)
answer

'Seaduse VVS paragrahv 6 kohaselt astub Vabariigi Valitsus või minister ametisse ametivande andmisega Riigikogu ees.'

In [26]:
rank_answer(question,answer,openai)

'Hinnang vastusele: I would rate this answer a 10 as it directly answers the question and provides a specific legal reference to support the answer.'