## Overview of solution

My tech stack is LlamaIndex, BM25, OpenAI (gpt-4o-mini throughout).   

I put most focus on the search engine, as the LLM only can answer as well as the data it is given. Therefore hybrid search has been implemented to both use contextual info (embeddings) and also search on keywords (BM25). The User question is also rewritten before search, to make sure that the wording doesn't negatively affect search.    

#### The steps are:

1. Chunking  
1.1. Text is chunked by sentences

2. Search engine  
2.1. User question is rewritten (by GPT-4o-mini) to remove superfluous words -> improving search  
2.2 Hybrid search is used, with equal weight on both BM25 (keyword search) and vector search (text embeddings). There is a limitation on number of search hits to not go over token limit.

3. Answering  
3.1. GPT-4o-mini is fed search results and is instructed to reason and then answer. There is no strict limitation on output format, _focus is to reason/think before answering_.  
3.2. A second call to GPT-4o-mini is used to re-format answer to suit the output format.  

## How to test the solution

1. Set up environment (I used Python 3.9)
2. Run through the example below
3. Look through explanation of solution


#### 1.0. Set the OPENAI_API_KEY env variable and install requirements.txt

Sometimes running these commands does not work in Jupyter notebooks, you may need to run the commands below in your terminal.

In [5]:
!export OPENAI_API_KEY="your_api_key_here"
# Note that you may have to do this in the terminal, Jupyter notebooks are not water-tight regarding setting env vars in code-cells.

In [5]:
!pip install -r requirements.txt


Collecting aiohappyeyeballs==2.4.6 (from -r requirements.txt (line 1))
  Using cached aiohappyeyeballs-2.4.6-py3-none-any.whl.metadata (5.9 kB)
Collecting aiohttp==3.11.13 (from -r requirements.txt (line 2))
  Downloading aiohttp-3.11.13-cp310-cp310-macosx_10_9_x86_64.whl.metadata (7.7 kB)
Collecting aiosignal==1.3.2 (from -r requirements.txt (line 3))
  Using cached aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting annotated-types==0.7.0 (from -r requirements.txt (line 4))
  Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting anyio==4.8.0 (from -r requirements.txt (line 5))
  Using cached anyio-4.8.0-py3-none-any.whl.metadata (4.6 kB)
Collecting async-timeout==5.0.1 (from -r requirements.txt (line 8))
  Using cached async_timeout-5.0.1-py3-none-any.whl.metadata (5.1 kB)
Collecting attrs==25.1.0 (from -r requirements.txt (line 9))
  Using cached attrs-25.1.0-py3-none-any.whl.metadata (10 kB)
Collecting backcall==0.2.0 (from -r requirements.tx

#### 1.1. Go to Line 3 in "search_engines.py" and set your OpenAI API key there as well
This may not be a problem for you, but for me the env key set in Jupyter is not read in correctly.

#### 2.0. Import code

In [1]:
# We first import the three classes needed
from backend import ColumnDataObj, AnswererObj
from search_engines import DocSearch


  from .autonotebook import tqdm as notebook_tqdm


#### 2.1 First we initialize the knowledge base/search engine with the universe of files we are interested in

Embeddings are built of the text (using OpenAI's text-embedding-3-small) & and the BM25 search engine is also initialized - as we are using hybrid search.

In [2]:
search_engine = DocSearch('./data') # Insert all text into the search engine, we can filter it later

#### 2.2 We define the questions, output formats & files to look through

We essentially define the "column" and "row" of one cell.

In [3]:

one_column = ColumnDataObj("Are there confidentiality agreements?", "Boolean") # Question & output format, here called a "column"

# Files to look through, one "row"
file_list = [
    "AzulSa_20170303_F-1A_EX-10.3_9943903_EX-10.3_Maintenance Agreement1.txt", 
    "AzulSa_20170303_F-1A_EX-10.3_9943903_EX-10.3_Maintenance Agreement2.txt"
    ]


#### 2.3 Call the "answer" endpoint, using AnswererObj

I left some print statements to make it clearer what is happening under the hood :)

In [4]:
answer_endpoint = AnswererObj() # Object which handles all answering is instantiated

# The "answer" function requires the search engine, list of files to filter by & question/output format info
bot_answer = answer_endpoint.answer(search_engine, file_list, one_column)

print("FINAL ANSWER: ", bot_answer)

search_query:  confidentiality agreements existence
Number of search hits: 7
Unformatted answer:  Thoughts:

- The provided sources contain references to "Confidential Information" and stipulations regarding its treatment.
- There are specific obligations for the "Receiving Party" regarding the handling of Confidential Information, indicating that there are confidentiality agreements in place.
- Confidential Information must only be disclosed to individuals who need to know it for the purpose of the agreement.
- The sources emphasize that the Receiving Party is responsible for ensuring its employees adhere to confidentiality obligations.
- Disclosure to third parties requires specific conditions, including prior approval from the Disclosing Party and a signed acknowledgment of confidentiality from those third parties.
- There are provisions regarding legal disclosures, which highlight that confidentiality obligations may not apply in certain legal circumstances.
- The Confidential Info

#### Some more examples

In [5]:
one_column = ColumnDataObj("What is the governing law of the contract?", "Text")

file_list = [
    "AzulSa_20170303_F-1A_EX-10.3_9943903_EX-10.3_Maintenance Agreement1.txt", 
    "AzulSa_20170303_F-1A_EX-10.3_9943903_EX-10.3_Maintenance Agreement2.txt"
    ]

print("FINAL ANSWER: ", answer_endpoint.answer(search_engine, file_list, one_column))

search_query:  governing law of contract
Number of search hits: 11
Unformatted answer:  Sure! Here’s how I’ll approach the question:

- The question asks about the "governing law" of the contract, which refers to the legal framework under which the contract will be interpreted and enforced.
- I need to specifically look for any clauses in the provided text that mention governing law.
- In the sources, I see multiple references to various clauses indicating that the agreement has certain legal stipulations regarding its execution and interpretation.
- I note that "Clause 24" is mentioned, and it indicates there's a specific section that addresses governing law and arbitration.

After assessing the excerpts more closely, particularly focusing on results that mention the governing law directly:

- In the 24.1 Governing Law section, it’s stated that the agreement shall be governed by and construed in accordance with the laws of the State of New York, U.S.A.
- This confirmation comes from a

In [6]:
one_column = ColumnDataObj("What is the date when the agreement went into force?", "Date")

file_list = [
    "PhasebioPharmaceuticalsInc_20200330_10-K_EX-10.21_12086810_EX-10.21_Development Agreement.txt"
    ]

print("FINAL ANSWER: ", answer_endpoint.answer(search_engine, file_list, one_column))

search_query:  agreement effective date
Number of search hits: 12
Unformatted answer:  - The term "Effective Date" is mentioned several times in the provided text. It appears to be a critical point that marks the start of the agreement's terms.
- The "Effective Date" is defined in the preamble of the agreement, but I don't have the exact text of the preamble available in the sources.
- There's a section that mentions the term of the agreement begins on the Effective Date and expires based on certain conditions.
- The text hints at an important date, which is connected to various actions and obligations that are contingent upon the Effective Date.
- There is also a specific mention of the date "January 9, 2020," which appears on the signature page for both parties, likely indicating when the agreement was executed.
- The language suggests that once the agreement was executed, that date (January 9, 2020) could be considered the Effective Date unless stated otherwise in the preamble.

Now