# Question 1

## 1. Index
1. Index
2. Problem definition
3. Solution strategy
4. Design choies
5. Implementation approach
6. Implementation
7. Discussion

## 2. Problem definition

- Reshaped holds six discovery meetings weekly and prioritizes gaining a deep understanding of potential clients beforehand.  
- A key part of this preparation involves reviewing annual reports,  
  - A task that is often time-consuming due to their length and inconsistent formatting.  
- To streamline this process, we aim to leverage a **Large Language Model (LLM)**.  
- The LLM should be capable of **answering questions based on the content of the annual reports**.


## 3. Solution Strategy

- Language models often lack up-to-date or document-specific knowledge.  
  - As a result, they cannot reliably analyze arbitrary documents on their own.
- To address this limitation, a language model can be combined with an external information source.  
  - This approach is known as **Retrieval Augmented Generation (RAG)**.
- In RAG, the model retrieves real documents to **"ground"** its answers in factual content.  
- This enables more **accurate, reliable, and current responses**, especially for niche or rapidly changing topics.  
- RAG is ideal for use cases like **chatbots, search assistants,** and **enterprise knowledge tools**.

<br/><br/>

A typical RAG process consists of the following steps:


![title](docs/img/flowchart_rag.png)

## 4. Design choices

Every step in the RAG process described above involves making choices. I will go over the main choices and highlight possible alternatives

### 4.1 Document Loading

- Many libraries are capable of extracting text from a PDF file (e.g., `PyPDF2`, `Tika`, LangChain's `PyPDFLoader`).  
- The main decision is not which library to use, but what **form** the output should take:  
  - Flat text  
  - Markdown  
- In many cases, LLMs perform better when the input is **structured using Markdown**.  
- However, extracting precise and consistent Markdown formatting is difficult.  
- The results of the **splitting** step are highly sensitive to slight perturbations in Markdown.  
- For this reason, we chose a more robust approach: loading documents as **flat text**.


### 4.2 Splitting

- Text is split before being stored in a **vector database**  
  - This ensures that entries are small and focused enough for **accurate search and efficient retrieval**.

- I evaluated several splitters from LangChain:  
  - `CharacterTextSplitter`  
  - `RecursiveCharacterTextSplitter`  
  - `MarkdownHeaderTextSplitter`

- The Markdown splitter is compelling because it preserves metadata,  
  - but it's highly sensitive to changes in Markdown formatting.

- `RecursiveCharacterTextSplitter` offers slightly more flexibility than `CharacterTextSplitter`.

- Key parameters:  
  - `Chunk size`  
  - `Overlap`

- Based on average paragraph length, I chose:  
  - `chunk_size = 500`  
  - `overlap = 100`


### 4.3 Storage

- After splitting and embedding the text, vectors are stored in a vector database:  
  - This enables fast similarity search during retrieval.

- Common vector stores include:  
  - **FAISS**  
    - Open source, fast for local projects  
    - Not scalable  
  - **Chroma**  
    - Open source, fast for local projects  
    - Not scalable  
  - **Pinecone**  
    - Offers a paid version, very scalable  
    - Can be more difficult to set up

- For small projects (like this one), there's very little difference between FAISS and Chroma:  
  - Chroma is slightly easier to set up and persist.

- **Embeddings** determine how text is converted into vector representations and placed in the vector space:  
  - Hundreds of embedding options are available.  
  - For ease of use, I worked with **OpenAI embeddings**.


### 4.4 Retrieval

- Retrieval involves searching a vector database to find the most relevant text chunks:  
  - Based on similarity to the query  

- Common retrieval methods include:  
  - **K-nearest neighbors (KNN):** Find the K chunks most similar to the query  
  - **(Self) Filtered search:** Combine KNN with metadata-based search  
  - **Maximal Marginal Relevance (MMR):** Maximize variance of responses within the most similar chunks

- Characteristics of chunks:  
  - No relevant metadata is present  
  - Much of the text is relatively similar

- Work with the **MMR** retrieval method.
  - `Lambda` parameter balances relevance/penalize similarity
  - Set `Lambda` = 0.5

- The number of chunks retrieved depends on the size of the context window:  
  - More chunks = more information  
  - More chunks = more noise

- The context window for `GPT-3.5 Turbo` is **±16,000** tokens (~12,000 words):  
  - Questions 1 and 2 cover a lot of information  
  - Number of chunks retrieved = 30

N.B.
- Newer models (e.g. `GPT-4o` or `GPT-4.1`) have context windows or 100K - 1M
  - This could remove the need for chunking/retrieval on single documents


### 4.5 Output

- To obtain output, the question and retrieved information must be combined.

- There are various strategies for combining retrieved information, such as:  
  - **Stuffing:** All retrieved chunks are included in the prompt.  
  - **Map Reduce:** Each chunk is used separately to answer the question and produce intermediate results.  
  - **Refine:** Generates an answer from one chunk and then iteratively refines it with additional chunks.

- We’re working with one relatively short document:  
  - 23,000 words (± 30,000 tokens)

- No need for advanced combination strategies in this case:  
  - Opt for **stuffing**.

- The system prompt should guide behavior by specifying:  
  - Source usage  
  - Tone/response style  
  - How to handle conflicting information  
  - Task type (e.g., Q&A vs. summarization)

- The **temperature** parameter controls the randomness of the language model's output:  
  - Lower values make responses more focused and deterministic.  
  - Higher values increase creativity and variability.

- To ensure we stick to facts, we set the temperature to **0**.


![title](docs/img/q1_system_prompt.PNG)

### 4.6 design overview
- Document loading = LangChain `PyPDFLoader`
- Splitting = LangChain `RecursiveCharacterTextSplitter` (default separators), `chunk size` = 500, `overlap` = 100
- Storage = Chroma vector database (OpenAI embeddings)
- Retrieval = Maximum marginal relevance (90 chunks fetched, 30 retrieved, `lambda` = 0.5)
- Output = Combine question and retrieved information through LangChain Prompt template using stuffing, set temperature to 0

## 5. Implementation Approach

- The implementation of the design was developed by:  
  - Iteratively exploring design choices using **ChatGPT**  
  - Experimenting with design choices using code snippets implemented by **Cursor**  
  - Implementing the folder structure, `utils.py`, and `README` with **Cursor**


## 6 Implementation

### 6.1 Setting up the environment

In [None]:
# Import libraries
import importlib
import os
import utils
from dotenv import load_dotenv

# Reload utils to ensure changes in functions carry over from helper file
importlib.reload(utils)

# Load environment variables
load_dotenv()

# Specify question 1
QUESTION_1 = "What company is described in the document and what is their business model?"

# Specify question 2
QUESTION_2 = "What are the main risks this company is facing?"

# Specify document locations
PDF_PATH = r"Microsoft_2023_Trimmed.pdf"

# Specify vector database storage directory
DB_PERSISTANCE_PATH = r"db\annual_report_vector_db"

### 6.2 Load and split PDF

In [None]:
# Define params
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100

# Load and split
chunks = utils.load_pdf_with_recursive_splitter(PDF_PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Inspect result
print(f"Created {len(chunks)} chunks from the PDF")

### 6.3 Create and store vector database

In [None]:
# Create and store database
vector_db = utils.create_vector_db(chunks=chunks, persist_directory=DB_PERSISTANCE_PATH)

# Verify outcome
print("Vector database created successfully")


### 6.4 Retrieve relevant chunks

In [None]:
# Define params
NUM_CHUNKS = 30

# Retrieve relevant chunks for question 1
relevant_chunks_q1 = utils.retrieve_relevant_chunks(vector_db, query=QUESTION_1, k=NUM_CHUNKS)

# Retrieve relevant chunks for question 2
relevant_chunks_q2 = utils.retrieve_relevant_chunks(vector_db, query=QUESTION_2, k=NUM_CHUNKS)

### 6.5 Create prompt template and query LLM

In [None]:
# Define params
TEMP = 0

# Create template
prompt_q1 = utils.setup_prompt(relevant_chunks_q1)

prompt_q2 = utils.setup_prompt(relevant_chunks_q2)

# Pass prompt for question 1 to LLM
answer_q1 = utils.query_llm(system_prompt=prompt_q1, question=QUESTION_1, temperature=TEMP)

# Pass prompt for question 2 to LLM
answer_q2 = utils.query_llm(system_prompt=prompt_q2, question=QUESTION_2, temperature=TEMP)

### 6.6 Results

In [None]:
questions = [(QUESTION_1, answer_q1), (QUESTION_2, answer_q2)]

for question, answer in questions:
    print(question)
    print(answer)
    print("\n")

## 7. Discussion