# **Building a RAG Knowledge Assistant for Translators**

## **Project Summary**

In this project, I explain the process of building a Retrieval-Augmented Generation (RAG) system, from initial experimentation to a final, deployed application. The system acts as a knowledge assistant for professional translators, providing accurate answers from the European Commission's English Style Guide. This work explores how key component choices contribute to a more effective pipeline, focusing on three central questions:

1.  What is the impact of different **chunking strategies** on the quality of the retrieval results?
2.  Which **retrieval method** (keyword, semantic, or hybrid) yields the most relevant information?
3.  How can **prompt engineering** reduce hallucinations and improve the final answer's reliability?

The insights from these experiments resulted in a practical web application built with **Streamlit**. The entire system, which uses **Amazon Bedrock** for LLM access, was containerized with **Docker** and deployed on an **Amazon EC2** instance.

---

## **1. Introduction**

**RAG** has gained popularity in recent years, and it's easy to see why. It bridges a critical gap between large language models **(LLMs)** and traditional **semantic search**, providing a more efficient way to get accurate, up-to-date answers from documents.

Here’s the problem with **LLMs** alone: they are great at generating fluent, coherent responses, but they rely entirely on their **training data**. If you ask about something not in that data, like a new style guide update or an internal document, they can’t help. You’d need to fine-tune the model, which is expensive, time-consuming, and hard to maintain.

Even if you paste the relevant text into the prompt, you run into other issues. The **context window** limits how much you can include. And LLMs can still hallucinate, giving confident but wrong answers even when the correct information is right there in the prompt.

Traditional **semantic search** solves part of this. It works well when you need to find a document or passage based on keyword or meaning similarity. It retrieves what exists, but it **doesn’t generate natural language answers**. It gives you a list of results, and you still have to read through them to find what you need.

**RAG enhances retrieval with generation**. It first searches your knowledge base for the most relevant chunks of text. Then it passes those to the LLM to generate a clear, concise answer, grounded in actual source material.

The **pros**?  
- You don’t need to retrain the model when your documents change. Just update the vector database.  
- Answers are traceable. You can check which part of the document was used.  
- It handles long, complex documents by breaking them into searchable pieces.  
- You reduce hallucinations because the model only answers from retrieved content.

The **cons**?  
- Chunking matters. Bad splits can break context and hurt retrieval.  
- Embeddings aren’t perfect. Sometimes the system misses relevant sections just because of phrasing differences.  
- You still need safeguards, like prompt engineering, to keep the LLM honest and on track.

In short: RAG gives you a flexible, cost-effective way to build question-answering systems over your own documents, without retraining models or trusting blind guesses.

---

## **2. Notebook Setup and Configuration**

### **2.1. The Source Document**

For the source material, I chose the **European Commission’s English Style Guide**. I specifically picked a **PDF** because while they are extremely common in professional environments, they can be notoriously **difficult to process**. PDFs are designed for visual consistency, not for machine readability, which can make extracting text and understanding its underlying structure a significant engineering challenge.

This particular guide adds another layer of complexity with its deep **hierarchy** of chapters, sections, and subsections. Instead of starting with a clean text file, I chose this document to tackle a realistic problem head-on: how do you build a high-quality RAG system when your source material is messy? The goal is to show that even with a difficult format, we can develop a parsing strategy that captures the document's structure to create meaningful chunks for retrieval.

### **2.2. Libraries**

This first step handles the necessary imports and sets up the main configuration variables for the project.

A few key libraries were used here:
* **`pymupdf`**: A powerful library for extracting text, images, and metadata from PDF documents. We'll use it to get the raw text content from each page.
* **`tiktoken`**: This is OpenAI's tokenizer. We use it to count tokens accurately, which is essential for ensuring our text chunks don't exceed the context window of the embedding model.
* **`RecursiveCharacterTextSplitter`** from LangChain: A practical tool for breaking text into smaller pieces while trying to preserve natural boundaries like paragraphs and sentences.

In [1]:
# === IMPORTS AND CONFIGURATION ===
import json
import re
from typing import List, Dict
import pymupdf as fitz
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

# --- Main Configuration ---
config = {
    'data_dir': '../data/',
    'pdf_file_name': 'English_Style_Guide-European_Commission.pdf'
}

# --- Path Definitions ---
PDF_PATH = f"{config['data_dir']}raw/{config['pdf_file_name']}"
FIXED_CHUNK_PATH = f"{config['data_dir']}processed/fixed_chunks.json"
SECTION_CHUNK_PATH = f"{config['data_dir']}processed/section_chunks.json"

# --- Document Metadata ---
SOURCE_DOCUMENT = config['pdf_file_name']

---

## **3. PDF Processing and Chunking Strategies**

We’re now ready to extract text from the PDF and split it into chunks for retrieval. The quality of these chunks is one of the most critical factors for a RAG system's success. To answer the first key question of this project—*what is the impact of different chunking strategies on the quality of the retrieval results?*—I will implement and compare two distinct methods:

-  **Fixed-Size Chunking**: This approach splits text into relatively uniform size chunks. It’s fast and simple, but sometimes ignores the document’s structure, splitting paragraphs or ideas across different chunks. I’ll use it as a baseline.
-  **Section-Based Chunking**: This is a more advanced, layout-aware strategy that uses the document's visual hierarchy (chapters, sections, etc.) to create more meaningful and contextually complete chunks.

To implement these strategies effectively, we first need to prepare the source material by cleaning the raw text.

### **3.1. Data Cleaning**

Raw text extracted from a PDF tends to be far from perfect. It often includes repetitive headers, footers, page numbers, and other digital artifacts that confuse the embedding model and hurt retrieval accuracy. Before chunking the text, it’s important to remove this noise to ensure the quality of the data we feed into the model.

The cleaning function below uses a series of regular expressions to remove specific, recurring patterns identified during the initial exploration:
* **Dates**: Removes the date that appears in the footer of the pages ("14 February 2025").
* **Page Numbers**: Strips out the page numbering format (e.g., "3/130").
* **Document Title**: Removes the "English Style Guide" text, which is part of the header.
* **Software-Specific Notes**: Deletes certain instructional footnotes related to Microsoft Word or Windows.
* **Extra Blank Lines**: Collapses multiple blank lines into a single one to keep the text formatting clean and consistent.

This function will be applied to all text chunks to improve their quality by removing this recurring noise.

In [2]:
# === DATA CLEANING FUNCTION ===
def clean_text(text: str) -> str:
    """Removes recurring headers, footers, and other PDF artifacts from text."""
    text = re.sub(r'14 February 2025', '', text)
    text = re.sub(r'(\n)?\d{1,3}/\d{3}(\n)?', '', text)
    text = re.sub(r'(\n)?English Style Guide(\n)?', '', text)
    text = re.sub(r'\n\d+\s+In (Word|Windows):.*\n', '\n', text)
    text = re.sub(r'\n{2,}', '\n\n', text)
    return text.strip()

### **3.2. Fixed-size Chunking**

The first strategy is fixed-size chunking, a baseline approach where every chunk is constrained by a fixed maximum size (in this case, 256 tokens). Despite the strict size constraint, the method is smarter than a simple brute-force split. It uses LangChain's `RecursiveCharacterTextSplitter`, which doesn’t understand meaning but follows a smart splitting order, preferring paragraph and sentence breaks over random cuts. While the 256-token limit is the rule, the splitter does its best to respect the text's structure within that constraint.

The process is handled in a few steps:
1.  **Text Extraction**: First, the raw text is extracted from the PDF, starting at **page 6** to skip the cover, table of contents, and other front matter.
2.  **Cleaning**: The extracted text is passed through the `clean_text` function to remove recurring headers, footers, and other digital artifacts.
3.  **Splitting**: The `RecursiveCharacterTextSplitter` then generates the chunks based on two key parameters:
    * **`max_tokens = 256`**: This is a typical chunk size for embedding models. For this document, it also creates chunks that are roughly comparable in size to the average logical section, allowing for a fairer comparison against the section-based strategy.
    * **`overlap_tokens = 50`**: A 50-token overlap (about 20% of the chunk size) is used to help maintain context between adjacent chunks, reducing the chance that an idea is awkwardly cut off at a boundary.
4.  **Metadata Assignment**: Finally, each chunk is packaged into a dictionary with important metadata, like its `chunk_id` and the method used.

This strategy is fast and easy to implement, making it a solid baseline for comparison.

In [3]:
# === FIXED-SIZE CHUNKING FUNCTION ===
def create_fixed_size_chunks(pdf_path: str, max_tokens: int = 256, overlap_tokens: int = 50) -> List[Dict]:
    """Creates fixed-size text chunks as a baseline chunking strategy."""
    doc = fitz.open(pdf_path)
    # Extract text from main content pages, skipping front matter.
    full_text = ''.join(page.get_text() for page in doc.pages(start=6))
    doc.close()
    
    cleaned_text = clean_text(full_text)
    
    tokenizer = tiktoken.get_encoding('cl100k_base')
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_tokens,
        chunk_overlap=overlap_tokens,
        length_function=lambda x: len(tokenizer.encode(x)),
        separators=['\n\n', '\n', '. ', ' ', ''],
    )
    texts = text_splitter.split_text(cleaned_text)
    
    chunks = []
    for i, text in enumerate(texts):
        token_count = len(tokenizer.encode(text))
        chunks.append({
            'chunk_id': f'fixed_{i:04d}',
            'text': text,
            'token_count': token_count,
            'page_number': 'N/A',
            'method': 'fixed_size',
            'source_document': SOURCE_DOCUMENT
        })
    return chunks

### **3.3. Section-Based Chunking**

The second strategy is a more advanced, layout-aware approach designed to create semantically complete chunks. Instead of a relatively fixed size, the boundaries of these chunks are determined by the document's logical hierarchy, after identifying the headers for Parts, Chapters, and Sections.

#### **3.3.1. Investigating the PDF Structure**

Before splitting the document, we need to understand how those sections are defined by their formatting and text patterns. The script below uses `PyMuPDF` to inspect a few pages of the document. The code iterates through text "spans" and prints their font size, whether they are bold, and the text itself.

This investigative step is crucial. By looking at the output, we can spot the patterns that define the document's hierarchy. For instance, we can see that chapter titles consistently use a specific font size and style, which is different from section or subsection titles.

Here is the code I used for the inspection:

In [4]:
# === PDF STRUCTURE INSPECTION ===
def inspect_pdf_structure(pdf_path: str, pages_to_check: List[int] = [9, 10, 11, 12]):
    """
    Prints text spans and their font properties from select pages of a PDF
    to help identify patterns for headers.
    """
    doc = fitz.open(pdf_path)
    print(f"--- Inspecting pages {pages_to_check} ---")

    for page_num in pages_to_check:
        print(f"\n--- PAGE {page_num+1} ---")
        page = doc.load_page(page_num)
        blocks = page.get_text('dict')['blocks']
        for block in blocks:
            if 'lines' in block:
                for line in block['lines']:
                    for span in line['spans']:
                        # The 'flags' attribute is a bitmask; flag 4 (or 2^4=16) indicates bold.
                        is_bold = (span['flags'] & 16) > 0
                        # Round size for easier comparison
                        font_size = round(span['size'], 2)
                        text = span['text'].strip()
                        if text:
                            print(f"Size: {font_size}, Bold: {is_bold}, Text: '{text}'")
    doc.close()

# Run the inspection
if __name__ == '__main__':
    inspect_pdf_structure(PDF_PATH)

--- Inspecting pages [9, 10, 11, 12] ---

--- PAGE 10 ---
Size: 9.0, Bold: False, Text: 'English Style Guide'
Size: 9.0, Bold: False, Text: '4/130'
Size: 9.0, Bold: False, Text: '14 February 2025'
Size: 14.04, Bold: True, Text: '1.'
Size: 14.04, Bold: True, Text: 'General'
Size: 12.0, Bold: False, Text: '1.1.'
Size: 12.0, Bold: False, Text: 'Language usage.'
Size: 12.0, Bold: False, Text: 'The language used in English texts should be understandable'
Size: 12.0, Bold: False, Text: 'to speakers of Irish/British English (defined in the introduction to this guide as'
Size: 12.0, Bold: False, Text: 'the shared standard usage of Ireland and the United Kingdom). As a general'
Size: 12.0, Bold: False, Text: 'rule, Irish/British English should be preferred, and Americanisms that are liable'
Size: 12.0, Bold: False, Text: 'not to be understood by speakers of Irish/British English should be avoided.'
Size: 12.0, Bold: False, Text: 'However, bearing in mind that a considerable proportion of the ta

Based on the inspection, I identified clear patterns for the main structural elements and codified them below.

  * **Chapters and Sections**: These headers have distinct font styles. Chapter titles are size `14.04` and bold, while main section titles are `12.00` and bold. I stored these findings in the `HEADER_PROFILES` dictionary.
  * **Parts and Subsections**: These headers could be identified by their text content. `Part I` and `Part II` follow a consistent naming convention, and subsections use a numbered format like `1.1.` or `10.2.`. I created regular expressions (`PART_REGEX` and `SUBSECTION_REGEX`) to catch these.

This combination of style and pattern matching allows for a more robust method of identifying the document's hierarchy.

In [5]:
# === HEADER PROFILES ===
# Define the font properties of structural headers in the document.
HEADER_PROFILES = {
    'chapter': {'size': 14.04, 'bold': True},
    'section': {'size': 12.00, 'bold': True},
}
# Regex patterns used for numbered or consistently worded headers.
SUBSECTION_REGEX = re.compile(r'^(\d{1,2}\.\d{1,2}\.)')
PART_REGEX = re.compile(r'^Part (I|II)')

#### **3.3.2. Creating Chunks from Document Structure**

These header profiles are the foundation for the `create_section_based_chunks` function, which works like a simple **state machine**, moving through the document page by page and block by block:

1.  **State Tracking**: It keeps track of the current "state" using a `context` dictionary (e.g., `{'part': 'Part I', 'chapter': '2. Punctuation', ...}`).
2.  **Header Detection**: For each text block, it checks if the block's font style or text pattern matches one of our predefined header profiles.
3.  **Chunking on State Change**: When a new header is detected, it signals the end of the previous section. The function then saves all the text it has accumulated since the last header as a single, complete chunk. It then updates its state to the new header and begins collecting text for the next chunk.

This process results in chunks that align with the document's natural flow, keeping all the text for a given section together.

In [6]:
# === SECTION-BASED CHUNKING FUNCTION ===
def create_section_based_chunks(pdf_path: str) -> List[Dict]:
    """Creates semantic chunks using a layout-aware state machine."""
    doc = fitz.open(pdf_path)
    
    # Pre-scan the document to map physical page numbers to the logical numbers in the footer.
    page_map = {page.number: int(match.group(1)) for page in doc if (match := re.search(r'\n(\d{1,3})/\d{3}\n', page.get_text()))}

    chunks = []
    context = {'part': 'N/A', 'chapter': 'N/A', 'section': 'N/A', 'subsection': 'N/A'}
    current_text = ''
    start_page = 0

    def save_chunk():
        nonlocal current_text
        if current_text.strip():
            chunks.append({'text': current_text, 'page_number': page_map.get(start_page, start_page + 1), **context})
        current_text = ''

    # Process pages containing the main, structured content.
    for page_num in range(6, doc.page_count):
        if page_num == 6:
            context.update({'part': 'N/A', 'chapter': 'Introduction', 'section': 'N/A', 'subsection': 'N/A'})
            start_page = page_num

        page = doc.load_page(page_num)
        blocks = page.get_text('dict', sort=True)['blocks']
        
        for block in blocks:
            if 'lines' not in block: continue
            
            block_text_content = ' '.join(s['text'] for l in block['lines'] for s in l['spans']).strip()
            if not block_text_content or 'English Style Guide' in block_text_content or re.match(r'^\d{1,3}/\d{3}$', block_text_content):
                continue

            span = block['lines'][0]['spans'][0]
            style = {'size': round(span['size'], 2), 'bold': (span['flags'] & 16) > 0}

            header_type = None
            if PART_REGEX.match(block_text_content): header_type = 'part'
            elif style == HEADER_PROFILES['chapter']: header_type = 'chapter'
            elif style == HEADER_PROFILES['section']: header_type = 'section'
            elif SUBSECTION_REGEX.match(block_text_content): header_type = 'subsection'
            
            if header_type:
                save_chunk()
                start_page = page_num
                
                header_text = block_text_content.replace('\n', ' ').strip()
                if header_type == 'part': context.update({'part': header_text, 'chapter': 'N/A', 'section': 'N/A', 'subsection': 'N/A'})
                elif header_type == 'chapter': context.update({'chapter': header_text, 'section': 'N/A', 'subsection': 'N/A'})
                elif header_type == 'section': context.update({'section': header_text, 'subsection': 'N/A'})
                elif header_type == 'subsection':
                    match = SUBSECTION_REGEX.match(header_text)
                    context.update({'subsection': match.group(0) if match else 'N/A'})
            
            current_text += block_text_content + '\n'
    
    save_chunk()
    doc.close()

    # Finalize all created chunks with metadata.
    tokenizer = tiktoken.get_encoding('cl100k_base')
    final_chunks = []
    for i, chunk in enumerate(chunks):
        cleaned_text = clean_text(chunk['text'])
        if not cleaned_text: continue
    
        token_count = len(tokenizer.encode(cleaned_text))
    
        final_chunks.append({
            'chunk_id': f'section_based_{i:04d}',
            'text': cleaned_text,
            'token_count': token_count,
            'page_number': chunk['page_number'],
            'method': 'section_based',
            'part': chunk['part'],
            'chapter': chunk['chapter'],
            'section': chunk['section'],
            'subsection': chunk['subsection'],
            'source_document': SOURCE_DOCUMENT
        })
    return final_chunks

#### **3.3.3. Handling Oversized Chunks**

A challenge with this method is that some logical sections can be very long, exceeding the token limit for our embedding model. The `split_oversized_chunks` function solves this problem:

1.  **Identify Large Chunks**: It iterates through the section-based chunks and checks their token count against a `max_tokens` limit (e.g., 512).
2.  **Split if Necessary**: If a chunk is too large, it's split into smaller pieces using the same `RecursiveCharacterTextSplitter` from our baseline. Chunks that are already within the limit are left untouched.
3.  **Preserve Metadata**: When a large chunk is split, its original metadata (Part, Chapter, Section, etc.) is copied to all the new sub-chunks. This ensures that even the smaller pieces retain their full contextual information from the document's hierarchy.

This two-step process gives us the best of both worlds: chunks that are semantically grouped by section, but also guaranteed to be a manageable size.

In [7]:
# === OVERSIZED CHUNK SPLITTING FUNCTION ===
def split_oversized_chunks(chunks: List[Dict], max_tokens: int = 512, overlap: int = 50) -> List[Dict]:
    """Splits chunks that exceed the token limit while preserving metadata."""
    tokenizer = tiktoken.get_encoding('cl100k_base')
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_tokens,
        chunk_overlap=overlap,
        length_function=lambda x: len(tokenizer.encode(x)),
        separators=['\n\n', '\n', '. ', ' ', '']
    )
    
    final_split_chunks = []
    for chunk in chunks:
        if chunk.get('text') and len(chunk['text']) > overlap:
            if chunk['token_count'] <= max_tokens:
                chunk['is_split'] = False
                final_split_chunks.append(chunk)
            else:
                sub_texts = splitter.split_text(chunk['text'])
                for i, sub_text in enumerate(sub_texts):
                    sub_chunk = chunk.copy()
                    cleaned_sub_text = clean_text(sub_text)
                    sub_token_count = len(tokenizer.encode(cleaned_sub_text))
                
                    sub_chunk.update({
                        'chunk_id': f"{chunk['chunk_id']}_part{i:02d}",
                        'text': cleaned_sub_text,
                        'token_count': sub_token_count,
                        'method': 'section_based_split',
                        'is_split': True
                    })
                    final_split_chunks.append(sub_chunk)
        elif chunk.get('text'):
            chunk['is_split'] = False
            final_split_chunks.append(chunk)

    return final_split_chunks

### **3.4. Executing the Chunking Pipeline**

With all the functions defined, this final part of the notebook puts them into action. It's time to execute both of the chunking strategies we've designed:
1.  **Fixed-size**: It calls `create_fixed_size_chunks` to generate the chunks for our baseline strategy and saves them to `fixed_chunks.json`.
2.  **Section-based**: It runs the more advanced pipeline by calling `create_section_based_chunks` and then passing those results to `split_oversized_chunks`. The final, optimized chunks are saved to `section_chunks.json`.

The result is two separate JSON files, each containing a clean, structured list of chunks ready for the next stage of the project: embedding and indexing.

In [8]:
# === SCRIPT EXECUTION ===
if __name__ == '__main__':
    # --- Process and save fixed-size chunks (Baseline) ---
    print('--- Creating Baseline Fixed-Size Chunks ---')
    fixed_chunks = create_fixed_size_chunks(PDF_PATH)
    with open(FIXED_CHUNK_PATH, 'w', encoding='utf-8') as f:
        json.dump(fixed_chunks, f, indent=4, ensure_ascii=False)
    print(f'✅ Created {len(fixed_chunks)} fixed-size chunks -> {FIXED_CHUNK_PATH}\n')
    
    # --- Process and save section-based chunks (Optimized) ---
    print('--- Creating Final Section-Based Chunks ---')
    section_chunks_raw = create_section_based_chunks(PDF_PATH)
    final_section_chunks = split_oversized_chunks(section_chunks_raw)
    
    with open(SECTION_CHUNK_PATH, 'w', encoding='utf-8') as f:
        json.dump(final_section_chunks, f, indent=4, ensure_ascii=False)
        
    print(f'✅ Created {len(final_section_chunks)} final section-based chunks -> {SECTION_CHUNK_PATH}\n')

--- Creating Baseline Fixed-Size Chunks ---
✅ Created 361 fixed-size chunks -> ../data/processed/fixed_chunks.json

--- Creating Final Section-Based Chunks ---
✅ Created 593 final section-based chunks -> ../data/processed/section_chunks.json



#### **3.4.1. Sanity Check**

After generating hundreds of chunks, it's good practice to perform a quick sanity check. The next cell does a simple "spot check" on the output of the more complex, section-based strategy.

It prints a few sample chunks from specific subsections. This allows for a quick visual confirmation that the text content and, more importantly, the hierarchical metadata (Part, Chapter, Section, etc.) were captured and assigned correctly.

In [9]:
# === VERIFICATION ===
if __name__ == '__main__':
    print('--- VERIFICATION: SAMPLE OUTPUT ---')
    # Verify the output by printing a few key chunks.
    target_subsections = ['2.1.', '2.2.', '2.3.']
    found_count = 0
    for chunk in final_section_chunks:
        if chunk.get('subsection') in target_subsections and found_count < len(target_subsections):
            print(json.dumps(chunk, indent=4))
            found_count += 1

--- VERIFICATION: SAMPLE OUTPUT ---
{
    "chunk_id": "section_based_0006",
    "text": "2.1.   The punctuation in an English text must follow the rules and conventions for  English, which often differ from those applying to other languages. Note in  particular that:\n\u2666   punctuation marks in English are always \u2013 apart from dashes (see  2.17 ) and\nellipsis points (see  2.3 ) \u2013 closed up to the preceding word, letter or number;\n\u2666   stops (. ? ! : ;) are always followed by only a single (not a double) space.",
    "token_count": 105,
    "page_number": 10,
    "method": "section_based",
    "part": "Part I",
    "chapter": "2.   Punctuation",
    "section": "N/A",
    "subsection": "2.1.",
    "source_document": "English_Style_Guide-European_Commission.pdf",
    "is_split": false
}
{
    "chunk_id": "section_based_0008",
    "text": "2.2.   A full stop marks the end of a sentence. All footnotes end with a full stop. Do  not use a full stop at the end of a heading.\n

---

## **4. Next Steps**

This notebook has successfully processed the raw PDF source document and generated two distinct sets of structured text chunks: a simple baseline using a fixed-size strategy, and a more advanced one using a layout-aware, section-based strategy.

The output is two clean JSON files (`fixed_chunks.json` and `section_chunks.json`), which are the foundation for the next stage of our RAG pipeline.

In the next notebook, `02_embedding_and_vectordb_setup.ipynb`, we will:
1.  Load these processed chunks.
2.  Use the `BAAI/bge-large-en-v1.5` model to convert each chunk's text into a vector embedding.
3.  Set up a Weaviate vector database to store these embeddings for efficient retrieval.