## **1. Introduction**

This notebook builds the final **Retrieval-Augmented Generation (RAG)** pipeline, using the insights gained from our retrieval experiments in Notebook 03. It combines the optimized retriever (section-based chunks with hybrid search) with a large language model (LLM) to generate natural language answers based solely on the retrieved context from the European Commission's English Style Guide. We will also evaluate the impact of prompt engineering on reducing hallucinations.

---

## **2. Initializing the RAG System**

The first step is to set up the environment by importing the necessary libraries and defining the core configurations for our RAG pipeline. 

### **2.1. Configuration Details**

* **LLM Configuration**: This section specifies how we'll access the large language model (LLM) for the generation step.
    * **Amazon Bedrock** was chosen as the LLM provider because it's a **fully managed service**. This simplifies deployment and scaling, eliminating the need to manage infrastructure. Bedrock also offers easy access to a **variety of foundation models** from different providers through a single API, making it flexible for future experimentation.
    * The model selected was **Anthropic's Claude 3 Sonnet** (`CLAUDE_MODEL_ID`). Sonnet offers a great balance between **strong performance** (especially in reasoning and following instructions, which is essential for RAG) and **cost-effectiveness**. It has a large context window and is well-suited for tasks like summarizing retrieved information and generating accurate answers based on provided text.
* **RAG Project Configuration**: This section defines the parameters specific to our RAG setup, namely the Weaviate collection name (`COLLECTION_NAME`), the embedding model used (`EMBEDDING_MODEL_NAME`), and the path for saving results (`RESULTS_FILE_PATH`).

In [1]:
# === IMPORTS AND CONFIGURATION ===
import weaviate
import weaviate.classes.query as wq
from sentence_transformers import SentenceTransformer
import torch
import boto3
from botocore.exceptions import ClientError
import json
import os
import re
import time
from datetime import datetime

# --- LLM Configuration ---
AWS_REGION = 'us-east-1'
CLAUDE_MODEL_ID = 'anthropic.claude-3-sonnet-20240229-v1:0'

# --- RAG Project Configuration ---
COLLECTION_NAME = 'StyleGuide'
EMBEDDING_MODEL_NAME = 'BAAI/bge-large-en-v1.5'
RESULTS_FILE_PATH = '../results/rag_evaluation_results.json'

### **2.2. Connecting to Bedrock and Weaviate**

Next, we load the **BGE** embedding model into memory and connect to our two external services: **Weaviate** for retrieval and **Amazon Bedrock** for generation.

In [2]:
# === INITIALIZE MODELS AND CONNECTIONS ===
print('--- Initializing Models and Connecting to Services ---')

if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'
embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)

bedrock_runtime = boto3.client(service_name='bedrock-runtime', region_name=AWS_REGION)
print(f'✅ Connected to Amazon Bedrock in region {AWS_REGION}')

client_weaviate = weaviate.connect_to_local()
collection = client_weaviate.collections.get(COLLECTION_NAME)
print(f'✅ Connected to Weaviate collection \'{COLLECTION_NAME}\'.')

--- Initializing Models and Connecting to Services ---
✅ Connected to Amazon Bedrock in region us-east-1
✅ Connected to Weaviate collection 'StyleGuide'.


### **2.3. Exponential Backoff for Bedrock Calls**

When interacting with cloud-based LLMs like those on Amazon Bedrock, it's common to encounter temporary rate limits. To handle this efficiently, the function below implements an **exponential backoff mechanism**. If a request fails due to throttling (`ThrottlingException`), it automatically waits for a progressively longer period before retrying. This makes our pipeline more resilient.

In [3]:
# === BEDROCK INVOCATION WITH EXPONENTIAL BACKOFF ===
def invoke_bedrock_with_backoff(body: str, model_id: str) -> dict:
    """
    Invokes a Bedrock model with an exponential backoff retry mechanism.
    Handles ThrottlingException errors by retrying the API call.
    """
    max_retries = 5
    base_delay = 2  # seconds
    
    for i in range(max_retries):
        try:
            response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
            return response
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                if i < max_retries - 1:
                    delay = base_delay * (2 ** i)
                    print(f'ThrottlingException caught. Retrying in {delay} seconds...')
                    time.sleep(delay)
                else:
                    print('Max retries reached for ThrottlingException. Aborting.')
                    raise e
            else:
                raise e

### **2.4. Defining the Retrieval Function**

This function defines the **retriever component** of our RAG pipeline. Based on our findings in the previous notebook, it uses the optimized **hybrid search strategy (`alpha=0.7`)** on the **`section-based` chunks** in Weaviate. Given a user's question, it fetches the top 5 most relevant text chunks from the style guide to serve as context for the LLM.

In [4]:
# === RETRIEVER FUNCTION ===
def get_retrieved_context(question: str, limit: int = 5) -> str:
    """Retrieves context from Weaviate using the optimized hybrid search strategy."""
    question_vector = embedding_model.encode(question).tolist()
    
    response = collection.query.hybrid(
        query=question,
        vector=question_vector,
        alpha=0.7,
        limit=limit,
        filters=wq.Filter.by_property('method').like('section_based*'),
        return_properties=['text']
    )
    
    if not response.objects:
        return 'No relevant context was found in the style guide.'

    context = "\n\n---\n\n".join([obj.properties['text'] for obj in response.objects])
    return context

---

## **3. Prompt Engineering for RAG**

Although the retriever provides the relevant context, the quality of the final answer heavily depends on how we instruct the LLM to use that context. This section defines two distinct **prompt templates** to compare their effectiveness, particularly in reducing hallucinations and ensuring the LLM adheres strictly to the provided information.

### **3.1. The Basic Prompt**

The first template, `BASIC_PROMPT_TEMPLATE`, is a simple, direct instruction. It tells the LLM to answer the question based only on the provided context. This serves as our **baseline** to see how the LLM performs with minimal guidance.

### **3.2. The Engineered Prompt**

The second template, `ENGINEERED_PROMPT_TEMPLATE`, incorporates several **prompt engineering best practices** designed to improve accuracy and control the LLM's output for a RAG task:

1.  **Role Definition:** It assigns a specific persona ("You are a specialist assistant for the European Commission's English Style Guide"), priming the model for the task.
2.  **Strict Guidelines:** It provides explicit, numbered rules that constrain the LLM's behavior:
      * Answer *only* from the context.
      * Quote directly when possible.
      * Provide a specific fallback response if the answer isn't in the context.
      * Explicitly forbid the use of external knowledge.
      * Demand precision about what the guide says versus doesn't say.
3.  **Clear Structure:** It uses separators (`---`) to clearly delineate the context section from the question, making it easier for the model to parse the input.

This engineered prompt aims to minimize the chances of the LLM hallucinating or deviating from the retrieved source material.

In [5]:
# === PROMPT TEMPLATES ===

# A. Basic Prompt (Baseline)
BASIC_PROMPT_TEMPLATE = '''
Answer the following question based only on the provided context.

Context:
{context}

Question:
{question}

Answer:
'''

# B. Engineered Prompt (Optimized)
ENGINEERED_PROMPT_TEMPLATE = '''
You are a specialist assistant for the European Commission's English Style Guide.

STRICT GUIDELINES:
1. Answer ONLY using information explicitly stated in the provided context.
2. Quote directly from the context using quotation marks when possible.
3. If the context lacks information to answer the question, respond: "The European Commission's English Style Guide does not address this topic in the provided sections."
4. Never use general knowledge about style guides, EU practices, or writing conventions.
5. Be precise about what the guide says vs. what it doesn't say.

Context from the Style Guide:
---
{context}
---

Question: {question}

Response:
'''

---

## **4. The RAG Pipeline Function**

The next function, `get_rag_response`, brings all the components together to execute the complete RAG pipeline for a given question.

It handles the process in several steps:

1.  **Retrieve Context**: First, it calls the `get_retrieved_context` function (defined in Section 2.4) to fetch the relevant text chunks from Weaviate using our optimized hybrid search.
2.  **Format Prompt**: It then takes the retrieved context and the user's question and inserts them into the chosen `prompt_template` (either the basic or engineered one).
3.  **Prepare LLM Payload**: It constructs the JSON payload required by the Amazon Bedrock API for the Claude 3 Sonnet model. This includes the formatted prompt and an option to set the `temperature` parameter, which controls the creativity (randomness) of the LLM's response. We'll use this later to force more deterministic outputs for the engineered prompt.
4.  **Invoke LLM**: It calls the `invoke_bedrock_with_backoff` function (defined in Section 2.3) to send the request to the LLM, automatically handling any potential throttling issues.
5.  **Extract Answer**: Finally, it parses the JSON response from the LLM to extract the generated text answer.
6.  **Return Results**: The function returns both the final answer and the context that was used to generate it, which is crucial for our later evaluation steps.

In [6]:
# === RAG PIPELINE FUNCTION ===
def get_rag_response(question: str, prompt_template: str, temperature: float = None) -> dict:
    """
    Generates an answer using a RAG pipeline and returns the answer and context.
    Optionally sets the temperature for the LLM call to control creativity.
    """
    context = get_retrieved_context(question)
    final_prompt = prompt_template.format(context=context, question=question)
    
    payload = {
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 1024,
        'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': final_prompt}]}]
    }
    
    if temperature is not None:
        payload['temperature'] = temperature
        
    body = json.dumps(payload)
    
    response = invoke_bedrock_with_backoff(body=body, model_id=CLAUDE_MODEL_ID)
    
    response_body = json.loads(response.get('body').read())
    answer = response_body.get('content')[0].get('text')
    
    return {'answer': answer, 'context': context}

---

## **5. The Hallucination Evaluation Framework**

A good answer must be **grounded** in the source material and **free of hallucinations**. Since manually checking every answer isn't scalable, this section sets up an automated evaluation framework using another **LLM as a judge**.

### **5.1. The Evaluator Prompt**

The core of this framework is the **`EVALUATOR_PROMPT_TEMPLATE`**. This prompt instructs a second LLM (Claude 3 Sonnet again, but acting as a judge) to meticulously compare the generated answer against the source context provided to the RAG pipeline.

Key instructions for the evaluator LLM include:

  * Analyzing the answer **sentence by sentence**.
  * Determining if each sentence is **fully supported** by the source context.
  * Also considering a statement "supported" if it accurately claims that information is *not present* in the context.
  * Outputting the analysis in a structured **JSON format**, including a sentence-level breakdown and a `final_verdict` (`SUPPORTED`, `PARTIALLY_SUPPORTED`, or `NOT_SUPPORTED`).

### **5.2. The Evaluation Function**

The **`evaluate_for_hallucination`** function handles this evaluation. It takes the original context and the generated answer, inserts them into the `EVALUATOR_PROMPT_TEMPLATE`, and sends this evaluation task to Bedrock (using the same robust backoff mechanism). It then parses the evaluator LLM's JSON response, providing a structured assessment of the answer's factual grounding.

### **5.3. Calculating Metrics**

Finally, the **`calculate_hallucination_metrics`** function takes the structured JSON results from multiple evaluations and calculates quantitative metrics, such as the overall accuracy rate (percentage of fully supported answers) and the hallucination rates (partial or complete). This allows us to easily compare the effectiveness of our basic vs. engineered prompts later on.

In [7]:
# === HALLUCINATION EVALUATION FRAMEWORK ===

EVALUATOR_PROMPT_TEMPLATE = '''
You are a meticulous fact-checker. Your task is to evaluate a generated answer against a provided source context.

Analyze the **Generated Answer** sentence by sentence. For each sentence, determine if the information is fully supported by the **Source Context**.
A statement is "supported" if the source context contains information that directly proves it.
**Crucially, a statement is ALSO considered "supported" if it accurately claims that information is NOT present in the context.**

Your response MUST be in the following JSON format and nothing else:
{{
  "analysis": [
    {{
      "sentence": "The first sentence of the generated answer.",
      "is_supported": boolean,
      "reasoning": "A brief explanation of why the sentence is or is not supported by the source context."
    }}
  ],
  "final_verdict": "SUPPORTED", "PARTIALLY_SUPPORTED", or "NOT_SUPPORTED"
}}

---
Source Context:
{context}
---
Generated Answer:
{answer}
---
'''

def evaluate_for_hallucination(context: str, answer: str) -> dict:
    """Uses an LLM as a judge to evaluate if an answer is grounded in the provided context."""
    evaluator_prompt = EVALUATOR_PROMPT_TEMPLATE.format(context=context, answer=answer)
    
    body = json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 2048,
        'temperature': 0.0,
        'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': evaluator_prompt}]}]
    })
    
    try:
        response = invoke_bedrock_with_backoff(body=body, model_id=CLAUDE_MODEL_ID)
        response_body = json.loads(response.get('body').read())
        evaluation_result_text = response_body.get('content')[0].get('text')
        
        # Extract the JSON object from the LLM's response, ignoring potential extra text.
        json_match = re.search(r'\{.*\}', evaluation_result_text, re.DOTALL)
        if json_match:
            return json.loads(json_match.group(0))
        else:
            print(f'Error: No valid JSON object found in the evaluator\'s response.')
            return {'analysis': [], 'final_verdict': 'EVALUATION_FAILED'}
            
    except (json.JSONDecodeError, IndexError, Exception) as e:
        print(f'Error parsing evaluation response: {e}')
        return {'analysis': [], 'final_verdict': 'EVALUATION_FAILED'}

def calculate_hallucination_metrics(evaluation_results: list) -> dict:
    """Calculates quantitative metrics from the structured evaluation results."""
    total_responses = len(evaluation_results)
    if total_responses == 0: return {}
        
    supported = sum(1 for r in evaluation_results if r['final_verdict'] == 'SUPPORTED')
    partial = sum(1 for r in evaluation_results if r['final_verdict'] == 'PARTIALLY_SUPPORTED')
    not_supported = sum(1 for r in evaluation_results if r['final_verdict'] == 'NOT_SUPPORTED')

    return {
        'accuracy_rate': supported / total_responses,
        'partial_hallucination_rate': partial / total_responses,
        'hallucination_rate': not_supported / total_responses,
        'total_evaluated': total_responses
    }

---

## **6. Defining the Test Questions**

Now we need a set of questions to evaluate our RAG pipeline. The **test set** that we'll use combines two types of questions:

1.  **Factual Questions**: These are loaded directly from the `ground_truth.json` file we used in the previous notebook. We already know the correct answers for these, allowing us to evaluate basic retrieval and generation accuracy.
2.  **Hallucination and Edge Case Questions**: These are new, carefully crafted questions designed to push the limits of the RAG system. They include:
      * Questions about **topics not covered** in the style guide (e.g., "Comic Sans font"). This tests the system's ability to correctly state when information is missing, using the fallback response defined in the engineered prompt.
      * Questions that ask for comparisons with **external knowledge** (e.g., "Chicago Manual of Style"), which the system should refuse based on the engineered prompt's rules.

This combined set of questions provides a comprehensive test bed to evaluate both the accuracy and the robustness of our RAG pipeline, especially when comparing the basic and engineered prompts. The results for each question will be logged for later analysis.

In [8]:
# === DEFINE TEST QUESTIONS ===

# Load factual questions from the ground truth file.
with open('../data/evaluation/ground_truth.json', 'r', encoding='utf-8') as f:
    ground_truth_data = json.load(f)
factual_questions = [q['question'] for q in ground_truth_data['questions']]

# Define questions to test for hallucinations and edge cases.
hallucination_and_edge_case_questions = [
    'What is the official EU-approved font face and size for all official translated documents submitted in Microsoft Word format?',
    'How do the EU\'s rules on capitalizing government bodies differ from the Chicago Manual of Style?',
    'What does the style guide say about using Comic Sans font?',
    'How should translators handle Brexit-related terminology?',
    'What are the differences between US and UK spelling in EU documents?',
]

test_questions = factual_questions + hallucination_and_edge_case_questions
results_log = []

---

## **7. Evaluation and Results**

We're now ready to execute the RAG pipeline using both our basic and engineered prompts against the full set of test questions.

The core experiment loop is executed in the cell below. For each question in our test suite:

1.  It runs the **RAG pipeline** using the **basic prompt**.
2.  It then uses our **LLM-based evaluator** to check the basic prompt's answer for factual grounding.
3.  It repeats steps 1 and 2 using the **engineered prompt** (with `temperature=0.0` for deterministic output).
4.  All results (answers, contexts, evaluations) are stored in the `results_log`.

A 5-second pause is included between questions to respect API rate limits. The output below shows the detailed, question-by-question results, including the generated answers and the structured JSON evaluation from the LLM judge.

In [9]:
# === RUN EXPERIMENT LOOP ===
print(f'--- Running experiments on {len(test_questions)} questions ---')

for i, question in enumerate(test_questions):
    print(f'\n====================== QUESTION {i+1}/{len(test_questions)} ======================')
    print(f'QUESTION: {question}')
    print(f'----------------------------------------------------------\n')
    
    # Test with Basic Prompt (Default Temperature)
    print('--- 1. ANALYSIS OF BASIC PROMPT (Default Temperature) ---')
    basic_response = get_rag_response(question, BASIC_PROMPT_TEMPLATE)
    print(f"GENERATED ANSWER:\n{basic_response['answer']}\n")
    
    basic_evaluation = evaluate_for_hallucination(basic_response['context'], basic_response['answer'])
    print('--- Hallucination Evaluation ---')
    print(json.dumps(basic_evaluation, indent=2))
    
    results_log.append({
        'question': question, 'prompt_type': 'basic',
        'response': basic_response, 'evaluation': basic_evaluation
    })
    
    print('\n' + '-'*60 + '\n')
    
    # Test with Engineered Prompt (Temperature = 0.0)
    print('--- 2. ANALYSIS OF ENGINEERED PROMPT (Temperature = 0.0) ---')
    engineered_response = get_rag_response(question, ENGINEERED_PROMPT_TEMPLATE, temperature=0.0) 
    print(f"GENERATED ANSWER:\n{engineered_response['answer']}\n")
    
    engineered_evaluation = evaluate_for_hallucination(engineered_response['context'], engineered_response['answer'])
    print('--- Hallucination Evaluation ---')
    print(json.dumps(engineered_evaluation, indent=2))
    
    results_log.append({
        'question': question, 'prompt_type': 'engineered',
        'response': engineered_response, 'evaluation': engineered_evaluation
    })
    
    # Pause to respect API rate limits
    if (i + 1) < len(test_questions):
        print('\nPausing for 5 seconds to avoid rate limiting...')
        time.sleep(5)

print('\n\n--- All experiments complete! ---')

--- Running experiments on 10 questions ---

QUESTION: When is it correct to use square brackets?
----------------------------------------------------------

--- 1. ANALYSIS OF BASIC PROMPT (Default Temperature) ---
GENERATED ANSWER:
According to the provided context, square brackets are used for the following purposes:

1. To make editorial insertions or explanations in quoted material.
Example: 'They [the members of the committee] voted in favour of the proposal.'

2. In administrative drafting, to indicate optional passages or those still open to discussion.

3. In mathematical formulae (but not in text), to enclose round brackets.
Example: 7[4 ab – (2 nm × 6 bm ) × nm ] + 7 a = 1240

4. When translating, to insert translations or explanations after names or titles left in the original language.

5. To mark errors with '[sic]' or insert missing text when directly quoting a piece of text, as mentioned in section 1.2.

So, in summary, square brackets are primarily used for editorial i

The detailed output above provides a qualitative view of how each prompt performed. We can see specific examples, like in **Question 7** (comparing EU rules to the Chicago Manual of Style), where the **engineered prompt correctly refused** to answer because the information wasn't in the context, while the **basic prompt attempted to synthesize** an answer before admitting the comparison wasn't present. Similar refusals by the engineered prompt occurred for **Question 6** (EU font face) and **Question 8** (Comic Sans).

To get a clear, quantitative comparison, the next cell processes the `results_log` using our `calculate_hallucination_metrics` function. It calculates the overall accuracy and hallucination rates for each prompt type.

In [10]:
# === CALCULATE AND DISPLAY FINAL METRICS ===
basic_results = [r['evaluation'] for r in results_log if r['prompt_type'] == 'basic']
engineered_results = [r['evaluation'] for r in results_log if r['prompt_type'] == 'engineered']

basic_metrics = calculate_hallucination_metrics(basic_results)
engineered_metrics = calculate_hallucination_metrics(engineered_results)

print('--- Quantitative Hallucination Analysis ---')
print('='*80)
print(f"{'Metric':<35} | {'Basic Prompt':<20} | {'Engineered Prompt'}")
print('-' * 80)
print(f"{'Accuracy Rate (Fully Supported)':<35} | {basic_metrics.get('accuracy_rate', 0):<20.2%} | {engineered_metrics.get('accuracy_rate', 0):<20.2%}")
print(f"{'Partial Hallucination Rate':<35} | {basic_metrics.get('partial_hallucination_rate', 0):<20.2%} | {engineered_metrics.get('partial_hallucination_rate', 0):<20.2%}")
print(f"{'Complete Hallucination Rate':<35} | {basic_metrics.get('hallucination_rate', 0):<20.2%} | {engineered_metrics.get('hallucination_rate', 0):<20.2%}")
print('='*80)

--- Quantitative Hallucination Analysis ---
Metric                              | Basic Prompt         | Engineered Prompt
--------------------------------------------------------------------------------
Accuracy Rate (Fully Supported)     | 90.00%               | 100.00%             
Partial Hallucination Rate          | 10.00%               | 0.00%               
Complete Hallucination Rate         | 0.00%                | 0.00%               


The quantitative results confirm the effectiveness of prompt engineering. The **engineered prompt achieved a 100% accuracy rate**, meaning every answer was fully supported by the retrieved context. The **basic prompt**, while still performing well (90% accuracy), showed a **10% rate of partial hallucination**.

This single instance of partial hallucination occurred in **Question 7**, where the basic prompt inferred a comparison to the "Chicago Manual of Style" that was not present in the source context. The engineered prompt correctly identified that this comparison was outside the scope of the provided material and declined to answer. This case perfectly illustrates how the engineered prompt's guidelines prevent the model from generating unsupported inferences.

This demonstrates that providing **specific, strict guidelines** significantly improves the reliability and trustworthiness of the RAG system's output. Within the scope of this evaluation, the engineered prompt effectively eliminated the partial hallucinations observed with the basic prompt. 

As the final step in this notebook, the cell below **saves the experimental results**, including the detailed logs for each question, into a timestamped JSON file. It also closes the connection to the Weaviate client to release resources.

In [11]:
# === LOG RESULTS AND CLOSE CONNECTION ===

# Compile the full log with a timestamp and metadata.
full_log = {
    'timestamp': datetime.now().isoformat(),
    'model_id': CLAUDE_MODEL_ID,
    'retriever_settings': {'alpha': 0.7, 'limit': 5},
    'results': results_log
}

# Ensure the results directory exists and write the log file.
os.makedirs('../results', exist_ok=True)
with open(RESULTS_FILE_PATH, 'w', encoding='utf-8') as f:
    json.dump(full_log, f, indent=2, ensure_ascii=False)

print(f'✅ Full experiment log saved to: {RESULTS_FILE_PATH}')

# Close the Weaviate client connection to release resources.
client_weaviate.close()
print('✅ Weaviate client connection closed.')

✅ Full experiment log saved to: ../results/rag_evaluation_results.json
✅ Weaviate client connection closed.


## **8. Conclusion and Next Steps**

This notebook demonstrated that **prompt engineering has a measurable effect on RAG system performance:** The engineered prompt achieved 100% accuracy on the test set, ensuring all answers were fully grounded in the retrieved context. In contrast, the baseline prompt showed a 10% rate of partial hallucination, highlighting the importance of clear, strict instructions for LLM guidance.

It's important to acknowledge that **a production-level system would require more extensive validation**. This typically involves a larger test suite, human review or multiple LLM judges, and repeated runs to ensure statistical robustness, even when using low temperature settings for consistency.

### **8.1. Deployed Application** 

Based on these findings, and the optimal chunking (`section-based`) and retrieval (`hybrid search, alpha=0.7`) strategies identified earlier, I developed a **Streamlit web application**. This app provides a chat interface for querying the EU Style Guide, utilizing Weaviate for retrieval and Amazon Bedrock (Claude 3 Sonnet) for generation with the optimized RAG pipeline. The application was containerized with **Docker** and deployed on **Amazon EC2**. 

The live deployment has been discontinued to manage costs, but a demonstration is available here: [**YouTube App Demo Video**](https://youtu.be/WarXeIrMPQI)

### **8.2. Future Work**

While this project successfully built a complete RAG pipeline, the system can be scaled and enhanced with additional features. Here are some key areas for future development:

  * **Expand the Knowledge Base:** Having successfully processed a complex PDF like the EU Style Guide, the pipeline is proven capable and can be readily expanded. It's straightforward to incorporate other useful resources for translators, such as internal **brand guidelines**, structured **terminology or regulatory databases**, or **Translation Memories (TMs)** to ensure consistency.

  * **Implement Advanced Retrieval Strategies:** The current hybrid search is effective, but retrieval accuracy can be pushed even further with more sophisticated techniques.

      * **Query Transformation:** Implementing a preliminary step to rewrite and expand user queries offers benefits. **Query rewriting** could fix typos, simplify complex phrasing, or rephrase questions for clarity. **Query expansion** could generate multiple variations of a question by adding synonyms or different phrasings, then run retrieval on all of them to maximize the chance of finding the most relevant context.
      * **Advanced Filtering and Ranking:** Result selection can also be improved. **Recursive retrieval** could fetch information at multiple levels, starting with high-level summaries and then drilling down into more specific paragraphs. After the initial retrieval, a **re-ranker** (a more powerful but slower model) could be used to re-score the top results, filtering out less relevant documents to ensure only the highest-quality context is passed to the LLM.<br><br>

  * **Optimize the LLM Context:** To improve efficiency and focus, implementing **context compression** is an option. This involves using techniques to remove redundant or irrelevant sentences from the retrieved chunks *before* they are sent to the LLM. This makes the context shorter and denser, which can improve response quality and reduce processing costs.

  * **Operationalize and Measure Performance:** Moving this from a proof-of-concept to a real-world tool requires operationalization. This includes implementing **optimization strategies** like caching for frequent queries and their embedding results to reduce latency and save costs. More importantly, building a **monitoring dashboard** is necessary to track not just technical performance, but key **business metrics**. This means measuring user satisfaction through ratings (e.g., setting a target of \>4.0/5.0), tracking task completion rates (did the user find their answer?), and monitoring engagement through follow-up questions to build a system that continuously learns and improves.

Overall, this project was a fantastic opportunity to put RAG concepts into practice, from data processing to evaluation and deployment. My next project will focus on exploring **Graph RAG**, so stay tuned\!