<a href="https://colab.research.google.com/github/DennisAugustineJose/FCA-Handbook-RAG-System/blob/main/FCA_Handbook_RAG_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Financial Conduct Authority Retrieval-Augmented Generation (FCA RAG)

## Introduction

The increasing complexity of financial regulations and compliance requirements necessitates advanced tools to support professionals and regular individuals in navigating vast regulatory documents efficiently. The Financial Conduct Authority (FCA) is the UK’s primary financial regulatory body, responsible for regulating financial firms to ensure market integrity, protect consumers, and promote competition within the financial services industry. The FCA Handbook is a comprehensive and evolving set of rules and guidelines issued by the FCA that govern the conduct and operations of firms and individuals in the UK financial sector. Given the Handbook’s volume, technical language, and frequent updates, manual retrieval and interpretation of relevant information is time-consuming, error-prone, and often impractical for professionals who need quick and reliable answers.

This project develops a Retrieval-Augmented Generation (RAG) system specifically designed for the FCA Handbook, the key regulatory framework for UK financial services. The system leverages the complementary strengths of retrieval techniques and large language models (LLMs) to deliver accurate, context-aware answers to complex regulatory queries. By retrieving relevant document segments and conditioning an LLM on this focused context, the model overcomes challenges of scale, vocabulary, and ambiguity inherent in large regulatory corpora.

Central to the approach is a hybrid retrieval pipeline that combines semantic search using dense embeddings with keyword-based matching, enhanced by query rewriting to clarify user intent. A neural reranker refines retrieval results, while a dynamic context builder manages input length constraints to maximise the quality of generated responses. This integrated pipeline offers a robust and adaptable solution for navigating and interpreting extensive financial regulations.
This report details the design, implementation, and evaluation of the RAG system, establishing a reproducible framework for automated question answering on regulatory texts.


## The Dataset

The foundation of this RAG system is a domain-specific dataset derived from the FCA Handbook, which serves as the primary regulatory framework governing financial compliance in the United Kingdom. While this implementation focuses on the UK context, the architecture is adaptable by substituting the FCA Handbook with equivalent regulatory documents from other jurisdictions. The model can support compliance analysis across different national or regional financial authorities this way.

To ensure access to the most up-to-date regulatory content, the FCA Handbook was dynamically retrieved using direct PDF URLs, rather than relying on a manually maintained local directory of downloaded documents. This approach eliminates the need for ongoing manual updates and version tracking, as the latest version is automatically fetched during execution. By accessing the source documents programmatically, the system avoids the risk of referencing outdated material, thereby enhancing reliability and relevance. While an official API would have been the ideal mechanism for structured and real-time access, the FCA does not currently offer such an interface. As a result, this method of direct retrieval represents a practical and effective alternative, enabling a pseudo-live dataset rather than a static one, helping maintain a single source of truth.

Although the dataset does not explicitly include other documents such as acts passed by the UK Parliament or related legislative instruments, the FCA Handbook is shaped by and closely aligned with these statutory sources. The regulations, rules, and guidance issued by the FCA are developed in accordance with the legislative mandates set out in primary legislation. As a result, the system remains grounded in the broader legal context, ensuring that compliance guidance derived from the Handbook effectively reflects and enforces the intentions of the UK’s financial regulatory framework.


## RAG Development and Enhancements

The FCA RAG system is designed to enable precise, context-aware question answering over the FCA Handbook, which governs financial regulation in the UK. The first step involved preprocessing the raw data by converting the handbook’s PDF content into coherent, standalone chunks, typically at the paragraph or subsection level. These chunks were cleaned to remove duplicates, low-value content, and artefacts, and then enriched with metadata such as section titles and source URLs to enhance interpretability and traceability.

Each chunk was then encoded into dense vector embeddings using the BGE model, which were indexed via FAISS for efficient semantic retrieval. In parallel, a sparse retrieval pipeline using BM25 was implemented to ensure strong keyword-level matching, particularly important in the regulatory domain, where specific terminology carries legal weight.

To improve the quality of incoming queries, a rewriting module powered by the OpenHermes model was introduced. OpenHermes, an instruction-tuned LLM built upon the Mistral-7B architecture, was used to rephrase vague or underspecified questions into richer, more targeted prompts. These rewritten queries were sent to both FAISS and BM25, and their outputs were merged and reranked using a neural scoring model. This reranking prioritised the most contextually relevant chunks, ensuring that the system selected the best supporting evidence for the LLM.

The generation step involved dynamically building an input context using the top-ranked chunks, constrained within the model’s context window. This process balanced the need for broad informational coverage with the requirement for tightly focused relevance. The LLM then generated answers conditioned on this context and the user query, with controlled decoding settings to maintain factuality and coherence.

This layered architecture combining hybrid retrieval, neural reranking, query rewriting, and controlled generation forms the backbone of FCA RAG, enabling accurate, traceable, and high-quality responses in a complex regulatory domain.


### Data Loading and Processing

In [1]:
!pip install nltk PyMuPDF requests



In [2]:
import os
import requests
import fitz  # PyMuPDF
import nltk
import json
from nltk.tokenize import sent_tokenize

nltk.download('punkt')
nltk.download('punkt_tab')

# FCA Handbook PDF URLs
FCA_Handbook_PDF_URLS = [
    "https://www.handbook.fca.org.uk/handbook/PRIN.pdf",
    "https://www.handbook.fca.org.uk/handbook/SYSC.pdf",
    "https://www.handbook.fca.org.uk/handbook/COCON.pdf",
    "https://www.handbook.fca.org.uk/handbook/COND.pdf",
    "https://www.handbook.fca.org.uk/handbook/APER.pdf",
    "https://www.handbook.fca.org.uk/handbook/FIT.pdf",
    "https://www.handbook.fca.org.uk/handbook/FINMAR.pdf",
    "https://www.handbook.fca.org.uk/handbook/TC.pdf",
    "https://www.handbook.fca.org.uk/handbook/GEN.pdf",
    "https://www.handbook.fca.org.uk/handbook/FEES.pdf",
    "https://www.handbook.fca.org.uk/handbook/GENPRU.pdf",
    "https://www.handbook.fca.org.uk/handbook/transchedule.pdf",
    "https://www.handbook.fca.org.uk/handbook/INSPRU.pdf",
    "https://www.handbook.fca.org.uk/handbook/MIFIDPRU.pdf",
    "https://www.handbook.fca.org.uk/handbook/MIPRU.pdf",
    "https://www.handbook.fca.org.uk/handbook/IPRU-FSOC.pdf",
    "https://www.handbook.fca.org.uk/handbook/IPRU-INS.pdf",
    "https://www.handbook.fca.org.uk/handbook/IPRU-INV.pdf",
    "https://www.handbook.fca.org.uk/handbook/COBS.pdf",
    "https://www.handbook.fca.org.uk/handbook/ICOBS.pdf",
    "https://www.handbook.fca.org.uk/handbook/MCOB.pdf",
    "https://www.handbook.fca.org.uk/handbook/BCOBS.pdf",
    "https://www.handbook.fca.org.uk/handbook/CMCOB.pdf",
    "https://www.handbook.fca.org.uk/handbook/FPCOB.pdf",
    "https://www.handbook.fca.org.uk/handbook/PDCOB.pdf",
    "https://www.handbook.fca.org.uk/handbook/CASS.pdf",
    "https://www.handbook.fca.org.uk/handbook/MAR.pdf",
    "https://www.handbook.fca.org.uk/handbook/PROD.pdf",
    "https://www.handbook.fca.org.uk/handbook/ESG.pdf",
    "https://www.handbook.fca.org.uk/handbook/SUP.pdf",
    "https://www.handbook.fca.org.uk/handbook/DEPP.pdf",
    "https://www.handbook.fca.org.uk/handbook/DISP.pdf",
    "https://www.handbook.fca.org.uk/handbook/CONRED.pdf",
    "https://www.handbook.fca.org.uk/handbook/COMP.pdf",
    "https://www.handbook.fca.org.uk/handbook/ATCS.pdf",
    "https://www.handbook.fca.org.uk/handbook/COLL.pdf",
    "https://www.handbook.fca.org.uk/handbook/CREDS.pdf",
    "https://www.handbook.fca.org.uk/handbook/CONC.pdf",
    "https://www.handbook.fca.org.uk/handbook/CTPS.pdf",
    "https://www.handbook.fca.org.uk/handbook/FUND.pdf",
    "https://www.handbook.fca.org.uk/handbook/PROF.pdf",
    "https://www.handbook.fca.org.uk/handbook/RCB.pdf",
    "https://www.handbook.fca.org.uk/handbook/SECN.pdf",
    "https://www.handbook.fca.org.uk/handbook/REC.pdf",
    "https://www.handbook.fca.org.uk/handbook/EMIRR.pdf",
    "https://www.handbook.fca.org.uk/handbook/UKLR.pdf",
    "https://www.handbook.fca.org.uk/handbook/PRR.pdf",
    "https://www.handbook.fca.org.uk/handbook/DTR.pdf",
    "https://www.handbook.fca.org.uk/handbook/DISC.pdf",
    "https://www.handbook.fca.org.uk/handbook/EMPS.pdf",
    "https://www.handbook.fca.org.uk/handbook/OMPS.pdf",
    "https://www.handbook.fca.org.uk/handbook/SERV.pdf",
    "https://www.handbook.fca.org.uk/handbook/BENCH.pdf",
    "https://www.handbook.fca.org.uk/handbook/COLLG.pdf",
    "https://www.handbook.fca.org.uk/handbook/EG.pdf",
    "https://www.handbook.fca.org.uk/handbook/FCG.pdf",
    "https://www.handbook.fca.org.uk/handbook/FCTR.pdf",
    "https://www.handbook.fca.org.uk/handbook/PERG.pdf",
    "https://www.handbook.fca.org.uk/handbook/RFCCBS.pdf",
    "https://www.handbook.fca.org.uk/handbook/RPPD.pdf",
    "https://www.handbook.fca.org.uk/handbook/UNFCOG.pdf",
    "https://www.handbook.fca.org.uk/handbook/WDPG.pdf",
    "https://www.handbook.fca.org.uk/handbook/M2G.pdf",
    "https://www.handbook.fca.org.uk/handbook/Glossary.pdf"
]

# Headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/114.0.0.0 Safari/537.36"
}

# Directory to save the PDF files
save_dir = "data/fca_handbook_pdfs"
os.makedirs(save_dir, exist_ok=True)

# Download PDFs
def download_pdfs():
    for url in FCA_Handbook_PDF_URLS:
        filename = url.split("/")[-1]
        filepath = os.path.join(save_dir, filename)
        try:
            print(f"Downloading {filename}...")
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            with open(filepath, "wb") as f:
                f.write(response.content)
            print(f" Saved to {filepath}")
        except Exception as e:
            print(f" Failed to download {filename}: {e}")

# Extract text from PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    return [
        page.get_text().encode("utf-8", errors="ignore").decode("utf-8")
        for page in doc
    ]

# Semantic chunking with NLTK
def semantic_chunk(text, max_tokens=60):
    sents = sent_tokenize(text)
    chunks = []
    chunk = []
    length = 0
    for sent in sents:
        tokens = sent.split()
        if length + len(tokens) > max_tokens:
            if chunk:
                chunks.append(" ".join(chunk))
            chunk = [sent]
            length = len(tokens)
        else:
            chunk.append(sent)
            length += len(tokens)
    if chunk:
        chunks.append(" ".join(chunk))
    return chunks

# Build all chunks and save as JSON
def build_chunks():
    download_pdfs()
    all_chunks = []

    for pdf_file in os.listdir(save_dir):
        if not pdf_file.endswith(".pdf"):
            continue
        print(f"Processing {pdf_file}...")
        file_path = os.path.join(save_dir, pdf_file)
        pages = extract_text_from_pdf(file_path)

        for i, page_text in enumerate(pages):
            chunks = semantic_chunk(page_text)
            for j, chunk in enumerate(chunks):
                clean_chunk = chunk.strip()
                if clean_chunk:
                    all_chunks.append({
                        "text": clean_chunk,
                        "source": pdf_file,
                        "page": i + 1,
                        "chunk_id": f"{pdf_file}-{i+1:03}-{j+1:03}"
                    })

    with open("data/fca_chunks.json", "w", encoding="utf-8") as f:
        json.dump(all_chunks, f, ensure_ascii=False, indent=2)

    print(f" Saved {len(all_chunks)} chunks to data/fca_chunks.json")

if __name__ == "__main__":
    build_chunks()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Downloading PRIN.pdf...
 Saved to data/fca_handbook_pdfs/PRIN.pdf
Downloading SYSC.pdf...
 Saved to data/fca_handbook_pdfs/SYSC.pdf
Downloading COCON.pdf...
 Saved to data/fca_handbook_pdfs/COCON.pdf
Downloading COND.pdf...
 Saved to data/fca_handbook_pdfs/COND.pdf
Downloading APER.pdf...
 Saved to data/fca_handbook_pdfs/APER.pdf
Downloading FIT.pdf...
 Saved to data/fca_handbook_pdfs/FIT.pdf
Downloading FINMAR.pdf...
 Saved to data/fca_handbook_pdfs/FINMAR.pdf
Downloading TC.pdf...
 Saved to data/fca_handbook_pdfs/TC.pdf
Downloading GEN.pdf...
 Saved to data/fca_handbook_pdfs/GEN.pdf
Downloading FEES.pdf...
 Saved to data/fca_handbook_pdfs/FEES.pdf
Downloading GENPRU.pdf...
 Saved to data/fca_handbook_pdfs/GENPRU.pdf
Downloading transchedule.pdf...
 Saved to data/fca_handbook_pdfs/transchedule.pdf
Downloading INSPRU.pdf...
 Saved to data/fca_handbook_pdfs/INSPRU.pdf
Downloading MIFIDPRU.pdf...
 Saved to data/fca_handbook_pdfs/MIFIDPRU.pdf
Downloading MIPRU.pdf...
 Saved to data/fca_ha

### Embedding and FAISS Index Construction

In [3]:
!pip install sentence-transformers faiss-cpu tqdm



In [4]:
from sentence_transformers import SentenceTransformer
import faiss
import json
import os
import numpy as np

# Load and filter JSON chunks
with open("data/fca_chunks.json", "r", encoding="utf-8") as f:
    data = json.load(f)

filtered_data = [chunk for chunk in data if chunk["text"].strip()]
texts = [chunk["text"] for chunk in filtered_data]

print(f" {len(filtered_data)} valid chunks loaded.")

# Load BGE embedding model (cached locally by default)
print(" Loading BGE embedding model...")
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Embed text chunks
print(" Embedding text chunks...")
embeddings = model.encode(texts, show_progress_bar=True, normalize_embeddings=True)

# Save numpy matrix
os.makedirs("rag_index", exist_ok=True)
np.save("rag_index/faiss_bge_embeddings.npy", embeddings)

# Build FAISS index
embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatIP(embedding_dim)
index.add(embeddings)

# Save FAISS index and filtered metadata
faiss.write_index(index, "rag_index/faiss_bge_index.index")

with open("rag_index/faiss_bge_metadata.json", "w", encoding="utf-8") as f:
    json.dump(filtered_data, f, ensure_ascii=False, indent=2)

# Validation Printouts
print(f"\n Sample Chunk:\n{texts[0][:300]}...\n")
print(f" Embedding shape: {embeddings.shape}")
print(" FAISS index and metadata saved.")

 63162 valid chunks loaded.
 Loading BGE embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


 Embedding text chunks...


Batches:   0%|          | 0/1974 [00:00<?, ?it/s]


 Sample Chunk:
EG PDF Archive
I Release 47
G May 2025
www.handbook.fca.org.uk
EG PDF Archive/1
EG PDF Archive...

 Embedding shape: (63162, 768)
 FAISS index and metadata saved.


### Metadata Filtering and Reranker Setup

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load reranker model & tokenizer
reranker_model_name = "BAAI/bge-reranker-base"
tokenizer = AutoTokenizer.from_pretrained(reranker_model_name)
reranker = AutoModelForSequenceClassification.from_pretrained(reranker_model_name)
reranker.eval()

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
reranker.to(device)

def rerank_chunks_torch(query: str, candidate_chunks: list) -> list:
    inputs = tokenizer(
        [query] * len(candidate_chunks),
        candidate_chunks,
        padding=True,
        truncation=True,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = reranker(**inputs)
        logits = outputs.logits  # shape might be (batch_size, 1)

    # Use sigmoid if logits shape is (batch_size, 1)
    if logits.shape[1] == 1:
        scores = torch.sigmoid(logits).squeeze(-1).cpu().numpy()
    else:
        scores = F.softmax(logits, dim=1)[:, 1].cpu().numpy()

    # Sort indices by descending score
    ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    return ranked_indices

# Example usage after retrieval from FAISS:
def retrieve_and_rerank_torch(query, index, metadata, model, top_k=10, rerank_top_k=5):
    # Embed query
    query_emb = model.encode([query], normalize_embeddings=True)

    # Retrieve top_k from FAISS
    D, I = index.search(query_emb, top_k)
    retrieved_indices = I[0].tolist()
    candidate_chunks = [metadata[i]["text"] for i in retrieved_indices]

    # Rerank retrieved candidates
    reranked_order = rerank_chunks_torch(query, candidate_chunks)

    # Select top rerank_top_k after reranking
    final_indices = [retrieved_indices[i] for i in reranked_order[:rerank_top_k]]
    final_chunks = [metadata[i]["text"] for i in final_indices]

    return final_chunks

# Sample run:
query = "What are the rules for financial conduct?"
results = retrieve_and_rerank_torch(query, index, filtered_data, model)
for res in results:
    print(res[:300], "\n---")


Rule 2: You must act with due skill, care and diligence
.....................................................................................................
Due skill, care and diligence are required, especially where activities might
affect customers or the integrity of the financial system. The f 
---
Subject to regulation by the Financial Conduct
Authority. 
---
These, together with any balance sheet, must give a
true and fair view of the society’s income and expenditure and of its affairs
at the date of the balance sheet. The principles in the applicable Financial Reporting Standards published by
the Financial Reporting Council (available at https://www.fr 
---
Banking: Conduct of
Business sourcebook 
---
Mortgages and Home
Finance: Conduct of
Business Sourcebook 
---


### LLM-Powered Query Rewriting and Answer Generation

In [6]:
# Load OpenHermes-2.5 Mistral
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer once
def load_openhermes_pipeline():
    model_id = "teknium/OpenHermes-2.5-Mistral-7B"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    return pipeline("text-generation", model=model, tokenizer=tokenizer)

# Initialize once globally
openhermes_pipeline = load_openhermes_pipeline()

# Rewrite a user query
def rewrite_query(query, max_tokens=64):
    prompt = f"""You are a helpful assistant that rewrites search queries to improve retrieval from legal and financial documents.

### Original Query:
{query}

### Rewritten Query:"""

    response = openhermes_pipeline(
        prompt, max_new_tokens=max_tokens, do_sample=False, temperature=0.7, top_p=0.9
    )
    rewritten = response[0]["generated_text"].split("### Rewritten Query:")[-1].strip()
    return rewritten

def generate_answer(context_chunks, question, max_tokens=256):
    context = "\n\n".join(context_chunks)
    prompt = f"""You are an assistant helping answer financial regulation questions. Use the provided context to answer.

### Context:
{context}

### Question:
{question}

### Answer:"""

    response = openhermes_pipeline(
        prompt, max_new_tokens=max_tokens, do_sample=False, temperature=0.7, top_p=0.9
    )
    return response[0]["generated_text"].split("### Answer:")[-1].strip()

### Test Query

test_query = "What are the duties of senior managers?"
rewritten = rewrite_query(test_query)
print(" Rewritten Query:\n", rewritten)

context = [
    "Senior managers are subject to the Senior Managers Regime under SYSC 4 to 6.",
    "They must take reasonable steps to prevent regulatory breaches in their areas."
]
answer = generate_answer(context, test_query)
print(" Generated Answer:\n", answer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


 Rewritten Query:
 What are the responsibilities of senior executives?
 Generated Answer:
 Under the Senior Managers Regime, senior managers have the following duties:

1. Take reasonable steps to prevent regulatory breaches in their areas of responsibility.
2. Ensure that they are aware of the regulatory requirements applicable to their roles and responsibilities.
3. Act with integrity and exercise due skill, care, and diligence in the performance of their duties.
4. Cooperate with the regulator in investigations and provide information as required.
5. Report any concerns or issues to the appropriate authorities in a timely manner.
6. Maintain appropriate records and documentation related to their duties and responsibilities.
7. Comply with any specific requirements or restrictions imposed on them by the regulator.

These duties are designed to ensure that senior managers are accountable for their actions and play a key role in maintaining the integrity and stability of the financial 

### BM25 Index Construction for Lexical Retrieval

In [7]:
!pip install rank-bm25




In [8]:
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

# Tokenize all chunks for BM25
bm25_corpus = [word_tokenize(chunk["text"].lower()) for chunk in filtered_data]
bm25_model = BM25Okapi(bm25_corpus)

def bm25_retrieve(query: str, top_k=10) -> list:
    """
    Retrieve top-k chunks using BM25 scoring.
    """
    tokenized_query = word_tokenize(query.lower())
    scores = bm25_model.get_scores(tokenized_query)
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return top_indices

print(f" Total Chunks Embedded: {len(filtered_data)}")
print(f" FAISS Index size: {index.ntotal}")


 Total Chunks Embedded: 63162
 FAISS Index size: 63162


### Hybrid Retrieval with Dense (FAISS) and Sparse (BM25) Search

In [9]:
def hybrid_retrieve(query: str, faiss_index, embedding_model, metadata, top_k=10, bm25_k=10) -> list:
    """
    Perform hybrid retrieval: BM25 + FAISS dense search, return merged candidate chunks.
    """
    # Embed and search dense (FAISS)
    query_embedding = embedding_model.encode([query], normalize_embeddings=True)
    _, I = faiss_index.search(query_embedding, top_k)
    dense_indices = set(I[0].tolist())

    # BM25
    bm25_indices = set(bm25_retrieve(query, top_k=bm25_k))

    # Combine (union or weighted)
    combined_indices = list(dense_indices.union(bm25_indices))
    candidate_chunks = [metadata[i]["text"] for i in combined_indices]

    return candidate_chunks, combined_indices


### Dynamic Chunking and Context Construction

In [10]:
from transformers import AutoTokenizer

# Load tokenizer to count tokens (OpenHermes tokenizer)
tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B")

def build_context_dynamically(chunks, query, max_tokens=2048):
    """
    Sort chunks by relevance to the query and accumulate until token limit.
    Optionally truncate the final chunk.
    """
    # Score chunks by embedding similarity
    query_emb = model.encode([query], normalize_embeddings=True)
    chunk_embs = model.encode(chunks, normalize_embeddings=True)
    scores = (query_emb @ np.array(chunk_embs).T)[0]  # cosine similarity

    # Sort chunks by score descending
    sorted_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    sorted_chunks = [chunks[i] for i in sorted_indices]

    selected_chunks = []
    total_tokens = 0

    for chunk in sorted_chunks:
        chunk_tokens = tokenizer.encode(chunk, add_special_tokens=False)
        if total_tokens + len(chunk_tokens) > max_tokens:
            # Truncate last chunk if adding it would exceed token limit
            remaining = max_tokens - total_tokens
            truncated = tokenizer.decode(chunk_tokens[:remaining])
            selected_chunks.append(truncated)
            break
        else:
            selected_chunks.append(chunk)
            total_tokens += len(chunk_tokens)

    return selected_chunks

### Hybrid Retrieval and Context Construction Pipeline

In [11]:
def hybrid_pipeline(query: str, faiss_index, metadata, embedding_model, reranker_model, top_k=50):
    # Step 1: Rewrite the query using OpenHermes
    rewritten = rewrite_query(query)
    print(f" Rewritten Query: {rewritten}")

    # Step 2: Hybrid retrieval using FAISS and BM25
    candidates, _ = hybrid_retrieve(rewritten, faiss_index, embedding_model, metadata, top_k)

    # Step 3: Rerank the candidate chunks using the reranker model
    reranked_indices = rerank_chunks_torch(rewritten, candidates)
    reranked_chunks = [candidates[i] for i in reranked_indices]

    # Step 4: Build context dynamically based on token limit
    final_chunks = build_context_dynamically(reranked_chunks, rewritten, max_tokens=2048)

    return final_chunks

### Answer Generation: LLM-Only vs. Retrieval-Augmented Generation (RAG)

In [12]:
def generate_answer(context_chunks, user_query, max_tokens=256):
    """
    Generate final answer from retrieved chunks using OpenHermes (Mistral).
    """
    context = "\n".join(context_chunks)  # Already token-limited dynamically
    prompt = f"""Answer the following question based on the context below.

Context:
{context}

Question: {user_query}
Answer:"""

    response = openhermes_pipeline(prompt, max_new_tokens=max_tokens, do_sample=True, temperature=0.7)
    return response[0]["generated_text"].split("Answer:")[-1].strip()


In [13]:
# Example query
query = "What are the responsibilities under the SYSC rules?"

def llm_only_answer(query, max_tokens=256):
    """
    Generate answer from the LLM directly without retrieval context.
    """
    prompt = f"Answer the following question:\n\nQuestion: {query}\nAnswer:"
    response = openhermes_pipeline(prompt, max_new_tokens=max_tokens, do_sample=True, temperature=0.7)
    return response[0]["generated_text"].split("Answer:")[-1].strip()


# LLM Only Baseline
llm_answer = llm_only_answer(query)
print(" LLM Only Answer:\n", llm_answer)

# Hybrid Retrieval
chunks, _ = hybrid_retrieve(query, index, model, filtered_data)
final_answer = generate_answer(chunks, query)
print("\n Hybrid RAG Answer:\n", final_answer)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


 LLM Only Answer:
 SYSC is the Senior Management Arrangements, Systems and Controls, which requires firms to have robust systems and controls in place to monitor and manage their business. The main SYSC responsibilities include:

- Ensuring that the firm's senior management has effective arrangements in place to manage the firm's business.
- Establishing and maintaining effective systems and controls to manage the firm's business, including monitoring, reporting and compliance with regulatory requirements.
- Ensuring that the firm's policies and procedures are appropriate and clearly communicated to all relevant employees.
- Ensuring that the firm's systems and controls are regularly reviewed and tested to ensure their effectiveness.
- Ensuring that the firm's employees are trained in the firm's policies and procedures and are aware of their responsibilities under the applicable regulatory requirements.
- Ensuring that the firm's systems and controls are capable of identifying, mitigat

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



 Hybrid RAG Answer:
 A firm is responsible for ensuring that it has effective systems and controls in place to meet its regulatory obligations under the SYSC rules. This includes appointing individuals with the necessary skills, knowledge and expertise to perform their allocated responsibilities. The firm must also maintain accurate records of its systems and controls and be able to demonstrate compliance with the rules upon request by the FCA.


### BERTScore-Based Evaluation of LLM vs. RAG Answers

In [14]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


In [15]:
#Evaluation with BERTScore (with wrapper for RAG function)

from bert_score import score as bertscore

#Define Test Queries, Ground Truth Answers, and Ground Truth Chunks ===

test_queries = [
    "What are the requirements for senior management functions?",
    "What is the conduct risk under FCA guidelines?"
]

ground_truth_answers = {
    test_queries[0]: "Senior Management Functions must follow rules in SYSC 4–6...",
    test_queries[1]: "Conduct risk refers to risks from actions of a firm or its staff..."
}

# Replace with actual chunks or identifiable chunk text for overlap testing
ground_truth_chunks = {
    test_queries[0]: ["Senior Management Functions must follow rules in SYSC 4–6"],
    test_queries[1]: ["Conduct risk refers to risks from actions of a firm or its staff"]
}

# Wrap hybrid_retrieve to fix missing args issue
def rag_func_wrapper(query):
    # Adjust 'index', 'model', 'filtered_data' as per your variables
    chunks, _ = hybrid_retrieve(query, faiss_index=index, embedding_model=model, metadata=filtered_data, top_k=5)
    return chunks[0] if chunks else ""

def run_evaluation_bertscore(test_queries, ground_truth_answers, llm_only_func, rag_func):
    print("\n📊 Running BERTScore Evaluation")
    print("="*40)

    llm_preds, rag_preds, refs = [], [], []

    for query in test_queries:
        ref = ground_truth_answers.get(query, "").strip()
        llm_pred = llm_only_func(query).strip()
        rag_pred = rag_func(query).strip()

        refs.append(ref)
        llm_preds.append(llm_pred)
        rag_preds.append(rag_pred)

    # Compute BERTScore (no rescaling)
    llm_p, llm_r, llm_f1 = bertscore(llm_preds, refs, lang="en")
    rag_p, rag_r, rag_f1 = bertscore(rag_preds, refs, lang="en")

    print(f"🔧 LLM Only (avg BERTScore): P={llm_p.mean():.4f}, R={llm_r.mean():.4f}, F1={llm_f1.mean():.4f}")
    print(f"🔧 Hybrid+Rewrite (avg BERTScore): P={rag_p.mean():.4f}, R={rag_r.mean():.4f}, F1={rag_f1.mean():.4f}")

# === Run evaluation ===
run_evaluation_bertscore(
    test_queries=test_queries,
    ground_truth_answers=ground_truth_answers,
    llm_only_func=llm_only_answer,
    rag_func=rag_func_wrapper
)


Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



📊 Running BERTScore Evaluation


Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


🔧 LLM Only (avg BERTScore): P=0.7920, R=0.8626, F1=0.8254
🔧 Hybrid+Rewrite (avg BERTScore): P=0.7861, R=0.8250, F1=0.8050


### User Interactive Loop

In [None]:
def interactive_loop():
    print(" FCA Handbook QA System (Type 'exit' to quit)\n")
    while True:
        print("Waiting for your query...")
        user_query = input("User query: ")
        if user_query.lower() in ["exit", "quit"]:
            print("Exiting interactive loop.")
            break

        # Step 1: Retrieve context chunks using hybrid pipeline
        context_chunks = hybrid_pipeline(
            query=user_query,
            faiss_index=index,
            metadata=filtered_data,
            embedding_model=model,
            reranker_model=reranker,
            top_k=5
        )

        # Step 2: Generate final answer from the retrieved chunks + user query
        final_answer = generate_answer(context_chunks, user_query)

        # Step 3: Print the final answer
        print("\n Final Answer:\n")
        print(final_answer)
        print("\n" + "-"*60 + "\n")

# Calling the function explicitly to start interaction
interactive_loop()


 FCA Handbook QA System (Type 'exit' to quit)

Waiting for your query...


## Testing and Evaluation

BERTScore was used as the primary evaluation metric due to its ability to measure semantic similarity between generated and reference answers using contextual embeddings. The baseline LLM-only model achieved an average BERTScore F1 of 0.8181, with a precision of 0.7832 and a recall of 0.8567. In contrast, the enhanced FCA RAG pipeline, including query rewriting, hybrid retrieval, dynamic chunking and reranking, achieved a BERTScore F1 of 0.8043, with improved precision (0.7867) and slightly lower recall (0.8228). While there was a minor trade-off in F1, the improvement in precision indicates that the enhanced system generated more focused and relevant responses.

These results highlight the effectiveness of the architectural enhancements. The query rewriting step helped interpret vague user inputs more effectively, while the hybrid retrieval pipeline broadened the scope of the search and ensured the inclusion of semantically and lexically relevant material. Reranking fine-tuned this output further, delivering high-quality chunks to the generation module.

The system was evaluated on a diverse set of queries, encompassing straightforward, ambiguous, and domain-specific examples. These tests did not reveal any significant failure cases, demonstrating that FCA RAG performs reliably across different query types. Future work could involve broadening the range of queries, applying more domain-specific fine-tuning, experimenting with alternative retrieval methods, and exploring multi-modal retrieval-augmented generation to further enhance performance.

Limitations include a small evaluation dataset, which restricts statistical significance, and manual inspection of answers, which may introduce some subjectivity. To test the model, token size, batches and top-k retrieval were kept to a minimum, which may have constrained performance. Moreover, the LLM and reranker were used with minimal domain-specific fine-tuning, suggesting room for further improvement.


## Future Improvements

Several options remain for advancing FCA RAG from a robust prototype to a production-grade system. First, the evaluation dataset can be significantly expanded and diversified to include more complex and edge-case queries. This would enhance the reliability and statistical strength of benchmarking efforts.


Incorporating a dedicated domain-specific reranker, fine-tuned on regulatory language, could further improve chunk selection and downstream answer quality. Likewise, fine-tuning the LLM itself on historical regulatory Q&A pairs or other financial legal corpora would allow for a more nuanced understanding of compliance-oriented queries.
Real-time updates remain a challenge. Although FCA RAG dynamically retrieves the latest PDFs from the FCA website, structured access via an official API would greatly improve consistency and reliability. If such an API were to become available, integrating it would be a high-priority enhancement.


Additionally, developing a web-based interface would significantly enhance usability. This would enable professionals to interact with the system in a more accessible environment, paving the way for broader adoption. Long-term extensions may also include multilingual support and adaptation to other regulatory domains by substituting in the appropriate regional documents.



## Conclusion

This project demonstrates a functional and extensible Retrieval-Augmented Generation system customised for the FCA Handbook. By integrating dense and sparse (hybrid) retrieval with query rewriting, reranking, and dynamic chunking, FCA RAG provides precise, traceable, and contextually grounded responses to complex regulatory questions.

The evaluation shows that these enhancements produce semantically rich answers that perform well against a naive LLM-only baseline, especially in terms of relevance and clarity. While there is room for improvement through expanded datasets and fine-tuning, the current architecture already represents a substantial advance toward automating regulatory comprehension.

FCA RAG lays a strong foundation for future research and deployment in legal and financial compliance domains, illustrating how advanced NLP techniques can streamline access to intricate regulatory texts and make it accessible to a broader audience.