RAG with Reranking and Query Decomposition for 10-K Filings

Purpose: Build a prototype Retrieval-Augmented Generation (RAG) system that answers questions using information from 10‑K filings (SEC). This notebook demonstrates how adding a reranker and query decomposition improves retrieval quality and final answers.

Example task: Answer the question "List the major changes that occurred at the company in 2023" using Tesla's 10‑K. (Note: typos in example queries can be intentional — this notebook shows how the system handles them.)

Learning outcomes:
- How to create and use a vectorstore built from SEC filings.
- How reranking (Flashrank or similar) improves retrieval precision.
- How query decomposition (sub-queries) helps gather targeted context for complex questions.
- How to assemble a RAG pipeline with LangChain components.


# How to Run this Notebook

Quick start (Colab or local):

1. Obtain an OpenAI API key: https://platform.openai.com/settings/api-keys — save it securely.
2. In Google Colab: open the left sidebar Secrets (key icon) → `+ Add new secret` → set name `OPENAI_KEY` and paste the key as the value. Enable the secret for this notebook.
3. Locally: export the key in your shell before running the notebook:

```bash
export OPENAI_API_KEY="sk-..."
```

4. Run the notebook top→down (`Runtime -> Run all` in Colab) or execute cells in order in your environment.

Notes:
- This notebook builds on an earlier RAG exercise; repeated conceptual explanations are intentionally abbreviated here.
- If you lose an API key in the UI, generate a new key — the UI shows keys only once at creation.


# Basic Setup

This section installs required packages and prepares the runtime. If you run locally, prefer using a virtual environment (venv / conda) and install the dependencies there.



## Install Frameworks

Install the Python packages used in this notebook. The provided cell uses `pip` to install LangChain, FAISS, OpenAI client and the optional reranker.


- `langchain`, `langchain_core`, `langchain_community`, `langchain_huggingface`, `langchain_openai` — LangChain core and provider integrations.
- `faiss-cpu` — FAISS for similarity search and efficient vector indexes (CPU build).
- `openai` — OpenAI Python SDK for embeddings and LLM calls.
- `flashrank` — optional reranker used to improve result ordering.

Tip: pin versions for reproducible runs when sharing notebooks.

In [None]:
%%capture
!pip install langchain langchain_core langchain_community faiss-cpu openai langchain_openai langchain_huggingface -U

In [None]:
# New installs
%%capture
!pip install -q U flashrank  # for Flashrank monkeypatch


In [None]:
import importlib.metadata
print(importlib.metadata.version("flashrank"))

0.2.10


In [None]:
import langchain
print(f"langchain version: {langchain.__version__}")

langchain version: 1.0.5


## API Keys Setup

Configure API keys required by the notebook:

- `OPENAI_API_KEY` — OpenAI key for embeddings and LLM requests.
- (Optional) `HUGGINGFACE_API_KEY` / `COHERE_API_KEY` if you use Hugging Face or Cohere services.

In Colab: save keys using the Secrets manager and load them into environment variables. Locally: set environment variables in your shell or use a secrets manager.

In [None]:
import os

if 'google.colab' in str(get_ipython):
    from google.colab import userdata
    # Set environment variables
    os.environ["OPENAI_API_KEY"] = userdata.get('OPEN_AI_KEY')


## Google Drive and Local Paths

When running in Colab this section mounts Google Drive and creates paths to persist FAISS indexes and data. If you run locally, update the paths to use a local `./data` or other persistent directory instead of Colab paths.

In [None]:
from google.colab import drive
from pathlib import Path
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Set up paths
gdrive_root = Path('/content/drive/My Drive')
faiss_dir = gdrive_root/"LLM/RAG/faiss_index"
faiss_dir.mkdir(parents=True, exist_ok=True)

## Import Libraries

Import the libraries used by the RAG pipeline: document loaders, text splitters, vectorstores, embedding providers, reranker and LangChain primitives.

Key imports are annotated inline in the code cell for clarity.

In [None]:
# LangChain imports
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_classic.retrievers import ContextualCompressionRetriever

from pprint import pprint
from typing import List
import re



In [None]:
# New imports
from flashrank import Ranker, RerankRequest
import langchain_community.document_compressors.flashrank_rerank as fr_mod
fr_mod.RerankRequest = RerankRequest

from langchain_community.document_compressors import FlashrankRerank

# Configuration Dictionary

The `defaultConfig` dictionary centralizes settings for chunking, embeddings, retrieval, reranking, prompt templates and model choices. Edit these values to adapt behavior (for example, change `chunkSize` or `numRerankedDocuments`).

In [None]:
defaultConfig = {
    # Document processing settings
    "chunkSize": 500,
    "chunkOverlap": 50,
    "userAgentHeader": "YourCompany-ResearchBot/1.0 (your@email.com)",

    # embedding model
    "embeddingModel": "BAAI/bge-base-en-v1.5",

    # Vector store settings
    "numRetrievedDocuments": 12,

    # Document formater settings
    "numSelectedDocuments": 5,

    # Reranker setting
    "rerankerType": "flashrank",
    "rerankerModel": "ms-marco-TinyBERT-L-2-v2",
    "numRerankedDocuments": 5,

    # Model settings
    "ragAnswerModel": "gpt-4o",
    "ragAnswerModelTemeprature": 0.7,

    # URLs to process
    "companyFilingUrls": [
        ("Tesla", "https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm")
    ],

    # RAG prompt template
    "ragPromptTemplate": """
    Give an answer for the `Question` using only the given `Context`. Use information relevant to the query from the entire context.
    Provide a detailed answer with thorough explanations, avoiding summaries.

    Question: {question}

    Context: {context}

    Answer:
    """,

    # Query Decomposer settings
    'queryDecomposerModel': "gpt-4o-mini",
    'queryDecomposerModelTemperature': 0.7,

    # SubQuery prompt template
    "subqueryPromptTemplate": """
    Break down the `Question` into multiple sub-queries. Use the guidelines given below to help in the task.

    1. The set of sub-queries together capture the complete information needed to answer the question.
    2. Each sub-query should ask for just one piece of information about one specific company.
    3. For each sub-query, only mention the information you're trying to get. Don't use verbs like "retrieve" or "find".
    4. Include the company name mentioned in each sub-query.
    5. Do not include any references to data sources in your sub-queries.

    Enclose the sub-query in angle brackets. For example:
    <sub-query 1>
    <sub-query 2>

    Question: {question}

    Begin:
    """,
}

In [None]:
config = defaultConfig.copy() # Creates a separate copy of the default configuration dictionary (defaultConfig) so that any subsequent changes won't alter the original default settings.

# Load Vector Store

This cell loads a prebuilt FAISS vectorstore from disk. If you don't have an existing vectorstore, run the ingestion/indexing notebook first to build it from SEC filings.

Ensure the `faiss_dir` path matches where the index was saved and that the embedding function used here matches the embeddings used when creating the index.

In [None]:
embedding_model_name = config.get('embeddingModelName', 'text-embedding-3-small')

if embedding_model_name.startswith("text-embedding"):
    # Use OpenAIEmbeddings for OpenAI models
    embeddingFunction = OpenAIEmbeddings(model=embedding_model_name)
else:
    # Use HuggingFaceEmbeddings for other models (assuming they are from HuggingFace)
    embeddingFunction = HuggingFaceEmbeddings(model_name=embedding_model_name)

loaded_vectorstore = FAISS.load_local(str(faiss_dir), embeddingFunction, allow_dangerous_deserialization=True)

# Retriever

The retriever fetches candidate documents by semantic similarity. Optionally, a reranker reorders those candidates using a more precise model to improve the top results' relevance.

The `create_retriever_with_reranking` function builds a base FAISS retriever and conditionally wraps it with a reranking compressor (Flashrank) to refine results.

In [None]:
def create_retriever_with_reranking(vectorstore, config, use_reranking = True ):
    """
    Creates a retriever with optional reranking capability.

    Args:
        vectorstore: The vector store to retrieve from
        config: Configuration dictionary
        use_reranking: Whether to use reranking (default: True)

    Returns:
        A retriever (either basic or with reranking)
    """
    # Create base retriever
    base_retriever = vectorstore.as_retriever(
        search_kwargs={"k": config.get("numRetrievedDocuments", 12)}
    )

    # If reranking is disabled, return the base retriever
    if not use_reranking:
        return base_retriever

    try:
        # Initialize the reranker
        model_name = config.get("rerankerModel", "ms-marco-TinyBERT-L-2-v2")
        top_n = int(config.get("numRerankedDocuments", 5))
        ranker_client = Ranker(model_name=model_name)
        reranker = FlashrankRerank(client=ranker_client, model=model_name, top_n=top_n)


        # Create and return the enhanced retrieval system
        return ContextualCompressionRetriever(base_retriever=base_retriever, base_compressor=reranker)
    except Exception as e:
        print(f" Error setting up reranker: {e}")
        print("Falling back to base retriever.")
        return base_retriever

In [None]:
retriever = create_retriever_with_reranking(loaded_vectorstore, config, use_reranking = True )

ms-marco-TinyBERT-L-2-v2.zip: 100%|██████████| 3.26M/3.26M [00:00<00:00, 13.8MiB/s]


# Query Decomposer

For complex questions, the decomposer breaks a query into focused sub-queries. Each sub-query retrieves documents that are then merged to form a richer context for the final answer. This improves coverage and reduces hallucination risk for multi-part questions.

In [None]:
def create_decomposer(config):
    prompt = PromptTemplate.from_template(config["subqueryPromptTemplate"])
    llm = ChatOpenAI(
        model=config["queryDecomposerModel"],
        temperature=config["queryDecomposerModelTemperature"]
    )
    chain = prompt | llm | StrOutputParser()

    def decompose_query(question):
        response = chain.invoke({"question": question})
        return re.findall(r'<(.*?)>', response, re.DOTALL)

    return decompose_query

In [None]:
decomposer = create_decomposer(config)

In [None]:
decomposer

# RAG Chain

RAG pipeline components:
1. **Retriever** — obtains semantically relevant documents (with optional reranking).
2. **Document Formatter** — concatenates or structures documents into a `context` for the LLM.
3. **LLM** — generates the answer conditioned on the `context` and `question`.
4. **Prompt Template** — controls how `context` and `question` are presented to the LLM.

The notebook shows how to build a flexible chain that supports both single-shot retrieval and the decomposition + aggregation flow.

In [None]:
def format_docs(docs):
    return "\n\n".join([doc.page_content for doc in docs])


# --- Prompt and LLM ---
def create_prompt(config):
    return PromptTemplate.from_template(config["ragPromptTemplate"])


def create_llm(config):
    return ChatOpenAI(
        model=config["ragAnswerModel"],
        temperature=config["ragAnswerModelTemeprature"]
    )

def build_rag_chain(retriever, format_docs_fn, prompt, llm):
    return (
        {"context": retriever | format_docs_fn, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )


In [None]:
def run_pipeline_flexible(question, retriever, decomposer, config):
    prompt = create_prompt(config)
    llm = create_llm(config)

    if decomposer is not None:
        subqueries = decomposer(question)
        all_docs = []
        seen_source_ids = set()

        for subq in subqueries:
            print(f"Subquery: {subq}")
            # Use retriever.invoke() instead of retriever.get_relevant_documents()
            docs = retriever.invoke(subq)
            print(f"Retrieved {len(docs)} documents for subquery")

            for doc in docs:
                source_id = doc.metadata.get("source_id", "")
                if source_id and source_id not in seen_source_ids:
                    all_docs.append(doc)
                    seen_source_ids.add(source_id)
                    print(f"Added document: {source_id}")

        print(f"Total unique documents: {len(all_docs)}")
        company_counts = {}
        for doc in all_docs:
            company = doc.metadata.get("company", "unknown")
            company_counts[company] = company_counts.get(company, 0) + 1
        print(f"Documents by company: {company_counts}")

        context = format_docs(all_docs)
        print(f"Context size: {len(context)} characters")

        chain = prompt | llm | StrOutputParser()
        answer = chain.invoke({"context": context, "question": question})
        return answer
    else:
        chain = build_rag_chain(retriever, format_docs, prompt, llm)
        return chain.invoke(question)

# Run the Pipeline to Answer Questions

Provide a `question` and execute `run_pipeline_flexible(question, retriever, decomposer, config)`. If decomposer is enabled, the function will print each subquery and the number of retrieved/merged documents before returning the final answer generated by the LLM.

Tips:
- Start with simple factual questions to validate retrieval quality.
- Inspect retrieved documents when results seem off — reranker and retriever settings can be tuned.

In [None]:
question ="How do Tesla and GM's approaches to manufacturing and production compare, particularly for electric vehicles? Where are their vehicles produced? What are the saftey standards followed in their vehicles?"


In [None]:
pprint(question)

("How do Tesla and GM's approaches to manufacturing and production compare, "
 'particularly for electric vehicles? Where are their vehicles produced? What '
 'are the saftey standards followed in their vehicles?')


In [None]:
answer = run_pipeline_flexible(question, retriever, decomposer, config)

Subquery: Tesla manufacturing and production approach for electric vehicles
Retrieved 5 documents for subquery
Subquery: GM manufacturing and production approach for electric vehicles
Retrieved 5 documents for subquery
Subquery: Tesla vehicle production locations
Retrieved 5 documents for subquery
Subquery: GM vehicle production locations
Retrieved 5 documents for subquery
Subquery: Tesla safety standards followed in vehicles
Retrieved 5 documents for subquery
Subquery: GM safety standards followed in vehicles
Retrieved 5 documents for subquery
Total unique documents: 0
Documents by company: {}
Context size: 0 characters
