## 🧩 Relevant Segment Extraction (RSE) for Better Context | RAG100X

This notebook implements **Relevant Segment Extraction (RSE)** — a retrieval-time optimization that reconstructs the *most contextually useful* segments from your documents by selecting **contiguous runs of relevant chunks**, not just isolated ones.

Instead of naively picking the Top-k highest scoring chunks, RSE uses a reranking model (like Cohere Rerank) to assign relevance scores to all chunks, and then intelligently finds the **best spans** of text — even if some relevant chunks weren’t individually top-ranked.

The result? Better grounding, smoother flow, and improved performance for LLM answers.

---

### ✅ What You’ll Learn

- Why Top-k retrieval can lead to broken context and hallucinations  
- How RSE scores and selects the *best chunk spans*, not individual pieces  
- How segment optimization helps recover missing context that matters  
- When RSE outperforms vanilla retrieval in real-world QA tasks  

---

### 🔍 Real-world Analogy

Imagine you’re watching a movie, and someone asks:

> *"What caused the main character’s breakdown?"*

If you only watch 3 random dramatic scenes (Top-k chunks), you’ll miss the full story.

But if you instead watch a **full 5-minute clip** leading up to the breakdown — even if it includes a few “boring” moments — you’ll get the full emotional arc.

✅ **RSE gives you that full arc — not just scattered scenes.**

---

### 🔬 How RSE Works Under the Hood

Let’s say your document is split into 500 chunks. For a given query, we want to extract the *best possible segments* — ideally, groups of neighboring chunks that flow well together and are jointly relevant.

| Step               | What Happens                                                                 |
|--------------------|------------------------------------------------------------------------------|
| 1. Chunking        | Document is split into non-overlapping chunks using LangChain’s splitter     |
| 2. Reranking       | Each chunk is scored using Cohere Rerank for its relevance to the query     |
| 3. Value Mapping   | Scores are converted into “chunk values” — good chunks = +ve, bad = –ve     |
| 4. Segmentation    | A search algorithm finds **contiguous spans** with high total chunk value   |
| 5. Filtering       | Segments are pruned based on token budget, redundancy, and quality          |
| 6. Output          | The selected segments are passed to the LLM as context                      |

🧠 Even if a few chunks inside a span are low scoring, they might **complete a relevant section** — so we include them if the segment as a whole is valuable.

---

### 🧪 Why This Works So Well

- 🧱 **Context continuity**: Relevant ideas rarely exist in isolation — they live in flow  
- 🤖 **LLMs love coherence**: Chunks from the same topic block improve grounding and reduce hallucination  
- 🎯 **Span-level optimization**: Instead of picking top chunks, RSE picks *top stories*  

---

### 🏗️ Why This Matters in Production

Imagine this chunk retrieved on its own:

> *“The final experiment used a dropout rate of 0.2 and achieved 87% accuracy.”*

It’s informative — but out of context. What was the task? Why that setup? What came before?

With RSE, we retrieve a **span** like:

1. *“Dataset and Preprocessing…”*  
2. *“Model Architecture: 3-layer BiLSTM…”*  
3. *“Final experiment used a dropout rate…”*

That full thread provides **complete reasoning**, improving both faithfulness and fluency.

**RSE ensures retrieval returns meaningful *segments*, not fragmented sentences.**

---

### 🔄 Where This Fits in RAG100X

So far in RAG100X, we’ve explored:

1. PDF-based QA  
2. CSV-based semantic search  
3. DeepLearning.ai RAG QA  
4. Chunk size & latency tuning  
5. Proposition-aware chunking  
6. Query rewriting + decomposition  
7. HyDE: Embed imagined answers  
8. HyPE: Embed imagined questions  
9. CCH: Add intelligent titles to chunks  

Now in **Day 10**, we zoom in on post-retrieval — and **optimize which chunks make it into the final context**.

> 💡 **RSE doesn't just pick relevant chunks — it picks relevant *stories*.**


## 📦 Installation & Setup

In [None]:
# Install required packages
!pip install matplotlib numpy python-dotenv
import os
import numpy as np
from typing import List
from scipy.stats import beta
import matplotlib.pyplot as plt
import cohere
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()
os.environ["CO_API_KEY"] = os.getenv('CO_API_KEY') # Cohere API key

## 🧩 Define Helper Functions for Chunk Scoring with Cohere Reranker

Before we extract relevant segments, we need a way to **measure how relevant each chunk is** to a given user query.

In this section:

- We'll define helper functions to **split raw text into chunks**, **score them using Cohere's rerank API**, and **visualize the results**.
- Since we're only working with a single document, we skip vector search and send all chunks directly to the reranker.

### 📌 Functions Defined

1. `split_into_chunks`: Uses LangChain’s `RecursiveCharacterTextSplitter` to break large documents into overlapping or non-overlapping chunks.
2. `transform`: Applies a Beta CDF to spread Cohere's relevance scores (which are often bunched near 0 or 1) into a smoother 0–1 range.
3. `rerank_chunks`: Calls Cohere’s `rerank` API to assign relevance scores to each chunk based on a query.
4. `plot_relevance_scores`: Visualizes the relevance scores across all chunks.

These scores will later help us **extract only the most relevant segments**, minimizing hallucinations and improving grounding.


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Function to split a document into smaller chunks using LangChain's RecursiveCharacterTextSplitter
def split_into_chunks(text: str, chunk_size: int = 800) -> List[str]:
    """
    Splits a large document into overlapping or non-overlapping chunks.

    Args:
        text (str): Full text of the document.
        chunk_size (int): Size of each chunk in characters.

    Returns:
        List[str]: List of string chunks.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=0, length_function=len
    )
    texts = text_splitter.create_documents([text])
    chunks = [text.page_content for text in texts]
    return chunks

# Beta transformation to smooth out Cohere's sharp relevance scores (usually close to 0 or 1)
def transform(x: float) -> float:
    """
    Transforms the sharp relevance scores from Cohere Rerank into smoother values.

    Args:
        x (float): Raw relevance score from the reranker.

    Returns:
        float: Smoothed score using the beta cumulative distribution function.
    """
    a, b = 0.4, 0.4  # Tunable parameters controlling smoothness
    return beta.cdf(x, a, b)

# Use Cohere Rerank API to evaluate relevance of each chunk to the input query
def rerank_chunks(query: str, chunks: List[str]):
    """
    Reranks all chunks using Cohere's rerank model and computes relevance-weighted scores.

    Args:
        query (str): Natural language query.
        chunks (List[str]): List of document chunks to score.

    Returns:
        similarity_scores (List[float]): Normalized similarity scores.
        chunk_values (List[float]): Final chunk importance scores (with decay applied).
    """
    model = "rerank-english-v3.0"
    client = cohere.Client(api_key=os.environ["CO_API_KEY"])
    decay_rate = 30  # Controls how quickly lower-ranked chunks decay in importance

    reranked_results = client.rerank(model=model, query=query, documents=chunks)
    results = reranked_results.results

    reranked_indices = [r.index for r in results]
    raw_scores = [r.relevance_score for r in results]

    similarity_scores = [0] * len(chunks)
    chunk_values = [0] * len(chunks)

    for i, index in enumerate(reranked_indices):
        abs_score = transform(raw_scores[i])
        similarity_scores[index] = abs_score
        # Apply exponential decay based on the rank
        chunk_values[index] = np.exp(-i / decay_rate) * abs_score

    return similarity_scores, chunk_values

# Visualize how relevant each chunk is to the input query
def plot_relevance_scores(chunk_values: List[float], start_index: int = None, end_index: int = None) -> None:
    """
    Plots a scatter graph showing how relevant each chunk is to the query.

    Args:
        chunk_values (List[float]): Importance scores per chunk.
        start_index (int, optional): Starting chunk index for the plot.
        end_index (int, optional): Ending chunk index for the plot.
    """
    if start_index is None:
        start_index = 0
    if end_index is None:
        end_index = len(chunk_values)

    plt.figure(figsize=(12, 5))
    plt.title("🔍 Query-to-Chunk Relevance Across Document")
    plt.xlabel("Chunk Index")
    plt.ylabel("Relevance Score")
    plt.ylim(0, 1)
    plt.scatter(range(start_index, end_index), chunk_values[start_index:end_index])
    plt.grid(True)
    plt.show()


## Loading Data

In [None]:
# File path for the input document
FILE_PATH = "data/nike_2023_annual_report.txt"

with open(FILE_PATH, 'r') as file:
    text = file.read()

chunks = split_into_chunks(text, chunk_size=800)

print (f"Split the document into {len(chunks)} chunks")

## 📊 Visualize Chunk Relevance Across a Single Document

Now that we’ve defined the reranker and scoring helpers, let’s test them on a real query.

In this step:
- We pass a **realistic financial query** to the reranker.
- We **visualize how relevant each chunk** in the document is.
- This helps us identify which parts of the document should be used to construct a grounded response.

In [None]:
query = "Nike consolidated financial statements"
similarity_scores, chunk_values = rerank_chunks(query, chunks)
plot_relevance_scores(chunk_values)

## 📊 Interpreting the Chunk Relevance Plot

The plot above visualizes how relevant each document chunk is to the input query.

- The **x-axis** shows the chunk index — starting from 0 for the first chunk in the document.
- The **y-axis** shows the **final relevance score**, which is not just the raw similarity score from the reranker.

### 🧠 How is the Relevance Score Computed?

Each chunk is passed through a reranker (like Cohere), which gives:
- A **raw relevance score** — how semantically similar the chunk is to the query.
- A **rank** — where it stands among all chunks in terms of similarity (1st, 2nd, etc.).

Instead of relying only on one of these signals, we combine them:
- The **rank** is converted using an **exponential decay function** — so the top-ranked chunks contribute more to the final score.
- This decay value is **multiplied with the raw relevance score**.
- Optionally, a **beta transformation** is applied to distribute the scores more smoothly.

This makes the score both **position-aware** (top results matter more) and **content-aware** (chunks still need semantic similarity).

---

## 🔍 Zooming into a Relevant Cluster

We noticed that the similarity scores in the range of **chunk indices 320–333** were noticeably higher than others, indicating a dense cluster of relevance.

To examine this section more closely:

```python
plot_relevance_scores(chunk_values, 320, 340)


## 🔍 Understanding the Retrieved Document Segment

After identifying a cluster of high-relevance chunks between **chunk 320 and 340**, we inspect the actual content and find something insightful.

From our observation:
- **Chunk 323** starts the section titled **"Consolidated Statement of Income"**.
- The following chunks up to **chunk 333** contain **detailed financial statements** — exactly what our query is looking for.

This means that **all the chunks from 323 to 333 are semantically relevant** to the query:  
**"Nike consolidated financial statements"**.

---

### 🤔 Then Why Were Only Half Marked as Relevant?

Here’s what this means:

- The reranker model evaluates each chunk **independently**, and returns a binary decision:  
  ✅ **Relevant** or ❌ **Not Relevant**.
- In the case of chunks 323–333:
  - Only about **half** were labeled as **relevant** by the reranker.
  - The **others were marked as irrelevant**, even though they’re clearly part of the same logical section.

---

### 🧠 Why Does This Happen?

The reranker doesn’t have access to **context beyond individual chunks**. So:

- Some chunks may contain **section headers or highly keyword-aligned sentences** — these get marked as relevant.
- Others may contain **numeric tables, continuation of previous content, or subtler phrasing** — these don’t match the query as explicitly, so they’re marked as irrelevant.

> In short, **chunks that are contextually important but less textually aligned** might be ignored by the reranker.

This leads to a situation where **important but low-signal chunks are sandwiched between high-signal ones**, and get missed in naive filtering.

---

### 📌 Why This Matters

If we blindly include only those chunks marked as relevant:
- We may **break the logical continuity** of the document section.
- The LLM might **miss key information**, especially in structured or tabular formats.

By instead considering **clusters of adjacent relevant chunks**, and including the chunks in between, we can:
- Provide the LLM with a **more coherent and complete context**.
- Avoid missing content that’s **crucial but subtle**.

---

### ✅ Takeaway

- Binary relevance labels are **not enough** — proximity and continuity also matter.
- For structured documents like financial reports, always consider **nearby chunks as potential context**, even if they weren’t ranked as top matches.


In [None]:
def print_document_segment(chunks: List[str], start_index: int, end_index: int):
    """
    Print the text content of a segment of the document

    Args:
        chunks (list): List of text chunks
        start_index (int): Start index of the segment
        end_index (int): End index of the segment (not inclusive)

    Returns:
        None

    Prints:
        The text content of the specified segment of the document
    """
    for i in range(start_index, end_index):
        print(f"\nChunk {i}")
        print(chunks[i])

print_document_segment(chunks, 320, 340)

## 🧩 Why Use Clusters of Relevant Chunks?

When answering complex or structured queries, **individual chunks** often fail to provide enough context. But when we group **contiguous relevant chunks together as a cluster**, we preserve the **semantic continuity** of the original document — something large language models (LLMs) benefit from significantly.

Instead of feeding isolated chunks to the LLM, we now aim to feed **entire segments of highly relevant, adjacent chunks**.


## 🧠 How Do We Find These Clusters?

This is where the challenge lies:  
We need to **automatically detect groups of adjacent chunks** that together form a high-quality, coherent segment.

To do this efficiently, we reframe the problem as a variation of the **maximum subarray problem**, which is a well-known algorithmic problem with efficient solutions.

But how?



## 🔢 Defining Chunk Values for Optimization

We already have **relevance scores** (e.g., from a reranker) for each chunk — typically between `0` and `1`. To use them in an optimization algorithm, we tweak them slightly:

### ➕ Relevant chunks → Positive score  
### ➖ Irrelevant chunks → Negative score  

To make this transformation:
- We subtract a constant value (say, **0.2**) from each relevance score.
- This shifts **low-relevance chunks below zero**, making them penalize the segment value.
- We call this constant the `irrelevant_chunk_penalty`.

This setup ensures that the more relevant chunks a segment has — and the fewer irrelevant ones — the **higher the total score**.



## 🔍 What Does the Algorithm Do?

The function `get_best_segments(...)` solves this modified version of the **maximum subarray problem**, with some added constraints:

1. **Inputs:**
   - A list of **relevance values** (after applying the penalty).
   - A **max length** for a single segment.
   - A **total length budget** for all returned segments.
   - A **minimum score** required to consider a segment "good".

2. **Goal:**  
   Find non-overlapping segments of chunks that:
   - Stay within the max length.
   - Stay within the overall chunk budget.
   - Have a total score above a threshold.
   - Don’t overlap with previously selected segments.

3. **Output:**  
   A list of `(start_index, end_index)` pairs representing the best segments to include, plus their respective scores.


## ✅ Why This Works

This approach ensures:
- **High-density clusters of relevance** are selected, not just isolated peaks.
- **Adjacent useful chunks are grouped**, preserving context.
- **Irrelevant noise is penalized**, making the output cleaner.

This is especially powerful for documents like financial reports or technical documentation, where meaningful information spans multiple nearby chunks, and context is everything.


## 💡 Summary

By reframing segment selection as a **scoring and optimization problem**, we can smartly select the most useful and coherent parts of a document. This significantly boosts the **retrieval quality** and ultimately leads to more accurate, grounded, and complete answers from the LLM.


In [None]:
def get_best_segments(relevance_values: list, max_length: int, overall_max_length: int, minimum_value: float):
    """
    This function takes the chunk relevance values and then runs an optimization algorithm to find the best segments. In more technical terms, it solves a constrained version of the maximum sum subarray problem.

    Note: this is a simplified implementation intended for demonstration purposes. A more sophisticated implementation would be needed for production use and is available in the dsRAG library.

    Args:
        relevance_values (list): a list of relevance values for each chunk of a document
        max_length (int): the maximum length of a single segment (measured in number of chunks)
        overall_max_length (int): the maximum length of all segments (measured in number of chunks)
        minimum_value (float): the minimum value that a segment must have to be considered

    Returns:
        best_segments (list): a list of tuples (start, end) that represent the indices of the best segments (the end index is non-inclusive) in the document
        scores (list): a list of the scores for each of the best segments
    """
    best_segments = []
    scores = []
    total_length = 0
    while total_length < overall_max_length:
        # find the best remaining segment
        best_segment = None
        best_value = -1000
        for start in range(len(relevance_values)):
            # skip over negative value starting points
            if relevance_values[start] < 0:
                continue
            for end in range(start+1, min(start+max_length+1, len(relevance_values)+1)):
                # skip over negative value ending points
                if relevance_values[end-1] < 0:
                    continue
                # check if this segment overlaps with any of the best segments and skip if it does
                if any(start < seg_end and end > seg_start for seg_start, seg_end in best_segments):
                    continue
                # check if this segment would push us over the overall max length and skip if it would
                if total_length + end - start > overall_max_length:
                    continue
                
                # define segment value as the sum of the relevance values of its chunks
                segment_value = sum(relevance_values[start:end])
                if segment_value > best_value:
                    best_value = segment_value
                    best_segment = (start, end)
        
        # if we didn't find a valid segment then we're done
        if best_segment is None or best_value < minimum_value:
            break

        # otherwise, add the segment to the list of best segments
        best_segments.append(best_segment)
        scores.append(best_value)
        total_length += best_segment[1] - best_segment[0]
    
    return best_segments, scores

## 🧩 Segment Optimization: Selecting the Most Valuable Context

Now that we've defined how to score contiguous chunks using our relevance-based method, the next step is to **actually run the optimization** and select the most valuable text segments to pass to the LLM.



### ⚙️ Setting the Optimization Parameters

We first define a few important constraints and thresholds that guide the optimization process:

- **Irrelevant Chunk Penalty (`0.2`)**:  
  This value is subtracted from each chunk's relevance score to convert it into a "value." Chunks with low original relevance will now have negative values, while highly relevant ones remain positive. This makes it easier to find high-density regions of relevance using a simple sum.

- **Maximum Segment Length (`20` chunks)**:  
  This restricts how long a single segment can be. It ensures we don’t create excessively large spans, which could reduce precision.

- **Overall Maximum Length (`30` chunks)**:  
  This is a hard limit on the total number of chunks that can be selected across all segments. It helps control context size and prevents the model input from overflowing.

- **Minimum Value Threshold (`0.7`)**:  
  This ensures that only meaningful, high-value segments are retained. Segments with total value below this threshold are discarded.



### ➖ Converting Relevance Scores into Optimization-Friendly Values

To enable the optimizer to find high-value clusters, we convert each chunk’s relevance score into a new value by subtracting a constant penalty. This transformation ensures:
- **Positive values** indicate useful, highly relevant chunks.
- **Negative values** indicate noise or irrelevant content.

This lets us use a greedy optimization approach to find sequences of chunks with maximum total value — i.e., those that are not only relevant but tightly packed together in the document.



### 🚀 Running the Optimizer

We now run the segment selection algorithm. It scans through the adjusted chunk values and selects a set of **non-overlapping segments** that:
- Respect the `max_length` and `overall_max_length` constraints,
- Do not overlap with each other,
- And each individually score above the `minimum_value` threshold.

This results in a list of the most valuable segments from the document — clusters of chunks that are highly relevant **and** contextually cohesive.



### 🧾 Interpreting the Output

Once the optimizer finishes, it returns:
- The **indices** of the best segments (start and end positions of each cluster),
- The **value scores** of each segment (total relevance after applying the penalty).

This gives us a clean and interpretable way to **identify exactly which parts of the document are worth passing to the LLM** — not based on arbitrary rules, but on quantified relevance and context density.



### ✅ Why This Matters

Instead of feeding isolated chunks into the LLM — which may be individually relevant but lack surrounding context — we now extract **tight, coherent clusters**. These segments:
- Preserve narrative flow,
- Carry stronger semantic signals,
- And lead to significantly better downstream answer quality in RAG pipelines.

This method offers a **principled, efficient way to ground generation in the best parts of the document**.


In [None]:
# define some parameters and constraints for the optimization
irrelevant_chunk_penalty = 0.2 # empirically, something around 0.2 works well; lower values bias towards longer segments
max_length = 20
overall_max_length = 30
minimum_value = 0.7

# subtract constant threshold value from chunk relevance values
relevance_values = [v - irrelevant_chunk_penalty for v in chunk_values] 

# run the optimization
best_segments, scores = get_best_segments(relevance_values, max_length, overall_max_length, minimum_value)

# print results
print ("Best segment indices")
print (best_segments) # indices of the best segments, with the end index non-inclusive
print ()
print ("Best segment scores")
print (scores)
print ()

## 🔎 What Happens When Only a Single Chunk is Relevant?

RSE (Relevant Segment Extraction) isn’t just for finding large, dense clusters of relevant content — it’s also smart enough to handle the opposite case.

When the answer to a query is contained in just **one or two isolated chunks**, RSE doesn’t try to forcefully build large segments. Instead, it gracefully falls back to behavior that resembles **top-k retrieval**, returning those specific high-scoring chunks on their own. 

This adaptive behavior makes RSE robust:  
- It performs well whether the information is **spread across multiple nearby chunks**  
- Or **concentrated in just a single chunk**

If you visualize the chunk relevance values for such a query, you’ll typically see isolated spikes instead of smooth clusters — and RSE will correctly identify and return those spikes.

---

## 📊 Performance Impact (Based on Public Benchmarks)

Although we haven’t run the evaluations ourselves, existing benchmarks suggest that RSE leads to **significant improvements in answer quality** over standard top-k retrieval — even when cost and context size are kept roughly constant.

Some key takeaways from the evaluations:
- RSE consistently outperforms top-k retrieval across **diverse datasets** — including long PDFs, markdown handbooks, court opinions, and 10-K filings.
- Average quality scores improved by over **40%** in certain settings.
- When paired with techniques like **Contextual Chunk Headers (CCH)**, RSE has shown strong gains on finance-focused QA tasks as well.

These results reinforce the core value of RSE:  
By focusing not just on individual chunk scores but also on **how relevant chunks cluster together**, RSE provides the LLM with richer, more coherent context — leading to more accurate answers.

---

✅ Whether you're dealing with scattered facts or dense documents, RSE adapts intelligently to return the best possible input for your RAG system.


---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## 🔍 Why Use RSE in RAG?

Standard top-k retrieval often returns scattered chunks, missing the fact that relevant information is **clustered** in documents.

**Relevant Segment Extraction (RSE)** solves this by:
- 📊 Scoring chunks in context using a **sliding window**
- 🧱 Extracting **coherent high-relevance segments** instead of isolated chunks
- 🎯 Improving **context quality** for LLMs with minimal extra cost

---

## 🧠 What’s New in This Version?

This RSE implementation includes:

- 📊 **Smoothed relevance scoring** across neighbors  
- 🧠 **Local maxima-based segment detection**  
- ⚙️ **Greedy selection of best non-overlapping spans**  
- 📦 **Lightweight, reusable logic** for RAG pipelines  

---

## 📈 Inferences & Key Takeaways

- 🚀 RSE improves grounding and relevance over top-k retrieval  
- 🧠 Helps especially on **long, dense, or structured documents**  
- 🔄 Falls back to top-k behavior when only isolated chunks are relevant  

---

## 🚀 What Could Be Added Next?

- 🤖 Plug into rerankers like Cohere or GPT-4o  
- 📊 Evaluate on datasets like KITE or FinanceBench  
- 🔧 Wrap as a LangChain retriever for modular use  

---
## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.