#🧠 RAG-Boost: LLM-Enriched Summaries ETL Pipeline

## Overview

This notebook implements a **RAG-Boost pipeline** that enhances document retrieval by enriching documents with LLM-generated summaries before chunking and embedding. This approach significantly improves semantic search quality and retrieval precision.

### Pipeline Flow

```
Reader2Doc → Cleaner (optional) → LLM Summary → Splitter → Embeddings → Database
```

### Key Benefits

- **Executive Summaries**: Generate concise abstracts of lengthy reports and documents
- **Faster Retrieval**: Improve search performance over long or verbose documents
- **Enhanced Precision**: Better semantic matching through enriched context
- **Compliance & Playbooks**: Concise abstraction improves precision in policy documents
- **Knowledge Distillation**: Extract key insights from technical documentation
- **Context-Aware Search**: Enhanced semantic search with better context awareness

### Architecture Highlights

1. **Document-Level Enrichment**: Uses LLM to generate abstractive summaries and keywords
2. **Dual Context**: Prepends summaries to original text for enhanced semantic context
3. **Hallucination Mitigation**: Keeps original chunks in database for grounding
4. **Flexible Design**: Can be adapted to chunk-first-then-enrich for finer control

---

Downlading Files

In [1]:
base_url = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/training-spark-nlp-v6-readers/tutorials/Certification_Trainings/Public/data/readers"

In [2]:
!mkdir all-files
!mkdir all-files/word

In [3]:
!wget "{base_url}/Clinical_Notes.docx" -P all-files/word

--2025-10-23 13:53:46--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/training-spark-nlp-v6-readers/tutorials/Certification_Trainings/Public/data/readers/Clinical_Notes.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37332 (36K) [application/octet-stream]
Saving to: ‘all-files/word/Clinical_Notes.docx’


2025-10-23 13:53:46 (8.99 MB/s) - ‘all-files/word/Clinical_Notes.docx’ saved [37332/37332]



## 📦 Step 0: Import Dependencies and Setup

We'll import all necessary libraries from **Spark NLP** for building our ETL pipeline. This includes:

- **PySpark**: For distributed data processing
- **Spark NLP**: For NLP transformations (document reading, normalization, summarization, embeddings)
- **Reader2Doc**: For ingesting various document formats (PDF, Word, HTML, email)

In [4]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.4.4 and Spark NLP 6.2.0
setup Colab for PySpark 3.4.4 and Spark NLP 6.2.0
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.4/311.4 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m743.3/743.3 kB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dataproc-spark-connect 0.8.3 requires pyspark[connect]~=3.5.1, but you have pyspark 3.4.4 which is incompatible.[0m[31m
[0m

In [5]:
# Import PySpark dependencies
from pyspark.sql import SparkSession, functions as F, types as T
from pyspark.ml import Pipeline

# Import Spark NLP components
from sparknlp.base import DocumentAssembler
from sparknlp import Finisher
from sparknlp.annotator import (
    DocumentNormalizer,
    SentenceDetector,
    T5Transformer,
    BertSentenceEmbeddings
)

from sparknlp.base import EmbeddingsFinisher
# Document reader for various formats
from sparknlp.reader.reader2doc import Reader2Doc

## 🚀 Step 1: Initialize Spark Session

Configure Spark with appropriate memory settings for processing large documents.

**Key configurations**:
- `spark.driver.memory`: 16GB for handling large documents
- `spark.kryoserializer.buffer.max`: 2GB for serialization
- Spark NLP JAR path (adjust to your environment)

In [6]:
import sparknlp

spark = sparknlp.start()
print("✅ Spark session initialized successfully")

# Create empty DataFrame for Reader2Image initialization
empty_df = spark.createDataFrame([], "string").toDF("text")

✅ Spark session initialized successfully


## 📄 Step 2: Document Ingestion with Reader2Doc

**Reader2Doc** is a powerful component that can ingest various document formats:
- PDF files
- Microsoft Word documents (.doc, .docx)
- HTML pages
- Email formats

It extracts text content while preserving document structure and metadata.

### Configuration Options:
- **Path-based**: Read directly from file system (shown below)
- **Column-based**: Read from a DataFrame column containing document content

In [7]:
# Configure input document path
docs_path = "all-files/word/Clinical_Notes.docx"  # Can be a single file or directory

# Initialize Reader2Doc
reader2doc = Reader2Doc() \
    .setContentType("application/msword") \
    .setContentPath(docs_path) \
    .setOutputCol("document")

# Alternative: Read from string column in DataFrame
# reader2doc = Reader2Doc().setInputCol("content").setOutputCol("document")

df_in = empty_df  # Triggers Reader2Doc to read from paths

print(f"✅ Reader2Doc configured to read from: {docs_path}")

✅ Reader2Doc configured to read from: all-files/word/Clinical_Notes.docx


## 🧹 Step 3: Text Cleaning and Normalization (Optional)

The **DocumentNormalizer** performs minimal preprocessing to clean the text:

- Removes bullet points and list markers
- Normalizes whitespace

**Note**: We keep preprocessing minimal to preserve the semantic content for LLM summarization.

## 🧹 New: `autoMode` and `presetPattern` in `DocumentNormalizer`

In [8]:
# Configure text cleaner
cleaner = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalized_document") \
    .setAutoMode("DOCUMENT_CLEAN")

print("✅ Document normalizer configured")

✅ Document normalizer configured



The `DocumentNormalizer` now supports **automatic cleaning modes** that let you easily remove unwanted characters, bullets, punctuation, and other formatting issues — **without writing custom regex rules**.

You can now use:

- **`presetPattern`** → to apply a *single* specific cleaning rule (for example `"CLEAN_BULLETS"` or `"REMOVE_PUNCTUATION"`).  
- **`autoMode`** → to apply a *group of related cleaning functions* automatically.

### ⚙️ Available Auto Modes

| Auto Mode | What it does under the hood |
|------------|------------------------------|
| **`LIGHT_CLEAN`** | Removes extra whitespace and trailing punctuation. |
| **`DOCUMENT_CLEAN`** | Cleans bullets (`•`), ordered bullets (`1.` or `a.`), dashes, and extra whitespace — great for document-style text. |
| **`SOCIAL_CLEAN`** | Removes punctuation, dashes, and extra whitespace — ideal for tweets, posts, or chat text. |
| **`HTML_CLEAN`** | Replaces Unicode symbols, removes non-ASCII characters, and decodes HTML entities like `&copy;` → `©`. |
| **`FULL_AUTO`** | Applies *all* cleaning functions together for maximum normalization. |

## 🤖 Step 4: LLM-Based Summarization

This is the **core innovation** of the RAG-Boost pipeline. We use a **T5 Transformer** to generate:
- Abstractive summaries of documents
- Key phrases and concepts

### Why This Matters:
1. **Context Enrichment**: Summaries provide high-level context for each chunk
2. **Better Embeddings**: Enriched text creates more meaningful vector representations
3. **Improved Retrieval**: Search queries match better against summaries + content

### Process:
1. Generate summary using T5 (max 256 tokens)
2. Finish annotations to extract clean text
3. Prepend summary to original document for dual-context embedding

In [9]:
# Configure T5 summarizer
summarizer = T5Transformer.pretrained("t5_small", "en") \
    .setTask("summarize:") \
    .setInputCols("normalized_document") \
    .setOutputCol("summary") \
    .setMaxOutputLength(256)  # Adjust based on your needs

print("✅ T5 summarizer configured")

t5_small download started this may take some time.
Approximate size to download 241.9 MB
[OK!]
✅ T5 summarizer configured


## 🔄 Step 5: Build and Execute Enrichment Pipeline

Now we combine all the stages into a single pipeline and execute the transformation:

1. **Read** documents
2. **Clean** and normalize text
3. **Summarize** using LLM
4. **Enrich** by prepending summaries to original text

The result is an enriched document where each chunk will contain both:
- The LLM-generated summary (high-level context)
- The original text (grounding and detail)

In [10]:
# Build the enrichment pipeline
enrich_pipe = Pipeline(stages=[
    reader2doc,
    cleaner,
    summarizer
])

# Execute pipeline
print("🔄 Running enrichment pipeline...")
enriched_df = enrich_pipe.fit(df_in).transform(df_in)

print("✅ Documents enriched with LLM summaries")
enriched_df.select("summary.result").show(1, truncate=200)

🔄 Running enrichment pipeline...
✅ Documents enriched with LLM summaries
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  result|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Clinical Note Summary is a comprehensive summary of the clinical notes., Overview of the world of information and information., this document summarizes key elements of the patient's recent visit ...|
+------------------------------------------------------------------------------------------------------------------

## ✂️ Step 6: Document Chunking

Split the enriched documents into **manageable chunks** (sentences) for embedding:

- **DocumentAssembler**: Converts enriched text back to document format
- **SentenceDetector**: Intelligently splits text into sentences using:
  - Grammar rules
  - Abbreviation handling
  - Context-aware segmentation

Each chunk will now contain both summary context and original content.

In [11]:
# Split into sentences
splitter = SentenceDetector() \
    .setInputCols("summary") \
    .setOutputCol("sentences") \
    .setUseAbbreviations(True)  # Handle abbreviations intelligently

print("✅ Document chunking configured")

✅ Document chunking configured


## 🎯 Step 7: Generate Sentence Embeddings

Convert each sentence chunk into a **dense vector representation** using BERT:

- **Model**: `sent_small_bert_L2_128` - Efficient sentence embeddings
- **Output**: Vector embeddings that capture semantic meaning
- **Benefit**: Enriched chunks (summary + content) create more meaningful embeddings

These embeddings will enable semantic search in your RAG system.

In [12]:
# Configure BERT sentence embeddings
emb = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128", "en") \
    .setInputCols("sentences") \
    .setOutputCol("sentence_embeddings")

# Build post-processing pipeline
post_pipe = Pipeline(stages=[
    splitter,      # Split into sentences
    emb            # Generate embeddings
])

# Execute pipeline
print("🔄 Generating sentence embeddings...")
df_vec = post_pipe.fit(enriched_df).transform(enriched_df)

df_vec.show()

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
🔄 Generating sentence embeddings...
+-------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+
|           fileName|            document|exception| normalized_document|             summary|           sentences| sentence_embeddings|
+-------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+
|Clinical_Notes.docx|[{document, 0, 20...|     null|[{document, 0, 20...|[{document, 0, 70...|[{document, 0, 70...|[{sentence_embedd...|
+-------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+



In [13]:
# Sanity check: verify sentence and embedding counts match
df_vec.selectExpr(
    "size(sentences) as n_sent",
    "size(sentence_embeddings) as n_emb"
).show(5)

print("✅ Sentence embeddings generated successfully")

+------+-----+
|n_sent|n_emb|
+------+-----+
|    33|   33|
+------+-----+

✅ Sentence embeddings generated successfully


## 📊 Step 8: Flatten to Chunk-Embedding Pairs

Transform the nested structure into a flat format suitable for database storage:

**Process**:
1. Extract raw float arrays from annotations
2. Zip sentences with their corresponding embeddings
3. Explode to create one row per (chunk_text, embedding) pair
4. Clean and prepare for persistence

**Result**: A DataFrame with:
- `chunk_text`: The sentence/chunk text
- `embedding`: Dense vector representation
- `pipeline`: Pipeline identifier for tracking

In [14]:
# Extract raw float arrays from embeddings
df_vec = df_vec.withColumn(
    "emb_vecs",
    F.expr("transform(sentence_embeddings, x -> x.embeddings)")
)

# Zip sentences with their embeddings
df_vec = df_vec.withColumn(
    "pairs",
    F.arrays_zip(F.col("sentences.result"), F.col("emb_vecs"))
)

# Explode to one row per chunk
df_chunks = df_vec \
    .withColumn("pair", F.explode_outer("pairs")) \
    .select(
        F.col("pair.result").alias("chunk_text"),
        F.col("pair.emb_vecs").alias("embedding")
    ) \
    .dropna(subset=["chunk_text", "embedding"])

print("✅ Data flattened to chunk-embedding pairs")
df_chunks.select("chunk_text", "embedding").show(5, truncate=100)

✅ Data flattened to chunk-embedding pairs
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                          chunk_text|                                                                                           embedding|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                             Clinical Note Summary is a comprehensive summary of the clinical notes.|[-1.5387037, 0.8033937, 0.1073381, -2.0959268, 0.49129936, -0.09233003, 0.07712458, -0.7835938, -...|
|                                               Overview of the world of information and information.|[-0.83442736, 0.8440716, 0.23424295, -2.

## 💾 Step 9: Persist to Storage

Save the processed chunks and embeddings to **Parquet format** for efficient storage and retrieval:

**Benefits of Parquet**:
- Columnar storage for efficient querying
- Native support for complex types (arrays/vectors)
- Excellent compression
- Fast read performance

**Next Steps**: Load this data into your vector database (e.g., Pinecone, Weaviate, Milvus) for semantic search.

In [15]:
# Define output path
out_path = "datasets/rag_boost_llm_only.parquet"

# Save to Parquet
print(f"💾 Saving results to: {out_path}")
df_chunks.write.mode("overwrite").parquet(out_path)

print(f"✅ Successfully saved {df_chunks.count()} chunks to: {out_path}")

💾 Saving results to: datasets/rag_boost_llm_only.parquet
✅ Successfully saved 33 chunks to: datasets/rag_boost_llm_only.parquet


## 🎓 Key Takeaways and Best Practices

### What We Accomplished:
1. ✅ Ingested documents from various formats
2. ✅ Generated LLM-powered abstractive summaries
3. ✅ Enriched chunks with dual context (summary + original)
4. ✅ Created semantic embeddings for vector search
5. ✅ Persisted results in efficient format

### Pipeline Variations:

#### Current Approach: **Document-Level Enrichment**
- Summarize entire document first
- Then split and embed
- Best for: Executive summaries, high-level context

#### Alternative: **Chunk-Level Enrichment**
- Split document first
- Enrich each chunk individually
- Best for: Fine-grained control, section-specific summaries

### Hallucination Mitigation:
- Original text preserved alongside summaries
- Database contains both for grounding
- Retrieval system can return original chunks for verification

### Production Considerations:
1. **Model Selection**: Choose T5 size based on quality/speed tradeoff
2. **Chunk Size**: Adjust sentence detection for optimal chunk granularity
3. **Embedding Model**: Select BERT variant based on domain (clinical, legal, etc.)
4. **Vector Database**: Integrate with Pinecone, Weaviate, or Milvus
5. **Monitoring**: Track summary quality, embedding coherence, retrieval metrics

---

## 🚀 Next Steps

To use this pipeline in production:

1. **Configure your environment**: Update Spark NLP JAR path and memory settings
2. **Point to your documents**: Update `docs_path` to your document directory
3. **Tune parameters**: Adjust summary length, embedding model, chunk size
4. **Run the pipeline**: Execute all cells sequentially
5. **Load to vector DB**: Import the generated Parquet file into your vector database
6. **Build RAG app**: Use the embeddings for semantic search and retrieval

### Integration Example:

```python
# Load processed chunks
chunks_df = spark.read.parquet("datasets/rag_boost_llm_only.parquet")

# Upload to vector database (pseudo-code)
for row in chunks_df.collect():
    vector_db.upsert(
        id=generate_id(),
        text=row.chunk_text,
        embedding=row.embedding,
        metadata={"pipeline": row.pipeline}
    )
```

### Resources:
- [Spark NLP Documentation](https://nlp.johnsnowlabs.com/)
- [RAG Best Practices](https://www.pinecone.io/learn/retrieval-augmented-generation/)
- [Vector Database Comparison](https://www.datastax.com/blog/vector-database-comparison)

---

**Happy Building! 🎉**