# BSSC-QA Framework: Synthetic Question-Answer Generation Pipeline

**BSSC-QA** is a modular framework for generating high-quality question-answer pairs from text documents using multi-agent orchestration and RAG (Retrieval-Augmented Generation).

## Core Components:
- **Document Processor**: Loads and cleans text documents
- **Chunker**: Splits documents into semantic chunks
- **Vector Store**: Indexes chunks for retrieval (ChromaDB + embeddings)
- **Generator Agent**: Creates questions from chunks
- **Synthesis Agent**: Generates evidence-based answers
- **Evaluator Agent**: Assesses QA quality with multi-metric scoring
- **Pipeline Orchestrator**: Coordinates the entire workflow

This demo uses 4 Gutenberg (https://www.gutenberg.org/) novels to showcase the complete pipeline.

In [None]:
from pathlib import Path
import sys, os, shutil
import textwrap

PROJECT_ROOT = Path("/home/kaizu/Projects/test/BSSC_QA")  # Change to BSSC-QA directory
if not (PROJECT_ROOT / "bssc_qa").exists():
    print("Error: BSSC-QA directory not found.")
    PROJECT_ROOT = PROJECT_ROOT.parent

SRC_PATH = PROJECT_ROOT / "bssc_qa" / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.append(str(SRC_PATH))

DATA_DIR = PROJECT_ROOT / "data" / "papers"
paper_paths = sorted(DATA_DIR.glob("*.txt"))[:4]

print("Selected papers:")
for path in paper_paths:
    size_kb = path.stat().st_size / 1024
    print(f"  - {path.name} ({size_kb:.1f} KB)")
print(f"Workspace root: {PROJECT_ROOT}")

Selected papers:
  - G K Chesterton___The Man Who Knew Too Much.txt (326.7 KB)
  - Herbert Spencer___Essays on Education and Kindred Subjects.txt (876.6 KB)
  - Jack London___The Faith of Men.txt (258.8 KB)
  - Rudyard Kipling___The Jungle Book.txt (272.1 KB)
Workspace root: /home/kaizu/Projects/test/BSSC_QA


### Create/Edit a new config

BSSC-QA uses a centralized JSON config with these key sections:

**LLM Providers**: Multi-provider support (Gemini, DeepSeek, Mistral, HuggingFace)
- Each agent can use a different provider
- Configurable temperature and model selection

**Vector Store**: ChromaDB with customizable embeddings
- `offline-hash`: Fast, deterministic hashing
- `sentence-transformers`: Semantic embeddings

**Chunking Strategy**: Adaptive text splitting
- Default: 512 tokens with 50-token overlap
- Auto-adjusts based on model context window

**Agent Configuration**:
- Generator: LLM provider + retry logic
- Synthesis: Context window size + evidence span limits
- Evaluator: Quality threshold + custom metrics

*Note: The config is already loaded for this demo. Uncomment the cell below to create/modify your own.*

In [None]:
# import json

# # Default configuration
# config = {
#     "llm": {                                                                                    # LLM provider settings (You can add or remove providers)
#         "default_provider": "gemini",
#         "providers": {
#             "gemini": {                                       
#                 "api_key": "your_api_key_here",
#                 "model": "gemini-2.5-flash",
#                 "temperature": 0.7
#             },
#             "deepseek": {
#                 "api_key": "your_api_key_here",
#                 "model": "deepseek-chat",
#                 "temperature": 0.7
#             },
#             "mistral": {
#                 "api_key": "your_api_key_here",
#                 "model": "mistral-large-latest",
#                 "temperature": 0.7
#             },
#             "huggingface": {
#                 "api_key": "your_api_key_here",
#                 "model": "meta-llama/Llama-3.1-8B-Instruct",
#                 "temperature": 0.7
#             }
#         }
#     },
#     "vector_store": {                                                                                 # Vector store settings                             
#             "type": "chromadb",
#             "persist_directory": "./data/chroma_db",
#             "collection_name": "demo",
#             "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"                               # Options: offline-hash, sentence-transformers/all-MiniLM-L6-v2
#         },
#     "prompts": {
#             "path": "prompts/default_prompt.json"
#         },
#     "chunking": {
#         "chunk_size": 512,                                                                        # (~40 sentences, ~2,000 chars)
#         "chunk_overlap": 50,
#         "auto_adjust": True                                                                       # Auto-adjust chunk size based on model context window              
#     },
#     "agents": {
#         "planner": {                                                                              
#             "enabled": False,
#             "provider": "gemini"
#         },
#         "generator": {
#             "provider": "gemini",
#             "max_retries": 3
#         },
#         "synthesis": {
#             "provider": "deepseek",
#             "context_window": 3,                                                                  # Number of top relevant chunks to consider
#             "max_evidence_spans": 3                                                               # Number of evidence spans to cite in the answer
#         },
#         "evaluator": {
#             "provider": "mistral",
#             "quality_threshold": 0.75,
#             "metrics": ["relevance", "clarity", "completeness", "factuality", "diversity"]         # Evaluation metrics (Adjust as needed) 
#         }
#     },
#     "bloom_level": {
#         "enabled": False,
#         "levels": ["remember", "understand", "apply", "analyze", "evaluate", "create"]             # Bloom's taxonomy levels
#     },
#     "human_review": {
#         "enabled": False,                                                                          # Enable human review step
#         "review_threshold": 0.6
#     },
#     "export": {
#         "format": "json",
#         "include_metadata": True,
#         "output_path": "./data/output"
#     }
# }

# # Save config
# config_path = PROJECT_ROOT / 'config.json'
# with open(config_path, 'w') as f:
#     json.dump(config, indent=2, fp=f)

# print(f"‚úÖ Configuration saved to: {config_path}")
# print("‚ö†Ô∏è  Remember to update API keys in config.json")

## Load the config

In [3]:
from core.config import load_config

# Reload config (in case you updated API keys)
cfg = load_config(PROJECT_ROOT / 'config.json')

## Document Loading & Preprocessing

The pipeline starts by ingesting raw text files and preparing them for chunking.

**Key Operations**:
- Strip Gutenberg headers/footers and metadata
- Normalize whitespace and remove special characters
- Extract document metadata (title, author, source)
===================================================================================================================
- **Component:** `pipeline.document_loaders.load_document` + `utils.text_processing.clean_text/normalize_text`
- **Input:** Individual `.txt` docs.
- **Output:** Cleaned text plus structured metadata ready for semantic chunking.


In [4]:
from pipeline.document_loaders import load_document
from utils.text_processing import clean_text, normalize_text

cleaned_documents = {}

for path in paper_paths:
    raw_doc = load_document(str(path))
    cleaned_text = normalize_text(clean_text(raw_doc["content"]))
    cleaned_documents[path.name] = {
        "metadata": raw_doc["metadata"],
        "text": cleaned_text
    }

# Stash The Jungle Book novel for downstream demos
demo_doc_name = "Rudyard Kipling___The Jungle Book.txt"
demo_doc = cleaned_documents[demo_doc_name]
print(f"Stashed demo document: {demo_doc_name} ({len(demo_doc['text']):,} characters)")
print("Preview of demo document:\n========================\n", demo_doc['text'][:420] + "...")

Stashed demo document: Rudyard Kipling___The Jungle Book.txt (271,339 characters)
Preview of demo document:
 Mowgli's Brothers

 Now Rann the Kite brings home the night
 That Mang the Bat sets free--
 The herds are shut in byre and hut
 For loosed till dawn are we.
 This is the hour of pride and power,
 Talon and tush and claw.
 Oh, hear the call!--Good hunting all
 That keep the Jungle Law!
 Night-Song in the Jungle

It was seven o'clock of a very warm evening in the Seeonee hills when
Father Wolf woke up from his day's re...


## Semantic Chunking

Documents are split into overlapping chunks to maintain context continuity.

**Parameters**:
- `chunk_size`: 512 tokens (~400 words) (default)
- `overlap`: 50 tokens (prevents context loss at boundaries) (default)

**Why Overlapping?** Adjacent chunks share context, enabling the retriever to capture information that spans chunk boundaries.

Each chunk receives a unique ID and inherits parent document metadata.

===================================================================================================================

- **Component:** `pipeline.chunking.chunk_text`
- **Input:** Cleaned text from the first book: "G K Chesterton___The Man Who Knew Too Much.txt" with chunk size 256 tokens (~1,024 chars or around 20-30 English sentences) and overlap 40 tokens
- **Output:** Structured `Chunk` objects with ids, token estimates, and metadata for vector storage


In [5]:
from pipeline.chunking import chunk_text

chunk_size = cfg.chunking.chunk_size                                                          # Or load from config
chunk_overlap = cfg.chunking.chunk_overlap

demo_chunks = chunk_text(
    demo_doc["text"],                                                   # Contains Rudyard Kipling___The Jungle Book.txt
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    metadata={"filename": demo_doc_name}
)

print(f"Input doc: {demo_doc_name}")
print(f"Requested chunk size / overlap: {chunk_size} / {chunk_overlap}")
print(f"Output: {len(demo_chunks)} chunks")

first_chunk = demo_chunks[0]                                            # first_chunk of the chunk of Rudyard Kipling___The Jungle Book
print("First chunk sample:")
print(f"  chunk_id: {first_chunk.chunk_id}")
print(f"  tokens: {first_chunk.tokens}")
print(f"  position: {first_chunk.position}")
print(f"  text: {first_chunk.text[:420]}...")

Input doc: Rudyard Kipling___The Jungle Book.txt
Requested chunk size / overlap: 256 / 20
Output: 288 chunks
First chunk sample:
  chunk_id: f9285c6b-1bd1-4eec-9bb3-53b8ae26e2c0
  tokens: 226
  position: 0
  text: Mowgli's Brothers

 Now Rann the Kite brings home the night
 That Mang the Bat sets free--
 The herds are shut in byre and hut
 For loosed till dawn are we.
 This is the hour of pride and power,
 Talon and tush and claw.
 Oh, hear the call!--Good hunting all
 That keep the Jungle Law!
 Night-Song in the Jungle

It was seven o'clock of a very warm evening in the Seeonee hills when
Father Wolf woke up from his day's re...


## Vector Store: ChromaDB + Embeddings

**Purpose**: Enable semantic search over document chunks.

**Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2`
- Fast, open-source, 384-dimensional embeddings
- Trained on semantic similarity tasks

**Storage**: Persistent ChromaDB collection
- Chunks indexed by semantic vectors
- Metadata stored alongside embeddings for filtering

The retriever queries this store to find relevant context for answer generation.

===================================================================================================================
- **Component:** `core.vector_store.VectorStoreManager` + `pipeline.ingestion.IngestionPipeline`
- **Input:** Four cleaned documents with offline hash embeddings (chunk size 256, overlap 40)
- **Output:** Persisted Chroma collection plus an ingestion report per file

In [None]:
from core.vector_store import VectorStoreManager
from pipeline.ingestion import IngestionPipeline

vector_dir = PROJECT_ROOT / "data" / "output" / "demo_chroma"               # Contains vector database
if vector_dir.exists():
    shutil.rmtree(vector_dir)
vector_dir.mkdir(parents=True, exist_ok=True)

if False:                                                                    # Reset vector store if needed
    # Delete existing ChromaDB (if any)
    import shutil
    if vector_dir.exists():
        shutil.rmtree(vector_dir)
    # Reset vector store
    vs_manager.reset_collection()
    print("‚úÖ Vector store reset and ready for re-ingestion.")

vs_manager = VectorStoreManager(                                           # Create vector store instance
    persist_directory=str(vector_dir),
    collection_name="demo_novels",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",              # Options: sentence-transformers/all-MiniLM-L6-v2, cohere, offline-hash, or any huggingface embedding model
    embedding_dimension=384                                                # all-MiniLM-L6-v2 uses 384 dimensions, offline-hash is flexible
)

ingestion = IngestionPipeline(
    vector_store_manager=vs_manager,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

ingestion_summary = {}
for path in paper_paths:                                                   # Ingest all selected raw text files 
    doc_ids = ingestion.ingest_document(str(path))
    ingestion_summary[path.name] = len(doc_ids)
    print(f"Input file: {path.name}")
    print(f"Output chunk ids stored: {len(doc_ids)}")

print("Vector store now holds:", vs_manager.get_collection_count(), "chunks")
print("Ingestion summary:", ingestion_summary)

  from .autonotebook import tqdm as notebook_tqdm


Input file: G K Chesterton___The Man Who Knew Too Much.txt
Output chunk ids stored: 348
Input file: Herbert Spencer___Essays on Education and Kindred Subjects.txt
Output chunk ids stored: 933
Input file: Jack London___The Faith of Men.txt
Output chunk ids stored: 274
Input file: Rudyard Kipling___The Jungle Book.txt
Output chunk ids stored: 288
Vector store now holds: 1843 chunks
Ingestion summary: {'G K Chesterton___The Man Who Knew Too Much.txt': 348, 'Herbert Spencer___Essays on Education and Kindred Subjects.txt': 933, 'Jack London___The Faith of Men.txt': 274, 'Rudyard Kipling___The Jungle Book.txt': 288}


### Chunk Analysis Tool
- **Component:** `tools.chunk_tool.ChunkAnalysisTool`
- **Input:** First chunk text from the lead book: "G K Chesterton___The Man Who Knew Too Much"
- **Output:** Sentence/entity stats plus suggested question types for the generator

In [7]:
from tools.chunk_tool import ChunkAnalysisTool

chunk_analyzer = ChunkAnalysisTool()
analysis = chunk_analyzer.analyze_chunk(first_chunk.text)
suggestions = chunk_analyzer.suggest_question_types(analysis)           # Suggest possible question types (Rule Based) out of five based on analysis based on entities, sentence structure, complexity, and length

print("Input chunk preview:")
print(first_chunk.text[:220] + "...")

print("Analysis output:")
for key, value in analysis.items():
    print(f"  {key}: {value}")

print("============================="*3)
print("Suggested question types:", suggestions)

Input chunk preview:
Mowgli's Brothers

 Now Rann the Kite brings home the night
 That Mang the Bat sets free--
 The herds are shut in byre and hut
 For loosed till dawn are we.
 This is the hour of pride and power,
 Talon and tush and claw....
Analysis output:
  sentence_count: 7
  word_count: 171
  entities: ['It', 'Mother', 'Now', 'Father', 'Wolves', 'For', 'Wolf', 'Jungle', 'The', 'Bat']
  has_numbers: False
  number_count: 0
  potential_topics: 27
  question_potential: True
Suggested question types: ['factual', 'conceptual', 'analytical']


### Retrieval Tool
- **Component:** `tools.retrieval_tool.RetrievalTool`
- **Input:** Query to the vector database
- **Output:** Formatted context snippets pulled from the Chroma store

In [8]:
from tools.retrieval_tool import RetrievalTool

retrieval_tool = RetrievalTool(vs_manager)
retrieval_query = "How many holluschickies were in Kotick's army?"                                  # The answer is about ten thousands, which is exists in the top 1 context: 1024 tokens
retrieved_context = retrieval_tool.retrieve_context(retrieval_query, k=1)                           # Retrieve top-1 relevant chunk

print("Input query:", retrieval_query)
print("=============================")
print("Retrieved context:")
print(retrieved_context)

Input query: How many holluschickies were in Kotick's army?
Retrieved context:
[Chunk 1]
Source: Rudyard Kipling___The Jungle Book.txt
Position: 156
Content: vastoshnah, leaving the gulls to scream. There he
found that no one sympathized with him in his little attempt to discover
a quiet place for the seals. They told him that men had always driven
the holluschickie--it was part of the day's work--and that if he did not
like to see ugly things he should not have gone to the killing grounds.
But none of the other seals had seen the killing, and that made the
difference between him and his friends. Besides, Kotick was a white
seal.

"What you must do," said old Sea Catch, after he had heard his son's
adventures, "is to grow up and be a big seal like your father, and have
a nursery on the beach, and then they will leave you alone. In another
five years you ought to be able to fight for yourself." Even gentle
Matkah, his mother, said: "You will never be able to stop the killing.
Go and pla

## Test and initialize a LLM following config

BSSC-QA supports multiple LLM providers through a unified interface.


Each agent can use a different provider, enabling cost/performance optimization per task.

*This cell initializes the LLM provider manager from config.*

In [9]:
from core.llm_factory import create_llm
from core.config import load_config

# Reload config (in case you updated API keys)
cfg = load_config(PROJECT_ROOT / 'config.json')

# Get default provider config
provider_name = cfg.llm.default_provider
provider_cfg = cfg.llm.providers[provider_name]

# Create LLM
llm = create_llm(
    provider=provider_name,
    api_key=provider_cfg.api_key,
    model=provider_cfg.model,
    temperature=provider_cfg.temperature
)

# # Test with simple prompt
# response = llm.invoke("Say 'Hello from BSSC_QA!' and tell a pun about academic research.")
# print(f"‚úÖ LLM ({provider_name}) Response: {response.content}\n-_-")

  return ChatOpenAI(


### Question Generator Agent
**Role**: Generate questions from document chunks using LLM prompting.

**Process**:
1. Receives a text chunk
2. Analyzes content for key concepts
3. Generates N diverse questions covering chunk topics

**Output**: List of questions with metadata (chunk_id, difficulty level).

===================================================================================================================

**System Prompt:**

    
```python
    """
    You are an expert question generator. Your task is to create high-quality, 
    diverse questions from given text content.

    Guidelines:
    1. Questions should be clear, specific, and answerable from the content
    2. Vary question types: factual, conceptual, analytical
    3. Ask about key concepts, entities, and relationships
    4. Ensure questions test understanding, not just recall
    5. Each question should be complete and grammatically correct

    Output format for each question:
    {
    "question": "Your question here?",
    "type": "factual|conceptual|analytical",
    "rationale": "Why this question is valuable"
    }
    """
```

**User Prompt:**

```python
    """
    Generate {count} diverse, high-quality questions from this content:

    {chunk_text}

    Provide exactly {count} questions in the specified JSON format.
    """
```
===================================================================================================================
- **Component:** `agents.generator_agent.GeneratorAgent`
- **Input:** First chunk of text (‚âà200 chars shown) with a request for 2 questions
- **Output:** Structured questions (text, type, rationale) ready for synthesis

In [10]:
from agents.generator_agent import GeneratorAgent

generator = GeneratorAgent(
    llm=llm,                                                      
    retrieval_tool=retrieval_tool,
    chunk_analysis_tool=chunk_analyzer
)

questions = generator.generate_questions(first_chunk.text, count=2)                             # Generate 2 questions from the first chunk

print("Input chunk preview:")
print(first_chunk.text[:200] + "...")

print("Generated questions:")
for q in questions:
    print(f"- ({q['question_type']}) {q['question']}")
    print(f"  rationale: {q['rationale']}")



Input chunk preview:
Mowgli's Brothers

 Now Rann the Kite brings home the night
 That Mang the Bat sets free--
 The herds are shut in byre and hut
 For loosed till dawn are we.
 This is the hour of pride and power,
 Talo...
Generated questions:
- (factual) What time of day does Father Wolf wake up from his rest in the Seeonee hills?
  rationale: This question tests basic comprehension of the temporal setting and establishes the story's opening context
- (conceptual) How does the narrator describe the relationship between Mother Wolf and her cubs in the cave?
  rationale: This question examines understanding of family dynamics and the protective maternal imagery in the text


### Answer Synthesis Agent

**Role**: Generate grounded answers using retrieved evidence.

**Workflow**:
1. Query vector store with question
2. Retrieve top-K relevant chunks (context window)
3. Generate answer citing specific evidence spans

**Key Feature**: Answer provenance tracking
- Each answer includes 1-3 evidence spans
- Spans reference source chunks for verification

*Defaults to DeepSeek for its strong context reasoning.*

===================================================================================================================

**System Prompt:**
```python
    """
    You are an expert answer synthesizer. Your task is to create accurate, 
    comprehensive answers based on provided evidence. Make sure the answers are short and concise.

    Guidelines:
    1. Base answers strictly on the evidence provided
    2. Be clear, concise, and well-structured
    3. Include relevant details and context
    4. Maintain factual accuracy
    5. Match answer complexity to question complexity

    Your answer should:
    - Directly address the question
    - Use evidence to support claims
    - Be complete but not unnecessarily verbose
    """
```

**User Prompt:**

```python
    """
    Question: {question}
    Question Type: {question_type}

    Evidence:
    {evidence_text}

    Based on the evidence above, provide a clear and accurate answer to the question.
    """
```
===================================================================================================================

- **Component:** `agents.synthesis_agent.SynthesisAgent`
- **Input:** First generated question plus vector-store evidence (k=2)
- **Output:** Answer text with captured evidence spans inside the QA record

In [11]:
from agents.synthesis_agent import SynthesisAgent

synthesizer = SynthesisAgent(
    llm=llm,
    vector_store_manager=vs_manager,
    max_evidence_spans=2
)

demo_question = questions[0]                                                # Take the first generated question
qa_pair = synthesizer.synthesize_answer(
    demo_question["question"],
    demo_question["question_type"]
)

print("Input question:", demo_question["question"])
print("Synthesized answer:", qa_pair["answer"])
print("Evidence count:", len(qa_pair["evidence_spans"]))

üß† Synthesis prompt: 
Question: What time of day does Father Wolf wake up from his rest in the Seeonee hills?
Question Type: factual

Evidence:
[Evidence 1]
Mowgli's Brothers

 Now Rann the Kite brings home the night
 That Mang the Bat sets free--
 The herds are shut in byre and hut
 For loosed till dawn are we.
 This is the hour of pride and power,
 Talon and tush and claw.
 Oh, hear the call!--Good hunting all
 That keep the Jungle Law!
 Night-Song in the Jungle

It was seven o'clock of a very warm evening in the Seeonee hills
Input question: What time of day does Father Wolf wake up from his rest in the Seeonee hills?
Synthesized answer: Father Wolf wakes up from his day's rest at seven o'clock in the evening in the Seeonee hills.

**Supporting Evidence:**  
From "Mowgli's Brothers": "It was seven o'clock of a very warm evening in the Seeonee hills when Father Wolf woke up from his day's rest."
Evidence count: 2


In [12]:
qa_pair

{'qa_id': '90f44a83-a872-438e-a9bc-7bfaf29994b3',
 'question': 'What time of day does Father Wolf wake up from his rest in the Seeonee hills?',
 'answer': 'Father Wolf wakes up from his day\'s rest at seven o\'clock in the evening in the Seeonee hills.\n\n**Supporting Evidence:**  \nFrom "Mowgli\'s Brothers": "It was seven o\'clock of a very warm evening in the Seeonee hills when Father Wolf woke up from his day\'s rest."',
 'evidence_spans': ['Mowgli\'s Brothers\n\n Now Rann the Kite brings home the night\n That Mang the Bat sets free--\n The herds are shut in byre and hut\n For loosed till dawn are we.\n This is the hour of pride and power,\n Talon and tush and claw.\n Oh, hear the call!--Good hunting all\n That keep the Jungle Law!\n Night-Song in the Jungle\n\nIt was seven o\'clock of a very warm evening in the Seeonee hills when\nFather Wolf woke up from his day\'s rest, scratched himself, yawned, and\nspread out his paws one after the other to get rid of the sleepy feeling\nin th

### Evaluator Agent

**Role**: Score QA pairs across multiple dimensions.

**Evaluation Metrics**:
- **Relevance**: Question-answer alignment
- **Clarity**: Language quality and coherence
- **Completeness**: Answer thoroughness
- **Factuality**: Grounding in evidence
- **Format**: Structural correctness

**Scoring**: 0-1 scale per metric, aggregated to overall score.

**Quality Threshold**: 0.75 (configurable)
- QA pairs below threshold are flagged

*Uses Mistral for consistent evaluation.*

===================================================================================================================

**System Promt**
```python

    """
    You are a quality evaluation expert. Your task is to assess QA pairs 
    across multiple dimensions and provide detailed scores.

    Evaluation Criteria:
    1. Relevance (0-1): Does the answer address the question?
    2. Clarity (0-1): Are both Q&A clear and unambiguous?
    3. Completeness (0-1): Is the answer comprehensive?
    4. Factuality (0-1): Is the answer accurate based on evidence?

    Provide scores for each criterion and identify any issues.
    """
```

**User Promt:**

```python
    """
    Evaluate this QA pair:

    Question: {qa['question']}
    Answer: {qa['answer']}

    Evidence:
    {evidence_text[:-1]}

    Provide scores (0.0 to 1.0) for:
    - Relevance
    - Clarity
    - Completeness
    - Factuality

    Format: 
    relevance: X.X
    clarity: X.X
    completeness: X.X
    factuality: X.X
    """
```
===================================================================================================================

- **Component:** `agents.evaluator_agent.EvaluatorAgent` + `tools.validation_tool.ValidationTool`
- **Input:** QA pair emitted by the synthesizer
- **Output:** Combined rule-based + LLM-style quality scores and pass/fail signal

In [13]:
from tools.validation_tool import ValidationTool
from agents.evaluator_agent import EvaluatorAgent

validator = ValidationTool(quality_threshold=0.6)
evaluator = EvaluatorAgent(
    llm=llm,
    validation_tool=validator,
    quality_threshold=0.7
)

evaluation = evaluator.evaluate_qa(qa_pair)                         # Evaluate the selected synthesized QA pair

print("Input QA ID:", qa_pair["qa_id"])
print("Scores:", evaluation["scores"])
print("Overall score:", round(evaluation["overall_score"], 2))
print("Passed quality bar:", evaluation["passed"])
print("Flags:", evaluation["flags"])

Input QA ID: a6e14956-407e-455d-b39a-e3bc0927fc9d
Scores: {'length': 1.0, 'answer_length': 1.0, 'format': 1.0, 'relevance': 1.0, 'completeness': 1.0, 'clarity': 1.0, 'factuality': 1.0}
Overall score: 1.0
Passed quality bar: True
Flags: []


### Pipeline Orchestrator

The orchestrator coordinates all agents in sequence:

**Pipeline Flow**:
1. Sample N random chunks from vector store
2. **Generator**: Create M questions per chunk
3. **Synthesis**: Generate answers with evidence retrieval
4. **Evaluator**: Score each QA pair

**Output Statistics**:
- Total QA pairs generated
- Pass/fail counts (based on quality threshold)
- Aggregate metrics (diversity, difficulty distribution)

**This cell**: Generate 6 QA pairs (3 chunks √ó 2 questions) and display summary statistics.

===================================================================================================================

- **Component:** `pipeline.orchestrator.QAPipelineOrchestrator`
- **Input:** 3 sampled chunks from the vector store with 1 question per chunk
- **Output:** Aggregated QA dataset plus evaluation statistics

In [14]:
from pipeline.orchestrator import QAPipelineOrchestrator

orchestrator = QAPipelineOrchestrator(
    generator_agent=generator,
    synthesis_agent=synthesizer,
    evaluator_agent=evaluator,
    vector_store_manager=vs_manager,
    config={}
)

results = orchestrator.generate_qa_from_chunks(
    num_chunks=3,
    questions_per_chunk=2
)

print("Pipeline output summary:")
print(f"  total_chunks: {results['total_chunks']}")
print(f"  total_questions_attempted: {results['total_questions_attempted']}")
print(f"  total_qa_pairs: {results['total_qa_pairs']}")
print(f"  passed_qa_pairs: {results['passed_qa_pairs']}")
print(f"  pass_rate: {results['statistics'].get('pass_rate', 0):.2%}")
print(f"  sample QA IDs: {[qa['qa_id'] for qa in results['qa_pairs'][:2]]}")

Retrieving 3 chunks...

Processing 3 chunks...


Generating QA:   0%|          | 0/3 [00:00<?, ?it/s]

üßæ Retrieved these evidences: 
[Evidence 1]
pier if we hauled out at Otter
Island instead of this crowded place," said Matkah.

"Bah! Only the holluschickie go to Otter Island. If we went there they
would say we were afraid. We must preserve appearances, my dear."

Sea Catch sunk his he
üßæ Retrieved these evidences: 
[Evidence 1]
come with you to
your island--if there is such a place."

"Hear you, fat pigs of the sea. Who comes with me to the Sea Cow's
tunnel? Answer, or I shall teach you again," roared Kotick.

There was a murmur like the ripple of the tide all up and


Generating QA:  33%|‚ñà‚ñà‚ñà‚ñé      | 1/3 [00:28<00:57, 28.97s/it]

üßæ Retrieved these evidences: 
[Evidence 1]
d French
romances, but a good many wouldn't think about it at all. They would
just swallow the skepticism because it was skepticism. Modern
intelligence won't accept anything on authority. But it will accept
anything without authority. That's 
üßæ Retrieved these evidences: 
[Evidence 1]
riors. Not that he delighted in the work, but that it was the one
thing that prevented him from going mad.

The first year he wished he was dead. The second year he cursed God. The
third year he was divided between the two emotions, and in the


Generating QA:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 2/3 [01:03<00:32, 32.06s/it]

üßæ Retrieved these evidences: 
[Evidence 1]
low voice:

"I suppose it's all right about air?"

"Oh, yes," replied the other aloud; "there's a fireplace and a
chimney in the office just by the door."

A bound and the noise of a falling chair told them that the
irrepressible rising genera
üßæ Retrieved these evidences: 
[Evidence 1]
retences," he said, with a smile. "I
hardly even know what an archaeologist is, except that a rather
rusty remnant of Greek suggests that he is a man who studies old
things."

"Yes," replied Haddow, grimly. "An archaeologist is a man who
studi


Generating QA: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [01:30<00:00, 30.32s/it]

Pipeline output summary:
  total_chunks: 3
  total_questions_attempted: 6
  total_qa_pairs: 6
  passed_qa_pairs: 6
  pass_rate: 100.00%
  sample QA IDs: ['5e15c38e-7e4d-40b1-8722-7031bc19422c', 'd3bcc6f6-5f48-4812-974f-46ab1cf164a6']





## Examining Generated QA Pairs

Each QA pair includes:
- **Question**: Generated query
- **Answer**: Synthesized response with evidence
- **Evidence Spans**: Source text citations (2-3 per answer)
- **Evaluation Scores**: Per-metric breakdown
- **Overall Score**: Weighted average (0-1)
- **Pass Status**: Whether it meets quality threshold

In [15]:
# Print sample QA pairs
for qa in results['qa_pairs'][:2]:
    print("\n=============================")
    print(f"QA ID: {qa['qa_id']}")
    print(f"Question: {qa['question']}")
    print(f"Answer: {qa['answer']}")
    print(f"Evidence spans: {len(qa['evidence_spans'])}")
    print(f"Evaluation scores: {qa['scores']}")
    print(f"Overall score: {round(qa['overall_score'], 2)}")
    print(f"Passed quality bar: {qa['passed']}")


QA ID: 5e15c38e-7e4d-40b1-8722-7031bc19422c
Question: What natural features prevent ships from approaching within six miles of the beach?
Answer: According to the evidence, the natural features that prevent ships from approaching within six miles of the beach are:

- A line of bars, shoals, and rocks running northward out to sea
- These shoals would "knock a ship to splinters" if attempted to navigate

These features create a protective barrier that keeps ships at a safe distance from the coastline.
Evidence spans: 2
Evaluation scores: {'length': 1.0, 'answer_length': 1.0, 'format': 1.0, 'relevance': 1.0, 'completeness': 0.8, 'clarity': 1.0, 'factuality': 1.0}
Overall score: 0.97
Passed quality bar: True

QA ID: d3bcc6f6-5f48-4812-974f-46ab1cf164a6
Question: Why does Kotick conclude that this location is safer than Novastoshnah?
Answer: Based on the evidence provided, Kotick concludes that the location beyond Sea Cow's Tunnel is safer than Novastoshnah because:

1. **It is free from h

## Experiment: Generating Concise Answers

**Goal**: Reduce answer verbosity while maintaining factual accuracy.

**Config Changes**:
- Reduced context window (3 ‚Üí 2 chunks) to limit input length
- Lower max evidence spans (3 ‚Üí 2) to reduce citation overhead
- Adjusted evaluation metrics (Conciseness) to prioritize brevity

**Prompt Strategy** Read the prompts for shorter answers: prompts/short_prompt.json (Customize your own based on your data)


In [None]:
import json

# Default configuration
config = {
    "llm": {                                                                                    # LLM provider settings (You can add or remove providers)
        "default_provider": "gemini",
        "providers": {
            "gemini": {                                       
                "api_key": "your_api_key_here",
                "model": "gemini-2.5-flash",
                "temperature": 0.2                                                              # Example: lower temperature for more focused answers
            },
            "deepseek": {
                "api_key": "your_api_key_here",
                "model": "deepseek-chat",
                "temperature": 0.2
            },
            "mistral": {
                "api_key": "your_api_key_here",
                "model": "mistral-large-latest",
                "temperature": 0.3
            },
            "huggingface": {
                "api_key": "your_api_key_here",
                "model": "meta-llama/Llama-3.1-8B-Instruct",
                "temperature": 0.5
            }
        }
    },
    "vector_store": {                                                                                 # Vector store settings                             
            "type": "chromadb",
            "persist_directory": "./data/chroma_db",
            "collection_name": "demo",
            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"                               # Options: offline-hash, sentence-transformers/all-MiniLM-L6-v2
        },
    "prompts": {
            "path": "prompts/short_prompt.json"                                                   # Change to short prompt
        },
    "chunking": {
        "chunk_size": 256,                                                                        # (~30 sentences, ~2,000 chars)
        "chunk_overlap": 20,
        "auto_adjust": True                                                                       # Auto-adjust chunk size based on model context window              
    },
    "agents": {
        "planner": {                                                                              
            "enabled": False,
            "provider": "gemini"
        },
        "generator": {
            "provider": "gemini",
            "max_retries": 3
        },
        "synthesis": {
            "provider": "deepseek",
            "context_window": 2,                                                                  # Number of top relevant chunks to consider
            "max_evidence_spans": 3                                                               # Number of evidence spans to cite in the answer
        },
        "evaluator": {
            "provider": "mistral",
            "quality_threshold": 0.75,
            "metrics": ["relevance", "clarity", "completeness", "factuality", "diversity"]         # Evaluation metrics (Adjust as needed) 
        }
    },
    "bloom_level": {
        "enabled": False,
        "levels": ["remember", "understand", "apply", "analyze", "evaluate", "create"]             # Bloom's taxonomy levels
    },
    "human_review": {
        "enabled": False,                                                                          # Enable human review step
        "review_threshold": 0.6
    },
    "export": {
        "format": "json",
        "include_metadata": True,
        "output_path": "./data/output"
    }
}

# Save config
config_path = PROJECT_ROOT / 'config.json'
with open(config_path, 'w') as f:
    json.dump(config, indent=2, fp=f)

print(f"‚úÖ Configuration saved to: {config_path}")
print("‚ö†Ô∏è  Remember to update API keys in config.json")

‚úÖ Configuration saved to: /home/kaizu/Projects/test/BSSC_QA/config.json
‚ö†Ô∏è  Remember to update API keys in config.json


## Customize Prompt Templates

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *Just an example. **prompts/short_prompt.json** contains different prompts. Edit or create a new prompt set depending on the **dataset**, **llm**, and the returned llm responds* |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|


**Question Generator Prompt:** 

```python
    """
    Generate a clear, specific question from this text chunk.

    TEXT CHUNK:
    {chunk_text}

    Requirements:
    - Create ONE question that tests comprehension
    - Focus on key facts or concepts
    - Avoid yes/no questions
    - Keep question concise (1 sentence max)

    Question:
    """
```

**Answer Synthesis Prompt:** 

```python
    """
    Answer the question using ONLY the provided evidence. Be concise.

    QUESTION: {question}
    Question Type: {question_type}

    EVIDENCE:
    {evidence_text}

    Requirements:
    - Answer in a few words. One or two words preferred.
    - Cite specific evidence using [Evidence N] format
    - Direct and factual

    Answer:
```

**Evaluator Prompt:** 
```python
    """
    Strictly evaluate this QA pair on a 0-1 scale for each metric:

    Question: {qa['question']}
    Answer: {qa['answer']}

    Evidence:
    {evidence_text[:-1]}

    Metrics:
    1. Relevance: Does answer address the question?
    2. Clarity: Is answer clear and well-written?
    3. Completeness: Are key points covered?
    4. Factuality: Is answer grounded in evidence?
    5. Conciseness: Is answer appropriately brief and within one or two words?

    Format: 
    relevance: X.X
    clarity: X.X
    completeness: X.X
    factuality: X.X
    """
```

## Rerun with updated Config

After saving the new config rerun and comment the previous config.


*Note: Do not add same data again to the Vector store.*