# 05 - QA Generation: Answer with Citations and Math Verification

This notebook implements the answer generation pipeline with LLM reader, math verification, and precise citations.

**Objectives:**
- Generate answers using LLM (API or local model)
- Perform deterministic math verification for numeric answers
- Generate precise citations (section + table cell)
- Handle conflicting evidence
- Ensure faithfulness to retrieved sources

In [23]:
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'  # Fix OpenMP conflict

import sys
import json
import re
import pickle
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any
import faiss

sys.path.append(str(Path.cwd().parent / 'src'))

from utils.config import INDICES_DIR, MODEL_DIR
from retrieval.hierarchical_retriever import HierarchicalRetriever
from retrieval.query_router import QueryRouter
from qa.answer_generator import AnswerGenerator
from qa.math_verifier import MathVerifier
from qa.citation_builder import CitationBuilder

## 1. Load Retrieval System

In [24]:
# Load indices and retrieval components from previous notebook
from sentence_transformers import SentenceTransformer

# Load indices
section_index = faiss.read_index(str(INDICES_DIR / "section_index.faiss"))
text_index = faiss.read_index(str(INDICES_DIR / "text_index.faiss"))
table_index = faiss.read_index(str(INDICES_DIR / "table_index.faiss"))

with open(INDICES_DIR / "section_data.pkl", 'rb') as f:
    section_data = pickle.load(f)
with open(INDICES_DIR / "text_data.pkl", 'rb') as f:
    text_data = pickle.load(f)
with open(INDICES_DIR / "table_data.pkl", 'rb') as f:
    table_data = pickle.load(f)
with open(INDICES_DIR / "index_config.json", 'r') as f:
    index_config = json.load(f)

# Initialize components
embedding_model = SentenceTransformer(index_config['embedding_model'])
query_router = QueryRouter()
hierarchical_retriever = HierarchicalRetriever(
    section_index, text_index, table_index,
    section_data, text_data, table_data,
    embedding_model
)

print("Retrieval system loaded")

Retrieval system loaded


## 2. Initialize Answer Generation Components

In [25]:
# Initialize QA components

# Answer Generator with Ollama
USE_OLLAMA = True  # Using Ollama
OLLAMA_MODEL = "llama3.2:1b"  # You can change this to: mistral, phi3, llama3, etc.
OLLAMA_BASE_URL = "http://localhost:11434"  # Default Ollama URL

USE_OPENAI = False  # Set to True if you want to use OpenAI instead
OPENAI_API_KEY = "your-api-key-here"  # Only needed if USE_OPENAI = True

print("Initializing answer generator with Ollama...")
print(f"Model: {OLLAMA_MODEL}")
print(f"Make sure Ollama is running: ollama serve")
print()

try:
    answer_generator = AnswerGenerator(
        use_ollama=USE_OLLAMA,
        ollama_model=OLLAMA_MODEL,
        ollama_base_url=OLLAMA_BASE_URL,
        use_openai=USE_OPENAI,
        openai_api_key=OPENAI_API_KEY if USE_OPENAI else None,
        max_length=512,
        temperature=0.1  # Low temperature for factual answers
    )
except Exception as e:
    print(f"Error initializing answer generator: {e}")
    print("\nIf Ollama is not running, start it with: ollama serve")
    print(f"Then pull the model: ollama pull {OLLAMA_MODEL}")
    raise

# Math Verifier
math_verifier = MathVerifier(tolerance=0.01)  # 1% tolerance
print("✓ Math verifier initialized")

# Citation Builder
citation_builder = CitationBuilder()
print("✓ Citation builder initialized")

print("\n✓ All QA components ready")

Initializing answer generator with Ollama...
Model: llama3.2:1b
Make sure Ollama is running: ollama serve

✓ Ollama client initialized (model: llama3.2:1b)
✓ Math verifier initialized
✓ Citation builder initialized

✓ All QA components ready


## 3. Answer Generation Pipeline

In [26]:
def generate_answer_with_citations(query: str, verbose: bool = True) -> Dict[str, Any]:
    """
    Complete QA pipeline: retrieve → generate → verify → cite
    """
    if verbose:
        print(f"Query: {query}\n")
    
    # 1. Route query
    route_info = query_router.route(query)
    if verbose:
        print(f"Query type: {route_info['query_type']}")
        print(f"Table-centric: {route_info['is_table_centric']}")
        print(f"Requires math: {route_info['requires_math']}\n")
    
    # 2. Retrieve evidence
    retrieval_results = hierarchical_retriever.retrieve(
        query=query,
        route_info=route_info,
        top_k_sections=5,
        top_k_content=10,
        use_hybrid=True
    )
    
    if verbose:
        print(f"Retrieved {len(retrieval_results['content'])} evidence pieces\n")
    
    # 3. Generate answer
    answer_result = answer_generator.generate(
        query=query,
        evidence=retrieval_results['content'],
        route_info=route_info
    )
    
    answer_text = answer_result['answer']
    if verbose:
        print(f"Generated answer: {answer_text}\n")
    
    # 4. Math verification (if needed)
    verification_result = None
    if route_info['requires_math']:
        verification_result = math_verifier.verify(
            answer=answer_text,
            evidence=retrieval_results['content'],
            query=query
        )
        
        if verbose:
            print(f"Math verification: {verification_result['status']}")
            if verification_result['status'] != 'verified':
                print(f"  Issue: {verification_result['message']}")
            print()
    
    # 5. Build citations
    citations = citation_builder.build_citations(
        answer=answer_text,
        evidence=retrieval_results['content']
    )
    
    if verbose:
        print("Citations:")
        for i, citation in enumerate(citations):
            print(f"  [{i+1}] {citation}")
    
    # 6. Compile final result
    result = {
        'query': query,
        'answer': answer_text,
        'citations': citations,
        'evidence': retrieval_results['content'],
        'route_info': route_info,
        'verification': verification_result,
        'confidence': answer_result.get('confidence', None)
    }
    
    return result

## 4. Test QA Pipeline: Numeric Questions

In [27]:
# Test with numeric questions requiring math
numeric_queries = [
    "Report the YoY change in R&D expense for 2022 to 2024",
    "What is the ratio of long-term debt to equity in 2023?",
    "Which operating segment contributed most to 2024 revenue growth, and by how much?"
]

print("=" * 80)
print("NUMERIC QUESTION TESTS")
print("=" * 80)

numeric_results = []
for query in numeric_queries:
    print("\n" + "=" * 80)
    result = generate_answer_with_citations(query, verbose=True)
    numeric_results.append(result)
    print("=" * 80)

NUMERIC QUESTION TESTS

Query: Report the YoY change in R&D expense for 2022 to 2024

Query type: numeric_table
Table-centric: True
Requires math: True

Retrieved 10 evidence pieces

Generated answer: Based on the evidence provided, the YoY change in R&D expense for 2022 to 2024 is as follows:

R&D expense in 2023 was $1.23B (Table T8, Row 2).
R&D expense in 2024 was $1.45B (Table T7, Row 1).

The YoY change in R&D expense from 2022 to 2024 is a $22M increase ($1.45B - $1.23B).

Math verification: verified

Citations:
  [1] CMG 2023 10-K, Unknown, Table T8, Row 2
  [2] CMG 2024 10-K, Unknown, Table T8, Row 2
  [3] TEL 2024 10-K, Unknown, Table T6, Row 1

Query: What is the ratio of long-term debt to equity in 2023?

Query type: numeric_table
Table-centric: True
Requires math: True

Retrieved 10 evidence pieces

Generated answer: Based on the provided evidence, we can calculate the ratio of long-term debt to equity in 2023 as follows:

[TABLE] HCA 2024 - Table T6, Row 1
:  2022   Ratio 

## 5. Test QA Pipeline: Narrative Questions

In [28]:
# Test with narrative/explanation questions
narrative_queries = [
    "What are the main business segments?",
    "Explain the primary risk factors mentioned in the filing",
    "What is the company's business strategy?"
]

print("=" * 80)
print("NARRATIVE QUESTION TESTS")
print("=" * 80)

narrative_results = []
for query in narrative_queries:
    print("\n" + "=" * 80)
    result = generate_answer_with_citations(query, verbose=True)
    narrative_results.append(result)
    print("=" * 80)

NARRATIVE QUESTION TESTS

Query: What are the main business segments?

Query type: narrative
Table-centric: False
Requires math: False

Retrieved 10 evidence pieces

Generated answer: Based on the provided evidence, the main business segments of TJX Companies, Inc. are:

1. Marmaxx and HomeGoods, both in the U.S.
2. TJX Canada and TJX International, including Europe and Australia
3. Sierra, acquired in 2012 and rebranded from Sierra Trading Post in 2018

These three segments operate under the following sub-segments:

* Marmaxx:
	+ TJ Maxx and Marshalls chains in the United States (collectively the largest off-price retailer in the United States with a total of 2,482 stores)
* HomeGoods:
	+ HomeGoods and Homesense chains
* Sierra:
	+ Sierra Trading Post (acquired in 2012)

Citations:
  [1] TJX 2023 10-K, Section 10
  [2] TJX 2024 10-K, Section 10
  [3] PNR 2022 10-K, Section 8

Query: Explain the primary risk factors mentioned in the filing

Query type: narrative
Table-centric: False
Re

## 6. Math Verification Deep Dive

In [29]:
# Detailed math verification example
query = "Calculate the YoY percentage change in revenue from 2023 to 2024"

print(f"Query: {query}\n")

# Retrieve evidence
route_info = query_router.route(query)
evidence = hierarchical_retriever.retrieve(
    query=query,
    route_info=route_info,
    top_k_content=10
)['content']

# Find table rows with revenue data
revenue_evidence = [
    e for e in evidence 
    if 'revenue' in e['content'].lower() and e['metadata']['content_type'] == 'table'
]

print("=== Revenue Evidence ===")
for i, ev in enumerate(revenue_evidence[:3]):
    print(f"{i+1}. {ev['content']}")
    print(f"   Source: Table {ev['metadata']['table_id']}, Row {ev['metadata']['row_idx']}\n")

# Extract numbers and perform calculation
numbers = math_verifier.extract_numbers(revenue_evidence)
print("\n=== Extracted Numbers ===")
print(numbers)

# Verify calculation
answer = "Revenue increased 12.5% YoY from $100M in 2023 to $112.5M in 2024"
verification = math_verifier.verify(
    answer=answer,
    evidence=revenue_evidence,
    query=query
)

print("\n=== Verification Result ===")
print(f"Status: {verification['status']}")
print(f"Message: {verification['message']}")
if 'calculation_details' in verification:
    print(f"Details: {verification['calculation_details']}")

Query: Calculate the YoY percentage change in revenue from 2023 to 2024

=== Revenue Evidence ===
1. : • the percentage of our total revenue from various end markets,
   Source: Table T8, Row 21

2. :   Year EndedDecember 31,  Percentage change in revenue as reported  Impact ofchanges inforeigncurrency (a)  Percentage change in revenue on a constant currency basis (a) None None None
   Source: Table T8, Row 0

3. :  Year Ended December 31,   Percentage change in revenue   Impact ofchanges inforeigncurrency (a)   Percentage change in revenue on a constant currency basis (a)  None None None None None None None None
   Source: Table T8, Row 0


=== Extracted Numbers ===
[31.0, 31.0, 31.0, 2023.0, 2022.0, 2021.0]

=== Verification Result ===
Status: failed
Message: Calculations do not match evidence


## 7. Citation Quality Analysis

In [30]:
# Analyze citation quality and precision
def analyze_citations(results: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Analyze citation quality across multiple QA results
    """
    stats = {
        'total_answers': len(results),
        'avg_citations_per_answer': [],
        'citation_types': {'text': 0, 'table': 0},
        'answers_with_table_citations': 0,
        'answers_with_cell_level_citations': 0
    }
    
    for result in results:
        citations = result['citations']
        stats['avg_citations_per_answer'].append(len(citations))
        
        has_table = False
        has_cell_level = False
        
        for citation in citations:
            if 'table' in citation.lower():
                stats['citation_types']['table'] += 1
                has_table = True
                
                if 'row' in citation.lower() and 'column' in citation.lower():
                    has_cell_level = True
            else:
                stats['citation_types']['text'] += 1
        
        if has_table:
            stats['answers_with_table_citations'] += 1
        if has_cell_level:
            stats['answers_with_cell_level_citations'] += 1
    
    stats['avg_citations_per_answer'] = np.mean(stats['avg_citations_per_answer'])
    
    return stats

# Analyze all results
all_results = numeric_results + narrative_results
citation_stats = analyze_citations(all_results)

print("=== Citation Quality Analysis ===")
print(f"Total answers: {citation_stats['total_answers']}")
print(f"Avg citations per answer: {citation_stats['avg_citations_per_answer']:.2f}")
print(f"Citation types: {citation_stats['citation_types']}")
print(f"Answers with table citations: {citation_stats['answers_with_table_citations']}")
print(f"Answers with cell-level citations: {citation_stats['answers_with_cell_level_citations']}")

=== Citation Quality Analysis ===
Total answers: 6
Avg citations per answer: 3.00
Citation types: {'text': 9, 'table': 9}
Answers with table citations: 3
Answers with cell-level citations: 0


## 8. Formatted Output with Citations

In [31]:
def format_qa_output(result: Dict[str, Any]) -> str:
    """
    Format QA result for presentation
    """
    output = []
    output.append("=" * 80)
    output.append(f"QUESTION: {result['query']}")
    output.append("=" * 80)
    output.append("\nANSWER:")
    output.append(result['answer'])
    
    if result['verification']:
        output.append("\nVERIFICATION:")
        output.append(f"  Status: {result['verification']['status']}")
        if result['verification']['status'] != 'verified':
            output.append(f"  Note: {result['verification']['message']}")
    
    output.append("\nSOURCES:")
    for i, citation in enumerate(result['citations']):
        output.append(f"  [{i+1}] {citation}")
    
    output.append("\nEVIDENCE:")
    for i, ev in enumerate(result['evidence'][:3]):
        meta = ev['metadata']
        output.append(f"  {i+1}. [{meta['content_type'].upper()}] {meta['ticker']} {meta['fiscal_year']}")
        output.append(f"     {ev['content'][:150]}...")
    
    output.append("=" * 80)
    return "\n".join(output)

# Display formatted output for first result
if all_results:
    print(format_qa_output(all_results[0]))

QUESTION: Report the YoY change in R&D expense for 2022 to 2024

ANSWER:
Based on the evidence provided, the YoY change in R&D expense for 2022 to 2024 is as follows:

R&D expense in 2023 was $1.23B (Table T8, Row 2).
R&D expense in 2024 was $1.45B (Table T7, Row 1).

The YoY change in R&D expense from 2022 to 2024 is a $22M increase ($1.45B - $1.23B).

VERIFICATION:
  Status: verified

SOURCES:
  [1] CMG 2023 10-K, Unknown, Table T8, Row 2
  [2] CMG 2024 10-K, Unknown, Table T8, Row 2
  [3] TEL 2024 10-K, Unknown, Table T6, Row 1

EVIDENCE:
  1. [TABLE] CMG 2023
     : 2022  2021  change None None...
  2. [TABLE] CMG 2024
     : 2023  2022  change None None...
  3. [TABLE] TEL 2024
     ​:  2024   2023   2022  ...


## 9. Save QA Results

In [32]:
# Save QA results for evaluation
results_dir = Path.cwd().parent / 'data' / 'qa_results'
results_dir.mkdir(parents=True, exist_ok=True)

# Save as JSON
output_file = results_dir / 'qa_results.json'
with open(output_file, 'w') as f:
    # Convert to serializable format
    serializable_results = []
    for result in all_results:
        serializable_results.append({
            'query': result['query'],
            'answer': result['answer'],
            'citations': result['citations'],
            'route_info': result['route_info'],
            'verification': result['verification'],
            'num_evidence': len(result['evidence'])
        })
    json.dump(serializable_results, f, indent=2)

print(f"Saved {len(all_results)} QA results to {output_file}")

# Save formatted text output
text_output_file = results_dir / 'qa_results.txt'
with open(text_output_file, 'w') as f:
    for result in all_results:
        f.write(format_qa_output(result))
        f.write("\n\n")

print(f"Saved formatted output to {text_output_file}")

Saved 6 QA results to c:\Users\anand\Desktop\SEM 3\CS 582\Proj\data\qa_results\qa_results.json
Saved formatted output to c:\Users\anand\Desktop\SEM 3\CS 582\Proj\data\qa_results\qa_results.txt


## Next Steps

Proceed to **06_evaluation.ipynb** to evaluate the system on benchmark datasets (FinQA, DocFinQA, FinDER) and compute metrics.