#Financial Document QA & Risk Detection Pipeline

---



# Baseline Architecture & Models

## Baseline Design (MVP, Reproducible)
- **Chunking:** Simple overlapping splitter (300–500 tokens).  

### Retriever
- **Sparse:** BM25 via Elasticsearch (keyword precision for numeric/legal phrases).  
- **Dense:** sentence-transformers/all-MiniLM-L6-v2 → FAISS flat index for quick prototyping.  
- **Fusion:** Union top results from BM25 and dense retriever (deduplicate).  

### Generator
- **Model:** API LLM (GPT-4/GPT-4o)  
- **Prompt:** Strict prompt includes only retrieved chunks; must cite chunk IDs and source doc metadata.  
- **Temperature:** 0.0 (deterministic output).  

### Audit Trail
- Save retrieved chunk IDs, prompt text, model response, tokens used.  

### Fishiness Detector (Baseline)
- Rule-based heuristics:
  - YoY ratio thresholds  
  - Large goodwill changes  
  - Related-party keywords  

### Why Baseline?
- Fast to implement, explainable, cheap.  
- BM25 catches numeric/keyword queries that embeddings might miss.  
- MiniLM provides basic semantic coverage.  
- Good for initial human reviewers.  

### Scalability
- Elasticsearch clusters scale horizontally.  
- FAISS with IVF/PQ can scale to tens of millions of vectors.  
- LLM via API scales by concurrency quotas.  

---

# Candidate Models (In Depth)

## A. Retrieval Improvements
### 1. Domain-specific Dense Embeddings
- **Model:** FinBERT / FinDomain sentence-transformer / OpenAI financial-tuned embeddings.  
- **Why:** Captures domain semantics (e.g., “reserve” vs “provision”), increases retrieval precision.  
- **Scale:** Same vector infrastructure; slightly heavier encoder amortized at ingest time.  

### 2. Hybrid Retrieval (BM25 + Dense + Keyword Expansions)
- **What:** Query both systems and merge results. Add query expansion with company-specific synonyms and accounting terms.  
- **Why:** Maximizes recall — BM25 catches exact matches, dense catches paraphrases.  
- **Scale:** Two systems operate, fusion is O(k). Cache common queries.  

### 3. Cross-Encoder Reranker
- **Model:** Fine-tuned cross-encoder (BERT / FinBERT).  
- **What:** Re-ranks top N candidates (e.g., 100) from the retriever.  
- **Why:** Improves precision@k, reduces hallucination risk.  
- **Scale:** Run on GPUs, only on top N. Can distill later for latency.  

---

## B. Evidence Extraction & Citation
### Span-Extractor / Token-Level QA
- **Model:** Fine-tuned transformer for extractive QA (e.g., RoBERTa/FinBERT QA).  
- **Why:** Extracts exact supporting sentences/phrases → improves citation fidelity, reduces LLM context size.  
- **Scale:** Run per candidate chunk; batch inference. Fine-tune on labeled (question → supporting span) pairs.  

---

## C. Generation Improvements
### Instruction-Tuned / LoRA-Fine-Tuned Generator
- **What:** Fine-tune smaller LLM (7B–13B) on financial QA + citation templates.  
- **Why:** Reduces cost, improves adherence to citation rules and tone.  
- **Scale:** Serve on GPU cluster; LoRA keeps resources manageable.  

### Citation-Aware Decoding Constraints
- **What:** Enforce output patterns (“Claim — [Source: DOC_ID|CHUNK_ID]”) and post-validate with span extractor.  
- **Why:** Required for audit; reduces hallucination.  

---

## D. Verification & Safety
### Verifier / Entailment Classifier
- **What:** NLI model checks if each claim is supported by cited evidence (entail / contradict / unknown).  
- **Why:** Automates early detection of hallucinations; triggers human review if low confidence.  
- **Scale:** Fast, small transformer per claim.  

---

## E. Fishiness / Accounting Risk Detector (Hybrid)
- **Signals:**
  - Rule flags: “related party”, “restatement”, “one-time gain”, ambiguous language.  
  - Supervised classifier: Paragraph embeddings → suspicious vs normal; trained on analyst-labeled data.  
  - Time-series anomalies: z-score / isolation forests on ratios (ROA, gross margin, receivables/sales), peer-relative deviations.  
- **Why:** Combines linguistic + numeric signals to reduce false positives; human-in-the-loop improves precision.  
- **Scale:** Run offline at ingestion and on-demand.  

---

## F. Operational Acceleration
### Distillation / Approximate Reranker
- **What:** Distill cross-encoder into faster bi-encoder or train lightweight reranker for near cross-encoder accuracy.  
- **Why:** Reduces latency while keeping precision.  


In [2]:
!pip install PyMuPDF sentence-transformers faiss-cpu openai


Collecting PyMuPDF
  Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m76.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF, faiss-cpu
Successfully installed PyMuPDF-1.26.5 faiss-cpu-1.12.0


In [3]:

# -------------------------------
# 1. Imports
# -------------------------------
import fitz  # PyMuPDF for PDF extraction
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import re
import openai







In [None]:
# -------------------------------
# 2. OpenAI API Key setup
# -------------------------------
openai.api_key = "YOUR_OPENAI_API_KEY"


In [None]:
# -------------------------------
# 3. Sample PDF/Text ingestion
# -------------------------------
sample_docs = {
    "10K_2024.pdf": """
    Company ABC reports a 20% increase in revenue YoY, with operating income up 15%.
    Risk factors include potential litigation and supply chain disruptions.
    """,
    "Earnings_Call_Q1.txt": """
    Management mentions that cash flow is strong, but receivables have increased.
    There is a one-time gain from asset sale.
    """
}

In [None]:
# -------------------------------
# 4. Chunking function
# -------------------------------
def chunk_text(text, chunk_size=50, overlap=10):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

doc_chunks = []
metadata = []
for doc_name, text in sample_docs.items():
    chunks = chunk_text(text)
    for i, c in enumerate(chunks):
        doc_chunks.append(c)
        metadata.append(f"{doc_name} | chunk {i+1}")

In [None]:
# -------------------------------
# 5. Embeddings
# -------------------------------
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(doc_chunks, convert_to_numpy=True)

In [None]:
# -------------------------------
# 6. FAISS Index
# -------------------------------
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)


In [None]:
# -------------------------------
# 7. Fishiness Detector
# -------------------------------
def is_fishy(text):
    flags = []
    if re.search(r'\b\d{1,2}% increase\b', text):
        flags.append("Revenue/Income spike")
    if re.search(r'one-time gain|extraordinary item', text, re.I):
        flags.append("One-time gain")
    if re.search(r'litigation|risk factor|uncertain', text, re.I):
        flags.append("Potential risk")
    return flags


In [None]:
# -------------------------------
# 8. Retrieval function
# -------------------------------
def retrieve(query, k=3):
    query_emb = model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_emb, k)
    results = []
    for idx in indices[0]:
        text = doc_chunks[idx]
        results.append({
            "text": text,
            "source": metadata[idx],
            "fishy_flags": is_fishy(text)
        })
    return results



In [None]:
# -------------------------------
# 9. GPT Answer Generation
# -------------------------------
def generate_answer_gpt(query, retrieved_chunks):
    context = ""
    for r in retrieved_chunks:
        flags = ", ".join(r['fishy_flags']) if r['fishy_flags'] else "None"
        context += f"[Source: {r['source']}, Fishy: {flags}] {r['text']}\n"

    prompt = f"""
    You are an equity research assistant.
    Use the following context from company documents to answer the query below.
    Cite the source for each fact using the format [Source: DOC|Chunk].
    Highlight anything fishy using ⚠️ if flagged.

    Context:
    {context}

    Query: {query}

    Provide a concise answer with citations.
    """

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content



In [None]:
# -------------------------------
# 10. Example Usage
# -------------------------------
query = "Explain revenue growth and risks"
retrieved = retrieve(query)
answer = generate_answer_gpt(query, retrieved)

print("Query:", query)
print("\nTop retrieved chunks with citations and fishiness flags:")
for i, r in enumerate(retrieved):
    print(f"{i+1}. [{r['source']}] {r['text']}")
    if r['fishy_flags']:
        print(f"   ⚠️ Fishy Flags: {', '.join(r['fishy_flags'])}")

print("\nGenerated GPT Answer:\n")
print(answer)
