<a href="https://colab.research.google.com/github/RegNLP/GraphRAG4RegGraph/blob/main/Agent2_SectionSummarizationAgent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This multi-agent system works as a step-by-step pipeline to generate a refined, legally consistent, and linguistically clear summary of a regulatory document. Here's how the process unfolds:

Agent 1 (Document Processor Agent) reads the full document and segments it into manageable sections.

Agent 2 (Section-Specific Summarization Agent) then generates a summary for each individual section.

Agent 3 (Legal Consistency Checker Agent) compares each section with its corresponding summary to ensure all factual and legal details are accurately captured.
• If the summary passes the legal consistency check, Agent 3 calls Agent 4.
• If there are issues, the system loops back to Agent 2 to re-generate a better section summary.

1.   Factual Consistancy with NLI
2.   Obligation Coverage
3. Key Entities Coverage



Agent 4 (Quality Assessor Agent) evaluates how well the summary’s preserve the orginal content lexically and semantically
• If the section summary meets the quality criteria, Agent 4 calls Agent 5.
• If not, the process goes back to Agent 2 for further improvement.

1.   Rouge
2.   BertScore



Finally, Agent 5 (Summary Aggregation Agent) collects all the refined section summaries to create a complete, coherent summary of the entire document.

This iterative process ensures that each section is not only summarized accurately from a legal standpoint but also presented clearly, ultimately producing a reliable and comprehensive summary of the regulatory text.

Agent 2, steps are like that: Step 1: Extract key sentences (BM25 + Keyword Extarction) → Initial Summary

Step 2: Validate (Agent 3: Legal Checker, Agent 4: Linguistic Evaluator)

Step 3: If Failed:
   - Retrieve missing obligations (RAG)
   - Augment prompt to LLM for improved summary

Step 4: Validate Again

Step 5: If Failed Again:
   - Expand retrieval scope (more sections, stricter retrieval)
   - Force inclusion of obligations via prompt constraints
   - Use extractive-based sentence merging if hallucination detected
   
Step 6: If Still Failing:
   - Flag for human review OR
   - Store unresolved cases for system improvement

In [None]:
!pip install rank-bm25



In [None]:
import os
import re
import json
import nltk
import spacy
import torch
from nltk.tokenize import sent_tokenize
from rank_bm25 import BM25Okapi
import hashlib

import pandas as pd
import string
from collections import Counter
from nltk.corpus import stopwords

# Download NLTK stopwords
nltk.download("stopwords")

# Load English stopwords
stop_words = set(stopwords.words("english"))

# Load SpaCy NLP model for lemmatization
nlp = spacy.load("en_core_web_sm")

# Ensure NLTK resources are available
nltk.download("punkt")
nltk.download('punkt_tab')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    print(f"✅ GPU is available! Using device: {torch.cuda.get_device_name(0)}")
else:
    print("❌ No GPU detected. Using CPU.")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


✅ GPU is available! Using device: NVIDIA A100-SXM4-40GB


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
class Preprocessor:
    """Cleans text: removes noise, lowercases, and normalizes spaces."""
    def clean_text(self, text):
        if not isinstance(text, str):
            return ""

        text = text.lower()  # Convert to lowercase
        text = re.sub(r"\[\[page.*?\]\]", " ", text)  # Remove page markers
        text = re.sub(r"-{5,}", " ", text)  # Remove dashed lines
        text = re.sub(r"\.{5,}", ".", text)  # Replace long sequences of dots with a single dot
        text = text.replace("\n", " ")  # Replace newlines with space
        text = " ".join(text.split())  # Collapse multiple spaces

        return text

# Instantiate a global preprocessor
preprocessor = Preprocessor()

In [None]:
class KeywordExtractor:
    """
    Extracts the top 50 most frequent keywords from gold summaries in a JSON file.
    It applies preprocessing, removes numbers, punctuation, stopwords, and applies lemmatization.
    """

    def __init__(self, input_json):
        self.input_json = input_json
        self.keywords = self.extract_keywords()

    def load_json(self):
        """Loads the JSON file containing gold summaries."""
        with open(self.input_json, "r", encoding="utf-8") as f:
            return json.load(f)

    def extract_gold_summaries(self, data):
        """Extracts gold summaries from JSON."""
        summaries = []
        for doc in data:
            if "gold_summary" in doc:
                summaries.append(doc["gold_summary"])
        return summaries

    def clean_keywords(self, keywords):
        """Cleans keywords by removing numbers, punctuation, stopwords, and applying lemmatization."""
        # Remove punctuation
        keywords = ["".join(char for char in word if char not in string.punctuation) for word in keywords]

        # Remove numbers and alphanumeric words with numbers
        keywords = [word for word in keywords if not re.match(r"^\d+$", word) and not re.match(r"^\d+[a-zA-Z]*$", word)]

        # Remove stopwords and short words (less than 3 characters)
        keywords = [word for word in keywords if word not in stop_words and len(word) > 2]

        # Apply lemmatization
        lemmatized_keywords = [token.lemma_ for token in nlp(" ".join(keywords))]

        return lemmatized_keywords

    def extract_keywords(self):
        """Reads the JSON file, extracts and cleans the most frequent 50 keywords from gold summaries."""
        data = self.load_json()
        gold_summaries = self.extract_gold_summaries(data)

        # Tokenize words from gold summaries
        words = [word.lower() for summary in gold_summaries for word in summary.split()]

        # Clean the keywords
        cleaned_keywords = self.clean_keywords(words)

        # Count occurrences and get the top 50 most frequent keywords
        keyword_counts = Counter(cleaned_keywords)
        most_frequent_keywords = [word for word, _ in keyword_counts.most_common(50)]

        return most_frequent_keywords


In [None]:
class BM25KeywordSummarizer:
    """
    Uses BM25 to rank sentences in a document and prioritize sentences containing important keywords.
    """

    def __init__(self, top_n=5, keyword_extractor=None):
        self.top_n = top_n
        self.keywords = keyword_extractor.keywords if keyword_extractor else []
        self.lemmatized_keywords = self.lemmatize_keywords(self.keywords)
        self.token_cache = {}  # Dictionary to store tokenized sentences

    def lemmatize_keywords(self, keywords):
        """Lemmatizes the keywords for better matching."""
        return [token.lemma_ for token in nlp(" ".join(keywords))]

    def tokenize_and_cache(self, document):
        """Tokenizes and caches sentences for efficiency."""
        doc_hash = hashlib.md5(document.encode()).hexdigest()  # Unique hash for each document

        if doc_hash in self.token_cache:
            return self.token_cache[doc_hash]  # Return cached tokenized sentences

        sentences = sent_tokenize(document)  # Tokenize into sentences
        tokenized_sentences = [sentence.split() for sentence in sentences]

        self.token_cache[doc_hash] = (sentences, tokenized_sentences)  # Cache results
        return sentences, tokenized_sentences

    def bm25_ranked_sentences(self, document):
        """Ranks sentences in the document using BM25 and keyword matching."""
        document = preprocessor.clean_text(document)  # Apply text preprocessing

        sentences, tokenized_sentences = self.tokenize_and_cache(document)  # Use cached tokenization

        if not sentences:
            return []

        bm25 = BM25Okapi(tokenized_sentences)  # Initialize BM25 model

        # Use keywords as query instead of full document
        query = self.lemmatized_keywords
        scores = bm25.get_scores(query)  # Get BM25 scores

        # Rank sentences by BM25 score
        ranked_sentences = sorted(zip(sentences, scores), key=lambda x: x[1], reverse=True)
        return ranked_sentences  # Return ranked sentences with scores

    def prioritize_keywords(self, ranked_sentences):
        """Prioritizes sentences containing more keywords."""
        keyword_weighted_sentences = []

        for sentence, score in ranked_sentences:
            # Count occurrences of keywords in the sentence
            keyword_count = sum(1 for keyword in self.lemmatized_keywords if f" {keyword} " in f" {sentence.lower()} ")

            # Apply a weight factor for sentences with more keyword matches
            adjusted_score = score + (keyword_count * 0.5)

            keyword_weighted_sentences.append((sentence, adjusted_score))

        # Sort again with adjusted scores
        return sorted(keyword_weighted_sentences, key=lambda x: x[1], reverse=True)

    def summarize(self, section_text):
        """Summarizes a section using BM25 and keyword prioritization."""
        if not section_text.strip():
            return ""

        # Get BM25 top-ranked sentences
        ranked_sentences = self.bm25_ranked_sentences(section_text)

        # Apply keyword prioritization
        prioritized_sentences = self.prioritize_keywords(ranked_sentences)

        # Select top sentences
        selected_sentences = [sent for sent, _ in prioritized_sentences[:self.top_n]]

        return " ".join(selected_sentences)

In [None]:
def generate_summaries(input_file, output_file, summarizer):
    """
    Reads a JSON file, summarizes sections using BM25 and keyword filtering, and saves results.
    """
    with open(input_file, "r", encoding="utf-8") as f:
        data = json.load(f)

    for doc in data:
        for section in doc.get("Sections", []):
            section_text = section.get("text", "")
            section["summary"] = summarizer.summarize(section_text)
            print(f"🔹 Summary: {section['summary']}")

    # Save output JSON with summaries
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)

    print(f"\n✅ Summaries saved to: {output_file}")


In [None]:
if __name__ == "__main__":
    # Path for JSON with gold summaries
    INPUT_JSON = "/content/drive/MyDrive/Colab Notebooks/ExtractedSummaries/validation_RegSum_Data.json"

    # Input and output JSON files for summarization
    INPUT_FILE = "/content/drive/MyDrive/Colab Notebooks/ExtractedSummaries/segmented_test_RegSum_Data.json"
    OUTPUT_FILE = "/content/drive/MyDrive/Colab Notebooks/ExtractedSummaries/Agent2_First_Iteration_Results.json"

    # Extract keywords and initialize summarizer
    keyword_extractor = KeywordExtractor(INPUT_JSON)
    summarizer = BM25KeywordSummarizer(top_n=5, keyword_extractor=keyword_extractor)

    # Generate summaries
    generate_summaries(INPUT_FILE, OUTPUT_FILE, summarizer)

🔹 Summary: background this document contains final regulations that amend the income tax regulations (26 cfr part 1) and the procedure and administration regulations (26 cfr part 301) to implement the statutory provisions of section 6417 of the internal revenue code (code), as enacted by section 13801(a) of public law 117-169, 136 stat. 1818, 2003 (august 16, 2022), commonly known as the inflation reduction act of 2022 (ira).
🔹 Summary: if an applicable entity makes an elective payment election, the applicable entity is treated as making a payment against federal income taxes imposed by subtitle a of the code (subtitle a) for the taxable year with respect to which such credit was determined that is equal to the amount of such credit (elective payment amount). see general explanation of tax legislation enacted in the 117th congress, jcs-1-23 (december 21, 2023) at 282. thus, the final regulations refer to section 45w(d)(2). (7) the credit for advanced manufacturing production under sect