Data Flow Between Agents

The system's data flow is coordinated by the Task Planner Agent:

1. Initial Flow: Document → Task Planner → Pre-processor
2. Information Extraction: Pre-processor → Context Bank & Task Planner
3. Knowledge Gathering: Task Planner → Knowledge Agent → Context Bank & Task Planner
4. Compliance Analysis: Task Planner → Compliance Checker (accessing Context Bank)
   - If knowledge is insufficient → Knowledge Agent (with the missing fields)
   - If knowledge is sufficient → Check compliance for each clause
5. Conditional Processing:
   - If contradictions: Compliance Checker → Clause Rewriter → Compliance Checker
   - If compliant: Compliance Checker → Task Planner
6. Summarizing Changes: Task Planner → Post-processor
7. Task Completion: Post-processor → Final Output → User


In [1]:
from langchain_ollama import ChatOllama
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.prebuilt import create_react_agent
from langgraph_swarm import create_handoff_tool, create_swarm
from langchain_core.tools import tool
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from dotenv import load_dotenv
import os
from agents.utils.ollama_client import OllamaClient
from agents.utils.api_client import APIClient
from typing import Any, Dict, Optional

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# model = ChatOllama(model="llama3.1:latest")

load_dotenv()

google_api_key = os.getenv("GOOGLE_API_KEY")

tavily_api_key = os.getenv("TAVILY_API_KEY")
if not tavily_api_key:
    raise ValueError("TAVILY_API_KEY not found in environment variables. Please set it in your .env file.")

print("[SETUP] Tavily API Key loaded successfully.")

model = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.1,
    google_api_key=google_api_key
)
print("[SETUP] Initialized ChatGoogleGenerativeAI model with gemini-2.0-flash")

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",
    google_api_key=google_api_key
)
print("[SETUP] Initialized GoogleGenerativeAIEmbeddings model")

[SETUP] Tavily API Key loaded successfully.
[SETUP] Initialized ChatGoogleGenerativeAI model with gemini-2.0-flash
[SETUP] Initialized GoogleGenerativeAIEmbeddings model


In [3]:
# Initialize the LLM client to be used by tools

def _initialize_llm_client(use_ollama: bool, model_name: str) -> Any:
    """Initialize and return the appropriate LLM client based on settings."""
    try:
        if use_ollama:
            print(f"[SETUP] Initializing Ollama client with model {model_name}")
            return OllamaClient(model_name)
        else:
            print(f"[SETUP] Initializing API client with model {model_name}")
            return APIClient(model_name)
    except Exception as e:
        print(f"[ERROR] Failed to initialize LLM client: {str(e)}")
        return None

## Context Bank Initialization

Initialize a shared context bank instance that will be used by all agents to store and retrieve document information.


In [4]:
from context_bank import ContextBank

# Initialize a single shared context bank instance
context_bank = ContextBank()

# This context_bank will be passed to all agents that need to store or retrieve information

In [5]:
planner_agent_tools = [
    create_handoff_tool(agent_name="Pre Processor Agent", description="Transfer when pre-processing is needed, it helps to format and clean the input data."),
    create_handoff_tool(agent_name="Knowledge Agent", description="Transfer when knowledge is needed, it helps to retrieve knowledge from the web using websearcher."),
    create_handoff_tool(agent_name="Compliance Checker Agent", description="Transfer when compliance checking is needed, it helps to check legal documents for compliance with regulations."),
    create_handoff_tool(agent_name="Post Processor Agent", description="Transfer when post processing is needed, it helps to format and finalize the output."),
]

planner_agent_node = create_react_agent(
    model,
    planner_agent_tools,
    prompt="""
        You are a Task Planner Agent responsible for coordinating a multi-agent system to analyze legal documents for discrepancies and compliance. Your job is to plan and delegate tasks to specialized agents using the relevant handoff tools and track task completion.

        INPUTS:
        Problem to solve: use the user prompt
        Analyze a legal document for discrepancies and compliance issues.

        PLANNED TASKS:

        * Preprocess document
        * Extract and classify clauses
        * Retrieve relevant legal compliance knowledge from the web
        * Check clause compliance and legal discrepancies
        * Summarize issues and finalize output

        AVAILABLE AGENTS:
        * Pre Processor Agent: Responsible for pre-processing the document, extracting clauses, and classifying them. Processes the input legal document, adds all the relevant information to the context bank and returns status.
        * Knowledge Agent: Responsible for retrieving relevant legal compliance knowledge from the web. Fetches information from the web, adds all the relevant knowledge to a vector DB and returns status.
        * Compliance Checker Agent: Responsible for checking the compliance of clauses with legal regulations. Performs the compliance check, returns a list of non-compliant clauses and their details. Also returns status.
        * Post Processor Agent: Responsible for summarizing issues and finalizing the output. Formats the final output and returns a summary of the compliance check results. Also returns status.

        ACTION: 
        [IMPORTANT] Status Check - Check status of Preprocessor, Compliance Checker, and Post-Processor agents

        If Preprocessor status is not complete, trigger Preprocessor Agent.
        Once preprocessing is complete, trigger Knowledge Agent.
        After Knowledge Agent retrieves relevant knowledge, trigger Compliance Checker Agent.
        After completing clause compliance check, Post Processor Agent is triggered for final output and summary.

        Rationale:
        [IMPORTANT] Do not stop the workflow at any intermediate step. Always proceed till the final step, i.e., using the Post Processor Agent to generate the final summary for the user.
        [EXTREMELY CRITICAL] Each agent’s task is sequentially dependent, ensure no step is skipped in the workflow. Status check ensures no redundant computation and completion of workflow.
    """,
    name="Planner Agent"
)

In [6]:
import PyPDF2
import re
import spacy
import uuid
import spacy
import json  # Added for pretty printing

# Download the spaCy model if it's not already installed
!python -m spacy download en_core_web_sm

# Load once to avoid redundant loading
spacy_ner = spacy.load("en_core_web_sm")
print("[SETUP] Loaded spaCy NER model en_core_web_sm")

llm_client = _initialize_llm_client(use_ollama=False, model_name="gemini-2.0-flash")

# TODO : Update the context bank with the clauses, jurisdiction and the document type/metadata

def preprocess_document_tool_implementation(file_path: str, system_prompt) -> dict:
    """
    Consolidated preprocessing tool to be used as a callable function in a multi-agent system.
    Extracts text, title, named entities, and clause classifications from a PDF document.
    """
    print("\n" + "="*80)
    print(f"[PREPROCESS] Starting document preprocessing for: {file_path}")
    # print(f"[PREPROCESS] Context Bank state at start: {json.dumps(context_bank.get_all(), indent=2)}")
    print("="*80 + "\n")

    # Step 1: Extract text from PDF
    print("[PREPROCESS] Step 1: Extracting text from PDF...")
    with open(file_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        text = "".join(page.extract_text() for page in reader.pages)
    print(f"[PREPROCESS] Extracted {len(text)} characters of text")

    # Step 2: Title Extraction (from first 10 lines)
    # print("[PREPROCESS] Step 2: Extracting title...")
    # def extract_title(text: str) -> str:
    #     lines = text.split("\n")
    #     candidates = []
    #     for i, line in enumerate(lines[:10]):
    #         clean_line = line.strip()
    #         if not clean_line or len(clean_line) < 5:
    #             continue
    #         score = 0
    #         if re.match(r"^(CONTRACT|AGREEMENT|PETITION|NOTICE|ORDER|BILL|ACT|STATUTE)\b", clean_line, re.IGNORECASE):
    #             score += 5
    #         if re.match(r"^[A-Z\s\-]{5,}$", clean_line):
    #             score += 2
    #         if "**" in clean_line or clean_line.center(80) == clean_line:
    #             score += 1
    #         candidates.append((clean_line, score))
    #     candidates.sort(key=lambda x: x[1], reverse=True)
    #     return candidates[0][0] if candidates else "Unknown Title"

    print("[PREPROCESS] Step 2: Extracting title...")
    title_query = "what is the title of this document?" + text[:1000]
    title = llm_client.query(title_query)
    print(f"[PREPROCESS] Extracted title: {title}")

    # Step 3: Named Entity Recognition
    print("[PREPROCESS] Step 3: Performing Named Entity Recognition...")
    doc = spacy_ner(text)
    entities = []
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))
    print(f"[PREPROCESS] Extracted {len(entities)} named entities")
    print(f"[PREPROCESS] Sample entities (first 5): {entities[:5]}")

    # Step 4: Document + Clause Classification via external LLM system
    print("[PREPROCESS] Step 4: Classifying document and clauses...")
    llm_output = llm_client.query(text, system_prompt)

    if isinstance(llm_output, str):
        # strip any enclosing backticks (```), whitespace, etc.
        llm_output = llm_output.strip().strip("```json").strip("`")
        # remove literal "\\n" sequences that came escaped
        llm_output = llm_output.replace("\\n", "")
        try:
            llm_output = json.loads(llm_output)
            print(f"[PREPROCESS] Successfully parsed LLM output as JSON")
        except json.JSONDecodeError as e:
            print(f"[ERROR] Failed to parse LLM output as JSON: {e}")
            print(f"[ERROR] Raw output: {llm_output[:500]}...")
            raise RuntimeError("Failed to parse LLM output as JSON:\n" + llm_output)

    document_class = llm_output.get("CLASS", "")
    clause_classes = llm_output.get("CLAUSES", [])
    
    print(f"[PREPROCESS] Document class: {document_class}")
    print(f"[PREPROCESS] Extracted {len(clause_classes)} clauses")

    # Step 5: Soring in Context Bank
    print("[PREPROCESS] Step 5: Storing in Context Bank...")
    context_bank.add_document(text, {
        "title": title,
        "document_type": document_class,
        "source_file": file_path
    })
    context_bank.add_entities(entities)
    context_bank.add_clauses(clause_classes)
    print("[PREPROCESS] Data successfully stored in Context Bank")
    
    # print("\n" + "="*80)
    # print(f"[PREPROCESS] Context Bank state after processing: {json.dumps(context_bank.get_clauses(), indent=2)}")
    # print("="*80 + "\n")

    # Final structured output
    return {
        "Document Title": title,
        "Document Class": document_class,
    }

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[SETUP] Loaded spaCy NER model en_core_web_sm
[SETUP] Initializing API client with model gemini-2.0-flash


In [7]:
@tool
def preprocess_document_tool(file_path: str) -> dict:
    """
    Creates a tool function for preprocessing legal documents.
    
    Args:
        file_path (str): Path to the legal document PDF file.
        
    Returns:
        dict: A dictionary containing the extracted information.
    """

    SYSTEM_PROMPT = """
        You are a Pre-processor Agent, a specialized component in the Legal Document Analysis Framework responsible for extracting critical information from legal documents and storing it in the Context Bank. Your work forms the foundation for all subsequent analysis by other agents in the system.

        Core Responsibilities:
        Your sole task is to extract and structure information from legal documents, including:
        - Classifying the document type and purpose
        - Extracting important clauses with their classifications
        - Storing all extracted information in a structured format accessible to other agents
        - Provide your output as strict JSON
        Input:
        Legal Contract Document PDF.

        Output Format:
        {
        "CLASS": "Document type classification (e.g., Legal Agreement - Employment Contract)",
        "CLAUSES": [
            {"Text": "Section 3.1: The term of this agreement shall be...", "Category": "Term Clause"},
            {"Text": "Section 7.2: All disputes shall be resolved by...", "Category": "Dispute Resolution"},
            {"Text": "Section 9.5: This agreement shall be governed by...", "Category": "Governing Law"}
        ]
        }

        [EXTREMELY CRITICAL] Ensure that the output is strictly in JSON format. Do not include any additional text or explanations. The output must be parsable as JSON.
    """
        
        # Call the implementation with the shared context bank
    return preprocess_document_tool_implementation(file_path, SYSTEM_PROMPT)

In [8]:
pre_processor_agent_tools = [
    create_handoff_tool(agent_name="Planner Agent", description="Transfer when pre-processing is completed, it helps to plan the next steps in the workflow and delegate tasks."),
    preprocess_document_tool
]

pre_processor_agent_node = create_react_agent(
    model,
    pre_processor_agent_tools,
    prompt="""
         You are the Pre-processor Agent in a Legal Document Analysis Framework. Your sole function is to extract critical information from legal document PDFs and structure it for other agents.

         Core Task:
         Using the provided tool, preprocess_document_tool, process an input legal document PDF to:
         1.   Extract Full Text:  Get the complete text content.
         2.   Classify Document:  Determine the document type (e.g., NDA, Lease, Employment Agreement).
         3.   Identify Named Entities (NER):  Extract key entities (Parties, Laws, Dates, Jurisdictions, Monetary Values, etc.).
         4.   Extract Key Clauses:  Isolate and classify significant clauses (e.g., Term, Governing Law, Confidentiality).
         5.   Store Data:  Structure all extracted information (Text, Class, NER list, Clauses list) as a JSON object in the Context Bank.

         Input:  Legal Contract Document PDF.
         Output:  JSON object with "TEXT", "CLASS", "NER", and "CLAUSES" as keys.

         Guidelines: 
         * Be accurate and comprehensive.
         * Preserve original context, especially for clauses.
         * Focus on legally significant information and obligations.
         * Note any low-confidence classifications.
         * [CRITICAL STEP] Return to the Planner Agent to update that the pre-processing is completed.
      """,
    name="Pre Processor Agent"
)

In [9]:
import requests
import uuid
import traceback
from typing import List, Dict
from bs4 import BeautifulSoup
from duckduckgo_search import DDGS
from cleantext import clean
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, PointStruct, Distance
# from langchain_ollama import OllamaEmbeddings
from langchain_core.tools import tool
from tavily import TavilyClient

# Initialize global components once
_qdrant_url = "http://localhost:6333"
_qdrant_collection = "web_content"
_qdrant_client = QdrantClient(
    url=_qdrant_url, 
    prefer_grpc=False
)

_num_results = 3
_ddgs = DDGS()

# Use the GoogleGenerativeAIEmbeddings model instead of OllamaEmbeddings
_embeddings_model = embeddings

# Optional: create collection if it doesn't exist
# Get the embedding dimension from the model by embedding a test string
test_embedding = _embeddings_model.embed_query("test")
vector_size = len(test_embedding)
print(f"[SETUP] Generated test embedding with dimension: {vector_size}")

try:
    print(f"[SETUP] Creating Qdrant collection with vector size: {vector_size}")
    if not _qdrant_client.collection_exists(_qdrant_collection):
        _qdrant_client.create_collection(
            collection_name=_qdrant_collection,
            vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
        )
    else:
        print(f"[SETUP] Collection '{_qdrant_collection}' already exists; skipping creation.")

    print(f"[SETUP] Successfully created Qdrant collection '{_qdrant_collection}'")

except Exception as e:
    print(f"[ERROR] Failed to create Qdrant collection. Encountered an exception.")
    print(f"[ERROR] {type(e).__name__}: {e}")
    traceback.print_exc()
        



@tool
def retrieve_web_knowledge_tool(query: str) -> List[Dict]:
    """
    Searches the web for a legal/policy topic,
    scrapes and cleans page content, generates embeddings with Google Gen AI,
    stores them in Qdrant, and retrieves top relevant results.
    Returns a list of summarized search results.
    """

    # Step 1: Tavily search
    print(f"[KNOWLEDGE] Searching Tavily for query: {query}")
    # results = list(_ddgs.text(query, max_results=10))

    tavily_client = TavilyClient(tavily_api_key)
    results = tavily_client.search(query, max_results=10)

    # print(f"[INFO] Tavily results: {results}")

    url_results = results.get("results", [])
    
    print(f"[KNOWLEDGE] Found {len(url_results)} results from Tavily search")


    # Step 2: Scrape + clean + embed + store
    points_to_upsert = []
    successful_urls = 0
    for i, result in enumerate(url_results):
        try:
            url = result["url"]
            title = result["title"]
            print(f"[KNOWLEDGE] Processing result {i+1}/{len(url_results)}: {title}")

            response = requests.get(url, timeout=10)
            response.raise_for_status() # Raise an exception for bad status codes
            soup = BeautifulSoup(response.content, "html.parser")
            paragraphs = soup.find_all("p")
            content = " ".join(p.get_text() for p in paragraphs)
            content = " ".join(content.split()) # Remove extra whitespace
            cleaned_content = clean(
                content,
                fix_unicode=True,
                to_ascii=True,
                lower=True,
                no_line_breaks=True,
                lang="en"
            )

            if not cleaned_content: # Skip if content is empty after cleaning
                print(f"[KNOWLEDGE] No content found or extracted for URL {url}")
                continue

            # Generate embeddings using Google Gen AI embeddings model
            generated_embeddings = _embeddings_model.embed_query(text=cleaned_content)
            point_id = str(uuid.uuid5(uuid.NAMESPACE_URL, url))

            points_to_upsert.append(
                PointStruct( # Use PointStruct for clarity
                    id=point_id,
                    vector=generated_embeddings,
                    payload={
                        "title": title,
                        "content": cleaned_content,
                        "url": url
                    }
                )
            )
            successful_urls += 1
            print(f"[KNOWLEDGE] Successfully processed URL: {url}")
        except requests.exceptions.RequestException as e:
            print(f"[ERROR] Failed to fetch URL {url}: {e}")
        except Exception as e:
            print(f"[ERROR] Failed to process URL {url}: {e}") # Catch other potential errors

    # Upsert points in batch if any were successfully processed
    if points_to_upsert:
        print(f"[KNOWLEDGE] Upserting {len(points_to_upsert)} documents to Qdrant collection")
        try:
            _qdrant_client.upsert(
                collection_name=_qdrant_collection,
                points=points_to_upsert,
                wait=True # Optional: wait for operation to complete
            )
            print(f"[KNOWLEDGE] Successfully upserted {len(points_to_upsert)} documents to Qdrant")
        except Exception as e:
            print(f"[ERROR] Failed to upsert to Qdrant: {e}")
    else:
        print(f"[KNOWLEDGE] No documents to upsert to Qdrant")
        
    print(f"[KNOWLEDGE] Retrieved and processed {successful_urls} out of {len(url_results)} URLs")
    return [{"status": "knowledge search completed", "total_urls": len(url_results)}]
    

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


[SETUP] Generated test embedding with dimension: 768
[SETUP] Creating Qdrant collection with vector size: 768
[SETUP] Collection 'web_content' already exists; skipping creation.
[SETUP] Successfully created Qdrant collection 'web_content'


In [10]:

knowledge_agent_tools = [
    create_handoff_tool(agent_name="Planner Agent", description="Transfer to Planner Agent when knowledge has been retrieved and pass a summary back to it."),
    retrieve_web_knowledge_tool
]

knowledge_agent_node = create_react_agent(
    model,
    knowledge_agent_tools,
    prompt="""
        You are a Knowledge Retrieval Agent tasked with extracting accurate, up-to-date legal information from official U.S. government sources. You use the retrieve_web_knowledge_tool to search for legal/policy topics, scrape and clean page content, generate embeddings, store them in Qdrant, and retrieve top relevant results. 
        Your work is crucial for ensuring that the Compliance Checker Agent has access to the most current and relevant legal information.

        Use the retrieve_web_knowledge_tool to fetch up-to-date legal data from trusted government sites and perform the following functions:
        * Retrieve relevant statutes, regulations, and policies based on a given topic.
        * Ensure content is current, authoritative, and clearly summarized.
        * Avoid non-official, outdated, or speculative sources.
        * Store the retrieved knowledge in a vector database for later use by the Compliance Checker Agent.

        Search Query Format:
        site:[gov source] "[TOPIC]" AND "[FOCUS]" AND ("[KEYWORD1]" OR "[KEYWORD2]") after:[YEAR]

        Sources: congress.gov, govinfo.gov, law.cornell.edu, federalregister.gov, ecfr.gov, justice.gov, whitehouse.gov

        Output Format: Return a simple sentence with the status of the knowledge retrieval.

        Guidelines:
        * Use only listed government sources.
        * Do not fabricate or paraphrase inaccurately.
        * If no reliable info is found, say so.
        * [CRITICAL STEP] Return to the Planner Agent to update that the knowledge retrieval is completed.
    """,
    name="Knowledge Agent",
)

In [11]:
import time
import json
from typing import List, Dict
from langchain_core.tools import tool
from agents.compliance_checker import check_legal_compliance
from context_bank import ContextBank
from langgraph.prebuilt import ToolNode
from agents.utils.websearcher import WebContentRetriever

# Create an instance of WebContentRetriever to use for querying the vector database
def get_knowledge_from_vector_db(query, jurisdiction, document_type: str) -> List[Dict]:
    """
    Retrieves legal knowledge from the vector database based on the query,
    jurisdiction, and knowledge type.
    
    Args:
        query: The search query string
        jurisdiction: The legal jurisdiction (default: "US")
        
    Returns:
        List of relevant knowledge items with title, content, URL, and relevance score
    """
    
    # Refine the query with jurisdiction and knowledge type for better results
    refined_query = f"{query} jurisdiction:{jurisdiction} document_type:{document_type}"
    print(f"[KNOWLEDGE RETRIEVAL] Refined query: {refined_query}")

    try:
        print("[KNOWLEDGE RETRIEVAL] Querying vector database...")
        query_vec = _embeddings_model.embed_query(text=refined_query)
        search_results = _qdrant_client.search(
            collection_name=_qdrant_collection,
            query_vector=query_vec,
            with_payload=True,
            limit=_num_results
        ) # search returns Hit objects directly

        print(f"[KNOWLEDGE RETRIEVAL] Retrieved {len(search_results)} search results")

        results = [
            {
                "title": result.payload["title"],
                "content": result.payload["content"][:400] + "...",
                "url": result.payload["url"],
                "relevance": result.score
            }
            for result in search_results
        ]
    except Exception as e:
        print(f"[ERROR] Failed to retrieve knowledge from vector DB: {e}")
        return [] # Return empty list on search failure
    
    print(f"[KNOWLEDGE RETRIEVAL] Completed with {len(results)} knowledge items")

    return results

# Create a tool that the ComplianceCheckerAgent can use
# TODO : Figure out the parameters for the compliance check tool and update the function signature and description
@tool
def compliance_check_tool() -> List[Dict]:
    """
    Tool for checking legal document clauses for compliance issues.
    Performs the following tasks:
    Fetches all legal clauses from the context bank
    Retrieves relevant legal laws and regulations from the vector database
    Performs compliance checks on the clauses using the gathered knowledge
    Detects compliance issues: statutory, precedent-based, internal.
    Ensures internal consistency across clauses
    Identifies legal risks and their implications
    Provides structured legal reasoning and confidence scores
    
        
    Returns:
        List of non-compliant clauses with detailed analysis
    """

    # Adding a delay here to avoid hitting API rate limits
    time.sleep(5)
    
    # Get clauses from context bank
    clauses = context_bank.get_clauses()

    print("\n" + "="*80)
    print(f"[COMPLIANCE CHECKER] Starting compliance check for {len(clauses)} clauses")
    # print(f"[COMPLIANCE CHECKER] Context Bank state at start: {json.dumps(context_bank.get_all(), indent=2)}")
    print("="*80 + "\n")

    print(f"The list of clauses is: {clauses}")


    query = "Retrieve all relevant legal knowledge for compliance checking."

    jurisdiction = context_bank.get_jurisdiction()
  
    doc_meta_from_bank = context_bank.get_document()
    # Extract the document_type from the retrieved metadata
    # Provide a default value if the key is missing
    document_type = doc_meta_from_bank.get("document_type", "Unknown Document Type") 
    
    print(f"[COMPLIANCE CHECKER] Using jurisdiction: {jurisdiction}")
    print(f"[COMPLIANCE CHECKER] Using document type: {document_type}")

    # Create a knowledge retrieval adapter that mimics a knowledge agent
    # but actually uses the vector DB directly
    print("[COMPLIANCE CHECKER] Retrieving knowledge from vector DB...")
    knowledge_data = get_knowledge_from_vector_db(query, jurisdiction, document_type)
    print(f"[COMPLIANCE CHECKER] Retrieved {len(knowledge_data)} knowledge items")
    
    print("[COMPLIANCE CHECKER] Checking legal compliance...")
    results = check_legal_compliance(
        context_bank=context_bank,
        knowledge_from_vector_db=knowledge_data,
        use_ollama=False,
        # model_name="llama3.1:latest",
        model_name="gemini-2.0-flash",
        min_confidence=0.75
    )
    
    print(f"[COMPLIANCE CHECKER] Compliance check completed with {len(results)} results")
    # print("\n" + "="*80)
    # print(f"[COMPLIANCE CHECKER] Context Bank state after check: {json.dumps(context_bank.get_all(), indent=2)}")
    # print("="*80 + "\n")
    
    return results

In [12]:

compliance_checker_agent_tools = [
    create_handoff_tool(agent_name="Knowledge Agent", description="Transfer to Knowledge Agent if more knowledge is needed, it helps to retrieve knowledge from the web using websearcher."),
    create_handoff_tool(agent_name="Planner Agent", description="Transfer to Planner Agent when compliance checking is completed and all clauses are found to be compliant, it helps to plan the next steps in the workflow and delegate tasks."),
    compliance_check_tool
]

compliance_checker_agent_node = create_react_agent(
    model,
    compliance_checker_agent_tools,
    prompt="""
      You are the Compliance Checker Agent, responsible for analyzing a list of extracted legal clauses to identify contradictions, ensure statutory compliance, and assess contractual consistency under U.S. law.
      You will use the compliance_check_tool to fetch all the **legal clauses from the context bank** and **relevant laws and regulations retrieved from the vector database** to perform a compliance check for the clauses against the legal knowledge.
      Your task is to ensure that the clauses are compliant with federal, state, and city laws, and to identify any legal risks or implications associated with non-compliance.
      Your work is crucial for ensuring that the legal document is compliant with all relevant laws and regulations.

      Use the compliance_check_tool to perform the following primary functions:
      * Fetch all legal clauses from the context bank
      * Retrieve relevant legal laws and regulations from the vector database
      * Perform compliance checks on the clauses using the gathered knowledge
      * Detect compliance issues: statutory, precedent-based, internal.
      * Ensure internal consistency across clauses
      * Identify legal risks and their implications
      * Provide structured legal reasoning and confidence scores

      Output:
      1. Contradiction Report
      {
        "has_contradiction": true|false,
        "contradiction_type": "statutory|precedent|internal",
        "severity": "high|medium|low",
        "description": "...",
        "source_clause": { "id": "...", "text": "..." },
        "reference": { "type": "...", "id": "...", "text": "..." }
      }
      2. Reasoning & Analysis
      {
        "analysis_steps": ["Step 1...", "Step 2...", "Step 3..."],
        "confidence_score": 0.0–1.0,
        "supporting_references": [{ "type": "statute", "id": "...", "relevance": "..." }]
      }
      3. Legal Implications
      {
        "implications": [
          {
            "description": "...",
            "severity": "high|medium|low",
            "affected_parties": ["..."],
            "risk_areas": ["..."]
          }
        ]
      }


      [CRITICAL STEP] Decision Flow:
      If information is deemed insufficient, call the Knowledge Agent with correct field to retrieve information on the missing topic
      If compliance check is complete, return to the Planner Agent for post processing where the whole process is summarized by the Post Processor Agent.

      Guidelines:
      * Use only validated legal sources
      * No fabrication or assumptions
      * Flag unclear issues and recommend human review when needed
      * Consider jurisdictional scope and maintain objectivity
      * [CRITICAL STEP] Return to the Planner Agent to update that the compliance check is completed.
    """,
    name="Compliance Checker Agent",
)

In [13]:
# # TODO : Add the other tools for Clause Rewriter Agent -

# clause_rewriter_agent_tools = [
#     create_handoff_tool(agent_name="Compliance Checker Agent", description="Transfer to Compliance Checker Agent after a non-compliant clause has been rewritten, it helps to check the compliance of the rewritten clause."),
# ]

# clause_rewriter_agent_node = create_react_agent(
#     model,
#     clause_rewriter_agent_tools,
#     prompt="""
#     You are the Clause Rewriter Agent, tasked with revising legal clauses flagged as non-compliant, contradictory, or unclear by the Compliance Checker Agent. Your goal is to ensure legal compliance while preserving the original intent.

# Responsibilities:
# Rewrite clauses to resolve statutory, precedent-based, or internal contradictions
# Ensure clarity, enforceability, and alignment with U.S. law
# Maintain intent and context of original clause
# Signal if more legal context is required (route to Knowledge Agent)

# Input Format:
# {
#   "original_clause": {
#     "id": "clause_id",
#     "text": "original text"
#   },
#   "issue": {
#     "description": "reason for non-compliance",
#     "contradiction_type": "statutory|precedent|internal",
#     "reference": {
#       "type": "statute|precedent|clause",
#       "text": "reference text",
#       "source_link": "optional"
#     }
#   },
#   "context_info": {
#     "document_title": "Title",
#     "named_entities": [...],
#     "document_class": "e.g., NDA, Lease"
#   }
# }
# Output Format:
# {
#   "clause_id": "clause_id",
#   "rewritten_clause": "Compliant version of the clause",
#   "justification": "How it resolves the issue and aligns with legal standards"
# }
# Guidelines:
# Be concise, precise, and legally sound
# Do not fabricate or generalize
# Flag insufficient context when needed

#     """,
#     name="Clause Rewriter Agent",
# )

In [14]:
# TODO : Add the other tools for Post-Processor Agent - Process Summarizer, Context Bank getter

post_processor_agent_tools = [
    create_handoff_tool(agent_name="Planner Agent", description="Transfer to Planner Agent when post-processing is completed, it helps to plan the next steps in the workflow and delegate tasks."),
]

post_processor_agent_node = create_react_agent(
    model,
    post_processor_agent_tools,
    prompt="""
        You are the Post-processor Agent, responsible for generating final, human-readable outputs after a legal document passes all compliance checks.

        Input:
        Context Bank (document metadata, clause info)
        Compliance Checker outputs (reasoning, implications)
        History of tasks done
        Tools: Summarizer
        Outputs (via Process Summarizer):
        Contract Summary – Overview of the document
        Changes – Highlighted clause modifications
        Risks Averted – Legal issues resolved
        References – Cited statutes and precedents

        Guidelines:
        * Be clear, concise, and legally accurate
        * Avoid jargon or speculation
        * Tailor for legal and business audiences
        * [CRITICAL STEP] Return the summary to the Planner Agent and update that the post processing is completed.
""",
    name="Post Processor Agent",
)

In [15]:
checkpointer = InMemorySaver()
workflow = create_swarm(
    [planner_agent_node, pre_processor_agent_node, knowledge_agent_node, compliance_checker_agent_node, post_processor_agent_node],
    default_active_agent="Planner Agent"
)
app = workflow.compile(checkpointer=checkpointer)

print("\n" + "="*80)
print("[WORKFLOW] Multi-agent swarm initialized with the following agents:")
print("  - Planner Agent")
print("  - Pre Processor Agent")
print("  - Knowledge Agent")
print("  - Compliance Checker Agent")
print("  - Post Processor Agent")
print("[WORKFLOW] Default active agent: Planner Agent")
print("="*80 + "\n")


[WORKFLOW] Multi-agent swarm initialized with the following agents:
  - Planner Agent
  - Pre Processor Agent
  - Knowledge Agent
  - Compliance Checker Agent
  - Post Processor Agent
[WORKFLOW] Default active agent: Planner Agent



In [16]:
config = {"configurable": {"thread_id": "1"}}
turn_1 = app.invoke(
    {"messages": 
        [{
            "role": "user", 
            "content": """You are given a file path the document which you must preprocess to extract clauses. Once the clauses are extracted, fetch all relevant knowledge related to it. Based on the collected knowledge, you should check for compliance of these clauses. Explain the non-compliant clauses, suggest changes and summarize the results for the User. FILE PATH OF DOCUMENT: \"./Original and Modified/modified_UsioInc_20040428_SB-2_EX-10.11_1723988_EX-10.11_Affiliate Agreement 2.pdf\" ",
            """
        }]
    },
    config,
)
print(turn_1)



[PREPROCESS] Starting document preprocessing for: ./Original and Modified/modified_UsioInc_20040428_SB-2_EX-10.11_1723988_EX-10.11_Affiliate Agreement 2.pdf

[PREPROCESS] Step 1: Extracting text from PDF...
[PREPROCESS] Extracted 167916 characters of text
[PREPROCESS] Step 2: Extracting title...
[PREPROCESS] Extracted title: Based on the text provided, the title of the document is:

**NETWORK 1 FINANCIAL CORPORATION AFFILIATE OFFICE AGREEMENT**
[PREPROCESS] Step 3: Performing Named Entity Recognition...
[PREPROCESS] Extracted 3165 named entities
[PREPROCESS] Sample entities (first 5): [('10.11', 'CARDINAL'), ('NETWORK 1 FINANCIAL CORPORATION', 'WORK_OF_ART'), ('AGREEMENT', 'GPE'), ('NETWORK 1 FINANCIAL, INC.', 'ORG'), ('NETWORK 1', 'WORK_OF_ART')]
[PREPROCESS] Step 4: Classifying document and clauses...
[PREPROCESS] Successfully parsed LLM output as JSON
[PREPROCESS] Document class: Legal Agreement - Affiliate Office Agreement
[PREPROCESS] Extracted 6 clauses
[PREPROCESS] Step 5: Stor

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2024-01-18/pdf/2023-28629.pdf
[KNOWLEDGE] Processing result 2/10: Federal Register/Vol. 89, No. 49/Tuesday, March 12, 2024/ ...


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2024-03-12/pdf/2024-05207.pdf
[KNOWLEDGE] Processing result 3/10: Rules and Regulations


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2021-02-23/pdf/2020-28473.pdf
[KNOWLEDGE] Processing result 4/10: Rules and Regulations


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2021-02-12/pdf/2021-01499.pdf
[KNOWLEDGE] Processing result 5/10: Rules and Regulations


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2022-01-28/pdf/2022-01607.pdf
[KNOWLEDGE] Processing result 6/10: Federal Register/Vol. 89, No. 78/Monday, April 22, 2024/ ...


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2024-04-22/pdf/2024-07496.pdf
[KNOWLEDGE] Processing result 7/10: Federal Register/Vol. 90, No. 5/Wednesday, January 8, ...


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2025-01-08/pdf/2024-31486.pdf
[KNOWLEDGE] Processing result 8/10: Federal Register/Vol. 89, No. 154/Friday, August 9, 2024/ ...


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2024-08-09/pdf/2024-17351.pdf
[KNOWLEDGE] Processing result 9/10: Public Law 95-369 95th Congress An Act


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] Successfully processed URL: https://www.govinfo.gov/content/pkg/STATUTE-92/pdf/STATUTE-92-Pg607.pdf
[KNOWLEDGE] Processing result 10/10: Statement of Policy on Bank Merger Transactions


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[KNOWLEDGE] No content found or extracted for URL https://www.govinfo.gov/content/pkg/FR-2024-09-27/pdf/2024-22189.pdf
[KNOWLEDGE] Upserting 1 documents to Qdrant collection
[KNOWLEDGE] Successfully upserted 1 documents to Qdrant
[KNOWLEDGE] Retrieved and processed 1 out of 10 URLs

[COMPLIANCE CHECKER] Starting compliance check for 6 clauses

The list of clauses is: [{'Text': 'THIS AGREEMENT is entered into by and between NETWORK 1 FINANCIAL, INC. ("NETWORK 1"), a Virginia Corporation with its principal place of business at 1501 Farm Credit Drive, Suite 1500, McLean, Virginia 22102-5004, and Payment Data Systems, Inc., the Affiliate Office ("AFFILIATE"), a Nevada Corporation with its principal place of business at 12500 San Pedro Suite 120 San Antonio, TX 78216.', 'Category': 'Parties and Definitions'}, {'Text': 'The term ("Term") of this Agreement shall be for one hundred eighty days (180) from the date set forth below unless Network 1 or Visa or MasterCard or Harris Bank doesn\'t ap

  search_results = _qdrant_client.search(


[KNOWLEDGE RETRIEVAL] Retrieved 3 search results
[KNOWLEDGE RETRIEVAL] Completed with 3 knowledge items
[COMPLIANCE CHECKER] Retrieved 3 knowledge items
[COMPLIANCE CHECKER] Checking legal compliance...
Starting legal compliance check with model 'gemini-2.0-flash' (use_ollama=False)
Initializing API client with model gemini-2.0-flash
Estimated jurisdiction: Virginia
Found 6 clauses and 3165 entities
Processing 3 knowledge items from vector database
Classified knowledge: 2 statutes, 1 precedents
Prepared knowledge context with 2 statutes and 1 precedents
Analyzing clause 1/6 (ID: unknown)
Clause Text: THIS AGREEMENT is entered into by and between NETWORK 1 FINANCIAL, INC. ("NETWORK 1"), a Virginia Corporation with its principal place of business at 1501 Farm Credit Drive, Suite 1500, McLean, Virginia 22102-5004, and Payment Data Systems, Inc., the Affiliate Office ("AFFILIATE"), a Nevada Corporation with its principal place of business at 12500 San Pedro Suite 120 San Antonio, TX 78216.