# Module 3: Building the Graph Retrieval Tool

Welcome to Module 3 and the start of our **Retrieval Pipeline**! In the first two modules, we successfully built a complete Ingestion Pipeline, transforming raw contract text into a fully populated Neo4j knowledge graph.

Now, it's time to build the bridge that will allow our AI agent to access this powerful new data source.

**Our Mission:**
1. Define a clear input schema for our graph queries.
2. Build a robust Python function that dynamically constructs and executes hybrid Cypher queries against our graph.
3. Wrap this function into an official LangChain "Tool".
4. Rigorously test the tool in isolation to ensure it's reliable and bug-free for Module 4.

### The "Tool": Our Agent's Bridge to the Knowledge Graph
Before we write any code, let's clarify what we're building. In our final system (Module 4), an AI Agent will take a user's question, but it won't query the database directly. Instead, it will use a specialized Tool that we create.

A key point to understand is how the tool gets its instructions.

In This Module (Testing): We will play the role of the "agent" and manually pass structured parameters to our tool (e.g., a dictionary like {'contract_type': 'Supply'}). This allows us to test the tool's logic directly and ensure it works perfectly.

In Module 4 (Final System): The LLM Agent will take a natural language question (e.g., "Find me all the supply contracts") and autonomously determine and generate these structured parameters before calling the tool.

Our entire focus in this module is to build and test this "Tool" component in isolation, making sure it's robust and ready for the agent to use.

## 1. Setup and Dependencies

We'll start by installing the necessary libraries. The key additions here are `langchain` and `langchain_core` for creating our tool.

In [12]:
%pip install -qU langchain langchain-core langchain-google-genai langchain-neo4j python-dotenv

In [13]:
import os
import json
from dotenv import load_dotenv
from typing import List, Optional, Any

from langchain_core.tools import tool
from pydantic import BaseModel, Field
from langchain_neo4j import Neo4jGraph
from langchain_google_genai import GoogleGenerativeAIEmbeddings


from google.colab import drive
import json
from google.colab import userdata

## 2. Configure Environment Variables

As in Module 2, we need to connect to our Google and Neo4j services. Please provide your credentials below.

In [14]:
# Define required environment variables
required_vars = ["GOOGLE_API_KEY", "NEO4J_URI", "NEO4J_USERNAME", "NEO4J_PASSWORD", "NEO4J_DATABASE"]

# Set environment variables with validation
missing_vars = []
for var in required_vars:
    value = userdata.get(var)
    if value:
        os.environ[var] = value
        print(f"✅ {var}: Set successfully")
    else:
        missing_vars.append(var)
        print(f"❌ {var}: Missing or empty")

# Check if all required variables are set
if missing_vars:
    print(f"\n🚨 Error: Missing required environment variables: {', '.join(missing_vars)}")
    print("Please ensure all secrets are properly configured in Colab.")
    raise ValueError(f"Missing environment variables: {missing_vars}")
else:
    print(f"\n🎉 Successfully loaded all {len(required_vars)} required environment variables!")

# Optional: Verify API key format (basic validation)
if os.environ.get("GOOGLE_API_KEY"):
    api_key = os.environ["GOOGLE_API_KEY"]
    if len(api_key) < 20:  # Basic length check
        print("⚠️  Warning: Google API key seems unusually short")
    else:
        print("✅ Google API key format looks valid")

✅ GOOGLE_API_KEY: Set successfully
✅ NEO4J_URI: Set successfully
✅ NEO4J_USERNAME: Set successfully
✅ NEO4J_PASSWORD: Set successfully
✅ NEO4J_DATABASE: Set successfully

🎉 Successfully loaded all 5 required environment variables!
✅ Google API key format looks valid


## 3. Connect to Neo4j and Initialize Embeddings Model

Let's establish our connection to the Neo4j database and initialize the Google Generative AI Embeddings model, which we'll need for the vector search part of our tool.

In [15]:
try:
    graph = Neo4jGraph(
            url=os.environ["NEO4J_URI"],
            username=os.environ["NEO4J_USERNAME"],
            password=os.environ["NEO4J_PASSWORD"],
            database=os.environ["NEO4J_DATABASE"]
        )
    graph.query("RETURN 1")
    print("Successfully connected to Neo4j.")
except Exception as e:
    print(f"Failed to connect to Neo4j: {e}")

try:
    embedding_model = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
    print("Successfully initialized embeddings model.")
except Exception as e:
    print(f"Failed to initialize embeddings model: {e}")

Successfully connected to Neo4j.
Successfully initialized embeddings model.


## 4. Define the Tool's Input Schema

This is the 'instruction manual' for our agent. By defining a Pydantic `BaseModel`, we tell the LLM exactly what parameters it can use to search for contracts. The descriptions for each field are crucial, as they guide the LLM in mapping a user's natural language query to the correct parameters.

In [16]:
class ContractInput(BaseModel):
    "Input schema for the contract search tool."

    contract_type: Optional[str] = Field(None, description="The type of contract, e.g., 'Service', 'Supply', 'Reseller'.")
    parties: Optional[List[str]] = Field(None, description="List of parties involved in the contract, e.g., ['Aperture Global Logistics', 'Fonterra'].")
    summary_search: Optional[str] = Field(None, description="A semantic search query to run against the contract's summary.")
    min_effective_date: Optional[str] = Field(None, description="Earliest contract effective date in YYYY-MM-DD format.")

## 5. Build the Core Graph Query Function

This function is the heart of our tool. It takes the parameters defined in our schema and dynamically builds a single, powerful Cypher query. It intelligently combines graph-based filtering (for things like `parties` and `contract_type`) with vector similarity search (for `summary_search`).

In [53]:
def get_contracts(
    contract_type: Optional[str] = None,
    parties: Optional[List[str]] = None,
    summary_search: Optional[str] = None,
    min_effective_date: Optional[str] = None
):
    """
    Searches for contracts in the Neo4j database based on provided criteria.
    Dynamically builds a Cypher query to filter by metadata and perform vector search.
    """
    cypher_statement = "MATCH (c:Contract) "
    params = {}
    filters = []

    # Metadata filters
    if contract_type:
        filters.append("c.contract_type = $contract_type")
        params["contract_type"] = contract_type

    if min_effective_date:
        filters.append("c.effective_date >= date($min_effective_date)")
        params["min_effective_date"] = min_effective_date

    if parties:
        for i, party in enumerate(parties):
            party_param = f"party_{i}"
            filters.append(f"EXISTS {{ MATCH (c)<-[:PARTY_TO]-(p:Party) WHERE toLower(p.name) CONTAINS ${party_param} }}")
            params[party_param] = party.lower()

    if filters:
        cypher_statement += "WHERE " + " AND ".join(filters) + " "

    # Vector similarity search (post-filtering)
    if summary_search:
        embedding = embedding_model.embed_query(summary_search)
        params["embedding"] = embedding
        cypher_statement += (
            "WITH c, vector.similarity.cosine(c.embedding, $embedding) AS score "
            "WHERE score > 0.7 " # Similarity threshold
            "ORDER BY score DESC "
        )
    else:
         cypher_statement += "WITH c ORDER BY c.effective_date DESC " # Default sort

    # Final RETURN clause to format the output
    cypher_statement += """WITH collect(c) AS nodes
    RETURN {
        total_count: size(nodes),
        example_contracts: [
            el in nodes[..5] | {
                summary: el.summary,
                contract_type: el.contract_type,
                effective_date: toString(el.effective_date),
                parties: [(el)<-[:PARTY_TO]-(p:Party) | p.name]
            }
        ]
    } AS output
    """

    # Execute the query
    print("cypher_statement ->")
    print(cypher_statement)
    print("\n")
    print("params ->")
    print(params)
    print("\n")
    result = graph.query(cypher_statement, params)
    print("******************\n")
    print("Output from Contract Search Tool ->")
    print("\n******************\n")
    return result[0]['output']

## 6. Create the LangChain Tool

Now we wrap our `get_contracts` function into an official LangChain tool. We use the `@tool` decorator, which is the latest and simplest way to create a tool in LangChain. By default, the function’s docstring becomes the tool’s description that helps the model understand when to use it.

We pass our `ContractInput` Pydantic model to the `args_schema` to ensure the LLM knows what arguments are available.

For more details on Langchain Tool , refer the official Langchain documentation here: https://docs.langchain.com/oss/python/langchain/tools

In [54]:
@tool(args_schema=ContractInput)
def contract_search_tool(
    contract_type: Optional[str] = None,
    parties: Optional[List[str]] = None,
    summary_search: Optional[str] = None,
    min_effective_date: Optional[str] = None
) -> dict:
    """Searches for contracts in the AGL contract database based on various criteria."""
    return get_contracts(contract_type, parties, summary_search, min_effective_date)

# Let's inspect our tool
print(f"Tool Name: {contract_search_tool.name}")
print(f"Tool Description: {contract_search_tool.description}")
print(f"Tool Arguments Schema: {contract_search_tool.args}")

Tool Name: contract_search_tool
Tool Description: Searches for contracts in the AGL contract database based on various criteria.
Tool Arguments Schema: {'contract_type': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': "The type of contract, e.g., 'Service', 'Supply', 'Reseller'.", 'title': 'Contract Type'}, 'parties': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'description': "List of parties involved in the contract, e.g., ['Aperture Global Logistics', 'Fonterra'].", 'title': 'Parties'}, 'summary_search': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': "A semantic search query to run against the contract's summary.", 'title': 'Summary Search'}, 'min_effective_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': 'Earliest contract effective date in YYYY-MM-DD format.', 'title': 'Min Effective Date'}}


## 7. Testing the Tool in Isolation

#### Why Test the Tool Manually?

This is the most important step of Module 3.Before we let an LLM agent use our tool in Module 4, we need to be 100% confident that the tool itself is working correctly. By calling it directly with Python, we are **unit testing** our data retrieval logic. This ensures that if any issues arise in the next module, we'll know the problem is with the agent's reasoning, not with our database query code.

In [55]:
# Function to display test results.

def display_test_results(results):
    """A helper function to print the tool's output in a clean format."""
    print(f"Total Contracts Found: {results.get('total_count', 'N/A')}")
    if results.get('example_contracts'):
        print("Example Contracts:")
        for i, contract in enumerate(results['example_contracts']):
            print(f"  {i+1}. Type: {contract.get('contract_type', 'N/A')}")
            print(f"     Parties: {', '.join(contract.get('parties', []))}")
            print(f"     Effective Date: {contract.get('effective_date', 'N/A')}")
            print("-" * 20)
    else:
        print("No example contracts returned.")

#### Test 1: Simple Metadata Filter

Our first test is the simplest. We'll ask the tool to find all contracts where the `contract_type` is "Service".

**What to look for:**
- The generated **Cypher query** should have a `WHERE c.contract_type = $contract_type` clause.
- The **parameters** should show `{'contract_type': 'Service'}`.
- The **output** should return a `total_count` of 1, showing the Innovate Solutions agreement.

In [46]:
print("--- 🧪 Test 1: Simple Metadata Filter --- ")
print("Finding all 'Service' agreements...\\n")

# The tool's internal print statements for Cypher/params will still run
test_1_results = contract_search_tool.invoke({"contract_type": "Service"})

print("\n--- Formatted Output ---")
display_test_results(test_1_results)

--- 🧪 Test 1: Simple Metadata Filter --- 
Finding all 'Service' agreements...\n
cypher_statement ->
MATCH (c:Contract) WHERE c.contract_type = $contract_type WITH c ORDER BY c.effective_date DESC WITH collect(c) AS nodes
    RETURN {
        total_count: size(nodes),
        example_contracts: [
            el in nodes[..5] | {
                summary: el.summary,
                contract_type: el.contract_type,
                effective_date: toString(el.effective_date),
                parties: [(el)<-[:PARTY_TO]-(p:Party) | p.name]
            }
        ]
    } AS output
    


params ->
{'contract_type': 'Service'}


******************

Output from Contract Search Tool ->

******************


--- Formatted Output ---
Total Contracts Found: 1
Example Contracts:
  1. Type: Service
     Parties: Aperture Global Logistics, Innovate Solutions Inc.
     Effective Date: 2019-11-01
--------------------


#### Test 2: Relationship-Based Filter

Next, we'll test the tool's ability to traverse relationships in the graph. We will ask for all contracts involving the party "Fonterra". This requires the query to find a `Party` node and then follow the `:PARTY_TO` relationship to the `Contract` node.

**What to look for:**
- The **Cypher query** should use an `EXISTS { ... }` clause to check for a relationship with a `Party` node.
- The **parameters** should contain `{'party_0': 'fonterra'}`.
- The **output** should return a `total_count` of 1 and show the Master Supply Agreement with Fonterra.

In [47]:
print("\n--- 🧪 Test 2: Relationship-Based Filter --- ")
print("Finding all contracts involving 'Fonterra'...\n")

test_2_results = contract_search_tool.invoke({"parties": ["Fonterra"]})

print("\n--- Formatted Output ---")
display_test_results(test_2_results)


--- 🧪 Test 2: Relationship-Based Filter --- 
Finding all contracts involving 'Fonterra'...

cypher_statement ->
MATCH (c:Contract) WHERE EXISTS { MATCH (c)<-[:PARTY_TO]-(p:Party) WHERE toLower(p.name) CONTAINS $party_0 } WITH c ORDER BY c.effective_date DESC WITH collect(c) AS nodes
    RETURN {
        total_count: size(nodes),
        example_contracts: [
            el in nodes[..5] | {
                summary: el.summary,
                contract_type: el.contract_type,
                effective_date: toString(el.effective_date),
                parties: [(el)<-[:PARTY_TO]-(p:Party) | p.name]
            }
        ]
    } AS output
    


params ->
{'party_0': 'fonterra'}


******************

Output from Contract Search Tool ->

******************


--- Formatted Output ---
Total Contracts Found: 1
Example Contracts:
  1. Type: Supply
     Parties: Aperture Global Logistics, Fonterra (USA) Inc.
     Effective Date: 2019-10-31
--------------------


#### Test 3: Date Filter

This test validates our ability to filter based on date properties. We'll ask for all contracts that became effective on or after January 1, 2018.

**What to look for:**
- The **Cypher query** should contain a `WHERE c.effective_date >= date($min_effective_date)` clause.
- The **parameters** should show `{'min_effective_date': '2018-01-01'}`.
- The **output** should return a `total_count` of 2, listing the agreements with Innovate Solutions and Fonterra, as both were signed after this date.

In [48]:
print("\n--- 🧪 Test 3: Date Filter --- ")
print("Finding all contracts effective on or after Jan 1, 2018...\n")

test_3_results = contract_search_tool.invoke({"min_effective_date": "2018-01-01"})

print("\n--- Formatted Output ---")
display_test_results(test_3_results)


--- 🧪 Test 3: Date Filter --- 
Finding all contracts effective on or after Jan 1, 2018...

cypher_statement ->
MATCH (c:Contract) WHERE c.effective_date >= date($min_effective_date) WITH c ORDER BY c.effective_date DESC WITH collect(c) AS nodes
    RETURN {
        total_count: size(nodes),
        example_contracts: [
            el in nodes[..5] | {
                summary: el.summary,
                contract_type: el.contract_type,
                effective_date: toString(el.effective_date),
                parties: [(el)<-[:PARTY_TO]-(p:Party) | p.name]
            }
        ]
    } AS output
    


params ->
{'min_effective_date': '2018-01-01'}


******************

Output from Contract Search Tool ->

******************


--- Formatted Output ---
Total Contracts Found: 2
Example Contracts:
  1. Type: Service
     Parties: Aperture Global Logistics, Innovate Solutions Inc.
     Effective Date: 2019-11-01
--------------------
  2. Type: Supply
     Parties: Aperture Global Logist

#### Test 4: Hybrid Search (Metadata + Vector)

This is our most advanced test, combining a metadata filter with a semantic vector search. We will look for "Supply" contracts and then search their summaries for the *concept* of "business continuity".

**What to look for:**
- The **Cypher query** should contain both a `WHERE c.contract_type = $contract_type` clause AND a `WITH c, vector.similarity.cosine(...) AS score` clause.
- The **parameters** will include both the `contract_type` and a long `embedding` vector.
- The **output** should correctly identify the Fonterra MSA, as it's the only supply contract that discusses business continuity.

In [57]:
print("\n--- 🧪 Test 4: Hybrid Search (Metadata + Vector) --- ")
print("Finding 'Supply' contracts with a summary mentioning 'business continuity'...\n")

test_4_results = contract_search_tool.invoke({
    "contract_type": "Supply",
    "summary_search": "agreements about business continuity or crisis management"
})

print("\n--- Formatted Output ---")
display_test_results(test_4_results)


--- 🧪 Test 4: Hybrid Search (Metadata + Vector) --- 
Finding 'Supply' contracts with a summary mentioning 'business continuity'...

cypher_statement ->
MATCH (c:Contract) WHERE c.contract_type = $contract_type WITH c, vector.similarity.cosine(c.embedding, $embedding) AS score WHERE score > 0.7 ORDER BY score DESC WITH collect(c) AS nodes
    RETURN {
        total_count: size(nodes),
        example_contracts: [
            el in nodes[..5] | {
                summary: el.summary,
                contract_type: el.contract_type,
                effective_date: toString(el.effective_date),
                parties: [(el)<-[:PARTY_TO]-(p:Party) | p.name]
            }
        ]
    } AS output
    


params ->
{'contract_type': 'Supply', 'embedding': [0.0025018458254635334, 0.008123455569148064, 0.003764278953894973, 0.002123128389939666, -0.0007903666701167822, -0.0017894835909828544, 0.012410693801939487, 0.04778783768415451, 0.010265043005347252, 0.019873052835464478, 0.05385565385222

## Congratulations!

You have successfully completed Module 3. We have built and, most importantly, **verified** our `contract_search_tool`. It is now a reliable, bug-free component.

With this proven tool in our arsenal, we are now ready to move on to Module 4, where we will build the LangGraph agent and bring our Intelligent Contract Analyst to life!

##Appendix


### The "Why": Graph Search vs. Vector/Lexical Search

You might be asking, "Why do we need a graph for this? Can't we just use a good vector search or a keyword (lexical) search?"

That's the key lesson of this cohort. While vector search is brilliant for finding things based on *conceptual similarity* (like finding "crisis management" when you search for "business continuity"), it fundamentally fails when a question's answer depends on **explicit, structured relationships**.

Let's analyze our test cases to see where other methods fall short and why the graph shines.

---

#### **Can Test 1, 2, or 3 be answered by Vector/Lexical Search?** -> **No.**

* **Test 1 (Finding 'Service' agreements):**
    * **Lexical Search (`grep` or simple keyword):** This *might* work if the text "Service Agreement" appears clearly. But what if the contract is just titled "Agreement" and the *type* "Service" was an extracted metadata field? A simple text search wouldn't know the document's category.
    * **Vector Search:** This is worse. A vector search for "Service agreement" would find any document that *talks about* services, even if it's a Supply contract with a small service clause. It doesn't understand the document's primary, structured `contract_type`.
    * **Why Graph Search Wins:** The graph treats `contract_type` as a structured property of the `Contract` node. The query `WHERE c.contract_type = 'Service'` is precise, unambiguous, and instant. It answers based on facts, not similarity.

* **Test 2 (Finding contracts involving 'Fonterra'):**
    * **Lexical Search:** This would find any contract that *mentions* the word "Fonterra". This could include contracts that just reference Fonterra in passing, not ones where they are an actual signatory (`Party`). It's noisy and unreliable.
    * **Vector Search:** This is completely ineffective for finding a specific named entity like "Fonterra".
    * **Why Graph Search Wins:** Our graph has an explicit `:PARTY_TO` relationship. The query `MATCH (c:Contract)<-[:PARTY_TO]-(p:Party) WHERE p.name CONTAINS 'fonterra'` is asking "Show me contracts that have a structural, legal connection to the entity named Fonterra." This is a relational query that only a graph can answer with 100% certainty.

* **Test 3 (Finding contracts effective after '2018-01-01'):**
    * **Lexical/Vector Search:** These methods have no concept of dates as structured, comparable data types. You could search for the text "2018", but you couldn't reliably perform a "greater than or equal to" operation. What about a contract signed in December 2017 that mentions plans for 2018? Text search would find it incorrectly.
    * **Why Graph Search Wins:** Neo4j stores `effective_date` as a native `date` object. This allows us to perform precise, logical operations like `WHERE c.effective_date >= date('2018-01-01')`. The graph understands the *meaning* and structure of the date, not just the text.

---

#### **The Power of Hybrid Search (Test 4)**

Test 4 shows why the combination is so powerful.

1.  **"Find 'Supply' contracts..."**: This first part is a precise, structural filter. **Only the graph can do this reliably.** It instantly narrows our search space from 5 documents down to just the 1 that is a `Supply` contract.
2.  **"...mentioning 'business continuity'."**: This second part is a conceptual, semantic search. **This is where vector search excels.**

A system with *only* vector search would have to search all 5 documents for "business continuity", which is inefficient and could return irrelevant clauses from other contract types.

**GraphRAG shines by using the graph first to ask "WHERE should I look?" and then using vector search to ask "WHAT am I looking for within that specific area?"** This combination of precision and conceptual understanding is what allows us to build truly intelligent and reliable AI systems.