# Hybrid Pipeline Demo: Vector Search + Knowledge Graph

This notebook demonstrates the **2 Core Use Cases** for our Legal Knowledge Graph:

1.  **Filtering**: Using the Graph to remove "noise" from Vector Search results.
2.  **Context**: Using the Graph to find hidden connections (Context Enrichment).

We will replicate a mini-version of the pipeline here.

In [None]:
!pip install gliner networkx matplotlib tqdm -q

In [None]:
import NetworkX as nx
import json
from gliner import GLiNER
import networkx as nx

# 1. Setup - Let's build a tiny Graph from real data for the demo
# We will use a small sample of bills to keep it fast.

print("Initializing GLiNER...")
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

labels = ["Person", "Legislator", "Committee", "Government Agency", "Bill", "Date", "Topic"]

### Step 1: Build a Mini-Graph (The Reference Knowledge)
We ingest a few chunks. Imagine this is our **Master Knowledge Graph**.

In [None]:
# Dummy Data (simulating content from bills.json)
# We create 3 documents. 
# Doc 1 & 2 are about 'Senator Smith' and 'Energy'.
# Doc 3 is about 'Energy' but completely unrelated to Smith (Noise).

documents = [
    {
        "id": "chunk_1",
        "text": "Senator John Smith introduced the Clean Energy Act to reduce carbon emissions. He spoke before the Committee on Environment."
    },
    {
        "id": "chunk_2",
        "text": "The Committee on Environment approved the budget proposed by Senator Smith for solar panel research."
    },
    {
        "id": "chunk_3", 
        "text": "The Department of Energy announced a new initiative for nuclear fusion. This has no relation to the Senate's recent activities."
    }
]

print("Building Graph...")
G = nx.Graph()

# Map to store check text for retrieval later
chunk_store = {}

for doc in documents:
    chunk_id = doc['id']
    text = doc['text']
    chunk_store[chunk_id] = text
    
    # Add Chunk Node
    G.add_node(chunk_id, type="Chunk", text=text)
    
    # Extract Entities
    entities = model.predict_entities(text, labels)
    
    for entity in entities:
        name = entity['text'].strip()
        label = entity['label']
        
        # Add Entity Node
        G.add_node(name, type=label)
        # Connect Chunk -> Entity
        G.add_edge(chunk_id, name, relation="MENTIONS")

print(f"Graph built with {G.number_of_nodes()} nodes.")
# Visualize specifically the connections for Chunk 1
print("Chunk 1 connects to:", list(G.neighbors("chunk_1")))

### Step 2: Use Case 1 - Filtering (The "Wide Net" Refiner)
**Scenario**: User searches for **"What did Senator Smith do for Energy?"**

A typical Vector Search (searching for 'Energy') might return **Chunk 3** (Nuclear Fusion) because it scores high on 'Energy', even though Senator Smith isn't involved.

We use the Graph to filter this out.

In [None]:
def pipeline_search(query, vector_results_ids):
    print(f"--- User Query: '{query}' ---")
    
    # 1. Extract Key Entities from Query using GLiNER
    # We want to know WHO/WHAT matters in this query.
    query_entities = model.predict_entities(query, labels)
    key_entities = [e['text'].strip() for e in query_entities]
    print(f"Detected Query Entities: {key_entities}")
    
    if not key_entities:
        print("No specific entities found in query. Returning all vector results.")
        return vector_results_ids
        
    # 2. Filter Vector Results
    filtered_results = []
    
    for chunk_id in vector_results_ids:
        # Logic: Keep chunk IF connected to AT LEAST ONE key entity in the graph
        # (Or a close neighbor, but we'll specific direct connection for demo)
        
        is_relevant = False
        
        # Get neighbors of this chunk in the graph
        if chunk_id in G:
            chunk_neighbors = list(G.neighbors(chunk_id))
            
            # Check intersection
            full_matches = set(key_entities).intersection(set(chunk_neighbors))
            
            # Also check partial string matches (e.g. 'Smith' matches 'Senator John Smith')
            partial_matches = []
            for key_ent in key_entities:
                for neighbor in chunk_neighbors:
                    if key_ent in neighbor or neighbor in key_ent:
                        partial_matches.append(neighbor)
                        
            if full_matches or partial_matches:
                is_relevant = True
                print(f"  [KEEP] {chunk_id}: Linked to {partial_matches or full_matches}")
            else:
                print(f"  [DROP] {chunk_id}: No connection to {key_entities}")
        else:
            print(f"  [DROP] {chunk_id}: Not in graph")
            
        if is_relevant:
            filtered_results.append(chunk_id)
            
    return filtered_results

# SIMULATION
# Vector search (mock) returns ALL 3 chunks because they all mention 'Energy' or 'Smith'
mock_vector_results = ["chunk_1", "chunk_2", "chunk_3"]

user_query = "What did Senator John Smith do about Energy?"

final_chunks = pipeline_search(user_query, mock_vector_results)
print(f"\nFinal Results: {final_chunks} (Chunk 3 should be gone)")

### Step 3: Use Case 2 - Context Enrichment (The "Deepen" Logic)
**Scenario**: The user finds a chunk about the "Clean Energy Act".

The text says "The Act was introduced." It doesn't say *who* is on the committee reviewing it, or what *other* bills Smith sponsored. The Graph knows this.

We use the Graph to inject this "Missing Context".

In [None]:
def get_graph_context(chunk_id):
    # 1. Identify Entities in the valid chunk
    if chunk_id not in G: return ""
    
    entities_in_chunk = list(G.neighbors(chunk_id))
    
    context_facts = []
    
    # 2. Hop to 2nd degree neighbors
    # e.g. Chunk -> Senator Smith -> Committee on Environment
    # This tells us Smith is related to that Committee, even if the chunk didn't say "Member of"
    
    for entity in entities_in_chunk:
        # Find what else this entity is connected to (other than the chunk itself)
        related_nodes = list(G.neighbors(entity))
        for related in related_nodes:
            # Skip the chunk itself
            if related.startswith("chunk_"): 
                # Optional: If it connects to another Chunk, that's a "Related Document"!
                if related != chunk_id:
                    context_facts.append(f"Related Document found: {related}")
                continue
            
            # Create a context sentence
            # e.g. "Senator John Smith" is connected to "Committee on Environment"
            context_facts.append(f"{entity} is related to {related}")
            
    return list(set(context_facts))

# SIMULATION
print(f"--- enriching {final_chunks[0]} ---")
context = get_graph_context(final_chunks[0]) # Enrich Chunk 1

print(f"Original Text: \n   '{chunk_store[final_chunks[0]]}'\n")
print("Graph Context Added:")
for fact in context:
    print(f" + {fact}")

## Conclusion of Demo

1.  **Filtering**: We successfully dropped `chunk_3`. Even though it matched the keyword "Energy", the Graph knew it wasn't connected to the primary query entity "Senator John Smith".
2.  **Context**: We found that `Senator John Smith` is connected to `Committee on Environment` and `chunk_2`. This allows the LLM to answer questions like "What other activities is Smith involved in?" using the *structural* knowledge, not just the text.