# GraphRAG Codebase Understanding - Interactive Demo

This notebook demonstrates the GraphRAG system for code understanding.

## What you'll learn:
1. How to index a GitHub repository
2. How to query the codebase using natural language
3. How different retrieval strategies work
4. How to evaluate system performance

## Setup

In [7]:
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# 1) Project root (.. from notebooks/)
PROJECT_ROOT = Path.cwd().parent  # -> d:\kaggle_project\GraphRAG
SRC_DIR = PROJECT_ROOT / "src"

# 2) Add project root to sys.path (so `src.` imports work)
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

print("PROJECT_ROOT:", PROJECT_ROOT)
print("SRC_DIR:", SRC_DIR)

# 3) Explicitly load .env from project root
env_path = PROJECT_ROOT / ".env"
print("Loading .env from:", env_path)
load_dotenv(env_path)

# 4) Debug: check that env vars are actually visible
print("GOOGLE_API_KEY from env:", os.getenv("GOOGLE_API_KEY"))
print("NEO4J_PASSWORD from env:", os.getenv("NEO4J_PASSWORD"))




PROJECT_ROOT: d:\kaggle_project\GraphRAG
SRC_DIR: d:\kaggle_project\GraphRAG\src
Loading .env from: d:\kaggle_project\GraphRAG\.env
GOOGLE_API_KEY from env: AIzaSyDHDF3uw3wUbqUg2mODysokzqhxXPWoHAI
NEO4J_PASSWORD from env: graphrag2025


In [None]:
import asyncio

from src.agents.orchestrator import orchestrator
from src.agents.parser_agent import parser_agent
from src.tools.neo4j_operations import neo4j_manager
from src.utils.metrics import GraphRAGEvaluator
from src.models.entities import QueryIntent, RetrievalStrategy

print("‚úÖ Imports successful!")


ValidationError: 1 validation error for LlmAgent
tools
  Input should be a valid list [type=list_type, input_value=<bound method ParserAgent... at 0x000002341F96BE90>>, input_type=method]
    For further information visit https://errors.pydantic.dev/2.12/v/list_type

In [5]:

# import sys
# sys.path.append('../src')

# import asyncio
# from agents.orchestrator import orchestrator
# from agents.parser_agent import parser_agent
# from tools.neo4j_tools import neo4j_manager
# from utils.metrics import GraphRAGEvaluator
# from models.code_entities import QueryIntent, RetrievalStrategy

# print("‚úÖ Imports successful!")

## Part 1: Index a Repository

In [1]:
# Choose a repository to index
REPO_URL = "https://github.com/donnemartin/data-science-ipython-notebooks"
BRANCH = "master"

print(f"üì¶ Indexing: {REPO_URL}")
print("This will take a few minutes...\n")

üì¶ Indexing: https://github.com/donnemartin/data-science-ipython-notebooks
This will take a few minutes...



In [2]:
# Clear existing data (optional)
neo4j_manager.clear_database()
print("üóëÔ∏è  Database cleared")

NameError: name 'neo4j_manager' is not defined

In [None]:
# Index the repository
result = await orchestrator.index_repository(
    repo_url=REPO_URL,
    branch=BRANCH
)

print("\n‚úÖ Indexing complete!")
print(f"Files parsed: {result['parse']['parsed_files']}")
print(f"Functions found: {result['parse']['total_functions']}")
print(f"Classes found: {result['parse']['total_classes']}")
print(f"Graph nodes: {result['graph']['nodes_created']}")
print(f"Embeddings: {result['embeddings']['embeddings_created']}")

## Part 2: Query the Codebase

In [None]:
# Example queries
queries = [
    "What functions handle data preprocessing?",
    "Show me functions related to machine learning models",
    "Where is pandas used in the codebase?",
    "What are the main data visualization functions?"
]

In [None]:
# Run first query
query = queries[0]

print(f"‚ùì Query: {query}\n")

response = await orchestrator.query(
    query,
    top_k=10,
    max_depth=2
)

print(f"Intent: {response.intent.value}")
print(f"Strategy: {response.strategy.value}")
print(f"Confidence: {response.confidence:.1%}")
print(f"Retrieved nodes: {len(response.retrieved_nodes)}\n")

print("üìù ANSWER:")
print("-" * 80)
print(response.answer)
print("-" * 80)

In [None]:
# View retrieved context
print("üìö Top Retrieved Functions:\n")

for i, node in enumerate(response.retrieved_nodes[:5], 1):
    print(f"{i}. {node.name}")
    print(f"   File: {node.file_path}")
    print(f"   Score: {node.score:.3f}")
    print(f"   Source: {node.source}")
    if node.docstring:
        print(f"   Doc: {node.docstring[:100]}...")
    print()

## Part 3: Compare Retrieval Strategies

In [None]:
# Compare different strategies
test_query = "What functions handle file I/O?"

strategies = [
    RetrievalStrategy.VECTOR_ONLY,
    RetrievalStrategy.GRAPH_ONLY,
    RetrievalStrategy.HYBRID,
    RetrievalStrategy.SEMANTIC_THEN_GRAPH
]

results = {}

for strategy in strategies:
    print(f"\nüîç Testing {strategy.value}...")
    
    # For this demo, we'd need to modify the orchestrator to accept strategy override
    # Simplified version:
    response = await orchestrator.query(test_query, top_k=10, max_depth=2)
    
    results[strategy.value] = {
        'nodes_retrieved': len(response.retrieved_nodes),
        'confidence': response.confidence,
        'answer_length': len(response.answer)
    }
    
    print(f"   Nodes: {results[strategy.value]['nodes_retrieved']}")
    print(f"   Confidence: {results[strategy.value]['confidence']:.1%}")

In [None]:
# Visualize comparison
import pandas as pd
import plotly.express as px

df = pd.DataFrame(results).T.reset_index()
df.columns = ['Strategy', 'Nodes Retrieved', 'Confidence', 'Answer Length']

fig = px.bar(
    df, 
    x='Strategy', 
    y='Nodes Retrieved',
    title='Nodes Retrieved by Strategy',
    color='Confidence',
    color_continuous_scale='Viridis'
)

fig.show()

## Part 4: Evaluate System Performance

In [None]:
# Create evaluator
evaluator = GraphRAGEvaluator()

# Define ground truth test cases
test_cases = [
    {
        'query': 'What functions handle data cleaning?',
        'expected_nodes': ['clean_data', 'remove_nulls', 'normalize'],
        'expected_answer': 'Data cleaning functions include...'
    },
    {
        'query': 'Where is matplotlib used?',
        'expected_nodes': ['plot_graph', 'visualize_data'],
        'expected_answer': 'Matplotlib is used for visualization...'
    }
]

print("üß™ Running evaluation...\n")

In [None]:
# Run evaluation
for test in test_cases:
    response = await orchestrator.query(test['query'], top_k=10)
    
    retrieved_node_names = [node.name for node in response.retrieved_nodes]
    
    evaluator.add_evaluation(
        query=test['query'],
        expected_answer=test['expected_answer'],
        generated_answer=response.answer,
        retrieved_nodes=retrieved_node_names,
        expected_nodes=test['expected_nodes']
    )
    
    print(f"‚úÖ Evaluated: {test['query']}")

In [None]:
# View results
evaluator.print_report()

# Save results
evaluator.save_results('../data/evaluation/results.json')
print("\nüíæ Results saved to data/evaluation/results.json")

## Part 5: Explore the Graph

In [None]:
# Query Neo4j directly
query = """
MATCH (f:Function)
RETURN f.name as name, f.file_path as file, size(f.code) as code_size
ORDER BY code_size DESC
LIMIT 10
"""

results = neo4j_manager.execute_cypher(query)

print("üìä Largest Functions:\n")
for i, r in enumerate(results, 1):
    print(f"{i}. {r['name']} ({r['file']})")
    print(f"   Code size: {r['code_size']} characters\n")

In [None]:
# Find most-called functions
query = """
MATCH (f:Function)<-[r:CALLS]-(:Function)
WITH f, count(r) as call_count
RETURN f.name as name, f.file_path as file, call_count
ORDER BY call_count DESC
LIMIT 10
"""

results = neo4j_manager.execute_cypher(query)

print("üìû Most Called Functions:\n")
for i, r in enumerate(results, 1):
    print(f"{i}. {r['name']} ({r['file']})")
    print(f"   Called by {r['call_count']} functions\n")

## Conclusion

You've learned:
- ‚úÖ How to index a repository into a knowledge graph
- ‚úÖ How to query code using natural language
- ‚úÖ Different retrieval strategies and their trade-offs
- ‚úÖ How to evaluate system performance
- ‚úÖ How to explore the graph database

**Next steps:**
- Try indexing your own repository
- Experiment with different query strategies
- Customize the graph schema for your needs
- Add support for more programming languages