# 🧠 GPT now understands my repo like a senior dev – here's how

This notebook is an end-to-end reproduction and reinterpretation of the **CodeRAG** framework (see [paper](https://arxiv.org/pdf/2504.10046)), adapted to run locally on your own codebase.

We want to give GPT (or any LLM) the ability to:
- Parse and **understand your entire repo**
- Retrieve and reason over related code
- Answer questions like a senior developer would — with **zero fine-tuning**

---

## 🎯 Goal of this experiment

Our objective is to build a **code-aware assistant** using a combination of:

- 🔍 **Tree-sitter** to parse and structure the codebase
- 🧠 **LLMs** to describe each function or class (aka "requirements")
- 🕸️ **Code graphs** to represent dependencies (calls, imports, etc)
- 🧭 **Agentic reasoning** to let the LLM query and retrieve context dynamically
- ⚡ **RAG (Retrieval-Augmented Generation)** to reduce hallucinations and give smarter answers

The end result is a local-first, fully transparent, and extensible RAG pipeline tailored for your own project.

---

## 🧪 Inspired by CodeRAG (What we're replicating)

From the CodeRAG paper (April 2025), we aim to recreate the following innovations:

1. **Requirement Graph**  
   A graph where each node is a *natural language description* of a function or class. Edges represent semantic similarity or parent-child relations.

2. **DS-Code Graph**  
   A code graph that encodes structural dependencies like:
   - function calls
   - class inheritance
   - file/module containment
   - semantic similarity (via embeddings)

3. **BiGraph Mapping**  
   Links between requirements and code elements — allowing retrieval of relevant code given a high-level prompt.

4. **Agentic Reasoning**  
   An LLM-driven reasoning loop that dynamically:
   - queries the graph
   - follows dependencies
   - does web search if needed
   - formats and tests generated code

---

## 🪜 Pipeline Overview (what this notebook covers)

| Step | Description |
|------|-------------|
| ✅ 1. Parse your local repo using Tree-sitter |
| ✅ 2. Extract all functions and classes |
| ✅ 3. Generate descriptions for each (via LLM) |
| ✅ 4. Build the **Requirement Graph** |
| ✅ 5. Build the **DS-Code Graph** |
| ✅ 6. Link both graphs into a BiGraph |
| ✅ 7. Implement a simple **agentic loop** using ReAct |
| ✅ 8. Let GPT answer deep questions about your code (with full context) |

---

## 🧰 Tech Stack

| Component        | Tool                            |
|------------------|---------------------------------|
| Parsing          | `tree-sitter-language-pack`     |
| Description gen. | OpenAI / DeepSeek-V2.5          |
| Graph storage    | Neo4j                           |
| Semantic sim.    | HuggingFace Transformers        |
| Reasoning agent  | Custom ReAct (or LangChain)     |
| Validation       | `black`, `pytest`, `mypy`       |

---

## 🔗 References & Credits

- [CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation](https://arxiv.org/pdf/2504.10046)
- [Self-RAG (Asai et al., 2023)](https://arxiv.org/pdf/2307.05068)
- [DRAGIN: Dynamic RAG for real-time needs](https://arxiv.org/pdf/2501.13742)
- [CodeRAG benchmark (June 2024)](https://arxiv.org/pdf/2406.14497)

---

👉 Let’s get started by parsing the repo with Tree-sitter...

## 📦 Install dependencies

We’ll begin by installing the `tree-sitter-language-pack` Python library, which provides precompiled Tree-sitter grammars for popular languages — including Python.

This saves us from having to manually clone grammars or compile `.so` libraries.

In [8]:
pip install tree-sitter-language-pack

Collecting tree-sitter-language-pack
  Downloading tree_sitter_language_pack-0.8.0-cp39-abi3-macosx_10_13_universal2.whl.metadata (17 kB)
Collecting tree-sitter-c-sharp>=0.23.1 (from tree-sitter-language-pack)
  Downloading tree_sitter_c_sharp-0.23.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (2.7 kB)
Collecting tree-sitter-embedded-template>=0.23.2 (from tree-sitter-language-pack)
  Downloading tree_sitter_embedded_template-0.23.2-cp39-abi3-macosx_11_0_arm64.whl.metadata (2.2 kB)
Collecting tree-sitter-yaml>=0.7.0 (from tree-sitter-language-pack)
  Downloading tree_sitter_yaml-0.7.1-cp310-abi3-macosx_11_0_arm64.whl.metadata (1.8 kB)
Downloading tree_sitter_language_pack-0.8.0-cp39-abi3-macosx_10_13_universal2.whl (28.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.6/28.6 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m:01[0m
[?25hDownloading tree_sitter_c_sharp-0.23.1-cp39-abi3-macosx_11_0_arm64.whl (419 kB)
Downloading t

## 🧪 Parse a test string with Tree-sitter

Let's load the Python parser and run Tree-sitter on a simple test snippet to confirm everything is working.

We also define a small `dump()` helper function to pretty-print the AST (abstract syntax tree) node structure.

This will help us verify that Tree-sitter is correctly parsing the function and its components (name, parameters, body, etc).

In [12]:
from tree_sitter_language_pack import get_parser

parser = get_parser("python")
tree = parser.parse(b"def foo(): pass")

def dump(node, indent=0):
    print("  " * indent + f"{node.type}: {node.text.decode('utf-8')}")
    for child in node.named_children:
        dump(child, indent + 1)

root = parser.parse(b"def foo(): pass").root_node
dump(root)

module: def foo(): pass
  function_definition: def foo(): pass
    identifier: foo
    parameters: ()
    block: pass
      pass_statement: pass


## 📂 Parse all Python files in the repo

Now that Tree-sitter is working, let’s walk through the entire local repo and extract all `function_definition` and `class_definition` nodes.

For each one, we’ll collect:

- Type (`function` or `class`)
- Name
- Start and end line
- File path

This structured data will help us build the code graph later.

In [16]:
from tree_sitter_language_pack import get_parser
from pathlib import Path
import json

# Setup
parser = get_parser("python")
REPO_ROOT = Path(".").resolve()

# Extract function/class nodes
def extract_code_elements(source_code: str, file_path: str):
    tree = parser.parse(bytes(source_code, "utf-8"))
    root = tree.root_node
    elements = []

    def visit(node):
        if node.type in ("function_definition", "class_definition"):
            name_node = node.child_by_field_name("name")
            name = name_node.text.decode("utf-8") if name_node else "<anonymous>"
            elements.append({
                "type": node.type,
                "name": name,
                "start_line": node.start_point[0] + 1,
                "end_line": node.end_point[0] + 1,
                "file": str(file_path)
            })
        for child in node.named_children:
            visit(child)

    visit(root)
    return elements

# Walk through repo
all_elements = []
for py_file in REPO_ROOT.rglob("*.py"):
    try:
        code = py_file.read_text(encoding="utf-8")
        extracted = extract_code_elements(code, py_file.relative_to(REPO_ROOT))
        all_elements.extend(extracted)
    except Exception as e:
        print(f"⚠️ Failed to parse {py_file}: {e}")

# Save results
with open("code_elements.json", "w") as f:
    json.dump(all_elements, f, indent=2)


## 🧾 Requirement Descriptions (Docstrings)

In this project, we assume that each function and class in the codebase already includes a properly written **docstring** that describes its purpose, inputs, and outputs.

This serves as the "requirement" we need for building the **Requirement Graph** in the next step.

> ⚠️ If the codebase lacks docstrings or uses inconsistent formatting, you would need to generate these descriptions using a language model (e.g. GPT-4 or DeepSeek) based on the raw source code.

Since our current dataset is clean and well-documented, we’ll skip this step and move directly to extracting docstrings from the parsed code.

### 🔄 Extract docstrings with Python `ast`

Tree-sitter gives us positions, calls, etc.  
For docstrings the built-in `ast` module is simpler and bullet-proof.
This cell parses each file twice:

* Tree-sitter → start/end lines, names, calls (as before)  
* AST → exact docstring for every func/class  

The result is `code_elements_with_docstrings.json`.

In [27]:
import ast, json, pandas as pd
from tree_sitter_language_pack import get_parser
from pathlib import Path

parser = get_parser("python")
ROOT   = Path(".").resolve()
out    = []

def ts_name_and_span(src: str):
    """Return dict {name,start,end} for every function/class via Tree-sitter."""
    tree = parser.parse(src.encode())
    root = tree.root_node
    res  = []

    def walk(node):
        if node.type in ("function_definition", "class_definition"):
            name = node.child_by_field_name("name").text.decode()
            res.append((name, node.start_point[0]+1, node.end_point[0]+1))
        for c in node.named_children:
            walk(c)
    walk(root)
    return res

for py in ROOT.rglob("*.py"):
    code = py.read_text(encoding="utf-8")
    # 1️⃣ positions with tree-sitter
    spans = ts_name_and_span(code)
    # 2️⃣ docstrings with ast
    module = ast.parse(code, filename=str(py))
    for node in ast.walk(module):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            doc = ast.get_docstring(node) or ""
            name = node.name
            # match span (name is unique inside file)
            start, end = next((s,e) for n,s,e in spans if n == name)
            out.append({
                "type": "class_definition" if isinstance(node, ast.ClassDef) else "function_definition",
                "name": name,
                "docstring": doc,
                "start_line": start,
                "end_line": end,
                "file": str(py.relative_to(ROOT))
            })

with open("code_elements_with_docstrings.json", "w") as f:
    json.dump(out, f, indent=2)
print("✅ saved", len(out), "elements with docstrings")
pd.DataFrame(out).head()

✅ saved 102 elements with docstrings


Unnamed: 0,type,name,docstring,start_line,end_line,file
0,function_definition,root,Root endpoint to verify the API is running.,47,51,app/main.py
1,function_definition,health_check,Health check endpoint to verify the API is run...,54,58,app/main.py
2,function_definition,load_config,Load configuration from environment variables....,13,45,app/core/config.py
3,class_definition,Database,,14,1438,app/core/database.py
4,function_definition,__init__,Initialize the database connection with the gi...,15,32,app/core/database.py


## 📥 Load elements (docstrings included)

This cell reads **`code_elements_with_docstrings.json`** – the file we just generated –  
and loads it into a Pandas DataFrame for a quick visual check.

`elements` → Python list of dicts  
`df.head()` → first few rows so we can confirm each entry now has a `docstring`.

In [28]:
import json, pandas as pd, networkx as nx
from pathlib import Path

with open("code_elements_with_docstrings.json") as f:
    elements = json.load(f)

df = pd.DataFrame(elements)
df.head()

Unnamed: 0,type,name,docstring,start_line,end_line,file
0,function_definition,root,Root endpoint to verify the API is running.,47,51,app/main.py
1,function_definition,health_check,Health check endpoint to verify the API is run...,54,58,app/main.py
2,function_definition,load_config,Load configuration from environment variables....,13,45,app/core/config.py
3,class_definition,Database,,14,1438,app/core/database.py
4,function_definition,__init__,Initialize the database connection with the gi...,15,32,app/core/database.py


## 🧾 Requirement Graph – build *similar_to* edges

In this step we turn the list of elements into a **Requirement Graph (RG)**
where each node is a docstring and edges connect semantically-similar nodes.

What happens inside the code block:

1. **Install** the embedding + math libs  
   `sentence_transformers` → text-to-vector model  
   `scikit-learn` → `cosine_similarity` helper
2. **Embed** every docstring using the compact model  
   (`all-MiniLM-L6-v2`, 384-dim).
3. **Create** a NetworkX graph  
   *Node ID*: `R0`, `R1`, … – stores the original metadata.
4. **Compute pairwise cosine similarity**  
   If two requirements score **≥ 0.80** we add  
   `RG.add_edge(Ri, Rj, kind="similar_to", weight=score)`.
5. Finally we print a summary (`nx.info`) to check how many nodes and
   *similar_to* edges were created.

> **Threshold 0.80** is empirical – adjust higher for stricter similarity,
> lower for looser matching.

In [29]:
pip install -q sentence_transformers scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [31]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

texts = [e["docstring"] or "" for e in elements]
emb = model.encode(texts, normalize_embeddings=True)

RG = nx.Graph()

# add nodes
for idx, e in enumerate(elements):
    RG.add_node(f"R{idx}", **e)

# similarity edges
cos_matrix = cosine_similarity(emb)
THRESH = 0.8
for i in range(len(elements)):
    for j in range(i + 1, len(elements)):
        if cos_matrix[i, j] >= THRESH:
            RG.add_edge(f"R{i}", f"R{j}", kind="similar_to", weight=float(cos_matrix[i, j]))


In [32]:
print(f"RG: {RG.number_of_nodes():,} nodes  |  {RG.number_of_edges():,} similar_to edges")

RG: 102 nodes  |  127 similar_to edges


In [34]:
from networkx.classes.reportviews import NodeView, EdgeView
print("Nodes:", RG.number_of_nodes())
print("Edges:", RG.number_of_edges())
print("Node examples:", list(RG.nodes(data=True))[:3])

Nodes: 102
Edges: 127
Node examples: [('R0', {'type': 'function_definition', 'name': 'root', 'docstring': 'Root endpoint to verify the API is running.', 'start_line': 47, 'end_line': 51, 'file': 'app/main.py'}), ('R1', {'type': 'function_definition', 'name': 'health_check', 'docstring': 'Health check endpoint to verify the API is running.', 'start_line': 54, 'end_line': 58, 'file': 'app/main.py'}), ('R2', {'type': 'function_definition', 'name': 'load_config', 'docstring': 'Load configuration from environment variables.\n\nReturns:\n    A dictionary containing configuration settings', 'start_line': 13, 'end_line': 45, 'file': 'app/core/config.py'})]


### 2.2 Add *parent_child* edges (calls)

We already linked semantically similar requirements.  
Now we’ll connect a parent node to each requirement it **calls**:

* Parse every `.py` file with `ast` to extract func → func calls.  
* When function **A** calls **B**, add `RG.add_edge(RA, RB, kind="parent_child")`.

In [38]:
import ast, collections, itertools

# 1. lookup (file, name)  ->  requirement ID
name_to_rid = {
    (data["file"], data["name"]): rid
    for rid, data in RG.nodes(data=True)
}

def deepest_attr(node: ast.Attribute) -> str:
    """Return the last attribute name of a dotted call: pkg.mod.func -> func"""
    while isinstance(node, ast.Attribute):
        last = node.attr
        node  = node.value
    return last  # 'func'

for py in ROOT.rglob("*.py"):
    src   = py.read_text(encoding="utf-8")
    mod   = ast.parse(src, filename=str(py))
    file_ = str(py.relative_to(ROOT))

    # todos los defs de este archivo
    defs = {n.name: n for n in ast.walk(mod)
            if isinstance(n, (ast.FunctionDef, ast.AsyncFunctionDef))}

    for def_name, fn in defs.items():
        rid_caller = name_to_rid.get((file_, def_name))
        if not rid_caller:
            continue

        for call in ast.walk(fn):
            if not isinstance(call, ast.Call):
                continue

            callee_name = None
            # foo()
            if isinstance(call.func, ast.Name):
                callee_name = call.func.id
            # obj.foo()  /  pkg.mod.bar()
            elif isinstance(call.func, ast.Attribute):
                callee_name = deepest_attr(call.func)

            if callee_name:
                rid_callee = name_to_rid.get((file_, callee_name))
                if rid_callee:
                    RG.add_edge(rid_caller, rid_callee, kind="parent_child")

print(f"Parent_child edges added. Graph now has {RG.number_of_edges()} edges.")

Parent_child edges added. Graph now has 184 edges.


In [39]:
parents = [
    u for u, v, kind in RG.edges(data="kind")
    if kind == "parent_child"
][:10]

for rid in parents:
    data = RG.nodes[rid]
    print(f"{rid}  {data['file']}:{data['name']}")

R4  app/core/database.py:__init__
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection
R5  app/core/database.py:_get_connection


In [40]:
for u, v, k in RG.edges(data="kind"):
    if u == 'R5' and k == 'parent_child':   # R5 es un ejemplo; usa el RID que quieras
        print("get_connection  ➜ calls  ➜", RG.nodes[v]['name'])

get_connection  ➜ calls  ➜ _create_tables
get_connection  ➜ calls  ➜ save_thought
get_connection  ➜ calls  ➜ get_thought
get_connection  ➜ calls  ➜ get_thoughts
get_connection  ➜ calls  ➜ create_procedure
get_connection  ➜ calls  ➜ get_procedures
get_connection  ➜ calls  ➜ get_procedure
get_connection  ➜ calls  ➜ add_procedure_step
get_connection  ➜ calls  ➜ add_procedure_steps
get_connection  ➜ calls  ➜ delete_thought
get_connection  ➜ calls  ➜ delete_procedure
get_connection  ➜ calls  ➜ update_thought
get_connection  ➜ calls  ➜ update_procedure
get_connection  ➜ calls  ➜ update_procedure_step
get_connection  ➜ calls  ➜ create_technical_decision
get_connection  ➜ calls  ➜ get_technical_decisions
get_connection  ➜ calls  ➜ get_technical_decision
get_connection  ➜ calls  ➜ update_technical_decision
get_connection  ➜ calls  ➜ delete_technical_decision
get_connection  ➜ calls  ➜ create_experience
get_connection  ➜ calls  ➜ get_experiences
get_connection  ➜ calls  ➜ get_experience
get_conn

### 👀 Quick graph visualisation

Below we draw **a small sub-graph** (up to 50 nodes) with NetworkX +
Matplotlib.  
For a full interactive view we also provide a PyVis snippet that writes an
`html` file you can open in your browser.

In [43]:
!pip install -q pyvis

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [44]:
from pyvis.network import Network

net = Network(height="750px", width="100%", notebook=True, directed=False)
net.toggle_physics(True)

for n, data in RG.nodes(data=True):
    net.add_node(n, label=data["name"], title=data["docstring"][:200])

for u, v, k in RG.edges(data="kind"):
    color = "#2ca02c" if k == "parent_child" else "#9467bd"
    net.add_edge(u, v, color=color)

net.show("requirement_graph.html")

requirement_graph.html


## 3️⃣ DS-Code Graph (CG) – structure of the real code

* **Nodes**
  * **File / module**        → id = file path
  * **Function / Class**     → id = `C0`, `C1`, …

* **Edges**
  | Kind          | Added when …                                          |
  |---------------|-------------------------------------------------------|
  | `contain`     | file → func / class lives inside that file            |
  | `call`        | function A calls function B (same file for now)       |
  | `inherit`     | class A inherits from class B                         |
  | `import`      | file A imports module/file B                          |
  | `similar_to`  | _(optional)_ cosine ≥ 0.80 between **code bodies**    |

The graph is a `networkx.DiGraph` so edge direction matters  
(`caller → callee`, `file → function`, etc.).

In [45]:
import ast, networkx as nx
from pathlib import Path
from collections import defaultdict
import json

ROOT = Path(".").resolve()

# ---------- load elements ----------
with open("code_elements_with_docstrings.json") as f:
    elements = json.load(f)

# helper: (file, name) -> code-node id
cid_map = {}
CG = nx.DiGraph()

# ---------- add code nodes ----------
for idx, el in enumerate(elements):
    cid = f"C{idx}"
    cid_map[(el["file"], el["name"])] = cid
    CG.add_node(cid, **el)

# ---------- add file nodes ----------
for el in elements:
    CG.add_node(el["file"], type="module")

# ---------- contain edges ----------
for idx, el in enumerate(elements):
    CG.add_edge(el["file"], f"C{idx}", kind="contain")

# ---------- scan every file with ast ----------
for py in ROOT.rglob("*.py"):
    file_id = str(py.relative_to(ROOT))
    src     = py.read_text(encoding="utf-8")
    tree    = ast.parse(src, filename=file_id)

    # --- import edges (file -> imported module/file) ---
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for n in node.names:
                CG.add_edge(file_id, n.name, kind="import")  # crude, module string
        elif isinstance(node, ast.ImportFrom):
            mod = node.module or ""
            CG.add_edge(file_id, mod, kind="import")

    # --- call + inherit edges inside this file ---
    defs = {n.name: n for n in ast.walk(tree)
            if isinstance(n, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef))}

    for def_name, obj in defs.items():
        caller_cid = cid_map.get((file_id, def_name))
        if not caller_cid:
            continue

        # inherit (classes only)
        if isinstance(obj, ast.ClassDef):
            for base in obj.bases:
                if isinstance(base, ast.Name):
                    parent_cid = cid_map.get((file_id, base.id))
                    if parent_cid:
                        CG.add_edge(caller_cid, parent_cid, kind="inherit")

        # call edges
        for call in ast.walk(obj):
            if isinstance(call, ast.Call):
                # simple cases: foo(), obj.foo()
                target_name = None
                if isinstance(call.func, ast.Name):
                    target_name = call.func.id
                elif isinstance(call.func, ast.Attribute):
                    target_name = call.func.attr
                if target_name:
                    callee_cid = cid_map.get((file_id, target_name))
                    if callee_cid:
                        CG.add_edge(caller_cid, callee_cid, kind="call")

print(f"CG built: {CG.number_of_nodes()} nodes  |  {CG.number_of_edges()} edges")

# preview a few edges
list(CG.edges(data="kind"))[:10]

CG built: 149 nodes  |  257 edges


[('C3', 'C6', 'call'),
 ('C3', 'C5', 'call'),
 ('C3', 'C8', 'call'),
 ('C3', 'C12', 'call'),
 ('C3', 'C22', 'call'),
 ('C3', 'C27', 'call'),
 ('C4', 'C6', 'call'),
 ('C6', 'C5', 'call'),
 ('C7', 'C5', 'call'),
 ('C8', 'C5', 'call')]

### 🔎 3.1 Quick sanity-checks for the Code Graph
We’ll print basic stats and inspect a few edges per kind.

In [46]:
from collections import Counter

print("Nodes:", CG.number_of_nodes())
print("Edges:", CG.number_of_edges())

# distribución por tipo de arista
edge_kinds = Counter(k for _,_,k in CG.edges(data="kind"))
print("Edge types:", edge_kinds)

# ejemplo: ¿quién llama a quién?
for u, v, k in list(CG.edges(data="kind"))[:10]:
    print(f"{CG.nodes[u]['name']}  --{k}-->  {CG.nodes[v]['name']}")

Nodes: 149
Edges: 257
Edge types: Counter({'contain': 102, 'import': 79, 'call': 67, 'inherit': 9})
Database  --call-->  _create_tables
Database  --call-->  _get_connection
Database  --call-->  get_thought
Database  --call-->  get_procedure
Database  --call-->  get_technical_decision
Database  --call-->  get_experience
__init__  --call-->  _create_tables
_create_tables  --call-->  _get_connection
save_thought  --call-->  _get_connection
get_thought  --call-->  _get_connection


### 4️⃣  ID map  (Requirement  ↔  Code)  –  the Bigraph glue
We now link each requirement node `R*` to its corresponding code node `C*`.

In [48]:
# 1-to-1 map (por construcción los índices coinciden)
id_map = {f"R{i}": f"C{i}" for i in range(len(elements))}

# guardamos como atributo cruzado
for rid, cid in id_map.items():
    RG.nodes[rid]["code_id"] = cid
    CG.nodes[cid]["req_id"]  = rid
print("Bigraph mapping added:", len(id_map), "links")

Bigraph mapping added: 102 links


### 💾 Persist graphs to disk

We’ll save both graphs in **GraphML** format so they can be:

* Reloaded later in NetworkX without rebuilding.
* Imported into Neo4j (via `neo4j-admin import`) or visual tools like Gephi.

Files created:

* `requirement_graph.graphml`
* `code_graph.graphml`

In [49]:
nx.write_graphml(RG, "requirement_graph.graphml")
nx.write_graphml(CG, "code_graph.graphml")

### 🔍 Mini-demo — “What does X call and who is similar?”

Give a function/class name, and we’ll:

1. Locate its requirement node in **RG**  
2. Display its docstring  
3. Show *similar* requirements (semantic)  
4. Show *children it calls* (parent_child edges)  
5. Print the source code for context

This proves the graphs are usable right away.

In [53]:
from pathlib import Path
import textwrap

def show_info(func_name: str, file_hint: str = None):
    """
    Quick graph-based inspection tool.

    Args
    ----
    func_name : name of the function/class to inspect
    file_hint : optional relative file path if several defs share the same name
    """
    # 1️⃣ find node id
    candidates = [
        rid for rid, data in RG.nodes(data=True)
        if data["name"] == func_name
           and (file_hint is None or data["file"] == file_hint)
    ]
    if not candidates:
        print("❌ Not found")
        return
    rid = candidates[0]
    data = RG.nodes[rid]
    print(f"### {data['name']}  ({data['file']}:{data['start_line']}-{data['end_line']})\n")
    print(textwrap.indent(data["docstring"] or "*(no docstring)*", "  "))

    # 2️⃣ similar requirements
    similars = [
        v for u, v, kind in RG.edges(rid, data="kind")
        if kind == "similar_to"
    ][:5]
    print("\n— Similar requirements:")
    for v in similars:
        print(" •", RG.nodes[v]["name"])

    # 3️⃣ child calls
    childs = [
        v for u, v, kind in RG.edges(rid, data="kind")
        if kind == "parent_child"
    ][:10]
    print("\n— Direct calls (parent_child):")
    for v in childs:
        print(" •", RG.nodes[v]["name"])

    # 4️⃣ show code
    code_path = Path(data["file"])
    code_lines = code_path.read_text(encoding="utf-8").splitlines()
    snippet = code_lines[data["start_line"]-1 : data["end_line"]]
    print("\n— Source code:")
    print(textwrap.indent("\n".join(snippet), "    "))


In [56]:
show_info("update_technical_decision")

### update_technical_decision  (app/core/database.py:1014-1099)

  Update a technical decision in the database.

  Args:
      decision_id: The ID of the technical decision to update
      data: Dictionary containing the fields to update
    
  Returns:
      The updated technical decision data, or None if not found or error

— Similar requirements:
 • update_technical_decision

— Direct calls (parent_child):
 • _get_connection
 • get_technical_decision

— Source code:
        def update_technical_decision(self, decision_id: int, data: Dict[str, Any]) -> Optional[Dict[str, Any]]:
            """
            Update a technical decision in the database.
        
            Args:
                decision_id: The ID of the technical decision to update
                data: Dictionary containing the fields to update
            
            Returns:
                The updated technical decision data, or None if not found or error
            """
            try:
                conn = sel