# 🧠 GPT now understands my repo like a senior dev – here's how

This notebook is an end-to-end reproduction and reinterpretation of the **CodeRAG** framework (see [paper](https://arxiv.org/pdf/2504.10046)), adapted to run locally on your own codebase.

We want to give GPT (or any LLM) the ability to:
- Parse and **understand your entire repo**
- Retrieve and reason over related code
- Answer questions like a senior developer would — with **zero fine-tuning**

---

## 🎯 Goal of this experiment

Our objective is to build a **code-aware assistant** using a combination of:

- 🔍 **Tree-sitter** to parse and structure the codebase
- 🧠 **LLMs** to describe each function or class (aka "requirements")
- 🕸️ **Code graphs** to represent dependencies (calls, imports, etc)
- 🧭 **Agentic reasoning** to let the LLM query and retrieve context dynamically
- ⚡ **RAG (Retrieval-Augmented Generation)** to reduce hallucinations and give smarter answers

The end result is a local-first, fully transparent, and extensible RAG pipeline tailored for your own project.

---

## 🧪 Inspired by CodeRAG (What we're replicating)

From the CodeRAG paper (April 2025), we aim to recreate the following innovations:

1. **Requirement Graph**  
   A graph where each node is a *natural language description* of a function or class. Edges represent semantic similarity or parent-child relations.

2. **DS-Code Graph**  
   A code graph that encodes structural dependencies like:
   - function calls
   - class inheritance
   - file/module containment
   - semantic similarity (via embeddings)

3. **BiGraph Mapping**  
   Links between requirements and code elements — allowing retrieval of relevant code given a high-level prompt.

4. **Agentic Reasoning**  
   An LLM-driven reasoning loop that dynamically:
   - queries the graph
   - follows dependencies
   - does web search if needed
   - formats and tests generated code

---

## 🪜 Pipeline Overview (what this notebook covers)

| Step | Description |
|------|-------------|
| ✅ 1. Parse your local repo using Tree-sitter |
| ✅ 2. Extract all functions and classes |
| ✅ 3. Generate descriptions for each (via LLM) |
| ✅ 4. Build the **Requirement Graph** |
| ✅ 5. Build the **DS-Code Graph** |
| ✅ 6. Link both graphs into a BiGraph |
| ✅ 7. Implement a simple **agentic loop** using ReAct |
| ✅ 8. Let GPT answer deep questions about your code (with full context) |

---

## 🧰 Tech Stack

| Component        | Tool                            |
|------------------|---------------------------------|
| Parsing          | `tree-sitter-language-pack`     |
| Description gen. | OpenAI / DeepSeek-V2.5          |
| Graph storage    | Neo4j                           |
| Semantic sim.    | HuggingFace Transformers        |
| Reasoning agent  | Custom ReAct (or LangChain)     |
| Validation       | `black`, `pytest`, `mypy`       |

---

## 🔗 References & Credits

- [CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation](https://arxiv.org/pdf/2504.10046)
- [Self-RAG (Asai et al., 2023)](https://arxiv.org/pdf/2307.05068)
- [DRAGIN: Dynamic RAG for real-time needs](https://arxiv.org/pdf/2501.13742)
- [CodeRAG benchmark (June 2024)](https://arxiv.org/pdf/2406.14497)

---

👉 Let’s get started by parsing the repo with Tree-sitter...

## 📦 Install dependencies

We’ll begin by installing the `tree-sitter-language-pack` Python library, which provides precompiled Tree-sitter grammars for popular languages — including Python.

This saves us from having to manually clone grammars or compile `.so` libraries.

In [8]:
pip install tree-sitter-language-pack

Collecting tree-sitter-language-pack
  Downloading tree_sitter_language_pack-0.8.0-cp39-abi3-macosx_10_13_universal2.whl.metadata (17 kB)
Collecting tree-sitter-c-sharp>=0.23.1 (from tree-sitter-language-pack)
  Downloading tree_sitter_c_sharp-0.23.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (2.7 kB)
Collecting tree-sitter-embedded-template>=0.23.2 (from tree-sitter-language-pack)
  Downloading tree_sitter_embedded_template-0.23.2-cp39-abi3-macosx_11_0_arm64.whl.metadata (2.2 kB)
Collecting tree-sitter-yaml>=0.7.0 (from tree-sitter-language-pack)
  Downloading tree_sitter_yaml-0.7.1-cp310-abi3-macosx_11_0_arm64.whl.metadata (1.8 kB)
Downloading tree_sitter_language_pack-0.8.0-cp39-abi3-macosx_10_13_universal2.whl (28.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.6/28.6 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m:01[0m
[?25hDownloading tree_sitter_c_sharp-0.23.1-cp39-abi3-macosx_11_0_arm64.whl (419 kB)
Downloading t

## 🧪 Parse a test string with Tree-sitter

Let's load the Python parser and run Tree-sitter on a simple test snippet to confirm everything is working.

We also define a small `dump()` helper function to pretty-print the AST (abstract syntax tree) node structure.

This will help us verify that Tree-sitter is correctly parsing the function and its components (name, parameters, body, etc).

In [12]:
from tree_sitter_language_pack import get_parser

parser = get_parser("python")
tree = parser.parse(b"def foo(): pass")

def dump(node, indent=0):
    print("  " * indent + f"{node.type}: {node.text.decode('utf-8')}")
    for child in node.named_children:
        dump(child, indent + 1)

root = parser.parse(b"def foo(): pass").root_node
dump(root)

module: def foo(): pass
  function_definition: def foo(): pass
    identifier: foo
    parameters: ()
    block: pass
      pass_statement: pass


## 📂 Parse all Python files in the repo

Now that Tree-sitter is working, let’s walk through the entire local repo and extract all `function_definition` and `class_definition` nodes.

For each one, we’ll collect:

- Type (`function` or `class`)
- Name
- Start and end line
- File path

This structured data will help us build the code graph later.

In [13]:
from tree_sitter_language_pack import get_parser
from pathlib import Path
import json

# Setup
parser = get_parser("python")
REPO_ROOT = Path(".").resolve()

# Extract function/class nodes
def extract_code_elements(source_code: str, file_path: str):
    tree = parser.parse(bytes(source_code, "utf-8"))
    root = tree.root_node
    elements = []

    def visit(node):
        if node.type in ("function_definition", "class_definition"):
            name_node = node.child_by_field_name("name")
            name = name_node.text.decode("utf-8") if name_node else "<anonymous>"
            elements.append({
                "type": node.type,
                "name": name,
                "start_line": node.start_point[0] + 1,
                "end_line": node.end_point[0] + 1,
                "file": str(file_path)
            })
        for child in node.named_children:
            visit(child)

    visit(root)
    return elements

# Walk through repo
all_elements = []
for py_file in REPO_ROOT.rglob("*.py"):
    try:
        code = py_file.read_text(encoding="utf-8")
        extracted = extract_code_elements(code, py_file.relative_to(REPO_ROOT))
        all_elements.extend(extracted)
    except Exception as e:
        print(f"⚠️ Failed to parse {py_file}: {e}")

# Save results
with open("code_elements.json", "w") as f:
    json.dump(all_elements, f, indent=2)

# Preview results
import pandas as pd
pd.DataFrame(all_elements)

Unnamed: 0,type,name,start_line,end_line,file
0,class_definition,Build,9,14,tree-sitter-python/setup.py
1,function_definition,run,10,14,tree-sitter-python/setup.py
2,class_definition,BdistWheel,17,22,tree-sitter-python/setup.py
3,function_definition,get_tag,18,22,tree-sitter-python/setup.py
4,class_definition,TokenTests,17,80,tree-sitter-python/examples/python2-grammar-cr...
...,...,...,...,...,...
652,class_definition,TestLanguage,6,11,tree-sitter-python/bindings/python/tests/test_...
653,function_definition,test_can_load_grammar,7,11,tree-sitter-python/bindings/python/tests/test_...
654,function_definition,_get_query,8,11,tree-sitter-python/bindings/python/tree_sitter...
655,function_definition,__getattr__,14,20,tree-sitter-python/bindings/python/tree_sitter...


## 🧾 Requirement Descriptions (Docstrings)

In this project, we assume that each function and class in the codebase already includes a properly written **docstring** that describes its purpose, inputs, and outputs.

This serves as the "requirement" we need for building the **Requirement Graph** in the next step.

> ⚠️ If the codebase lacks docstrings or uses inconsistent formatting, you would need to generate these descriptions using a language model (e.g. GPT-4 or DeepSeek) based on the raw source code.

Since our current dataset is clean and well-documented, we’ll skip this step and move directly to extracting docstrings from the parsed code.