# Code Hierarchy Node Parser

The `CodeHierarchyNodeParser` is useful to split long code files into more reasonable chunks. What this will do is create a "Hierarchy" of sorts, where sections of the code are made more reasonable by replacing the scope body with short comments telling the LLM to search for a referenced node if it wants to read that context body. This is called skeletonization, and is toggled by setting `skeleton` to `True` which it is by default. Nodes in this hierarchy will be split based on scope, like function, class, or method scope, and will have links to their children and parents so the LLM can traverse the tree.

## Installation and Import

First be sure to install the necessary [tree-sitter](https://tree-sitter.github.io/tree-sitter/) libraries.

In [1]:
!pip install tree-sitter tree-sitter-languages


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [19]:
from llama_hub.file.code.code_hierarchy import CodeHierarchyNodeParser
from llama_index.text_splitter import CodeSplitter
from llama_index.readers import SimpleDirectoryReader
from pathlib import Path
from pprint import pprint

ModuleNotFoundError: No module named 'llama_hub'

In [3]:
from IPython.display import Markdown, display
def print_python(python_text):
    """This function prints python text in ipynb nicely formatted."""
    display(Markdown("```python\n"+python_text+"```"))

# Prepare your Data

Choose a directory you want to scan, and glob for all the code files you want to import.

In this case I'm going to glob all "*.py" files in the `llama_index/node_parser` directory.

In [4]:
reader = SimpleDirectoryReader(
    input_files=[Path("./code_hierarchy.py")],
    file_metadata=lambda x: {"filepath": x},
)
nodes = reader.load_data()

This should be the code hierarchy node parser itself. Lets have it parse itself!

In [5]:
print(f"Length of text: {len(nodes[0].text)}")
print_python(nodes[0].text[:1500]+"\n\n# ...")

Length of text: 33247


```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.extractors.metadata_extractors import BaseExtractor
from llama_index.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field


from tree_sitter import Node

from llama_index.callbacks.base import CallbackManager
from llama_index.callbacks.schema import CBEventType, EventPayload
from llama_index.schema import BaseNode, Document, NodeRelationship, TextNode
from llama_index.text_splitter import CodeSplitter
from llama_index.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    """
    Unfortunately some languages need special options for how to make a signature.

    For example, html element signatures should include their closing >, there is no
    easy way to include this using an always-exclusive system.

    However, using an always-inclusive system, python decorators don't work,
    as there isn't an easy to define terminator for decorators that is inclusive
    to their signature.
    """

    type: str = Field(description="The type string to match on.")
    inclusive: bool = Field(
        description=(
            "Whether to include the text of the node matched by this type or not."
        ),
    )


class _SignatureCaptureOptions(BaseModel):
    start_signature_types: Optional[List[_SignatureCaptureType]] = Field(

# ...```

This is way too long to fit into the context of our LLM. So what are we to do? Well we will split it. We are going to use the `CodeHierarchyNodeParser` to split the nodes into more reasonable chunks.

In [6]:
split_nodes = CodeHierarchyNodeParser(
    language="python",
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python", max_chars=1000, chunk_lines=10),
).get_nodes_from_documents(nodes)
print("Number of nodes after splitting:", len(split_nodes))

Number of nodes after splitting: 87


Great! So that split up our data from 1 node into 86 nodes! Whats the max length of any of these nodes?

In [7]:
print(f"Longest text in nodes: {max(len(n.text) for n in split_nodes)}")

Longest text in nodes: 1160


That's much shorter than before! Let's look at a sample.

In [8]:
print_python(split_nodes[0].text)

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.extractors.metadata_extractors import BaseExtractor
from llama_index.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field


from tree_sitter import Node

from llama_index.callbacks.base import CallbackManager
from llama_index.callbacks.schema import CBEventType, EventPayload
from llama_index.schema import BaseNode, Document, NodeRelationship, TextNode
from llama_index.text_splitter import CodeSplitter
from llama_index.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 9fc27450-8dd7-4459-a67b-d35266d949be


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id d79396a6-bc83-4115-a748-37173ac792c2
    # Code replaced for brevity. See node_id 1bccb6a2-cb81-4dd8-b2ca-60de94fb4311```

Without even needing a long printout we can see everything this module imported in the first document (which is at the module level) and some classes it defines.

We also see that it has put comments in place of code that was removed to make the text size more reasonable.
These can appear at the beginning or end of a chunk, or at a new scope level, like a class or function declaration.

`# Code replaced for brevity. See node_id {node_id}`

# Code Hierarchy

These scopes can be listed by the `CodeHierarchyNodeParser`, giving a "repo map" of sorts.
The namesake of this node parser, it creates a tree of scope names to use to search the code.
Put this in your context to give the LLM a default search hierarchy.

Instruct an LLM using the KeywordQueryEngine (shown later) as a tool to:

```
"Search the tool by any element in this list, or any uuid found in the resulting code, to get more information about that element."
```

Then append this to your context:

In [9]:
print(CodeHierarchyNodeParser.get_code_hierarchy_from_nodes(split_nodes))

- code_hierarchy
  - _SignatureCaptureType
  - _SignatureCaptureOptions
  - _ScopeMethod
  - _CommentOptions
  - _ScopeItem
  - _ChunkNodeOutput
  - CodeHierarchyNodeParser
    - class_name
    - __init__
    - _get_node_name
      - recur
    - _get_node_signature
      - find_start
      - find_end
    - _chunk_node
    - get_code_hierarchy_from_nodes
      - get_subdict
      - recur_inclusive_scope
      - dict_to_markdown
    - _parse_nodes
    - _get_indentation
    - _get_comment_text
    - _create_comment_line
    - _get_replacement_text
    - _skeletonize
    - _skeletonize_list
      - recur



# Exploration by the Programmer

So that we understand what is going on under the hood, what if we go to that node_id we found above?

In [10]:
split_nodes_by_id = {n.node_id: n for n in split_nodes}
uuid_from_text = split_nodes[0].text.splitlines()[-1].split(" ")[-1]
print("Going to print the node with UUID:", uuid_from_text)
print_python(split_nodes_by_id[uuid_from_text].text)

Going to print the node with UUID: 1bccb6a2-cb81-4dd8-b2ca-60de94fb4311


```python
# Code replaced for brevity. See node_id 1165ccf1-7954-4350-847e-8677ae49a5a0
"""
Maps language -> Node Type -> SignatureCaptureOptions

The best way for a developer to discover these is to put a breakpoint at the TIP
tag in _chunk_node, and then create a unit test for some code, and then iterate
through the code discovering the node names.
"""
    # Code replaced for brevity. See node_id 13c351a9-fa3c-4d91-8e4a-bfde2b7d4f6c```

This is the next split in the file. It is prepended with the node before it and appended with the node after it as a comment.

We can also see the relationships on this node programmatically.

In [11]:
print_python(split_nodes_by_id[uuid_from_text].text)

```python
# Code replaced for brevity. See node_id 1165ccf1-7954-4350-847e-8677ae49a5a0
"""
Maps language -> Node Type -> SignatureCaptureOptions

The best way for a developer to discover these is to put a breakpoint at the TIP
tag in _chunk_node, and then create a unit test for some code, and then iterate
through the code discovering the node names.
"""
    # Code replaced for brevity. See node_id 13c351a9-fa3c-4d91-8e4a-bfde2b7d4f6c```

The `NEXT` `PREV` relationships come from the `CodeSplitter` which is a component of the `CodeHierarchyNodeParser`. It is responsible for cutting up the nodes into chunks that are a certain character length. For more information about the `CodeSplitter` read this:

[Code Splitter](https://docs.llamaindex.ai/en/latest/api/llama_index.node_parser.CodeSplitter.html)

The `PARENT` and `CHILD` relationships come from the `CodeHierarchyNodeParser` which is responsible for creating the hierarchy of nodes. Things like classes, functions, and methods are nodes in this hierarchy.

The `SOURCE` is the original file that this node came from.

In [12]:
from llama_index.schema import NodeRelationship

node_id = uuid_from_text
if NodeRelationship.NEXT not in split_nodes_by_id[node_id].relationships:
    print("No next node found!")
else:
    next_node_relationship_info = split_nodes_by_id[node_id].relationships[
        NodeRelationship.NEXT
    ]
    next_node = split_nodes_by_id[next_node_relationship_info.node_id]
    print_python(next_node.text)

```python
# Code replaced for brevity. See node_id 1bccb6a2-cb81-4dd8-b2ca-60de94fb4311
_DEFAULT_SIGNATURE_IDENTIFIERS: Dict[str, Dict[str, _SignatureCaptureOptions]] =
    # Code replaced for brevity. See node_id 04a96f73-8399-4ec6-8db7-ae18cd18127d```

# Keyword Table and Usage by the LLM

Lets explore the use of this node parser in an index. We will be able to use any index which allows search by keyword, which should enable us to search for any node by it's uuid, or by any scope name.

We have created a `CodeHierarchyKeywordQueryEngine` which will allow us to search for nodes by their uuid, or by their scope name. It's `.query` method can be used as a simple search tool for any LLM. Given the repo map we created earlier, or the text of a split file, the LLM should be able to figure out what to search for very naturally.

Lets create the KeywordQueryEngine

In [13]:
from index import CodeHierarchyKeywordQueryEngine

idx = CodeHierarchyKeywordQueryEngine(
    nodes=split_nodes,
)

Now we can get the same code as before.

In [14]:
print_python(idx.query(split_nodes[0].node_id).response)

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.extractors.metadata_extractors import BaseExtractor
from llama_index.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field


from tree_sitter import Node

from llama_index.callbacks.base import CallbackManager
from llama_index.callbacks.schema import CBEventType, EventPayload
from llama_index.schema import BaseNode, Document, NodeRelationship, TextNode
from llama_index.text_splitter import CodeSplitter
from llama_index.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 9fc27450-8dd7-4459-a67b-d35266d949be


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id d79396a6-bc83-4115-a748-37173ac792c2
    # Code replaced for brevity. See node_id 1bccb6a2-cb81-4dd8-b2ca-60de94fb4311```

But now we can also search for any node by it's common sense name.

For example, the class `_SignatureCaptureOptions` is a node in the hierarchy. We can search for it by name.

In [15]:
print_python(idx.query("_SignatureCaptureType").response)

```python
class _SignatureCaptureType(BaseModel):
    """
    Unfortunately some languages need special options for how to make a signature.

    For example, html element signatures should include their closing >, there is no
    easy way to include this using an always-exclusive system.

    However, using an always-inclusive system, python decorators don't work,
    as there isn't an easy to define terminator for decorators that is inclusive
    to their signature.
    """

    type: str = Field(description="The type string to match on.")
    inclusive: bool = Field(
        description=(
            "Whether to include the text of the node matched by this type or not."
        ),
    )```

And by module name, in case the LLM sees something in an import statement and wants to know more about it.

In [16]:
print_python(idx.query("code_hierarchy").response)

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.extractors.metadata_extractors import BaseExtractor
from llama_index.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field


from tree_sitter import Node

from llama_index.callbacks.base import CallbackManager
from llama_index.callbacks.schema import CBEventType, EventPayload
from llama_index.schema import BaseNode, Document, NodeRelationship, TextNode
from llama_index.text_splitter import CodeSplitter
from llama_index.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 9fc27450-8dd7-4459-a67b-d35266d949be


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id d79396a6-bc83-4115-a748-37173ac792c2
    # Code replaced for brevity. See node_id 1bccb6a2-cb81-4dd8-b2ca-60de94fb4311```

# As a Tool

To get the langchain tool, just use `as_langchain_tool` on the `CodeHierarchyKeywordQueryEngine` and it will be ready to use in the LLM.

In [20]:
print_python(idx.as_langchain_tool().run("code_hierarchy"))

```python
from collections import defaultdict
from enum import Enum
from typing import Any, Dict, List, Optional, Sequence, Tuple

from llama_index.extractors.metadata_extractors import BaseExtractor
from llama_index.node_parser.interface import NodeParser

try:
    from pydantic.v1 import BaseModel, Field
except ImportError:
    from pydantic import BaseModel, Field


from tree_sitter import Node

from llama_index.callbacks.base import CallbackManager
from llama_index.callbacks.schema import CBEventType, EventPayload
from llama_index.schema import BaseNode, Document, NodeRelationship, TextNode
from llama_index.text_splitter import CodeSplitter
from llama_index.utils import get_tqdm_iterable


class _SignatureCaptureType(BaseModel):
    # Code replaced for brevity. See node_id 9fc27450-8dd7-4459-a67b-d35266d949be


class _SignatureCaptureOptions(BaseModel):
    # Code replaced for brevity. See node_id d79396a6-bc83-4115-a748-37173ac792c2
    # Code replaced for brevity. See node_id 1bccb6a2-cb81-4dd8-b2ca-60de94fb4311```

The description for your LLM to read to learn the tool is:

In [18]:
print("Name: "+idx.as_langchain_tool().name)
display(Markdown("Description: "+idx.as_langchain_tool().description))

Name: Code Search


Description: 
        Search the tool by any element in this list,
        or any uuid found in the code,
        to get more information about that element.

        - code_hierarchy
  - _SignatureCaptureType
  - _SignatureCaptureOptions
  - _ScopeMethod
  - _CommentOptions
  - _ScopeItem
  - _ChunkNodeOutput
  - CodeHierarchyNodeParser
    - class_name
    - __init__
    - _get_node_name
      - recur
    - _get_node_signature
      - find_start
      - find_end
    - _chunk_node
    - get_code_hierarchy_from_nodes
      - get_subdict
      - recur_inclusive_scope
      - dict_to_markdown
    - _parse_nodes
    - _get_indentation
    - _get_comment_text
    - _create_comment_line
    - _get_replacement_text
    - _skeletonize
    - _skeletonize_list
      - recur

        