# Code Hierarchy Node Parser

The `CodeHierarchyNodeParser` is useful to split long code files into more reasonable chunks. What this will do is create a "Hierarchy" of sorts, where sections of the code are made more reasonable by replacing the scope body with short comments telling the LLM to search for a referenced node if it wants to read that context body. This is called skeletonization, and is toggled by setting `skeleton` to `True` which it is by default. Nodes in this hierarchy will be split based on scope, like function, class, or method scope, and will have links to their children and parents so the LLM can traverse the tree.

## Installation and Import

First be sure to install the necessary [tree-sitter](https://tree-sitter.github.io/tree-sitter/) libraries.

`pip install tree-sitter tree-sitter-languages`

In [None]:
from llama_index.node_parser.code_hierarchy import CodeHierarchyNodeParser
from llama_index.text_splitter.code_splitter import CodeSplitter
from llama_index.readers import SimpleDirectoryReader
from pathlib import Path
from pprint import pprint

Now, choose a directory you want to scan, and glob for all the code files you want to import.

In this case I'm going to glob all "*.py" files in the `llama_index/node_parser` directory.

In [None]:
reader = SimpleDirectoryReader(
    input_files=Path("../../../../llama_index/node_parser").glob("*.py"),
    file_metadata=lambda x: {"filepath": x},
)
nodes = reader.load_data()
len(nodes)

10

Looks like we got 8 files. Lets examine one of these nodes.
We see here that the second one is 28756 characters long. That's way too long for most LLMs.

In [None]:
print(len(nodes[1].text))
pprint(nodes[1].text)

33980
('from collections import defaultdict\n'
 'from enum import Enum\n'
 'from pprint import pformat\n'
 'from typing import Any, Dict, List, Optional, Sequence, Tuple\n'
 '\n'
 'from llama_index.node_parser.extractors.metadata_extractors import '
 'MetadataExtractor\n'
 'from llama_index.node_parser.interface import NodeParser\n'
 'from llama_index.node_parser.node_utils import get_nodes_from_node\n'
 '\n'
 'try:\n'
 '    from pydantic.v1 import BaseModel, Field\n'
 'except ImportError:\n'
 '    from pydantic import BaseModel, Field\n'
 '\n'
 '\n'
 'from tree_sitter import Node\n'
 '\n'
 'from llama_index.callbacks.base import CallbackManager\n'
 'from llama_index.callbacks.schema import CBEventType, EventPayload\n'
 'from llama_index.schema import BaseNode, Document, NodeRelationship, '
 'TextNode\n'
 'from llama_index.text_splitter.code_splitter import CodeSplitter\n'
 'from llama_index.utils import get_tqdm_iterable\n'
 '\n'
 '\n'
 'class _SignatureCaptureType(BaseModel):\n'
 '  

So what are we to do? Well lets try splitting it. We are going to use the `CodeHierarchyNodeParser` to split the nodes into more reasonable chunks.

In [None]:
split_nodes = CodeHierarchyNodeParser(
    language="python",
    # You can further parameterize the CodeSplitter to split the code
    # into "chunks" that match your context window size using
    # chunck_lines and max_chars parameters, here we just use the defaults
    code_splitter=CodeSplitter(language="python"),
).get_nodes_from_documents(nodes)
len(split_nodes)

148

Great! So that split up our data from 8 nodes into 112 nodes! Whats the max length of any of these nodes?

In [None]:
max(len(n.text) for n in split_nodes)

1664

That's much shorter than before! Let's look at a sample.

In [None]:
pprint(split_nodes[0].text)

('"""Simple node parser."""\n'
 'from typing import Callable, List, Optional, Sequence\n'
 '\n'
 'from llama_index.bridge.pydantic import Field\n'
 'from llama_index.callbacks.base import CallbackManager\n'
 'from llama_index.callbacks.schema import CBEventType, EventPayload\n'
 'from llama_index.node_parser.extractors.metadata_extractors import '
 'MetadataExtractor\n'
 'from llama_index.node_parser.interface import NodeParser\n'
 'from llama_index.node_parser.node_utils import build_nodes_from_splits\n'
 'from llama_index.schema import BaseNode, Document\n'
 'from llama_index.text_splitter.utils import split_by_sentence_tokenizer\n'
 'from llama_index.utils import get_tqdm_iterable\n'
 '\n'
 'DEFAULT_WINDOW_SIZE = 3\n'
 'DEFAULT_WINDOW_METADATA_KEY = "window"\n'
 'DEFAULT_OG_TEXT_METADATA_KEY = "original_text"\n'
 '\n'
 '\n'
 'class SentenceWindowNodeParser(NodeParser):\n'
 '    # Code replaced for brevity. See node_id '
 'aa2137f3-c798-4d55-82c9-5c3e9be1c770')


Without even needing a long printout we can see everything this module imported in the first document (which is at the module level) and the single class it defines. However, now instead of the class body, we see a comment: 

`# Code replaced for brevity. See node_id {node_id}`

What if we go to that node_id?

Notice below, that node_id is also in it's `child_nodes`

In [None]:
split_nodes_by_id = {n.node_id: n for n in split_nodes}
uuid_from_text = split_nodes[0].text.splitlines()[-1].split(" ")[-1]
pprint(split_nodes_by_id[uuid_from_text].text)

assert uuid_from_text in (n.node_id for n in split_nodes[0].child_nodes)

('class SentenceWindowNodeParser(NodeParser):\n'
 '    # Code replaced for brevity. See node_id '
 'fe68e6de-3851-4c77-8acb-445fe45fee6c')


This is an artefact of the `CodeSplitter`. This must have been a big class! But lets look at it's relationships.

In [None]:
pprint(split_nodes_by_id[uuid_from_text].relationships)

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='aa2137f3-c798-4d55-82c9-5c3e9be1c770', node_type=<ObjectType.TEXT: '1'>, metadata={'language': 'python', 'inclusive_scopes': [{'name': 'SentenceWindowNodeParser', 'type': 'class_definition', 'signature': 'class SentenceWindowNodeParser(NodeParser):'}], 'filepath': '../../../../llama_index/node_parser/sentence_window.py'}, hash='4278766c3216e33889069110a938de8d8586df6be121586a7eec929470c7d131'),
 <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='fe68e6de-3851-4c77-8acb-445fe45fee6c', node_type=<ObjectType.TEXT: '1'>, metadata={'language': 'python', 'inclusive_scopes': [{'name': 'SentenceWindowNodeParser', 'type': 'class_definition', 'signature': 'class SentenceWindowNodeParser(NodeParser):'}], 'filepath': '../../../../llama_index/node_parser/sentence_window.py'}, hash='29a4be710fef2a2fe594805008b2dfeafacd7cb5f19e8b10eef65802c3251128'),
 <NodeRelationship.PARENT: '4'>: RelatedNodeInfo(node_id='1a1dd9cf-64f4-4cbf-8bf9-c6d87344

You see that this node has a `NEXT` relationship, and many children.

If we view it's `NEXT` relationship we will see things as the `CodeSplitter` sees things. Cutting up the node into chunks that are a certain character length. For more information about the `CodeSplitter` read this:
https://docs.sweep.dev/blogs/chunking-2m-files

As you can see, the next node was split from this one because of a big docstring!

In [None]:
from llama_index.schema import NodeRelationship

next_node_relationship_info = split_nodes_by_id[uuid_from_text].relationships[
    NodeRelationship.NEXT
]
next_node = split_nodes_by_id[next_node_relationship_info.node_id]
pprint(next_node.text)

('# Code replaced for brevity. See node_id '
 'aa2137f3-c798-4d55-82c9-5c3e9be1c770\n'
 '"""Sentence window node parser.\n'
 '\n'
 '    Splits a document into Nodes, with each node being a sentence.\n'
 '    Each node contains a window from the surrounding sentences in the '
 'metadata.\n'
 '\n'
 '    Args:\n'
 '        sentence_splitter (Optional[Callable]): splits text into sentences\n'
 '        include_metadata (bool): whether to include metadata in nodes\n'
 '        include_prev_next_rel (bool): whether to include prev/next '
 'relationships\n'
 '    """\n'
 '\n'
 '    sentence_splitter: Callable[[str], List[str]] = Field(\n'
 '        default_factory=split_by_sentence_tokenizer,\n'
 '        description="The text splitter to use when splitting documents.",\n'
 '        exclude=True,\n'
 '    )\n'
 '    window_size: int = Field(\n'
 '        default=DEFAULT_WINDOW_SIZE,\n'
 '        description="The number of sentences on each side of a sentence to '
 'capture.",\n'
 '    )\n'
 '

You can think of the difference between `NodeRelationship.CHILD`/`NodeRelationship.PARENT` and `NodeRelationship.NEXT`/`NodeRelationship.PREVIOUS` as different dimensions.

`CodeHierarchyNodeParser` creates `NodeRelationship.CHILD`/`NodeRelationship.PARENT` between code blocks based on scope hierarchys.

Nodes which are then additionally split by `CodeSplitter` based on context length get an additional `NodeRelationship.NEXT`/`NodeRelationship.PREVIOUS`, and the first node in this chain maintains the `NodeRelationship.CHILD`/`NodeRelationship.PARENT` relationships given to it by `CodeHierarchyNodeParser`

Now what if we explore that first nodes children (`NodeRelationship.CHILD`) are.

In [None]:
next_node_relationship_info = split_nodes_by_id[uuid_from_text].relationships[
    NodeRelationship.CHILD
][0]
next_node = split_nodes_by_id[next_node_relationship_info.node_id]
pprint(next_node.text)

('def __init__(\n'
 '        self,\n'
 '        sentence_splitter: Optional[Callable[[str], List[str]]] = None,\n'
 '        window_size: int = DEFAULT_WINDOW_SIZE,\n'
 '        window_metadata_key: str = DEFAULT_WINDOW_METADATA_KEY,\n'
 '        original_text_metadata_key: str = DEFAULT_OG_TEXT_METADATA_KEY,\n'
 '        include_metadata: bool = True,\n'
 '        include_prev_next_rel: bool = True,\n'
 '        callback_manager: Optional[CallbackManager] = None,\n'
 '        metadata_extractor: Optional[MetadataExtractor] = None,\n'
 '    ) -> None:\n'
 '        """Init params."""\n'
 '        callback_manager = callback_manager or CallbackManager([])\n'
 '        sentence_splitter = sentence_splitter or '
 'split_by_sentence_tokenizer()\n'
 '        super().__init__(\n'
 '            sentence_splitter=sentence_splitter,\n'
 '            window_size=window_size,\n'
 '            window_metadata_key=window_metadata_key,\n'
 '            original_text_metadata_key=original_text_metadat

The first child of the class is the `__init__` statement! That makes sense.

# Indices

Lets explore the use of this node parser in an index. We will be able to use any index which allows search by keyword, which should enable us to search for any node by it's uuid, or by any scope name.

Lets use a keyword index to facilitate this kind of operation. We have created a CodeHierarchyKeywordTableIndex which will allow us to search for nodes by their uuid, or by their scope name.

In [None]:
from llama_index.indices.code_hierarchy import (
    CodeHierarchyKeywordTableIndex,
)

idx = CodeHierarchyKeywordTableIndex(
    nodes=split_nodes,
)
retriever = idx.as_retriever()

Now we can get the same code as before.

In [None]:
pprint(retriever.retrieve(uuid_from_text)[0].node.get_content())

('class SentenceWindowNodeParser(NodeParser):\n'
 '    # Code replaced for brevity. See node_id '
 'fe68e6de-3851-4c77-8acb-445fe45fee6c')


Now what about getting the rest of the code for this scope?

In [None]:
pprint(
    [
        n.node.get_content()
        for n in retriever.retrieve("SentenceWindowNodeParser")
    ]
)

['# Code replaced for brevity. See node_id '
 'd8a4b50c-c0d9-4338-870a-1ab058d1e3c8\n'
 '@classmethod\n'
 '    def from_defaults(\n'
 '        cls,\n'
 '        sentence_splitter: Optional[Callable[[str], List[str]]] = None,\n'
 '        window_size: int = DEFAULT_WINDOW_SIZE,\n'
 '        window_metadata_key: str = DEFAULT_WINDOW_METADATA_KEY,\n'
 '        original_text_metadata_key: str = DEFAULT_OG_TEXT_METADATA_KEY,\n'
 '        include_metadata: bool = True,\n'
 '        include_prev_next_rel: bool = True,\n'
 '        callback_manager: Optional[CallbackManager] = None,\n'
 '        metadata_extractor: Optional[MetadataExtractor] = None,\n'
 '    ) -> "SentenceWindowNodeParser":\n'
 '        # Code replaced for brevity. See node_id '
 'ecb65036-88bb-4008-806e-b9c629f7d038\n'
 '\n'
 '    def get_nodes_from_documents(\n'
 '        self,\n'
 '        documents: Sequence[Document],\n'
 '        show_progress: bool = False,\n'
 '    ) -> List[BaseNode]:\n'
 '        # Code replaced for

The only difficulty is that these are out of order. The CodeSplitter controls how much overlap there is for each of these documents, and how big they are. You can play with its settings to disambiguate any confusion. However, they do have their uuids for their splits in the text themselves, so an LLM should be able to recursively search these documents to put them in some kind of order.

# Code Hierarchy

The namesake of this node parser, creates a tree of scope names to use to search the code.

In [None]:
print(CodeHierarchyNodeParser.get_code_hierarchy_from_nodes(split_nodes))

- ..
  - ..
    - ..
      - ..
        - llama_index
          - node_parser
            - sentence_window.py
              - SentenceWindowNodeParser
                - __init__
                - text_splitter
                - from_defaults
                - get_nodes_from_documents
                - build_window_nodes_from_documents
            - code_hierarchy.py
              - _SignatureCaptureType
              - _SignatureCaptureOptions
              - _ScopeMethod
              - _CommentOptions
              - _ScopeItem
              - _ChunkNodeOutput
              - CodeHierarchyNodeParser
                - class_name
                - __init__
                - _get_node_name
                  - recur
                - _get_node_signature
                  - find_start
                  - find_end
                - _chunk_node
                - get_code_hierarchy_from_nodes
                  - get_subdict
                  - recur_inclusive_scope
                  - dict_