# Document Chunking for RAG Systems
Chunking is a critical aspect of building effective RAG systems, yet it's often where many implementations silently fail. This notebook takes a deliberate, visible approach to chunking to ensure we don't lose valuable information in the process.
## Why Chunking Matters
When building RAG systems, particularly for critical domains like medical literature, we need to ensure that:

1. Chunks preserve complete information - No content should be silently truncated or lost
2. Rich metadata is maintained - Context about document structure and hierarchy is preserved
3. Chunk sizes match embedding model constraints - We avoid exceeding token limits
4. Hierarchical context is retained - The relationship between sections remains clear

## Common Chunking Pitfalls
Silent truncation is one of the most dangerous failure modes. Consider this scenario: you chunk medical literature by H3 headings, and one section contains 1,000 tokens. If your embedding model (such as Chroma DB's default all-MiniLM-L6-v2 model with a 256-token limit) silently truncates content beyond its capacity, you lose 744 tokens of potentially critical medical informationâ€”without any error or warning.

Orphaned content occurs when chunking by headings. Text that exists between a parent heading and its first child (e.g., introductory paragraphs between an H1 and the first H2) can be lost if not explicitly handled. In hierarchical medical literature, this bridging content often contains crucial context.

Context loss from overly short chunks happens when aggressive chunking breaks apart content that should be understood together, making it difficult for the embedding model to capture meaningful semantic relationships.  

Exceeding model capacity can occur even after proper chunking if you retrieve too many chunks during similarity search, overwhelming your LLM's context windowâ€”especially critical when running smaller models locally with hardware constraints.


## Why Use a Notebook for Chunking?
This notebook approach provides visibility at every step. Rather than running a production pipeline that silently processes thousands of documents, we can:

- Inspect chunk sizes before embedding
- Verify that no content exceeds token limits
- Examine orphaned sections and ensure they're properly handled
- Validate that hierarchical context is preserved
- Test our chunking strategy iteratively before deploying

For medical literature and other critical knowledge sources, this careful, visible approach is essential. We must see our chunks before we trust them.
Let's begin by examining our document structure and building a chunking strategy that preserves every piece of valuable information

## Why I Chose Custom Chunking Over LangChain Splitters

I built a custom hierarchical chunking parser instead of using LangChain's RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter, or other general-purpose splitters.


For this medical diabetes guidelines document, we needed complete content preservation, precise citations with exact section numbers and URLs, orphan section handling (content between headings (eg h1) and first heading child(eg h2)), and intact hierarchical structure for navigation and web integration.


**Why LangChain splitters weren't suitable:**
- Size-based splitting can break logical boundaries
- Risk of losing orphan sections
- Limited hierarchical metadata for precise citations
- Generic approach doesn't match the structured needs for this document. 


**When LangChain Splitters Are Appropriate**
LangChain's splitters work well when handling diverse, unstructured documents at scale, where fixed-size chunks are acceptable and document structure is less critical.


**Our Custom Solution**

Our parser respects document structure (chunks align with H1/H2/H3/H4 sections), preserves orphan sections, maintains full hierarchy with breadcrumbs, enables precise citations with section numbers and URLs, and supports web integration with generated URLs and anchors.

For this single, well-structured medical document, custom chunking ensures completeness, traceability, and accurate referencingâ€”essential for clinical guidelines.

In [2]:
import re
from pathlib import Path
from typing import List, Dict, Optional, Any
from dataclasses import dataclass, field
import tiktoken

def count_tokens(text: str, encoding_name: str = "cl100k_base") -> int:
    """
    Count tokens in text using tiktoken.
    
    Args:
        text: Input text string
        encoding_name: Tokenizer encoding (cl100k_base for GPT-3.5/GPT-4)
    
    Returns:
        Number of tokens
    """
    try:
        encoding = tiktoken.get_encoding(encoding_name)
        return len(encoding.encode(text))
    except Exception as e:
        print(f"Warning: Token counting failed, using word count approximation: {e}")
        # Fallback: approximate 1 token â‰ˆ 0.75 words
        return int(len(text.split()) * 0.75)

@dataclass
class DocumentNode:
    """Represents a node in the document tree (heading or section)"""
    level: int  # 1=H1, 2=H2, 3=H3, 4=H4, 0=section with no heading
    title: str
    content: str = ""
    tokens: int = 0
    start_line: int = 0
    end_line: int = 0
    children: List['DocumentNode'] = field(default_factory=list)
    parent: Optional['DocumentNode'] = None
    
    def add_child(self, child: 'DocumentNode'):
        """Add a child node and set its parent"""
        child.parent = self
        self.children.append(child)
    
    def get_full_path(self) -> str:
        """Get the full path from root to this node"""
        path = []
        node = self
        while node:
            if node.title:
                path.insert(0, node.title)
            node = node.parent
        return " > ".join(path)
    
    def __repr__(self) -> str:
        indent = "  " * (self.level if self.level > 0 else 0)
        prefix = f"H{self.level}" if self.level > 0 else "Section"
        return f"{indent}{prefix}: {self.title[:60]}... ({self.tokens:,} tokens)" if len(self.title) > 60 else f"{indent}{prefix}: {self.title} ({self.tokens:,} tokens)"

def parse_markdown_tree(markdown_path: str) -> Dict[str, Any]:
    """
    Parse markdown file and create a hierarchical tree structure.
    
    Rules for sections:
    - H1 without H2s: Just show H1 with token count
    - H1 with H2s: Check for content between H1 and FIRST H2 only (create ---section--- if exists)
    - H2 without H3s: Just show H2 with token count
    - H2 with H3s: Check for content between H2 and FIRST H3 only (create ---section--- if exists)
    - H3 without H4s: Just show H3 with token count
    - H3 with H4s: Check for content between H3 and FIRST H4 only (create ---section--- if exists)
    - Content after the last child belongs to that last child (not a separate section)
    - No sections between siblings (e.g., between H2 and H2)
    
    Args:
        markdown_path: Path to the markdown file
        
    Returns:
        Dictionary with:
        - 'tree': Root node of the document tree
        - 'sections': List of section nodes (H1 and H2 level)
        - 'stats': Statistics about the document
    """
    markdown_path = Path(markdown_path)
    
    if not markdown_path.exists():
        raise FileNotFoundError(f"Markdown file not found: {markdown_path}")
    
    with open(markdown_path, "r", encoding="utf-8") as f:
        lines = f.readlines()
    
    # Root node to hold everything
    root = DocumentNode(level=0, title="Document Root", start_line=0, end_line=len(lines) - 1)
    
    # Track the current path in the tree
    current_path = [root]  # Stack of nodes: root -> H1 -> H2 -> H3 -> H4
    
    # Track content before first heading
    content_before_first_heading = []
    first_heading_found = False
    
    # Track sections (H1 and H2 level only)
    sections = []
    
    i = 0
    while i < len(lines):
        line = lines[i]
        stripped = line.strip()
        
        # Check if this is a heading
        if stripped.startswith('#'):
            first_heading_found = True
            
            # Handle any content before first heading
            if content_before_first_heading and not sections:
                content_text = ''.join(content_before_first_heading)
                if content_text.strip():
                    section_node = DocumentNode(
                        level=0,
                        title="Content Before First Heading",
                        content=content_text,
                        tokens=count_tokens(content_text),
                        start_line=0,
                        end_line=i - 1
                    )
                    root.add_child(section_node)
                    sections.append(section_node)
                content_before_first_heading = []
            
            # Determine heading level
            level = 0
            for char in stripped:
                if char == '#':
                    level += 1
                else:
                    break
            
            if 1 <= level <= 4:
                heading_text = stripped[level:].strip()
                
                # Close all nodes at this level and deeper
                # (they end before this new heading)
                nodes_to_close = []
                while len(current_path) > 1 and current_path[-1].level >= level:
                    node_to_close = current_path.pop()
                    nodes_to_close.append(node_to_close)
                
                # Process nodes to close and check for sections between parent and first child
                for closed_node in reversed(nodes_to_close):
                    # Set end line first
                    closed_node.end_line = i - 1
                    
                    # ONLY check for content between parent and FIRST child
                    # Content after last child belongs to that last child (not a separate section)
                    if closed_node.children:
                        first_child = closed_node.children[0]
                        # Check if there's content between the parent heading and first child heading
                        if first_child.start_line > closed_node.start_line + 1:
                            orphan_start = closed_node.start_line + 1
                            orphan_end = first_child.start_line - 1
                            orphan_content = ''.join(lines[orphan_start:orphan_end + 1]).strip()
                            
                            if orphan_content:
                                # Create section node
                                orphan_node = DocumentNode(
                                    level=0,
                                    title="---section---",
                                    content=orphan_content,
                                    tokens=count_tokens(orphan_content),
                                    start_line=orphan_start,
                                    end_line=orphan_end
                                )
                                closed_node.children.insert(0, orphan_node)
                                orphan_node.parent = closed_node
                    
                    # Calculate tokens for the closed node (includes all content, including children)
                    # Content after last child is included in the parent's total content
                    closed_node.content = ''.join(lines[closed_node.start_line:closed_node.end_line + 1])
                    closed_node.tokens = count_tokens(closed_node.content)
                
                # Create new node
                new_node = DocumentNode(
                    level=level,
                    title=heading_text,
                    start_line=i,
                    end_line=i
                )
                
                # Add to appropriate parent
                parent = current_path[-1]
                parent.add_child(new_node)
                current_path.append(new_node)
                
                # If this is H1 or H2, add to sections list
                if level <= 2:
                    sections.append(new_node)
        
        else:
            # Regular content line
            if not first_heading_found:
                content_before_first_heading.append(line)
            # Content is automatically included when we close nodes
        
        i += 1
    
    # Close all remaining nodes and check for sections between parent and first child
    while len(current_path) > 1:
        closed_node = current_path.pop()
        
        # ONLY check for content between parent and FIRST child
        # Content after last child belongs to that last child (not a separate section)
        if closed_node.children:
            first_child = closed_node.children[0]
            if first_child.start_line > closed_node.start_line + 1:
                orphan_start = closed_node.start_line + 1
                orphan_end = first_child.start_line - 1
                orphan_content = ''.join(lines[orphan_start:orphan_end + 1]).strip()
                
                if orphan_content:
                    orphan_node = DocumentNode(
                        level=0,
                        title="---section---",
                        content=orphan_content,
                        tokens=count_tokens(orphan_content),
                        start_line=orphan_start,
                        end_line=orphan_end
                    )
                    closed_node.children.insert(0, orphan_node)
                    orphan_node.parent = closed_node
        
        # Set end line and calculate tokens
        # Content after last child is included in the parent's content
        closed_node.end_line = i - 1
        closed_node.content = ''.join(lines[closed_node.start_line:closed_node.end_line + 1])
        closed_node.tokens = count_tokens(closed_node.content)
    
    # Handle any trailing content after last heading
    if content_before_first_heading:
        content_text = ''.join(content_before_first_heading)
        if content_text.strip():
            section_node = DocumentNode(
                level=0,
                title="Content After Last Heading",
                content=content_text,
                tokens=count_tokens(content_text),
                start_line=len(lines) - len(content_before_first_heading),
                end_line=len(lines) - 1
            )
            root.add_child(section_node)
            sections.append(section_node)
    
    # Calculate statistics
    total_tokens = sum(node.tokens for node in sections)
    h1_count = sum(1 for node in sections if node.level == 1)
    h2_count = sum(1 for node in sections if node.level == 2)
    h3_count = sum(1 for node in _get_all_nodes(root) if node.level == 3)
    h4_count = sum(1 for node in _get_all_nodes(root) if node.level == 4)
    orphan_sections = sum(1 for node in _get_all_nodes(root) if node.title == "---section---")
    
    stats = {
        'total_sections': len(sections),
        'h1_sections': h1_count,
        'h2_sections': h2_count,
        'h3_headings': h3_count,
        'h4_headings': h4_count,
        'orphan_sections': orphan_sections,
        'total_tokens': total_tokens,
        'avg_tokens_per_section': total_tokens / len(sections) if sections else 0
    }
    
    return {
        'tree': root,
        'sections': sections,
        'stats': stats
    }

def _get_all_nodes(node: DocumentNode) -> List[DocumentNode]:
    """Recursively get all nodes in the tree"""
    nodes = [node]
    for child in node.children:
        nodes.extend(_get_all_nodes(child))
    return nodes

def print_tree(node: DocumentNode, max_depth: Optional[int] = None, current_depth: int = 0, show_tokens: bool = True):
    """
    Print the document tree structure.
    
    Args:
        node: Root node to start printing from
        max_depth: Maximum depth to print (None = all levels)
        current_depth: Current depth in recursion
        show_tokens: Whether to show token counts
    """
    if max_depth is not None and current_depth > max_depth:
        return
    
    if node.level > 0 or node.title != "Document Root":
        indent = "  " * current_depth
        prefix = f"H{node.level}" if node.level > 0 else "Section"
        token_str = f" ({node.tokens:,} tokens)" if show_tokens else ""
        title_display = node.title[:80] + "..." if len(node.title) > 80 else node.title
        print(f"{indent}{prefix}: {title_display}{token_str}")
    
    for child in node.children:
        print_tree(child, max_depth, current_depth + 1, show_tokens)

# Read the markdown file
markdown_file = "output\Kenya-National-Clinical-Guidelines-for-the-Management-of-Diabetes-2nd-Editiion-2018-1_with_images_edited_hierarchy_fixed_final.md"

print(f"ðŸ“„ Reading and parsing: {markdown_file}\n")

# Parse the document tree
document_data = parse_markdown_tree(markdown_file)

# Display statistics
stats = document_data['stats']
print("=" * 80)
print("DOCUMENT STATISTICS")
print("=" * 80)
print(f"Total Sections (H1 & H2): {stats['total_sections']}")
print(f"  - H1 Sections: {stats['h1_sections']}")
print(f"  - H2 Sections: {stats['h2_sections']}")
print(f"H3 Headings: {stats['h3_headings']}")
print(f"H4 Headings: {stats['h4_headings']}")
print(f"Orphan Sections (content not under headings): {stats['orphan_sections']}")
print(f"Total Tokens: {stats['total_tokens']:,}")
print(f"Average Tokens per Section: {stats['avg_tokens_per_section']:,.0f}")
print("=" * 80)
print()

# Display the document tree
print("DOCUMENT TREE STRUCTURE")
print("=" * 80)
print_tree(document_data['tree'], show_tokens=True)
print("=" * 80)


ðŸ“„ Reading and parsing: output\Kenya-National-Clinical-Guidelines-for-the-Management-of-Diabetes-2nd-Editiion-2018-1_with_images_edited_hierarchy_fixed_final.md

DOCUMENT STATISTICS
Total Sections (H1 & H2): 57
  - H1 Sections: 18
  - H2 Sections: 38
H3 Headings: 31
H4 Headings: 0
Orphan Sections (content not under headings): 8
Total Tokens: 120,679
Average Tokens per Section: 2,117

DOCUMENT TREE STRUCTURE
  Section: Content Before First Heading (465 tokens)
  H1: TABLE OF CONTENT (2,077 tokens)
  H1: LIST OF FIGURES (670 tokens)
  H1: LIST OF TABLES (1,093 tokens)
  H1: ACRONYMS (480 tokens)
  H1: FOREWORD (860 tokens)
  H1: PREFACE (458 tokens)
  H1: ACKNOWLEDGEMENTS (477 tokens)
  H1: EXECUTIVE SUMMARY (862 tokens)
  H1: CHAPTER ONE: INTRODUCTION TO DIABETES (3,149 tokens)
    H2: 1.1. Definition (71 tokens)
    H2: 1.2. Pathophysiology (784 tokens)
      Section: ---section--- (305 tokens)
      H3: 1.2.1. Pathogenesis and pathophysiology of type 1 diabetes (238 tokens)
      H3

In [3]:
import json
import re
from typing import List, Dict, Optional, Any

# ============================================================================
# HELPER FUNCTIONS FOR JSON CONVERSION
# ============================================================================

def generate_slug(title: str) -> str:
    """
    Convert a title to a URL-friendly slug.
    
    Args:
        title: Title string
        
    Returns:
        URL-friendly slug (lowercase, hyphens, alphanumeric)
    """
    # Convert to lowercase
    slug = title.lower()
    
    # Remove special characters and replace spaces/punctuation with hyphens
    slug = re.sub(r'[^\w\s-]', '', slug)
    slug = re.sub(r'[-\s]+', '-', slug)
    
    # Remove leading/trailing hyphens
    slug = slug.strip('-')
    
    # Limit length
    slug = slug[:100]
    
    return slug

def extract_chapter_number(title: str) -> Optional[str]:
    """
    Extract chapter number from title (e.g., "CHAPTER ONE" -> "1", "CHAPTER TWO" -> "2").
    
    Args:
        title: Chapter title
        
    Returns:
        Chapter number as string, or None if not found
    """
    # Match patterns like "CHAPTER ONE", "CHAPTER TWO", "CHAPTER 1", etc.
    patterns = [
        r'CHAPTER\s+(ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN)\s*:',
        r'CHAPTER\s+(\d+)',
        r'^CHAPTER\s+(ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN)',
        r'^CHAPTER\s+(\d+)',
    ]
    
    # Number word mapping
    number_words = {
        'ONE': '1', 'TWO': '2', 'THREE': '3', 'FOUR': '4', 'FIVE': '5',
        'SIX': '6', 'SEVEN': '7', 'EIGHT': '8', 'NINE': '9', 'TEN': '10'
    }
    
    for pattern in patterns:
        match = re.search(pattern, title.upper())
        if match:
            number = match.group(1)
            # Convert word to number if needed
            return number_words.get(number, number)
    
    return None

def extract_section_number(title: str) -> Optional[str]:
    """
    Extract section number from title (e.g., "1.1. Definition" -> "1.1", "2.3.1. Subsection" -> "2.3.1").
    
    Args:
        title: Section title
        
    Returns:
        Section number as string, or None if not found
    """
    # Match patterns like "1.1", "2.3.1", "1.2.3.4", etc.
    match = re.match(r'^(\d+(?:\.\d+)*)', title.strip())
    if match:
        return match.group(1)
    
    return None

def is_front_matter(title: str) -> bool:
    """
    Determine if an H1 title is front matter (not a chapter).
    
    Args:
        title: H1 title
        
    Returns:
        True if front matter, False if chapter
    """
    title_upper = title.upper()
    
    front_matter_keywords = [
        'TABLE OF CONTENT',
        'LIST OF FIGURES',
        'LIST OF TABLES',
        'ACRONYMS',
        'FOREWORD',
        'PREFACE',
        'ACKNOWLEDGEMENTS',
        'ACKNOWLEDGMENTS',
        'EXECUTIVE SUMMARY',
        'REFERENCES',
        'APPENDICES',
        'APPENDIX',
        'CONTENT BEFORE FIRST HEADING',
        'CONTENT AFTER LAST HEADING'
    ]
    
    # Check if title contains any front matter keywords
    for keyword in front_matter_keywords:
        if keyword in title_upper:
            return True
    
    # Check if it's a chapter (starts with "CHAPTER")
    if title_upper.startswith('CHAPTER'):
        return False
    
    # Default: treat as front matter if doesn't start with CHAPTER
    return True

def build_breadcrumb(node: DocumentNode) -> List[str]:
    """
    Build breadcrumb trail from root to node.
    
    Args:
        node: DocumentNode to build breadcrumb for
        
    Returns:
        List of titles from root to node
    """
    breadcrumb = []
    current = node
    
    # Walk up the tree, collecting titles (skip root)
    while current and current.parent and current.title != "Document Root":
        if current.title != "---section---":  # Skip orphan sections in breadcrumb
            breadcrumb.insert(0, current.title)
        current = current.parent
    
    return breadcrumb

def generate_url_path(node: DocumentNode, is_chapter: bool = False, parent_slug: Optional[str] = None) -> str:
    """
    Generate full URL path for a section.
    
    Args:
        node: DocumentNode to generate URL for
        is_chapter: Whether this is a chapter (H1)
        parent_slug: Slug of parent section (for nested paths)
        
    Returns:
        Full URL path (e.g., "/guidelines/chapter-1-introduction-to-diabetes")
    """
    base_path = "/guidelines"
    
    if is_chapter:
        slug = generate_slug(node.title)
        # Remove "chapter" prefix if already present in slug to avoid duplication
        if slug.startswith("chapter-"):
            # Slug already has "chapter-", use it as-is but ensure it starts correctly
            return f"{base_path}/{slug}"
        else:
            return f"{base_path}/chapter-{slug}"
    else:
        slug = generate_slug(node.title)
        if parent_slug:
            return f"{base_path}/{parent_slug}/{slug}"
        else:
            return f"{base_path}/{slug}"

def generate_id(node: DocumentNode, node_type: str = "section", number: Optional[str] = None, parent_id: Optional[str] = None) -> str:
    """
    Generate unique ID for a section.
    
    Args:
        node: DocumentNode
        node_type: Type prefix ("chapter", "frontmatter", "section", "subsection")
        number: Optional number (e.g., "1", "1.1", "2.3.1")
        parent_id: Optional parent ID for nested IDs
        
    Returns:
        Unique ID string
    """
    if number:
        # Use number in ID if available
        clean_number = number.replace('.', '-')
        return f"{node_type}-{clean_number}"
    else:
        # Use slug for ID
        slug = generate_slug(node.title)
        if parent_id:
            return f"{parent_id}-{slug}"
        else:
            return f"{node_type}-{slug}"

print("âœ“ Helper functions defined")


âœ“ Helper functions defined


In [4]:
# ============================================================================
# MAIN CONVERSION FUNCTION
# ============================================================================

def process_subsections(node: DocumentNode, parent_id: str, parent_slug: str, parent_breadcrumb: List[str]) -> List[Dict]:
    """
    Process H3/H4 subsections recursively.
    
    Args:
        node: Parent node containing subsections
        parent_id: ID of parent section
        parent_slug: Slug of parent section
        parent_breadcrumb: Breadcrumb of parent section
        
    Returns:
        List of subsection dictionaries
    """
    subsections = []
    intro_content = None
    
    for child in node.children:
        if child.title == "---section---":
            # This is an orphan section - convert to introContent
            intro_content = {
                "content": child.content,
                "tokenCount": child.tokens,
                "startLine": child.start_line,
                "endLine": child.end_line
            }
        elif child.level == 3:  # H3
            section_number = extract_section_number(child.title)
            subsection_id = generate_id(child, "subsection", section_number, parent_id)
            subsection_slug = generate_slug(child.title)
            subsection_url = generate_url_path(child, False, parent_slug)
            subsection_breadcrumb = parent_breadcrumb + [child.title]
            
            # Process nested H4 subsections if any
            nested_subsections = []
            nested_intro = None
            for grandchild in child.children:
                if grandchild.title == "---section---":
                    nested_intro = {
                        "content": grandchild.content,
                        "tokenCount": grandchild.tokens,
                        "startLine": grandchild.start_line,
                        "endLine": grandchild.end_line
                    }
                elif grandchild.level == 4:  # H4
                    h4_number = extract_section_number(grandchild.title)
                    h4_id = generate_id(grandchild, "subsection", h4_number, subsection_id)
                    h4_slug = generate_slug(grandchild.title)
                    h4_url = generate_url_path(grandchild, False, subsection_slug)
                    h4_breadcrumb = subsection_breadcrumb + [grandchild.title]
                    
                    nested_subsections.append({
                        "id": h4_id,
                        "level": "h4",
                        "number": h4_number,
                        "title": grandchild.title,
                        "slug": h4_slug,
                        "url": h4_url,
                        "parentId": subsection_id,
                        "breadcrumb": h4_breadcrumb,
                        "tokenCount": grandchild.tokens,
                        "content": grandchild.content,
                        "startLine": grandchild.start_line,
                        "endLine": grandchild.end_line,
                        "introContent": None,
                        "keywords": [],
                        "subsections": []
                    })
            
            subsection = {
                "id": subsection_id,
                "level": "h3",
                "number": section_number,
                "title": child.title,
                "slug": subsection_slug,
                "url": subsection_url,
                "parentId": parent_id,
                "breadcrumb": subsection_breadcrumb,
                "tokenCount": child.tokens,
                "content": child.content,
                "startLine": child.start_line,
                "endLine": child.end_line,
                "introContent": nested_intro,
                "keywords": [],
                "subsections": nested_subsections if nested_subsections else None
            }
            subsections.append(subsection)
    
    return subsections, intro_content

def process_sections(chapter_node: DocumentNode, chapter_id: str, chapter_slug: str) -> List[Dict]:
    """
    Process H2 sections within a chapter.
    
    Args:
        chapter_node: Chapter node (H1)
        chapter_id: ID of the chapter
        chapter_slug: Slug of the chapter
        
    Returns:
        List of section dictionaries
    """
    sections = []
    intro_content = None
    
    for child in chapter_node.children:
        if child.title == "---section---":
            # This is an orphan section - convert to introContent
            intro_content = {
                "content": child.content,
                "tokenCount": child.tokens,
                "startLine": child.start_line,
                "endLine": child.end_line
            }
        elif child.level == 2:  # H2
            section_number = extract_section_number(child.title)
            section_id = generate_id(child, "section", section_number, chapter_id)
            section_slug = generate_slug(child.title)
            section_url = generate_url_path(child, False, chapter_slug)
            section_breadcrumb = build_breadcrumb(child)
            
            # Process subsections (H3/H4)
            subsections, section_intro = process_subsections(child, section_id, section_slug, section_breadcrumb)
            
            section = {
                "id": section_id,
                "level": "h2",
                "number": section_number,
                "title": child.title,
                "slug": section_slug,
                "url": section_url,
                "parentId": chapter_id,
                "breadcrumb": section_breadcrumb,
                "tokenCount": child.tokens,
                "content": child.content,
                "startLine": child.start_line,
                "endLine": child.end_line,
                "introContent": section_intro,
                "keywords": [],
                "subsections": subsections if subsections else None
            }
            sections.append(section)
    
    return sections, intro_content

def process_chapter_or_frontmatter(node: DocumentNode, is_chapter: bool) -> Dict:
    """
    Process a chapter (H1) or front matter item.
    
    Args:
        node: H1 node
        is_chapter: True if chapter, False if front matter
        
    Returns:
        Dictionary representing chapter or front matter item
    """
    if is_chapter:
        chapter_number = extract_chapter_number(node.title)
        chapter_id = generate_id(node, "chapter", chapter_number)
        chapter_slug = generate_slug(node.title)
        chapter_url = generate_url_path(node, True)
        chapter_breadcrumb = build_breadcrumb(node)
        
        # Process sections (H2/H3/H4)
        sections, intro_content = process_sections(node, chapter_id, chapter_slug)
        
        return {
            "id": chapter_id,
            "level": "h1",
            "number": chapter_number,
            "title": node.title,
            "slug": chapter_slug,
            "url": chapter_url,
            "tokenCount": node.tokens,
            "summary": None,  # Can be populated later
            "content": node.content,
            "startLine": node.start_line,
            "endLine": node.end_line,
            "introContent": intro_content,
            "sections": sections if sections else None
        }
    else:
        # Front matter item
        frontmatter_id = generate_id(node, "frontmatter")
        frontmatter_slug = generate_slug(node.title)
        frontmatter_url = generate_url_path(node, False)
        frontmatter_breadcrumb = build_breadcrumb(node)
        
        # Process sections if any (some front matter might have H2s)
        sections, intro_content = process_sections(node, frontmatter_id, frontmatter_slug)
        
        return {
            "id": frontmatter_id,
            "level": "h1",
            "number": None,
            "title": node.title,
            "slug": frontmatter_slug,
            "url": frontmatter_url,
            "tokenCount": node.tokens,
            "content": node.content,
            "startLine": node.start_line,
            "endLine": node.end_line,
            "introContent": intro_content,
            "sections": sections if sections else None
        }

def convert_tree_to_json(document_data: Dict[str, Any], document_title: str = None, document_version: str = None) -> Dict:
    """
    Convert DocumentNode tree to React/FastAPI JSON structure.
    
    Args:
        document_data: Dictionary from parse_markdown_tree() with 'tree', 'sections', 'stats'
        document_title: Document title (extracted from filename if not provided)
        document_version: Document version (extracted from filename if not provided)
        
    Returns:
        Structured JSON dictionary
    """
    root = document_data['tree']
    stats = document_data['stats']
    
    # Extract title and version from filename if not provided
    if not document_title:
        document_title = "Kenya National Clinical Guidelines for Management of Diabetes"
    if not document_version:
        # Try to extract from filename - assuming format like "...2nd-Editiion-2018..."
        document_version = "2nd Edition 2018"
    
    # Separate root children into front matter and chapters
    front_matter = []
    chapters = []
    
    for child in root.children:
        if child.level == 0:  # Special sections like "Content Before First Heading"
            # Treat as front matter
            frontmatter_item = {
                "id": generate_id(child, "frontmatter"),
                "level": "section",
                "number": None,
                "title": child.title,
                "slug": generate_slug(child.title),
                "url": generate_url_path(child, False),
                "tokenCount": child.tokens,
                "content": child.content,
                "startLine": child.start_line,
                "endLine": child.end_line,
                "introContent": None,
                "keywords": []
            }
            front_matter.append(frontmatter_item)
        elif child.level == 1:  # H1
            if is_front_matter(child.title):
                frontmatter_item = process_chapter_or_frontmatter(child, False)
                front_matter.append(frontmatter_item)
            else:
                chapter = process_chapter_or_frontmatter(child, True)
                chapters.append(chapter)
    
    # Build final JSON structure
    result = {
        "document": {
            "title": document_title,
            "version": document_version,
            "totalSections": stats['total_sections'],
            "totalTokens": stats['total_tokens'],
            "frontMatter": front_matter,
            "chapters": chapters
        }
    }
    
    return result

print("âœ“ Conversion function defined")


âœ“ Conversion function defined


In [5]:
# ============================================================================
# CONVERT TREE TO JSON AND EXPORT
# ============================================================================

# Convert document tree to JSON structure
json_structure = convert_tree_to_json(document_data)

# Validate that orphan sections are preserved
def count_orphan_sections_in_json(json_data: Dict) -> int:
    """Count introContent fields in JSON (these represent preserved orphan sections)"""
    count = 0
    
    def count_in_chapter(chapter):
        nonlocal count
        if chapter.get("introContent"):
            count += 1
        for section in chapter.get("sections", []) or []:
            if section.get("introContent"):
                count += 1
            for subsection in section.get("subsections", []) or []:
                if subsection.get("introContent"):
                    count += 1
                for subsubsection in subsection.get("subsections", []) or []:
                    if subsubsection.get("introContent"):
                        count += 1
    
    # Count in front matter
    for item in json_data["document"]["frontMatter"]:
        if item.get("introContent"):
            count += 1
        if item.get("sections"):
            for section in item["sections"]:
                if section.get("introContent"):
                    count += 1
    
    # Count in chapters
    for chapter in json_data["document"]["chapters"]:
        count_in_chapter(chapter)
    
    return count

# Validate orphan sections
original_orphan_count = document_data['stats']['orphan_sections']
json_orphan_count = count_orphan_sections_in_json(json_structure)

print("=" * 80)
print("JSON CONVERSION VALIDATION")
print("=" * 80)
print(f"Original orphan sections in tree: {original_orphan_count}")
print(f"Orphan sections preserved as introContent: {json_orphan_count}")
if original_orphan_count == json_orphan_count:
    print("âœ“ All orphan sections successfully preserved!")
else:
    print(f"âš  Warning: Mismatch in orphan section count!")
print("=" * 80)
print()

# Display summary
print("JSON STRUCTURE SUMMARY")
print("=" * 80)
print(f"Document Title: {json_structure['document']['title']}")
print(f"Document Version: {json_structure['document']['version']}")
print(f"Total Sections: {json_structure['document']['totalSections']}")
print(f"Total Tokens: {json_structure['document']['totalTokens']:,}")
print(f"Front Matter Items: {len(json_structure['document']['frontMatter'])}")
print(f"Chapters: {len(json_structure['document']['chapters'])}")
print("=" * 80)
print()

# Export to JSON file
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
json_file = output_dir / "document_structure.json"

with open(json_file, "w", encoding="utf-8") as f:
    json.dump(json_structure, f, indent=2, ensure_ascii=False)

print(f"âœ“ JSON structure exported to: {json_file}")
print(f"  File size: {json_file.stat().st_size / 1024:.2f} KB")


JSON CONVERSION VALIDATION
Original orphan sections in tree: 8
Orphan sections preserved as introContent: 8
âœ“ All orphan sections successfully preserved!

JSON STRUCTURE SUMMARY
Document Title: Kenya National Clinical Guidelines for Management of Diabetes
Document Version: 2nd Edition 2018
Total Sections: 57
Total Tokens: 120,679
Front Matter Items: 11
Chapters: 8

âœ“ JSON structure exported to: output\document_structure.json
  File size: 959.70 KB


In [6]:
# Preview a sample of the JSON structure
print("=" * 80)
print("SAMPLE JSON STRUCTURE (First Chapter)")
print("=" * 80)
if json_structure['document']['chapters']:
    first_chapter = json_structure['document']['chapters'][0]
    # Show structure without full content
    sample_chapter = {
        "id": first_chapter["id"],
        "level": first_chapter["level"],
        "number": first_chapter["number"],
        "title": first_chapter["title"],
        "slug": first_chapter["slug"],
        "url": first_chapter["url"],
        "tokenCount": first_chapter["tokenCount"],
        "hasIntroContent": first_chapter.get("introContent") is not None,
        "sectionsCount": len(first_chapter.get("sections", []) or []),
        "firstSection": {
            "id": first_chapter["sections"][0]["id"] if first_chapter.get("sections") else None,
            "title": first_chapter["sections"][0]["title"] if first_chapter.get("sections") else None,
            "hasIntroContent": first_chapter["sections"][0].get("introContent") is not None if first_chapter.get("sections") else None,
            "subsectionsCount": len(first_chapter["sections"][0].get("subsections", []) or []) if first_chapter.get("sections") else None
        } if first_chapter.get("sections") else None
    }
    print(json.dumps(sample_chapter, indent=2, ensure_ascii=False))
print("=" * 80)
print()
print("Sample Front Matter Item:")
if json_structure['document']['frontMatter']:
    sample_fm = {
        "id": json_structure['document']['frontMatter'][0]["id"],
        "title": json_structure['document']['frontMatter'][0]["title"],
        "slug": json_structure['document']['frontMatter'][0]["slug"],
        "url": json_structure['document']['frontMatter'][0]["url"],
        "tokenCount": json_structure['document']['frontMatter'][0]["tokenCount"]
    }
    print(json.dumps(sample_fm, indent=2, ensure_ascii=False))


SAMPLE JSON STRUCTURE (First Chapter)
{
  "id": "chapter-1",
  "level": "h1",
  "number": "1",
  "title": "CHAPTER ONE: INTRODUCTION TO DIABETES",
  "slug": "chapter-one-introduction-to-diabetes",
  "url": "/guidelines/chapter-one-introduction-to-diabetes",
  "tokenCount": 3149,
  "hasIntroContent": false,
  "sectionsCount": 6,
  "firstSection": {
    "id": "section-1-1",
    "title": "1.1. Definition",
    "hasIntroContent": false,
    "subsectionsCount": 0
  }
}

Sample Front Matter Item:
{
  "id": "frontmatter-content-before-first-heading",
  "title": "Content Before First Heading",
  "slug": "content-before-first-heading",
  "url": "/guidelines/content-before-first-heading",
  "tokenCount": 465
}
