
# Chapter 2: 텍스트 청킹 기법 실습 (예제)

이 노트북은 다양한 청킹 전략을 예제로 실습하며 장단점과 적용 맥락을 빠르게 파악할 수 있도록 구성되어 있습니다.

## 📚 학습 목표
- 대표적 청킹 전략(고정 길이/슬라이딩/구조/계층)의 개념과 차이 이해
- 간단한 예문으로 각 기법을 직접 실행해 결과 비교
- 과/소청킹 현상과 문맥 보존의 트레이드오프 체감

## 📋 실습 구성
- 1️⃣ 환경 설정: Colab 드라이브 마운트 및 경로 설정
- 2️⃣ 고정 길이 청킹: 문장/단어 단위 고정 크기 분할
- 3️⃣ 슬라이딩 윈도우: 겹침 비율을 조절하며 중복 문맥 유지
- 4️⃣ 구조 기반: 마크다운 헤더/코드블록 경계를 활용한 분할
- 5️⃣ 계층 기반: 섹션 제목을 헤더로 포함해 상하위 맥락 유지

> ⚠️ 실습 셀 실행 전, 환경 설정 셀(1️⃣)을 먼저 실행하세요.

---
## 1️⃣ Google Colab 환경 설정

In [None]:
# ========================================
# Google Colab 환경 설정
# ========================================
from google.colab import drive
import os

# Google Drive 마운트
drive.mount('/content/drive')

# 경로 설정
BASE_DIR = "/content/drive/MyDrive/ostep_rag"
DATA_DIR = os.path.join(BASE_DIR, "data")

print("✅ Colab 환경 설정 완료")


# 텍스트 청킹 실습 노트북

이 노트북은 **고정 길이**, **슬라이딩 윈도우**, **의미 기반**, **구조 기반**, **계층 구조** 등 5가지 청킹 전략을 각각의 셀로 실습할 수 있도록 구성되어 있습니다.  
각 섹션에는 (1) 설명, (2) 예문, (3) 실행 코드가 포함되어 있습니다.

> 팁: 각 셀을 위에서부터 순서대로 실행하세요. 결과는 `chunks` 리스트 혹은 표 형태로 출력됩니다.



---
## 2️⃣ 고정 길이 청킹 (Fixed-length)

**개념**: 문서를 일정한 단위(문장/단어/문자)로 **균등하게** 나눕니다.  
**장점**: 구현이 단순하고 빠르며 병렬 처리에 유리.  
**단점**: 문맥이 끊기거나 질의와 무관한 정보가 포함될 수 있음.

아래 예문은 문장 3개씩 고정 길이로 분할합니다.


In [10]:
import re
from typing import List

def simple_sent_split(text: str) -> List[str]:
    """
    Split text into sentences using simple rules.
    - Split on periods, question marks, exclamation marks
    - Split on line breaks
    """
    # First split by line breaks
    parts = re.split(r"\n+", text.strip())
    sentences = []
    
    for part in parts:
        # Split on punctuation marks
        split_sentences = re.split(r"(?<=[\.!\?])\s+", part.strip())
        # Filter out empty strings
        split_sentences = [s for s in split_sentences if s]
        sentences.extend(split_sentences)
    
    # Clean up: remove extra whitespace
    return [s.strip() for s in sentences if s.strip()]

def chunk_by_fixed_length(text: str, unit: str = "sentences", size: int = 3) -> List[str]:
    """
    Fixed-length chunking: Split text into equal-sized chunks.
    
    Args:
        text: Input text to chunk
        unit: "sentences", "words", or "chars"
        size: Number of units per chunk
    
    Returns:
        List of text chunks
    """
    # Step 1: Split text into basic units
    if unit == "sentences":
        items = simple_sent_split(text)
    elif unit == "words":
        # Split on whitespace to get words
        items = re.findall(r"\S+", text)
    elif unit == "chars":
        # Split into individual characters
        items = list(text)
    else:
        raise ValueError("unit must be one of: sentences, words, chars")

    # Step 2: Group items into chunks of specified size
    chunks = []
    for i in range(0, len(items), size):
        # Get a slice of items for this chunk
        chunk_items = items[i:i+size]
        
        # Join items back into text
        if unit == "sentences":
            chunks.append(" ".join(chunk_items))
        elif unit == "words":
            chunks.append(" ".join(chunk_items))
        else:  # chars
            chunks.append("".join(chunk_items))
    
    return chunks

def show_chunks(chunks):
    """Display chunks in a numbered format"""
    for i, chunk in enumerate(chunks, 1):
        print(f"[{i}] {chunk}\n")

# Example: Let's chunk a simple text about machine learning
text = """
Machine learning is a subset of artificial intelligence. It focuses on algorithms that can learn from data. The goal is to make predictions or decisions without being explicitly programmed.
There are three main types of machine learning. Supervised learning uses labeled training data. Unsupervised learning finds patterns in unlabeled data. Reinforcement learning learns through trial and error.
Deep learning is a subset of machine learning. It uses neural networks with multiple layers. These networks can learn complex patterns in data. They have been very successful in image recognition and natural language processing.
"""

print("=== Fixed-length chunking by sentences (3 sentences per chunk) ===")
chunks = chunk_by_fixed_length(text, unit="sentences", size=3)
show_chunks(chunks)

print("=== Fixed-length chunking by words (15 words per chunk) ===")
chunks_words = chunk_by_fixed_length(text, unit="words", size=15)
show_chunks(chunks_words[:3])  # Show first 3 chunks only


=== Fixed-length chunking by sentences (3 sentences per chunk) ===
[1] Machine learning is a subset of artificial intelligence. It focuses on algorithms that can learn from data. The goal is to make predictions or decisions without being explicitly programmed.

[2] There are three main types of machine learning. Supervised learning uses labeled training data. Unsupervised learning finds patterns in unlabeled data.

[3] Reinforcement learning learns through trial and error. Deep learning is a subset of machine learning. It uses neural networks with multiple layers.

[4] These networks can learn complex patterns in data. They have been very successful in image recognition and natural language processing.

=== Fixed-length chunking by words (15 words per chunk) ===
[1] Machine learning is a subset of artificial intelligence. It focuses on algorithms that can learn

[2] from data. The goal is to make predictions or decisions without being explicitly programmed. There

[3] are three main ty


## 2) 슬라이딩 윈도우 청킹 (Sliding Window)

**개념**: 일정 길이의 윈도우를 **겹치게** 이동시키며 분할합니다.  
**장점**: 인접 청크 간 문맥이 유지되어 **의미 단절 최소화**. 긴 텍스트의 주제 전환 처리에 유리.  
**단점**: 중복 임베딩으로 **처리량 증가**.

예문은 3문장 윈도우, 1문장씩 이동(step=1)으로 분할합니다.


In [21]:
import re
from typing import List

def simple_sent_split(text: str) -> List[str]:
    """
    Split text into sentences using simple rules.
    - Split on periods, question marks, exclamation marks
    - Split on line breaks
    """
    # First split by line breaks
    parts = re.split(r"\n+", text.strip())
    sentences = []
    
    for part in parts:
        # Split on punctuation marks
        split_sentences = re.split(r"(?<=[\.!\?])\s+", part.strip())
        # Filter out empty strings
        split_sentences = [s for s in split_sentences if s]
        sentences.extend(split_sentences)
    
    # Clean up: remove extra whitespace
    return [s.strip() for s in sentences if s.strip()]

def chunk_by_sliding_window(text: str, window_size: int = 10, overlap_ratio: float = 0.5) -> List[str]:
    """
    Sliding window chunking: Create overlapping chunks by sliding a window across words.
    
    Args:
        text: Input text to chunk
        window_size: Number of words in each chunk
        overlap_ratio: Ratio of overlap (0.0 = no overlap, 0.5 = 50% overlap, 0.8 = 80% overlap)
    
    Returns:
        List of overlapping text chunks
    """
    # Step 1: Split text into words
    words = re.findall(r"\S+", text)
    
    # Step 2: Calculate step size based on overlap ratio
    # step = window_size * (1 - overlap_ratio)
    step = int(window_size * (1 - overlap_ratio))
    step = max(1, step)  # Ensure step is at least 1
    
    # Step 3: Create sliding windows
    chunks = []
    for i in range(0, len(words) - window_size + 1, step):
        # Get a window of words
        window_words = words[i:i+window_size]
        # Join them into a chunk
        chunk = " ".join(window_words)
        chunks.append(chunk)
    
    return chunks

def show_chunks(chunks):
    """Display chunks in a numbered format"""
    for i, chunk in enumerate(chunks, 1):
        print(f"[{i}] {chunk}\n")

# Example: Let's use sliding window on a text about data science
text = """
Data science combines statistics and computer science. It helps us find patterns in large datasets. The goal is to extract meaningful insights from data.
Machine learning is a key tool in data science. It can predict future outcomes based on past data. Popular algorithms include linear regression and decision trees.
Data visualization makes insights easier to understand. Charts and graphs help communicate findings. Tools like matplotlib and seaborn are commonly used.
Big data refers to datasets that are too large for traditional processing. Distributed computing frameworks like Hadoop help handle big data. Cloud platforms provide scalable storage and processing.
"""

print("=== Sliding Window Chunking (Word-based) ===")
print("Window size: 10 words, Overlap ratio: 0.5 (50% overlap)")
print("Notice how adjacent chunks share words for better context!\n")

chunks = chunk_by_sliding_window(text, window_size=20, overlap_ratio=0.1)
show_chunks(chunks)


=== Sliding Window Chunking (Word-based) ===
Window size: 10 words, Overlap ratio: 0.5 (50% overlap)
Notice how adjacent chunks share words for better context!

[1] Data science combines statistics and computer science. It helps us find patterns in large datasets. The goal is to extract

[2] to extract meaningful insights from data. Machine learning is a key tool in data science. It can predict future outcomes

[3] future outcomes based on past data. Popular algorithms include linear regression and decision trees. Data visualization makes insights easier to

[4] easier to understand. Charts and graphs help communicate findings. Tools like matplotlib and seaborn are commonly used. Big data refers

[5] data refers to datasets that are too large for traditional processing. Distributed computing frameworks like Hadoop help handle big data.




## 4) 구조 기반 청킹 (Structure-aware)

**개념**: 문서의 **형식적 구조**(마크다운 헤더, 코드블록, HTML 태그 등)를 경계로 분할합니다.  
**장점**: 구현이 비교적 간단하며 구조 정보에 그대로 의존 가능.  
**단점**: 구조 인식이 어려운 문서에는 한계가 있고, 동일한 구조라도 문서마다 의미 차이가 존재할 수 있음.


In [13]:
import re
from typing import List, Tuple

def chunk_by_structure_markdown(text: str) -> List[Tuple[str, str]]:
    """
    Structure-based chunking: Split text based on markdown headers and code blocks.
    
    Args:
        text: Input markdown text to chunk
    
    Returns:
        List of (title, content) tuples. Empty title if no header found.
    """
    lines = text.splitlines()
    chunks = []
    current_content = []
    current_title = ""
    in_code_block = False
    
    for line in lines:
        # Check if we're entering or exiting a code block
        if line.strip().startswith("```"):
            in_code_block = not in_code_block
            current_content.append(line)
            continue
        
        # Check if this is a header (not inside code block)
        if not in_code_block and re.match(r"^\s{0,3}#{1,6}\s", line):
            # Save previous chunk if it has content
            if current_content:
                content = "\n".join(current_content).strip()
                chunks.append((current_title, content))
                current_content = []
            
            # Extract title from header
            current_title = re.sub(r"^\s{0,3}#{1,6}\s*", "", line).strip()
        else:
            # Regular content line
            current_content.append(line)
    
    # Don't forget the last chunk
    if current_content:
        content = "\n".join(current_content).strip()
        chunks.append((current_title, content))
    
    return chunks

# Example: Let's chunk a markdown document about Python programming
markdown_text = """
# Python Basics

Python is a high-level programming language. It's known for its simple syntax and readability. Python supports multiple programming paradigms.

## Variables and Data Types

Variables in Python don't need explicit declaration. You can assign values directly. Common data types include integers, floats, strings, and booleans.

```python
# Example of variable assignment
name = "Alice"
age = 25
is_student = True
```

## Control Structures

Python uses indentation to define code blocks. This makes the code more readable. Common control structures include if-else statements and loops.

```python
# Example of if-else statement
if age >= 18:
    print("Adult")
else:
    print("Minor")
```

## Functions

Functions help organize code into reusable blocks. They can take parameters and return values. Python has many built-in functions.

```python
def greet(name):
    return f"Hello, {name}!"
```

# Best Practices

Good Python code follows certain conventions. Use meaningful variable names. Write clear comments. Keep functions small and focused.
"""

print("=== Structure-based Chunking ===")
print("This method splits text based on markdown headers and preserves code blocks.\n")

blocks = chunk_by_structure_markdown(markdown_text)

for i, (title, content) in enumerate(blocks, 1):
    print(f"[{i}] <{title or 'No Title'}>")
    print(content)
    print()


=== Structure-based Chunking ===
This method splits text based on markdown headers and preserves code blocks.

[1] <No Title>


[2] <Python Basics>
Python is a high-level programming language. It's known for its simple syntax and readability. Python supports multiple programming paradigms.

[3] <Variables and Data Types>
Variables in Python don't need explicit declaration. You can assign values directly. Common data types include integers, floats, strings, and booleans.

```python
# Example of variable assignment
name = "Alice"
age = 25
is_student = True
```

[4] <Control Structures>
Python uses indentation to define code blocks. This makes the code more readable. Common control structures include if-else statements and loops.

```python
# Example of if-else statement
if age >= 18:
    print("Adult")
else:
    print("Minor")
```

[5] <Functions>
Functions help organize code into reusable blocks. They can take parameters and return values. Python has many built-in functions.

```python



## 5) 계층 구조 기반 청킹 (Hierarchical)

**개념**: 문서의 논리적 **계층**을 인식해 상위/하위 내용을 함께 관리합니다.  
각 청크가 상위 제목(맥락)을 **헤더로 포함**하도록 설계하여, 섹션 기반 검색/요약/인덱싱에 강합니다.


In [14]:
import re
from typing import List, Tuple

def simple_sent_split(text: str) -> List[str]:
    """
    Split text into sentences using simple rules.
    - Split on periods, question marks, exclamation marks
    - Split on line breaks
    """
    # First split by line breaks
    parts = re.split(r"\n+", text.strip())
    sentences = []
    
    for part in parts:
        # Split on punctuation marks
        split_sentences = re.split(r"(?<=[\.!\?])\s+", part.strip())
        # Filter out empty strings
        split_sentences = [s for s in split_sentences if s]
        sentences.extend(split_sentences)
    
    # Clean up: remove extra whitespace
    return [s.strip() for s in sentences if s.strip()]

def chunk_by_structure_markdown(text: str) -> List[Tuple[str, str]]:
    """
    Structure-based chunking: Split text based on markdown headers and code blocks.
    
    Args:
        text: Input markdown text to chunk
    
    Returns:
        List of (title, content) tuples. Empty title if no header found.
    """
    lines = text.splitlines()
    chunks = []
    current_content = []
    current_title = ""
    in_code_block = False
    
    for line in lines:
        # Check if we're entering or exiting a code block
        if line.strip().startswith("```"):
            in_code_block = not in_code_block
            current_content.append(line)
            continue
        
        # Check if this is a header (not inside code block)
        if not in_code_block and re.match(r"^\s{0,3}#{1,6}\s", line):
            # Save previous chunk if it has content
            if current_content:
                content = "\n".join(current_content).strip()
                chunks.append((current_title, content))
                current_content = []
            
            # Extract title from header
            current_title = re.sub(r"^\s{0,3}#{1,6}\s*", "", line).strip()
        else:
            # Regular content line
            current_content.append(line)
    
    # Don't forget the last chunk
    if current_content:
        content = "\n".join(current_content).strip()
        chunks.append((current_title, content))
    
    return chunks

def chunk_hierarchical(text: str, section_depth: int = 2, leaf_sent_limit: int = 3) -> List[Tuple[str, str]]:
    """
    Hierarchical chunking: Create chunks that preserve document structure.
    
    This method:
    1. First splits by major sections (headers)
    2. Then splits each section into smaller chunks by sentences
    3. Each chunk includes the section title for context
    
    Args:
        text: Input markdown text to chunk
        section_depth: How deep to go in the hierarchy (not used in this simple version)
        leaf_sent_limit: Maximum sentences per leaf chunk
    
    Returns:
        List of (section_title, chunk_content) tuples
    """
    # Step 1: Split by major sections using markdown headers
    sections = chunk_by_structure_markdown(text)
    
    # Step 2: Split each section into smaller chunks
    hierarchical_chunks = []
    
    for section_title, section_content in sections:
        # Split section content into sentences
        sentences = simple_sent_split(section_content)
        
        # Create chunks of sentences within this section
        for i in range(0, len(sentences), leaf_sent_limit):
            # Get a chunk of sentences
            chunk_sentences = sentences[i:i+leaf_sent_limit]
            chunk_content = " ".join(chunk_sentences)
            
            # Store with section title for context
            hierarchical_chunks.append((section_title, chunk_content))
    
    return hierarchical_chunks

# Example: Let's create hierarchical chunks from a technical document
document = """
# Introduction

Retrieval-Augmented Generation (RAG) combines search and generation. It enhances language models with external knowledge. The main advantage is improved accuracy and up-to-date information.

# Architecture

RAG systems have two main components. The retriever finds relevant documents. The generator creates responses based on retrieved content. This separation allows for better control and optimization.

## Retrieval Component

The retriever uses vector similarity search. Documents are converted to embeddings. Queries are also converted to embeddings. Similarity is measured using cosine distance.

## Generation Component

The generator is typically a large language model. It takes retrieved documents as context. The model generates responses based on this context. Fine-tuning can improve performance.

# Implementation

Building a RAG system requires several steps. First, prepare and index your documents. Second, implement the retrieval mechanism. Third, integrate with a language model.

## Document Processing

Documents need to be cleaned and chunked. Chunking strategies affect retrieval quality. Smaller chunks provide more precise matches. Larger chunks provide more context.

## Vector Database

A vector database stores document embeddings. Popular options include Pinecone and Weaviate. The database must support similarity search. Performance depends on indexing strategy.

# Best Practices

Good RAG systems follow certain principles. Use high-quality source documents. Implement proper chunking strategies. Monitor and evaluate system performance regularly.
"""

print("=== Hierarchical Chunking ===")
print("This method preserves document structure by keeping section titles with each chunk.\n")

hierarchical_chunks = chunk_hierarchical(document, section_depth=2, leaf_sent_limit=2)

for i, (section_title, chunk_content) in enumerate(hierarchical_chunks, 1):
    print(f"[{i}] <{section_title}>")
    print(chunk_content)
    print()


=== Hierarchical Chunking ===
This method preserves document structure by keeping section titles with each chunk.

[1] <Introduction>
Retrieval-Augmented Generation (RAG) combines search and generation. It enhances language models with external knowledge.

[2] <Introduction>
The main advantage is improved accuracy and up-to-date information.

[3] <Architecture>
RAG systems have two main components. The retriever finds relevant documents.

[4] <Architecture>
The generator creates responses based on retrieved content. This separation allows for better control and optimization.

[5] <Retrieval Component>
The retriever uses vector similarity search. Documents are converted to embeddings.

[6] <Retrieval Component>
Queries are also converted to embeddings. Similarity is measured using cosine distance.

[7] <Generation Component>
The generator is typically a large language model. It takes retrieved documents as context.

[8] <Generation Component>
The model generates responses based on this 