# Document Segmentation

In this guide, we demonstrate how to do document segmentation using structured output from an LLM. We'll be using command-r-plus - one of Cohere's latest LLMs with 128k context length and testing the approach on an article explaining the Transformer architecture. The same approach to document segmentation can be applied to any other domain where we need to break down a complex long document into smaller chunks.

## Motivation

Sometimes we need a way to split the document into meaningful parts that center around a single key concept/idea. Simple length-based / rule-based text-splitters are not reliable enough. Consider the cases where documents contain code snippets or math equations - we don't want to split those on '\n\n' or have to write extensive rules for different types of documents. It turns out that LLMs with sufficiently long context length are well suited for this task.

## Defining the Data Structures

First, we need to define a Section class for each of the document's segments. StructuredDocument class will then encapsulate a list of these sections.

Note that in order to avoid LLM regenerating the content of each section, we can simply enumerate each line of the input document and then ask LLM to segment it by providing start-end line numbers for each section.

In [1]:
from pydantic import BaseModel, Field
from typing import List, Dict, Any

In [2]:
class Section(BaseModel):
    title: str = Field(description="main topic of this section of the document")
    start_index: int = Field(description="line number where the section begins")
    end_index: int = Field(description="line number where the section ends")


class StructuredDocument(BaseModel):
    """obtains meaningful sections, each centered around a single concept/topic"""
    sections: List[Section] = Field(description="a list of sections of the document")

## Document Preprocessing

Preprocess the input document by prepending each line with its number.

In [3]:
def doc_with_lines(document):
    document_lines = document.split("\n")
    document_with_line_numbers = ""
    line2text = {}
    for i, line in enumerate(document_lines):
        document_with_line_numbers += f"[{i}] {line}\n"
        line2text[i] = line
    return document_with_line_numbers, line2text

## Segmentation

Next use a Cohere client to extract StructuredDocument from the preprocessed doc.

In [4]:
import instructor
import os

In [5]:
# Ensure COHERE_API_KEY is in api_keys.env
from dotenv import load_dotenv
load_dotenv("../api_keys.env")

True

In [6]:
import cohere
client = instructor.from_cohere(cohere.Client(api_key=os.getenv("COHERE_API_KEY")))

In [7]:

system_prompt = f"""
You are a world class educator working on organizing your lecture notes.
Read the document below and extract a StructuredDocument object from it where each section of the document is centered around a single concept/topic that can be taught in one lesson.
Each line of the document is marked with its line number in square brackets (e.g. [1], [2], [3], etc). Use the line numbers to indicate section start and end.
"""

def get_structured_document(document_with_line_numbers) -> StructuredDocument:
    return client.chat.completions.create(
        model="command-r-plus",
        response_model=StructuredDocument,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": document_with_line_numbers,
            },
        ],
    )  # type: ignore

Next, we need to get back the section text based on the start/end indices and our line2text dict from the preprocessing step.

In [8]:
def get_sections_text(structured_doc, line2text):
    segments = []
    for s in structured_doc.sections:
        contents = []
        for line_id in range(s.start_index, s.end_index):
                contents.append(line2text.get(line_id, ''))
        segments.append({
            "title": s.title,
            "content": "\n".join(contents),
            "start": s.start_index,
            "end": s.end_index
        })
    return segments

## Example

Here's an example of using these classes and functions to segment a tutorial on Transformers from Sebastian Raschka. We can use trafilatura package to scrape the web page content of the article.

In [10]:
from trafilatura import fetch_url, extract
from rich import print

url='https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html'
document = extract(fetch_url(url))

document_with_line_numbers, line2text = doc_with_lines(document)
structured_doc = get_structured_document(document_with_line_numbers)
segments = get_sections_text(structured_doc, line2text)

print(segments[5]['title'])
print(segments[5]['content'])