# Long-form text editing with chunking, summarization, and iteration

## Model limitations

Let's first define some constants with descriptive statistics about OpenAI models and their capabilities. These are the constraints that we will be working with as we try to define an LLM pipeline to modernize Herman Melville's very lengthy novel *Moby Dick*.

In [2]:
# OpenAI models and their current costs
LLM_MODELS = [
    {
        "name": "gpt-3.5-turbo",
        "max_tokens": 16385,
        "prompt_cost_per_token": 0.5 / 1000000,
        "completion_cost_per_token": 1.5 / 1000000,
    },
    {
        "name": "gpt-4-turbo",
        "max_tokens": 128000,
        "prompt_cost_per_token": 10 / 1000000,
        "completion_cost_per_token": 30 / 1000000,
    },
    {
        "name": "gpt-4",
        "max_tokens": 8192,
        "prompt_cost_per_token": 30 / 1000000,
        "completion_cost_per_token": 60 / 1000000,
    },
    {
        "name": "gpt-4-32k",
        "max_tokens": 32768,
        "prompt_cost_per_token": 30 / 1000000,
        "completion_cost_per_token": 0.12 / 1000000,
    }
]

# OpenAI currently limits output tokens to 4096 for all models
MAX_OUTPUT_TOKENS = 4096

## The Task

The goal of our exercise will be to modernize the text of Herman Melville's classic public domain novel Moby Dick. Since the book is much too long for even long-context (128,000-token) LLMs, much less the 4096-token output limit, we will use a combination of chunking, summarization, synthesis, and multi-turn text generation with a sliding context window to maintain consistency across the task.

We'll begin by loading the text of Moby Dick, which is downloaded from Project Gutenberg and provided in this repository as `sample_input.txt`. Using the heuristic of about 4 English characters per token, we can estimate the length of the text at close to 300,000 tokens.

In [3]:
# Lambda function to estimate number of tokens based on 4 characters/token
estimate_tokens = lambda str: round(len(str)/4)

# Open sample input file, get the text, and estimate number of tokens
with open("sample_input.txt", "r", encoding="utf-8") as file:
    text = file.read()

print("Estimated token count: " + str(estimate_tokens(text)))

Estimated token count: 297792


## Chunking

Next, let's define some logic to chunk the text. We'll set a max_length for each chunk and then find the last sentence ending before that length (using a reasonable estimate of the ratio of English characters to tokens as a heuristic).

In [24]:
import re
from warnings import warn
from math import ceil

# Regular expressions for split points
SPLIT_PATTERNS = [
    re.compile(r'\s'), # WHITESPACE
    re.compile(r'[.!?][”’"\']*\s'), # SENTENCE_END
    re.compile(r'\n\n+'), # PARAGRAPH END
    re.compile(r'\n\n+(?=CHAPTER \d)', re.IGNORECASE), # CHAPTER END
]

# Find the last occurrence of a pattern in a text chunk
def find_last(text: str, pattern: str) -> int or None:
    # Get a list of matches for the pattern in the text
    matches = list(re.finditer(pattern, text))
    try:
        # Get the last one's end position
        split_point = matches[-1].end()
    except:
        split_point = None

    return split_point


# Split text into evenly sized chunks shorter than max_characters
def split_text(
            text: str,
            max_characters: int | float,
            split_patterns: list[re.Pattern] = SPLIT_PATTERNS
        ) -> list[str]:
    # Cooerce max_characters to an integer
    max_characters = int(max_characters)
    text_length = len(text)
    
    # Ceiling divide the text length by the max_length to get the number of chunks
    num_chunks = ceil(text_length / max_characters)

    # Ceiling divide the text length by the number of chunks to get the target_length
    target_length = ceil(text_length / num_chunks)

    # Get target split indices for the text
    target_split_indices = [target_length * i for i in range(1, num_chunks)]

    # Initialize split_indices, threshold, and patterns
    threshold = max(max_characters - target_length, 500)
    split_indices = [None] * (num_chunks - 1)
    patterns = split_patterns.copy()

    # Try patterns until each split index - corresponding target split index <= threshold
    while not all(split_indices) and patterns:
        # Pop the last split_pattern from the list and get all match end positions
        pattern = patterns.pop()
        end_positions = [match.end() for match in re.finditer(pattern, text)]

        # For each None index in split_indices, find the nearest end_position (by absolute difference)
        for i, split_index in enumerate(split_indices):
            if split_index is None:
                nearest_end_position = min(end_positions, key=lambda x: abs(x - target_split_indices[i]))
                if abs(nearest_end_position - target_split_indices[i]) <= threshold:
                    split_indices[i] = nearest_end_position
    
    # If not all split_indices are found, raise a warning and use the target_split_indices
    if not all(split_indices):
        warn("Could not split text evenly. Using target split indices instead.")
        split_indices = target_split_indices

    # Split the text into chunks using the split_indices
    chunks = [text[i:j] for i, j in zip([0] + split_indices, split_indices + [None])]

    return chunks


# Get model object for GPT-4 Turbo
model = [model for model in LLM_MODELS if model["name"] == "gpt-4-turbo"][0]

# Allow input no longer than gpt-4-turbo max_tokens - MAX_OUTPUT_TOKENS - 1000
estimate_chars = lambda tokens: tokens * 4
max_input_length = estimate_chars(model["max_tokens"] - MAX_OUTPUT_TOKENS - 1000)

# Split the input text into chunks no longer than max_input_length
input_chunks = split_text(text, max_input_length)

# Estimate the number of tokens for each chunk
token_counts = [estimate_tokens(chunk) for chunk in input_chunks]

# Print estimated token count and first 50 characters of each chunk
for i, chunk in enumerate(input_chunks):
    print(f"Chunk {i+1}: {chunk[:50]}...")
    print(f"Estimated token count: {token_counts[i]}\n")

Chunk 1: CHAPTER 1. Loomings.
Call me Ishmael. Some years a...
Estimated token count: 101272

Chunk 2: CHAPTER 43. Hark!
“HIST! Did you hear that noise, ...
Estimated token count: 97692

Chunk 3: CHAPTER 87. The Grand Armada.
The long and narrow ...
Estimated token count: 98827



## Asynchronous summarization and summary synthesis over multiple chunks

We'll use GPT-4-Turbo with an async client to summarize several large text chunks simultaneously, and then we'll synthesize the chunk summaries into a final summary. This lets us complete the task in constant time (O(1) time complexity), regardless of the length of the text.

In [26]:
import os
import logging
from typing import Optional
from openai import AsyncOpenAI, OpenAI
from openai.types.chat import ChatCompletion
from openai import BadRequestError
from pydantic import BaseModel
import asyncio

# Find smallest model that accommodates message list input + MAX_OUTPUT_TOKENS response
def select_model(message_list: list[dict]) -> str:
    # Estimate the context length including the longest possible output plus a buffer
    estimated_tokens = estimate_tokens(str(message_list)) + MAX_OUTPUT_TOKENS + 1000

    # Filter models that can handle the total_tokens and select the smallest one
    best_model = None
    for model in LLM_MODELS:
      max_tokens = model["max_tokens"]
      if model and max_tokens > estimated_tokens and max_tokens < best_model["max_tokens"]:
          best_model = model
    
    # If we found a model, return its name; otherwise, raise an error
    if best_model:
      return best_model["name"]
    raise ValueError("Context length exceeds the maximum supported token count.")


# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Constant for maximum number of API calls
MAX_API_CALLS = 5

# Variable to track the number of API calls
api_calls_counter = 0

async def make_api_call(model: str, messages: list[dict]) -> Optional[ChatCompletion]:
    global api_calls_counter

    # Check if the maximum number of API calls is reached
    if api_calls_counter >= MAX_API_CALLS:
        logger.warning("Maximum number of API calls reached. Skipping API call.")
        return None

    try:
        # Make the API call with a timeout
        response = await asyncio.wait_for(
            async_client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0
            ),
            timeout=100.0  # Adjust the timeout as needed
        )

        # Increment the API calls counter
        api_calls_counter += 1

        return response

    except asyncio.TimeoutError:
        logger.error("API call timed out.")
    except Exception as e:
        logger.error(f"API call failed: {str(e)}")

    return None


# Initialize sync OpenAI client
sync_client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    organization=os.getenv("OPENAI_ORGANIZATION")
)

# Instructions for the LLM synthesizer
SYNTHESIS_INSTRUCTIONS = (
    "Synthesize the summaries of large text chunks from Herman Melville's Moby Dick "
    "into a single cohesive summary. This summary should serve as a comprehensive "
    "reference cheat sheet for an editor tasked with modernizing the text. Focus on "
    "integrating key plot points, character arcs, and thematic elements to provide "
    "a useful guide for text modernization."
)

# Function to call an LLM to synthesize multiple summaries
def synthesize_summaries_with_llm(chunk_summaries: list[str]) -> str:
    # Concatenate chunk summaries into a single string with informative separators
    chunk_separator = "Chunk:\n\n"
    text = chunk_separator + chunk_separator.join(chunk_summaries)

    # Select the smallest model that can handle the input + output tokens
    system_message = {"role": "system", "content": SYNTHESIS_INSTRUCTIONS}
    user_message = {"role": "user", "content": text}
    model_name = select_model([system_message, user_message])

    # Call the OpenAI API to synthesize the summaries
    response = sync_client.chat.completions.create(
        model=model_name,
        messages=[user_message, system_message],
        temperature=0
    )

    # Extract the response text and return it
    model = [model for model in LLM_MODELS if model["name"] == model_name][0]
    return {
        "text": response.choices[0].message.content,
        "llm_models": [model_name],
        "llm_cost": response.usage.prompt_tokens*model["prompt_cost_per_token"] + \
            response.usage.completion_tokens*model["completion_cost_per_token"]
    }


# Initialize async OpenAI client
async_client = AsyncOpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    organization=os.getenv("OPENAI_ORGANIZATION")
)

# Instructions for the LLM summarizer
SUMMARY_INSTRUCTIONS = "The user will provide a portion of the text of Herman " + \
"Melville's classic public domain novel Moby Dick. Your task is to summarize the " + \
"plot and characters in a useful cheat sheet intended for use by an editor who " + \
"will be modernizing the text. You should only summarize the portion of the text " + \
"in the user's prompt."

# Pydantic model for our result
class Result(BaseModel):
    text: str
    llm_models: list[str]
    llm_cost: float

async def summarize_with_llm(input_chunks: list[str], model: dict) -> Result:
    # Initialize variables to track total cost and results
    total_cumulative_cost = 0
    tasks = []
    system_message = {"role": "system", "content": SUMMARY_INSTRUCTIONS}

    # Prepare the tasks for asynchronous API calls
    for chunk in input_chunks:
        # Add chunk as user message
        user_message = {"role": "user", "content": chunk}

        # Prepare the asynchronous task
        task = async_client.chat.completions.create(
            model=model["name"],
            messages=[user_message, system_message],
            temperature=0
        )
        tasks.append(task)

    # Execute the asynchronous API calls and process responses
    responses = await asyncio.gather(*tasks)
    
    chunk_summaries = []
    for response in responses:
        # Access the response text, strip whitespace, and concatenate
        chunk_summaries.append(response.choices[0].message.content.strip())

        # Update total cumulative cost
        total_cumulative_cost += response.usage.prompt_tokens*model["prompt_cost_per_token"] + \
            response.usage.completion_tokens*model["completion_cost_per_token"]

    # Synthesize the chunk summaries into a single summary
    if len(chunk_summaries) > 1:
        synthesis_result: Result = synthesize_summaries_with_llm(chunk_summaries)

        return Result(
                text=synthesis_result["text"],
                llm_models=[model["name"]].extend(synthesis_result["llm_models"]),
                llm_cost=total_cumulative_cost + synthesis_result["llm_cost"]
            )
    else:
        return Result(
                text=chunk_summaries[0],
                llm_models=[model["name"]],
                llm_cost=total_cumulative_cost
            )


# Get model object for GPT-4 Turbo
model = [model for model in LLM_MODELS if model["name"] == "gpt-4-turbo"][0]

# Call the summarizer function with the sample input text
result = await summarize_with_llm(text, model)

<coroutine object AsyncCompletions.create at 0x0000019C19318580>


In [None]:
LLM_INSTRUCTIONS = "The user will provide a portion of the text of Herman " + \
"Melville's classic public domain novel Moby Dick. Your task is to render the " + \
"text in modern American vernacular. You should try to preserve the meaning and " + \
"setting of the text, while modernizing its voice. You are only responsible to " + \
"render the portion of the text in the user's prompt, but you must provide a " + \
"full, unabbreviated rendition of this text."

def edit_transcription_with_llm(llm_input: str) -> str:
    client = OpenAI(
        api_key=os.getenv("OPENAI_API_KEY"),
        organization=os.getenv("OPENAI_ORGANIZATION")
    )

    # Split the input into chunks if it exceeds 4/5 of OpenAI's output token limit (converted to characters)
    max_input_length = (MAX_OUTPUT_TOKENS*4)*4//5
    input_chunks = split_text(llm_input, max_input_length)

    total_cumulative_cost = 0
    message_list = [{"role": "system", "content": LLM_INSTRUCTIONS}]
    llm_models = set()
    for chunk in input_chunks:
        # Add chunk as user message
        message_list.append({"role": "user", "content": chunk})

        done = False
        while not done:
            try:
                # Get the response from the model
                model = select_model(message_list)
                response: ChatCompletion = client.chat.completions.create(
                    model=model,
                    messages=message_list,
                    temperature=0
                )
            except BadRequestError:
                # Get the response from the model
                model = LLM_MODELS[max(LLM_MODELS.keys())]
                response: ChatCompletion = client.chat.completions.create(
                    model=select_model(message_list),
                    messages=message_list,
                    temperature=0
                )

            # Access the response text, strip whitespace, and append as assistant message
            message_list.append({"role": "assistant", "content": response.choices[0].message.content.strip()})

            # Update total cumulative cost
            total_cumulative_cost += response.usage.prompt_tokens*MODEL_COST[model][0] + response.usage.completion_tokens*MODEL_COST[model][1]

            # Add the model to the set of used models
            llm_models.add(model)

            # If not response.choices[0].finish_reason == "length", then we are done with the chunk
            if response.choices[0].finish_reason == "length":
                done = False
            else:
                done = True

    # Combine the assistant messages into a single string
    edited_text = " ".join([message["content"] for message in message_list if message["role"] == "assistant"])

    return EditResult(edited_text=edited_text, llm_models=list(llm_models), llm_cost=total_cumulative_cost)