# Front-loading context for long-form content generation with OpenAI "Turbo" models

## Introduction

OpenAI models have both context limits (various) and output limits (generally 4096 tokens). When working with potentially long inputs and outputs, we need to be able to manage both types of limits. 

The standard paradigm for long-form content editing with OpenAI's ChatCompletion API is to chunk the text into 4096-token segments, and then to edit them one at a time, using a rolling context window to provide some continuity between segments. This works fairly well for most purposes, but the advent of very long context windows in some models (up to 128,000 tokens) opens up the theoretical possibility of providing the model the entire document to be edited, up front, so that it can work with the entire context at once. This might, for instance, help the model recognize what kind of document it is working with, and accordingly tune its interpretation of text to be edited, translated, or otherwise operated upon.

In this workbook, I explore this possibility, showing my work as I go. It was a surprisingly frustrating process, but I gained some valuable insights into how the OpenAI ChatCompletions API works, what the limitations of OpenAI's models are, and how to work around some common problems. I found that the GPT-4.5-Turbo model is actually extremely bad at tracking its progress through a long document over multiple text generation turns, and it also frequently stops prematurely.

Prompt engineering can help. I can confirm, for instance, that offering to tip the model $20 for complete output helps mitigate the premature stopping problem. However, prompt engineering increases both token costs and potential failure points, the more likely the model will get confused and fail. I've tried to show my work in this notebook, but there was a lot of trial and error with prompts that I don't show here. (If you just want the final, working code, skip to the "Putting it all together" section at the end.)

In general, I found that the better approach was to keep the prompt simple and the AI task relatively straightforward, and build lots of code around the inputs and outputs to manage the complexity. For instance, I had reasonably good success with numbering the paragraphs so the AI could track its progress through the document, though I still had to carefully explain how.

All in all, I concluded that GPT-4.5-Turbo's 128,000 context length is still very limited for long-form content editing, and it's probably better, if possible, to stick with iteratively feeding the model shorter chunks. However, with *lots* of work, it *is* possible to make use of the long context window for multi-turn editing of front-loaded context, as I demonstrate here.

## Counting/estimating input tokens

For this exercise, we'll use a very long classic literary work obtained from Project Gutenberg-- Moby Dick, by Herman Melville-- to simulate a very long context input (>128,000 tokens). And we will choose a prompt that requires output of equal length.

OpenAI offers this heuristic for estimating token length: about 1 token per 4 characters of English text. We can check this estimate against an actual token count generated with the `tiktoken` Python library to confirm that this is in the right ballpark.

In [71]:
import os
import re
from openai import OpenAI
from openai.types.chat import ChatCompletion
from openai import BadRequestError
import tiktoken
import warnings

model_cost = {
    "gpt-3.5-turbo": (0.0005/1000, 0.0015/1000),
    "gpt-4": (0.03/1000, 0.06/1000),
    "gpt-4-32k": (0.06/1000, 0.12/1000),
    "gpt-4-turbo-preview": (0.01/1000, 0.03/1000)
}

def estimate_context_length(context: str) -> int:
    # Count characters in the input
    prompt_length = len(context)

    # Assume ~4 characters per token
    prompt_token_count = prompt_length/4

    return int(prompt_token_count)

def count_tokens(text: str, model: str = "gpt-3.5-turbo") -> int:
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)

with open("sample_input.txt", "r", encoding="utf-8") as file:
    long_text = file.read()

print("Estimated token count: " + str(estimate_context_length(long_text)))

# Count the tokens in long_text
print("Actual token count: " + str(count_tokens(long_text, "gpt-3.5-turbo")))

Estimated token count: 297792
Actual token count: 281265


## Handling errors related to violations of context length

We have an input of approximately 300,000 tokens. The estimate is a little above the actual token count, which is good because it means this estimation method will tend to err conservatively on the side of selecting a longer context model. If we naively submit this input to the OpenAI ChatCompletion API, we surprisingly get a `RateLimitError`, because our organization is only allowed to transmit 160,000 tokens per minute.

In [42]:
client = OpenAI(
        api_key=os.getenv("OPENAI_API_KEY")
    )

system_prompt = "The user will provide a portion of the text of Herman Melville's " + \
"classic public domain novel Moby Dick. Your task is to translate the text into " + \
"modern American vernacular. You should try to preserve the meaning and setting " + \
"of the text, while modernizing its voice. You are only responsible to translate " + \
"the portion of the text in the user's prompt, but you must provide a full, " + \
"unabbreviated translation of this text."

message_list = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": long_text}
            ]

try:
    response: ChatCompletion = client.chat.completions.create(
            model="gpt-3.5-turbo-0613",
            messages=message_list,
            max_tokens=4096,
            temperature=0
        )
    print(response)
except Exception as e:
    print(type(e))
    print(e.message)

<class 'openai.RateLimitError'>
Error code: 429 - {'error': {'message': 'Request too large for gpt-3.5-turbo-0613 in organization org-EitsfepwgP6MWPVn6teUfr7t on tokens per min (TPM): Limit 160000, Requested 301753. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


What if we submit an input below the rate limit, but longer than the context window of the model (4097 tokens in the case of `gpt-3.5-turbo-0613`)? Let's submit about 10,000 tokens to find out.

In [43]:
medium_text = long_text[:10000*4]

message_list = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": medium_text},
            ]

try:
    response: ChatCompletion = client.chat.completions.create(
            model="gpt-3.5-turbo-0613",
            messages=message_list,
            max_tokens=4096,
            temperature=0
        )
    print(response)
except Exception as e:
    print(type(e))
    print(e.message)

<class 'openai.BadRequestError'>
Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens. However, your messages resulted in 9584 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}


In this case, we get a `BadRequestError`, and the `message` attribute contains the string `context_length_exceeded`. We could use string matching for error handling.

The same problem arises if our input is below the context limit, but we supply a `max_tokens` parameter that, added to the context length, would cumulatively result in exceeding the context limit. For example, we can submit 3,000 tokens of input and specify a maximum of 4096 tokens of output. When we try this, we get the same `BadRequestError` and `context_length_exceeded` code as we got above.

In [5]:
short_text = long_text[:3000*4]

message_list = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": short_text}
            ]

try:
    response: ChatCompletion = client.chat.completions.create(
            model="gpt-3.5-turbo-0613",
            messages=message_list,
            max_tokens=4096,
            temperature=0
        )
    print(response)
except Exception as e:
    print(type(e))
    print(e.message)

<class 'openai.BadRequestError'>
Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens. However, you requested 6973 tokens (2877 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}


If we remove the max_tokens parameter, we can successfully submit the 3,000 token input and receive an output. However, the output will be truncated at 4096 (cumulative) tokens, with `response.choices[0].finish_reason == 'length'`.

In [44]:
message_list = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": short_text}
    ]

response: ChatCompletion = client.chat.completions.create(
    model="gpt-3.5-turbo-0613",
    messages=message_list,
    temperature=0
)

print(response)

ChatCompletion(id='chatcmpl-8vY9l3kUDBY98T5oizwxjQvJ7z41m', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="CHAPTER 1. Loomings.\n\nCall me Ishmael. A few years ago—never mind exactly how long—I was broke and had nothing better to do on land, so I decided to go out to sea and see what the ocean had to offer. It's my way of getting rid of my bad mood and clearing my head. Whenever I start feeling down and depressed, or when it's a gloomy November in my soul, and I find myself staring at coffins and following funerals, or when my anxieties start overwhelming me to the point where I have to fight the urge to go out and knock people's hats off, that's when I know it's time to hit the sea. It's my way of dealing with things. Cato throws himself on his sword to find peace, but I prefer to quietly board a ship. It's not that surprising, really. If people knew, most of them would feel the same way about the ocean as I do, to some extent.\n

## Using `finish_reason==length` to fall back to a longer context model

If we encounter this scenario where `finish_reason==length`, we can add the response to the `messages` list, instruct the model to pick up where it left off, and substitute a longer context model (such as gpt-3.5-turbo-0125, with 16k context length) for the `model` parameter. (One of the fun things about working with LLMs through an API is that we can swap different models in and out of the same chat.)

In [45]:
message_list.append({"role": "assistant", "content": response.choices[0].message.content})

response: ChatCompletion = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",
    messages=message_list,
    temperature=0
)

print(response)

ChatCompletion(id='chatcmpl-8vYAf3KJ4VwqFBjqYkcqjnTrWVgDQ', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='mummified versions of those animals in the pyramids.\n\nNo, when I go to sea, I go as a simple sailor, working hard on the deck, from the front to the top. Sure, they boss me around a bit, making me jump from one place to another like a grasshopper in a field. At first, it\'s not the most pleasant experience. It challenges my sense of honor, especially coming from a respected family like the Van Rensselaers or Randolphs. And especially if, just before getting my hands dirty with tar, I was a teacher making the big kids fear me. The transition from a teacher to a sailor is a tough one, and it takes a lot of mental strength to handle it. But eventually, you get used to it.\n\nSo what if an old sea captain tells me to grab a broom and sweep the decks? How much of an insult is that, really, when you think about it in the grand sche

This works as expected, returning a total of about 6500 tokens with `finish_reason == 'stop'`. We can simply concatenate the content from messages with role 'assistant' to get the full output:

In [46]:
message_list.append({"role": "assistant", "content": response.choices[0].message.content})

def get_full_output(message_list):
    # Join content with a space, ensuring 'content' key exists and is not empty
    return " ".join([message["content"] for message in message_list if message.get("role") == "assistant" and message.get("content", "")])

full_output = get_full_output(message_list)

print(full_output)

CHAPTER 1. Loomings.

Call me Ishmael. A few years ago—never mind exactly how long—I was broke and had nothing better to do on land, so I decided to go out to sea and see what the ocean had to offer. It's my way of getting rid of my bad mood and clearing my head. Whenever I start feeling down and depressed, or when it's a gloomy November in my soul, and I find myself staring at coffins and following funerals, or when my anxieties start overwhelming me to the point where I have to fight the urge to go out and knock people's hats off, that's when I know it's time to hit the sea. It's my way of dealing with things. Cato throws himself on his sword to find peace, but I prefer to quietly board a ship. It's not that surprising, really. If people knew, most of them would feel the same way about the ocean as I do, to some extent.

Now, let's talk about Manhattan, that island city surrounded by docks like coral reefs. The streets all lead to the water. The southernmost point is the battery, whe

Context length is less of an issue with state-of-the-art models than it used to be, because the current flagship GPT-3.5-Turbo model has a context length of 16,385 tokens. That comes out to about 65,000 characters of English text, or almost 11,000 words. That's long enough to accommodate your typical academic paper, though it falls short of most novellas and books. Thus, context management remains important for edge cases. And even for shorter inputs, we still need to manage output length because of the 4096 token output limit.

To handle the output limit, there's basically only one viable approach: check for the `length` finish reason, and if it's present, add the response to the `messages` list, instruct the model to pick up where it left off, and re-prompt. Unfortunately this is fairly high latency and leads to some redundancy in token usage, but it is what it is.

For managing context length, meanwhile, we have a couple options: we can either pre-estimate the necessary context length and select an appropriate model, or we can start with the smallest available model and fall back to a larger one whenever we run up against (or estimate that we will run up against) the context limit. The latter approach is less expensive in terms of token cost and latency, but it also risks poorer output, even from the more powerful fallback model, because the smaller model's output acts as part of the prompt for, and sets the pattern for, the larger model's output. (In fact, if we cared about maximizing output quality rather than minimizing cost, we might start with `gpt-4` or `gpt-4-32k`—with respective context lengths of 8,192 and 32,768 tokens—rather than `gpt-3.5-turbo`. However, in that case we might want to be a little stingier with our input tokens on the first pass instead of dumping the whole text into context!)

The following code demonstrates the first approach, switching from a smaller to a larger model as necessary:

In [47]:
system_prompt = "The user will provide a portion of the text of Herman Melville's " + \
"classic public domain novel Moby Dick. Your task is to translate the text into " + \
"modern American vernacular. You should try to preserve the meaning and setting " + \
"of the text, while modernizing its voice. You are only responsible to translate " + \
"the portion of the text in the user's prompt, but you must provide a full, " + \
"unabbreviated translation of this text. Your response will be parsed " + \
"by software and not by a human, and you will only be prompted for a continuation " + \
"if your response is interrupted for reasons of length, so do not use a stop " + \
"token until you are completely done with the translation task. If your response " + \
"is interrupted, pick up in the next message exactly where you left off."

# Define LLM parameters
models = {
        16384: "gpt-3.5-turbo",
        128000: "gpt-4-turbo-preview"
    }

def select_model(message_list: list[dict], minimum_output_tokens: int = 4096) -> str:
    # Validate minimum_output_tokens
    if minimum_output_tokens < 1 or minimum_output_tokens > 4096:
        raise ValueError("minimum_output_tokens must be between 1 and 4096.")

    # Estimate the context length including the minimum output tokens
    input_tokens = estimate_context_length(str(message_list))
    total_tokens = input_tokens + minimum_output_tokens

    # Filter models that can handle the total_tokens and select the smallest one
    suitable_models = [size for size in models.keys() if size >= total_tokens]

    if not suitable_models:
        # If no model supports the context length, raise an error
        # (Note: This constraint *should* prevent RateLimitError and BadRequestError.)
        raise ValueError("Context length exceeds the maximum supported token count.")

    selected_model_size = min(suitable_models)
    return models[selected_model_size]


def prompt_llm(
            message_list: list[dict],
            minimum_output_tokens: int = 4096,
            total_tokens_used: int = 0
        ) -> tuple[list[dict], int]:

    print(f"Message list length: {len(message_list)}, Total tokens used: {total_tokens_used}")

    # Select appropriate model based on prompt length and minimum_output_tokens
    model = select_model(message_list, minimum_output_tokens)

    try:
        # Prompt the LLM with the current message_list
        response = client.chat.completions.create(
            model=model,
            messages=message_list,
            temperature=0
        )
        # Update the message list with the response
        message_list.append(
                {"role": "assistant", "content": response.choices[0].message.content}
            )
        # Update total token usage
        total_tokens_used += response.usage.total_tokens

        print("Response appended to message list.")

        # If response was interrupted due to length, append response to message_list
        # and call function recursively with updated message_list to prompt again
        if response.choices[0].finish_reason == "length":
            print("Response finish reason: length. Recursing with updated message list.")
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_tokens_used
            )

    except Exception as e:
        # Try selecting model with more conservative assumptions before raising BadRequestError
        if isinstance(e, BadRequestError) and minimum_output_tokens < 4096:
            print("BadRequestError encountered. Retrying with minimum_output_tokens=4096.")
            return prompt_llm(
                message_list,
                4096,
                total_tokens_used
            )
        else:
            print(f"Exception encountered: {e}")
            raise e

    return message_list, total_tokens_used

message_list_prompt = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": medium_text}
            ]

message_list_result, tokens_used = prompt_llm(message_list_prompt, 4096)
full_output = get_full_output(message_list_result)
print("Tokens used:")
print(tokens_used)
print("Full output:")
print(full_output)

Message list length: 2, Total tokens used: 0
Response appended to message list.
Tokens used:
10663
Full output:
Call me Ishmael. So, a while back—never mind exactly how long—when I was pretty much broke, with no cash in my pocket and nothing exciting happening on land, I figured I'd hit the seas for a bit and check out the watery side of the world. It's my way of shaking off the blues and getting my blood pumping. Whenever I start feeling all serious and gloomy; when my soul feels like a damp, drizzly November day; when I catch myself staring at coffin shops and following funerals; and especially when my bad moods start making me want to knock people's hats off in the street, then I know it's time to head to sea. It's my way of dealing with things. Cato might throw himself on his sword with a flourish, but I prefer to hop on a ship quietly. It's not that surprising. If people knew, most of them probably feel the same way about the ocean at some point in their lives.

Now, picture this:

Unfortunately, we run into a problem when we try to use `finish_reason=='length'` as a condition for continuing the LLM's response. Apparently, the 'length' stopping reason only occurs when the model reaches the end of its context length. I cannot seem to get the model to overflow its 4096 output token limit and throw this signal, even if I explicitly instruct it to do so. Thus, we have to find a different heuristic for determining whether to continue generating text or stop.

## Using a stopping signal to determine whether the model has completed the task

Instead of using `finish_reason`, we can explicitly ask the model to tell us if it has completed the task. If it has, we can break the loop. If it hasn't, we can continue. The danger here is that the model might misrender our stopping signal, and we might therefore fail to detect it. We can mitigate this risk by using fuzzy string matching to detect the stopping signal, and by using a recursion depth limit as a last resort to prevent infinite loops (although our context length constraints should already prevent this).

In [54]:
system_prompt = "The user will provide a portion of the text of Herman Melville's " + \
"classic public domain novel Moby Dick. Your task is to translate the text into " + \
"modern American vernacular. You should try to preserve the meaning and setting " + \
"of the text, while modernizing its voice. You are only responsible to translate " + \
"the portion of the text in the user's prompt, but you must provide a full, " + \
"unabbreviated translation of this text. If your response is interrupted, pick up " + \
"in the next message exactly where you left off."

user_prompt = "If you have completed your translation of the provided text, " + \
"respond with 'STOP'. Otherwise, continue your translation from exactly where you " + \
"left off."

# Define LLM parameters
models = {
        16384: "gpt-3.5-turbo",
        128000: "gpt-4-turbo-preview"
    }


def is_stop_signal(content: str) -> bool:
    """
    Check if the response content loosely matches 'STOP', ignoring case and
    allowing for some variation in spacing and punctuation.
    """
    # Regular expression to match 'STOP' with flexibility
    # \s* allows for any number of spaces, [.,!?]* allows for trailing punctuation
    stop_pattern = re.compile(r"\s*STOP\s*[.,!?]*\s*$", re.IGNORECASE)
    return bool(stop_pattern.match(content))


def prompt_llm(
            message_list: list[dict],
            minimum_output_tokens: int = 4096,
            total_tokens_used: int = 0,
            depth: int = 0,
            max_depth: int = 16
        ) -> tuple[list[dict], int]:
    # Check recursion depth
    if depth > max_depth:
        print("Recursion depth limit reached.")
        return message_list, total_tokens_used

    print(f"Recursion depth: {depth}, Message list length: {len(message_list)}, Total tokens used: {total_tokens_used}")

    try:
        # Select appropriate model based on prompt length and minimum_output_tokens
        model = select_model(message_list, minimum_output_tokens)
        
        # Prompt the LLM with the current message_list
        response = client.chat.completions.create(
            model=model,
            messages=message_list,
            temperature=0
        )

        # Remove any text from message that is enclosed in double backslashes
        cleaned_message = re.sub(r"\\\\.*?\\\\", "", response.choices[0].message.content)

        # Update the message list with the response
        if not is_stop_signal(cleaned_message):
            message_list.append(
                    {"role": "assistant", "content": cleaned_message}
                )
        
        # Update total token usage
        total_tokens_used += response.usage.total_tokens

        print("Response appended to message list.")

        # If response was interrupted due to length, call function recursively with
        # updated message_list to prompt again
        if response.choices[0].finish_reason == "length":
            print("Response finish reason: length. Recursing with updated message list.")
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_tokens_used,
                depth + 1,
                max_depth
            )
        
        # If content of the response is not 'STOP', append user_prompt message and call
        # function recursively with message_list to prompt again
        if is_stop_signal(cleaned_message):
            print("Assistant response is 'STOP'. Returning message list.")
            return message_list, total_tokens_used
        else:
            print("Assistant response is not 'STOP'. Appending user prompt and recursing.")
            message_list.append({"role": "user", "content": user_prompt})
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_tokens_used,
                depth + 1,
                max_depth
            )            

    except Exception as e:
        # Try selecting model with more conservative assumptions before raising BadRequestError
        if isinstance(e, BadRequestError) and minimum_output_tokens < 4096:
            print("BadRequestError encountered. Retrying with minimum_output_tokens=4096.")
            return prompt_llm(
                message_list,
                4096,
                total_tokens_used,
                depth + 1,
                max_depth
            )
        else:
            print(f"Exception encountered: {e}")
            raise e


message_list_prompt = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": medium_text}
            ]

message_list_result, tokens_used = prompt_llm(message_list_prompt, 4096)
full_output = get_full_output(message_list_result)
print("Tokens used:")
print(tokens_used)
print("Full output:")
print(full_output)


Recursion depth: 0, Message list length: 2, Total tokens used: 0
Response appended to message list.
Assistant response is not 'STOP'. Appending user prompt and recursing.
Recursion depth: 1, Message list length: 4, Total tokens used: 10582
Response appended to message list.
Assistant response is 'STOP'. Returning message list.
Tokens used:
21201
Full output:
Call me Ishmael. So, a while back—don't worry about the exact time—when I was broke and bored on land, I decided to hit the seas for a change of scenery. It's my way of shaking off the blues and getting some fresh air. Whenever I start feeling down, or when the weather matches my mood, or when I catch myself lingering near funeral homes and following funerals, or especially when my worries start getting the best of me to the point where I have to really fight the urge to start knocking people's hats off in the street, that's when I know it's time to head out to sea. It's my way of clearing my head. Instead of resorting to violence,

## Prompt engineering for a more accurate 'STOP' signal

When we look at the above output, we notice a problem: we fed in 10,000 tokens, but the output is only about 1,000. This isn't (only) because the translation is more concise than the original. It's also because, for some reason, the stopping signal is being thrown prematurely. The translation covers only a small portion of the first chapter, about 5,000 characters of English text.

Perhaps this is because we are not allowing the model to reason about whether it has completed the task before returning 'STOP', which allows random chance to determine when the model stops. So, we can try asking the model to reason first about whether it has completed the task. Let's see whether this improves performance:

In [62]:
system_prompt = "The user will provide a portion of the text of Herman Melville's " + \
"classic public domain novel Moby Dick. Your task is to translate the text into " + \
"modern American vernacular. You should try to preserve the meaning and setting " + \
"of the text, while modernizing its voice. Each chapter and each paragraph in the " + \
"original text should have a corresponding chapter or paragraph in the " + \
"translation. You are only responsible to translate the portion of the text in " + \
"the user's prompt, but you must provide a full, unabbreviated translation of " + \
"this text. If your response is interrupted, pick up in the next message exactly " + \
"where you left off."

user_prompt = "Begin your next response by reasoning about how much of the text " + \
"in the user's prompt you have already translated, and where in that text " + \
"you left off. You must enclose your reasoning in \\\\double backslashes\\\\ to " + \
"distinguish it from your output. After your reasoning block, respond with 'STOP' " + \
"if, and only if, you have completed your full, unabbreviated translation of " + \
"the user's ENTIRE text prompt. Consider what chapter the user's prompt ended " + \
"in, and whether your translation has yet reached that chapter. Also consider " + \
"whether the specific narrative events of the final paragraph of the user's " + \
"prompt crrespond to the specific narrative events described in the last paragraph " + \
"of your translated text. Only return 'STOP' if both these conditions are met. " + \
"Examples:\n\n" + \
"\\\\The text in the user's prompt ended with a description of the carpenter " + \
"making Ahab a peg leg in Chapter 145. My translation concluded in Chapter 145 " + \
"with a corresponding description of the carpenter making Ahab a wooden " + \
"prosthetic. Thus, the translation of the provided text is complete.\\\\STOP\n\n" + \
"\\\\The text in the user's prompt broke off in the middle of chapter 5. I " + \
"concluded my last message with a translation of Ishmael asking the landlord if " + \
"the harpooner always keeps such late hours in the middle of chapter 3. The " + \
"translation of the provided text has not yet reached chapter 5, so I will " + \
"continue from where I left off.\\\\Narrative continues here."

# Define LLM parameters
models = {
        16384: "gpt-3.5-turbo",
        128000: "gpt-4-turbo-preview"
    }


def is_stop_signal(content: str) -> bool:
    """
    Check if the response content loosely matches 'STOP', ignoring case and
    allowing for some variation in spacing and punctuation.
    """
    # Regular expression to match 'STOP' with flexibility
    # \s* allows for any number of spaces, [.,!?]* allows for trailing punctuation
    stop_pattern = re.compile(r"\s*STOP\s*[.,!?]*\s*$", re.IGNORECASE)
    return bool(stop_pattern.match(content))


clean_output = lambda content: re.sub(r"\\\\.*?\\\\", "", content)


def prompt_llm(
            message_list: list[dict],
            minimum_output_tokens: int = 4096,
            total_tokens_used: int = 0,
            depth: int = 0,
            max_depth: int = 3
        ) -> tuple[list[dict], int]:
    # Check recursion depth
    if depth > max_depth:
        print("Recursion depth limit reached.")
        return message_list, total_tokens_used

    print(f"Recursion depth: {depth}, Message list length: {len(message_list)}, Total tokens used: {total_tokens_used}")

    try:
        # Select appropriate model based on prompt length and minimum_output_tokens
        model = select_model(message_list, minimum_output_tokens)
        
        # Prompt the LLM with the current message_list
        response = client.chat.completions.create(
            model=model,
            messages=message_list,
            temperature=0
        )

        # Remove any text from message that is enclosed in double backslashes
        cleaned_message = clean_output(response.choices[0].message.content)

        # Update the message list with the response
        if not is_stop_signal(cleaned_message):
            message_list.append(
                    {"role": "assistant", "content": response.choices[0].message.content}
                )
        
        # Update total token usage
        total_tokens_used += response.usage.total_tokens

        print("Response appended to message list.")

        # If response was interrupted due to length, call function recursively with
        # updated message_list to prompt again
        if response.choices[0].finish_reason == "length":
            print("Response finish reason: length. Recursing with updated message list.")
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_tokens_used,
                depth + 1,
                max_depth
            )
        
        # If content of the response is not 'STOP', append user_prompt message and call
        # function recursively with message_list to prompt again
        if is_stop_signal(cleaned_message):
            print("Assistant response is 'STOP'. Returning message list.")
            print("Stop reason: ", re.findall(r"\\\\(.*?)\\\\", response.choices[0].message.content)[0])
            return message_list, total_tokens_used
        else:
            print("Assistant response is not 'STOP'. Appending user prompt and recursing.")
            message_list.append({"role": "user", "content": user_prompt})
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_tokens_used,
                depth + 1,
                max_depth
            )            

    except Exception as e:
        # Try selecting model with more conservative assumptions before raising BadRequestError
        if isinstance(e, BadRequestError) and minimum_output_tokens < 4096:
            print("BadRequestError encountered. Retrying with minimum_output_tokens=4096.")
            return prompt_llm(
                message_list,
                4096,
                total_tokens_used,
                depth + 1,
                max_depth
            )
        else:
            print(f"Exception encountered: {e}")
            raise e


message_list_prompt = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": medium_text}
            ]

message_list_result, tokens_used = prompt_llm(message_list_prompt, 4096)
full_output = clean_output(get_full_output(message_list_result))
print("Tokens used:")
print(tokens_used)
print("Full output:")
print(full_output)

Recursion depth: 0, Message list length: 2, Total tokens used: 0
Response appended to message list.
Assistant response is not 'STOP'. Appending user prompt and recursing.
Recursion depth: 1, Message list length: 4, Total tokens used: 11376
Response appended to message list.
Assistant response is 'STOP'. Returning message list.
Stop reason:  The user's prompt ended with a portion of text from Chapter 3, specifically describing Ishmael's first night at the Spouter-Inn and his contemplations about sharing a bed with the harpooner. My translation concluded with Ishmael being shown to his room by the landlord, which corresponds to the narrative events described in the last paragraph provided by the user. Therefore, the translation of the user's ENTIRE text prompt is complete.
Tokens used:
23150
Full output:
CHAPTER 1. Loomings.

Just call me Ishmael. A while back—don't worry about the exact time—when I was broke and bored on land, I decided to hit the seas for a change of scenery. Sailing h

We get a *little* further, but even after *extensive* prompt engineering, including providing examples, the model still cannot consistently determine how much of the task it has completed or how far its translation has progressed. Specifically, the chapter breaks seem to pose a problem, because the model wants to return a 'STOP' signal at the end of each chapter. Our prompt continues through Chapter 3, yet we can't seem to consistently get out of Chapter 1.

I'm honestly a little stunned that the model is this stubbornly dumb, but I suppose it's kind of a known issue. A known limitation of OpenAI's models is that they're not great at searching over large context; they tend to put too much weight on the beginning and end of the context. They also tend not to be very good at planning or executive function tasks such as making the decision whether to stop or continue. And, of course, they're notoriously innumerate, so they can't count paragraphs or even, apparently, tell the difference between Chapters 1 and 3.

## Numbering paragraphs to improve performance

It's likely, I think, that we could get better results on our stopping task with GPT-4, but since we're specifically trying to take of advantage of the Turbo model's long context, let's instead see if we can improve the results by numbering paragraphs to help the model keep track of its progress without having to count paragraphs or chapters or identify corresponding narrative events.

In [70]:
def number_lines(text: str) -> str:
    # Replace double line breaks with single line breaks
    text = text.replace("\n\n", "\n")
    # Split the text by line breaks
    paragraphs = text.split("\n")
    # Number the paragraphs
    numbered_paragraphs = [f"{i+1} {paragraph}" for i, paragraph in enumerate(paragraphs)]
    # Join the paragraphs back together
    return "\n".join(numbered_paragraphs)

def unnumber_lines(text: str) -> str:
    # Split the text by line breaks
    paragraphs = text.split("\n")
    # Remove the paragraph numbers
    unnumbered_paragraphs = [re.sub(r"^\d+ ", "", paragraph) for paragraph in paragraphs]
    # Join the paragraphs back together
    return "\n\n".join(unnumbered_paragraphs)

def get_last_line_number(text: str) -> int:
    # Split the text by line breaks
    paragraphs = text.split("\n")
    # Get the last paragraph number
    return int(paragraphs[-1].split(" ")[0])

print(number_lines("This is a test.\n\nThis is another test."))
print(unnumber_lines("1 This is a test.\n2 This is another test."))
print(get_last_line_number("1 This is a test.\n2 This is another test."))

1 This is a test.
2 This is another test.
This is a test.

This is another test.
2


This should, in fact, altogether eliminate the need to re-prompt the model for a stopping signal, because we can simply automate comparison of the input and output paragraph numbers to determine whether the model has completed the task.

In [68]:
system_prompt = "The user will provide a portion of the text of Herman Melville's " + \
"classic public domain novel Moby Dick. Your task is to translate the text into " + \
"modern American vernacular. You should try to preserve the meaning and setting " + \
"of the text while modernizing its voice. Each numbered chapter and paragraph in" + \
"the original text should have a corresponding numbered chapter or paragraph in the " + \
"translation. You are only responsible to translate the portion of the text in " + \
"the user's prompt, but you must provide a full, unabbreviated translation of " + \
"this text. If your response is interrupted, pick up in the next message exactly " + \
"where you left off, using the paragraph numbers as a point of reference."

def prompt_llm(
            message_list: list[dict],
            minimum_output_tokens: int = 4096,
            total_cumulative_cost: float = 0.0,
            depth: int = 0,
            max_depth: int = 16
        ) -> tuple[list[dict], int]:
    # Check recursion depth
    if depth > max_depth:
        print("Recursion depth limit reached.")
        return message_list, total_cumulative_cost

    print(f"Recursion depth: {depth}, Message list length: {len(message_list)}, Total cumulative cost: {total_cumulative_cost}")

    try:
        # Select appropriate model based on prompt length and minimum_output_tokens
        model = select_model(message_list, minimum_output_tokens)

        # Prompt the LLM with the current message_list
        response = client.chat.completions.create(
            model=model,
            messages=message_list,
            temperature=0
        )

        # Update the message list with the response
        message_list.append(
                {"role": "assistant", "content": response.choices[0].message.content}
            )
        
        # Update total cumulative cost
        total_cumulative_cost += response.usage.prompt_tokens*model_cost[model][0] + response.usage.completion_tokens*model_cost[model][1]

        print("Response appended to message list.")

        # If response was interrupted due to length, call function recursively with
        # updated message_list to prompt again
        if response.choices[0].finish_reason == "length":
            print("Response finish reason: length. Recursing with updated message list.")
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_cumulative_cost,
                depth + 1,
                max_depth
            )
        
        # If line number is equal to or greater than last line in text prompt, break recursion
        output_last_line_number = get_last_line_number(response.choices[0].message.content)
        input_last_line_number = get_last_line_number(message_list[1]["content"])
        if output_last_line_number is None:
            warnings.warn("Paragraph numbers missing from output text. Breaking loop.")
            return message_list, total_cumulative_cost
        elif output_last_line_number >= input_last_line_number:
            print("Translation complete. Returning message list.")
            return message_list, total_cumulative_cost

    except Exception as e:
        # Try selecting model with more conservative assumptions before raising BadRequestError
        if isinstance(e, BadRequestError) and minimum_output_tokens < 4096:
            print("BadRequestError encountered. Retrying with minimum_output_tokens=4096.")
            return prompt_llm(
                message_list,
                4096,
                total_cumulative_cost,
                depth + 1,
                max_depth
            )
        else:
            print(f"Exception encountered: {e}")
            raise e


message_list_prompt = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": number_lines(medium_text)}
            ]

message_list_result, tokens_used = prompt_llm(message_list_prompt, 4096)
full_output = unnumber_lines(get_full_output(message_list_result))
print("Tokens used:")
print(tokens_used)
print("Full output:")
print(full_output)

Recursion depth: 0, Message list length: 2, Total tokens used: 0
Response appended to message list.
Response finish reason: length. Recursing with updated message list.
Recursion depth: 1, Message list length: 3, Total tokens used: 13833
Response appended to message list.
Translation complete. Returning message list.
Tokens used:
29729
Full output:
CHAPTER 1. Loomings.

Just call me Ishmael. A while back—can't say exactly how long—when I was broke as a joke and had nothing going on ashore, I figured I'd hit the high seas for a bit. It's my way of shaking off the blues and getting some fresh air. Whenever I start feeling all gloomy and down, like a wet, dreary November day in my soul; whenever I catch myself staring at coffin shops and following funerals; and especially when my bad moods start making me want to start some trouble, like knocking people's hats off for no reason—well, that's when I know it's time to head out to sea. It's my way of clearing my head. While some folks might t

This device performs beautifullly, and we can now generate the entire translation of the first three chapters. As a bonus, we use far fewer input tokens than with the prompt-engineering approach, because we don't have to re-prompt the model for a stopping signal after each text generation step. The main problem with this approach, of course, is that it requires a one-to-one correspondence between input and output paragraph numbers, which removes some flexibility from the task performance.

## Intelligently chunking the input

Note that up till now, we've been dealing with total context lengths no longer than about 20,000 tokens. But remember, our original context was 300,000 tokens long, and our goal was to translate the whole text. We can't submit that all at once, but we can submit it in chunks and then concatenate the responses. Since our output will be about the same length as our input, a 128,000-token context window can theoretically accommodate inputs up to about 64,000 tokens, though we'll want to leave some margin for error. We'll shoot for inputs on the order of 50,000 tokens. For continuity, we'll want to make sure that the last token of each chunk is a sentence-ending punctuation mark, perhaps ideally at the end of a paragraph.

In [40]:
import re
import math
import warnings

# Regular expressions for finding paragraph ends, sentence-ending punctuation, and fallback to spaces
PARAGRAPH_END = re.compile(r'\n\n+')
SENTENCE_END = re.compile(r'[.!?][”’"\']*\s')
SPACE = re.compile(r'\s')

def find_nearest_break_point(search_text, start_offset):
    """
    Attempts to find a split point in the search_text using a list of regular expressions.
    Returns the offset from start_offset of the first successful match or None if no match is found.
    
    :param search_text: The text to search through.
    :param start_offset: The starting offset of search_text within the larger text.
    :return: The absolute position of the split point within the larger text or None if not found.
    """
    for pattern in [PARAGRAPH_END, SENTENCE_END, SPACE]:
        match = pattern.search(search_text)
        if match:
            return start_offset + match.end()
    return None

def split_into_chunks(long_text: str, max_token_length: int = 50000, margin_of_error: int = 100):
    """
    Splits a given text into chunks. If the text is <= max_token_length, returns it unchunked.
    Otherwise, splits the text into evenly sized chunks, trying not to exceed max_token_length,
    and aiming to end each chunk at a paragraph break or a sentence-ending punctuation
    within a margin of error.

    :param long_text: The text to be split.
    :param max_token_length: Target length of each chunk in tokens (default 50,000).
    :param margin_of_error: Margin of error for splitting point in tokens (default 100).
    :return: A list of text chunks.
    """
    max_character_length = max_token_length * 4
    total_characters = len(long_text)
    margin_in_chars = margin_of_error * 4

    if total_characters <= max_character_length:
        return [long_text]
    
    num_chunks = math.ceil(total_characters / max_character_length)
    optimal_char_length = total_characters // num_chunks

    chunks = []
    start_index = 0

    for chunk_index in range(num_chunks):
        end_index = start_index + optimal_char_length
        is_last_chunk = chunk_index == num_chunks - 1

        # Adjust the end_index for the last chunk
        if is_last_chunk:
            end_index = total_characters
            chunks.append(long_text[start_index:end_index].strip())
        else:
            search_start = max(start_index, end_index - margin_in_chars)
            search_end = min(end_index + margin_in_chars, total_characters)
            search_range = long_text[search_start:search_end]
            break_point = find_nearest_break_point(search_range, search_start)

            if break_point is not None:
                end_index = break_point
                chunks.append(long_text[start_index:end_index].strip())
                start_index = end_index
            else:
                warnings.warn(f"No suitable break point found input text for chunk {chunk_index}. Using default split.")
                chunks.append(long_text[start_index:end_index].strip())
                start_index = end_index

    return chunks


# Print number of characters in long_text
print("Number of tokens:", len(long_text)/4)

# Split the long_text into chunks
chunks = split_into_chunks(long_text, 50000, 100)
print("Number of chunks:", len(chunks))

# Print first and last 100 characters in each chunk
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk[:100])
    print("...")
    print(chunk[-100:])
    print(len(chunk))
    print()

Number of tokens: 297960.75
Number of chunks: 6
Chunk 1:
CHAPTER 1. Loomings.
Call me Ishmael. Some years ago—never mind how long precisely—having little or 
...
and spare lines and harpoons, and spare everythings, almost, but a spare Captain and duplicate ship.
198705

Chunk 2:
At the period of our arrival at the Island, the heaviest storage of the Pequod had been almost compl
...
 does the passing mention of a White Friar or a White Nun, evoke such an eyeless statue in the soul?
198563

Chunk 3:
Or what is there apart from the traditions of dungeoned warriors and kings (which will not wholly ac
...
r indolent crests; and across the wide trance of the sea, east nodded to west, and the sun over all.
198372

Chunk 4:
Suddenly bubbles seemed bursting beneath my closed eyes; like vices my hands grasped the shrouds; so
...
ephants of antiquity often hailed the morning with their trunks uplifted in the profoundest silence.
198415

Chunk 5:
The chance comparison in this chapter, between the 

## Putting it all together

Let's put all these techniques together to see if the model now performs the task. Since querying these models is expensive, we'll use 20,000-word chunks and only submit the first chunk. The rest is left as an exercise for the reader.

### Adjustable parameters

In [18]:
# Define input and output file paths
input_filepath = "sample_input.txt"
output_filepath = "sample_output.txt"

# Select models to use based on token count
models = {
        16384: "gpt-3.5-turbo",
        128000: "gpt-4-turbo-preview"
    }

# Set model costs per input and output token for calculating total cumulative spend
model_cost = {
    "gpt-3.5-turbo": (0.0005/1000, 0.0015/1000),
    "gpt-4": (0.03/1000, 0.06/1000),
    "gpt-4-32k": (0.06/1000, 0.12/1000),
    "gpt-4-turbo-preview": (0.01/1000, 0.03/1000)
}

# Set system prompt for the Moby Dick translation task
system_prompt = "The user will provide a portion of the text of Herman Melville's " + \
"classic public domain novel Moby Dick. Your task is to translate the text into " + \
"modern American vernacular. You should try to preserve the meaning and setting " + \
"of the text while modernizing its voice. Each numbered paragraph in the original " + \
"text should have a corresponding numbered paragraph in the translation. (When " + \
"numbering, don't treat chapter headers as paragraphs.) You are only " + \
"responsible to translate the portion of the text in the user's prompt, but you " + \
"must provide a full, unabbreviated translation of this text. If your response is " + \
"interrupted, pick up in the next message exactly where you left off, using the " + \
"paragraph numbers as a point of reference. If you complete the full task, " + \
"I will tip you $20."

### Function Definitions

In [21]:
import os
import re
import math
import warnings
from openai import OpenAI
from openai.types.chat import ChatCompletion
from openai import BadRequestError

client = OpenAI(
        api_key=os.getenv("OPENAI_API_KEY")
    )

# Regular expressions for finding paragraph ends, sentence-ending punctuation, and fallback to spaces
PARAGRAPH_END = re.compile(r'\n\n+')
SENTENCE_END = re.compile(r'[.!?][”’"\']*\s')
SPACE = re.compile(r'\s')

# Regular expression pattern for chapter titles
CHAPTER_TITLE = re.compile(r"^CHAPTER \d+\..*$", re.IGNORECASE)

def find_nearest_break_point(search_text, start_offset):
    """
    Attempts to find a split point in the search_text using a list of regular expressions.
    Returns the offset from start_offset of the first successful match or None if no match is found.
    
    :param search_text: The text to search through.
    :param start_offset: The starting offset of search_text within the larger text.
    :return: The absolute position of the split point within the larger text or None if not found.
    """
    for pattern in [PARAGRAPH_END, SENTENCE_END, SPACE]:
        match = pattern.search(search_text)
        if match:
            return start_offset + match.end()
    return None


def split_into_chunks(long_text: str, max_token_length: int = 50000, margin_of_error: int = 100):
    """
    Splits a given text into chunks. If the text is <= max_token_length, returns it unchunked.
    Otherwise, splits the text into evenly sized chunks, trying not to exceed max_token_length,
    and aiming to end each chunk at a paragraph break or a sentence-ending punctuation
    within a margin of error.

    :param long_text: The text to be split.
    :param max_token_length: Target length of each chunk in tokens (default 50,000).
    :param margin_of_error: Margin of error for splitting point in tokens (default 100).
    :return: A list of text chunks.
    """
    max_character_length = max_token_length * 4
    total_characters = len(long_text)
    margin_in_chars = margin_of_error * 4

    if total_characters <= max_character_length:
        return [long_text]
    
    num_chunks = math.ceil(total_characters / max_character_length)
    optimal_char_length = total_characters // num_chunks

    chunks = []
    start_index = 0

    for chunk_index in range(num_chunks):
        end_index = start_index + optimal_char_length
        is_last_chunk = chunk_index == num_chunks - 1

        # Adjust the end_index for the last chunk
        if is_last_chunk:
            end_index = total_characters
            chunks.append(long_text[start_index:end_index].strip())
        else:
            search_start = max(start_index, end_index - margin_in_chars)
            search_end = min(end_index + margin_in_chars, total_characters)
            search_range = long_text[search_start:search_end]
            break_point = find_nearest_break_point(search_range, search_start)

            if break_point is not None:
                end_index = break_point
                chunks.append(long_text[start_index:end_index].strip())
                start_index = end_index
            else:
                warnings.warn(f"No suitable break point found input text for chunk {chunk_index}. Using default split.")
                chunks.append(long_text[start_index:end_index].strip())
                start_index = end_index

    return chunks


def estimate_context_length(context: str) -> int:
    # Count characters in the input
    prompt_length = len(context)

    # Assume ~4 characters per token
    prompt_token_count = prompt_length/4

    return int(prompt_token_count)


def select_model(message_list: list[dict], minimum_output_tokens: int = 4096) -> str:
    # Validate minimum_output_tokens
    if minimum_output_tokens < 1 or minimum_output_tokens > 4096:
        raise ValueError("minimum_output_tokens must be between 1 and 4096.")

    # Estimate the context length including the minimum output tokens
    input_tokens = estimate_context_length(str(message_list))
    total_tokens = input_tokens + minimum_output_tokens

    # Filter models that can handle the total_tokens and select the smallest one
    suitable_models = [size for size in models.keys() if size >= total_tokens]

    if not suitable_models:
        # If no model supports the context length, raise an error
        # (Note: This constraint *should* prevent RateLimitError and BadRequestError.)
        raise ValueError("Context length exceeds the maximum supported token count.")

    selected_model_size = min(suitable_models)
    return models[selected_model_size]


def get_full_output(message_list):
    # Join content with a space, ensuring 'content' key exists and is not empty
    return "\n".join([message["content"] for message in message_list if message.get("role") == "assistant" and message.get("content", "")])


def number_lines(text: str) -> str:
    # Replace double line breaks with single line breaks
    text = text.replace("\n\n", "\n")
    # Split the text by line breaks
    paragraphs = text.split("\n")
    # Initialize a paragraph list and paragraph counter
    numbered_paragraphs = []
    paragraph_number = 1

    for paragraph in paragraphs:
        # Add a number if the paragraph is a chapter title and increment paragraph number
        if CHAPTER_TITLE.match(paragraph):
            numbered_paragraphs.append(paragraph)
        else:
            numbered_paragraphs.append(f"{paragraph_number} {paragraph}")
            paragraph_number += 1

    # Join the paragraphs back together
    return "\n".join(numbered_paragraphs)


def unnumber_lines(text: str) -> str:
    # Replace any double line breaks with single line breaks
    text = text.replace("\n\n", "\n")
    # Split the text by line breaks
    paragraphs = text.strip().split("\n")
    # Remove the paragraph numbers
    unnumbered_paragraphs = [re.sub(r"^\d+ ", "", paragraph) for paragraph in paragraphs]
    # Join the paragraphs back together
    return "\n\n".join(unnumbered_paragraphs)


def get_last_line_number(text: str) -> int:
    # Split the text by line breaks
    paragraphs = text.strip().split("\n")
    
    # Get the last paragraph number
    try:
        print(paragraphs[-1])
        return int(paragraphs[-1].split(" ")[0])
    except ValueError:
        return None


def prompt_llm(
            message_list: list[dict],
            minimum_output_tokens: int = 4096,
            total_cumulative_cost: float = 0.0,
            depth: int = 0,
            max_depth: int = 16
        ) -> tuple[list[dict], int]:
    # Check recursion depth
    if depth > max_depth:
        print("Recursion depth limit reached.")
        return message_list, total_cumulative_cost

    print(f"Recursion depth: {depth}, Message list length: {len(message_list)}, Total cumulative cost: {total_cumulative_cost}")

    try:
        # Select appropriate model based on prompt length and minimum_output_tokens
        model = select_model(message_list, minimum_output_tokens)

        # Prompt the LLM with the current message_list
        response: ChatCompletion = client.chat.completions.create(
            model=model,
            messages=message_list,
            temperature=0
        )

        # Update the message list with the response
        message_list.append(
                {"role": "assistant", "content": response.choices[0].message.content}
            )
        
        # Update total cumulative cost
        total_cumulative_cost += response.usage.prompt_tokens*model_cost[model][0] + response.usage.completion_tokens*model_cost[model][1]

        print("Response appended to message list.")

        # If response was interrupted due to length, call function recursively with
        # updated message_list to prompt again
        if response.choices[0].finish_reason == "length":
            print("Response finish reason: length. Recursing with updated message list.")
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_cumulative_cost,
                depth + 1,
                max_depth
            )
        
        # If line number isn't equal to or greater than last line in text prompt, call
        # function recursively with message_list to prompt again
        print("Last output line: ")
        output_last_line_number = get_last_line_number(response.choices[0].message.content)
        print("Last input line: ")
        input_last_line_number = get_last_line_number(message_list[1]["content"])
        if output_last_line_number is None:
            warnings.warn("Paragraph numbers missing from output text. Breaking loop.")
            # Drop the last message from the message list
            message_list.pop()

            return message_list, total_cumulative_cost
        elif output_last_line_number >= input_last_line_number:
            print("Translation complete. Returning message list.")
            return message_list, total_cumulative_cost
        else:
            print("Translation incomplete. Recursing with updated message list.")
            user_prompt = "The original prompt ended with paragraph " + \
            str(input_last_line_number) + ", and you have only modernized through " + \
            "paragraph " + str(output_last_line_number) + ", so your task is not " + \
            "yet complete. To earn your $20 tip, you must resume from paragraph " + \
            str(output_last_line_number + 1) + ", remembering to number paragraphs " + \
            "and to maintain a one-to-one correspondence between input and output " + \
            "paragraph numbers until the task is complete."
            message_list.append(
                {"role": "user", "content": user_prompt}
            )
            return prompt_llm(
                message_list,
                minimum_output_tokens,
                total_cumulative_cost,
                depth + 1,
                max_depth
            )            

    except Exception as e:
        # Try selecting model with more conservative assumptions before raising BadRequestError
        if isinstance(e, BadRequestError) and minimum_output_tokens < 4096:
            print("BadRequestError encountered. Retrying with minimum_output_tokens=4096.")
            return prompt_llm(
                message_list,
                4096,
                total_cumulative_cost,
                depth + 1,
                max_depth
            )
        else:
            print(f"Exception encountered: {e}")
            raise e

### Chunking

In [22]:
with open(input_filepath, "r", encoding="utf-8") as file:
    long_text = file.read()

print("Estimated token count of input text: " + str(estimate_context_length(long_text)))

# Split the long_text into chunks
chunks = split_into_chunks(long_text, 20000, 100)

# Inspect chunks by printing first and last 100 characters in each chunk
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk[:100])
    print("...")
    print(chunk[-100:])
    print(f"Chunk length: ~{int(round(len(chunk)/4,-2))} tokens")
    print()

Estimated token count of input text: 297792
Chunk 1:
CHAPTER 1. Loomings.
Call me Ishmael. Some years ago—never mind how long precisely—having little or 
...
self-containing stronghold—a lofty Ehrenbreitstein, with a perennial well of water within the walls.
Chunk length: ~19900 tokens

Chunk 2:
But the side ladder was not the only strange feature of the place, borrowed from the chaplain’s form
...
on it, not to speak of my three years’ beef and board, for which I would not have to pay one stiver.
Chunk length: ~19900 tokens

Chunk 3:
It might be thought that this was a poor way to accumulate a princely fortune—and so it was, a very 
...
 themselves there, about something which he would find out when he obeyed the order, and not sooner.
Chunk length: ~20000 tokens

Chunk 4:
What, perhaps, with other things, made Stubb such an easy-going, unfearing man, so cheerily trudging
...
stars; even as the look-outs of a modern ship sing out for a sail, or a whale just bearing in sight.
Chunk len

### Execution

In [25]:
full_output = ""
cumulative_cost = 0.0

for i, chunk in enumerate(chunks):
    # Debug: Only process first chunk
    if i == 0:
        message_list_prompt = [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": number_lines(chunk)}
                ]

        message_list_result, cost = prompt_llm(message_list_prompt, 4096)
        full_output += unnumber_lines(get_full_output(message_list_result)) + "\n"
        cumulative_cost += cost

        print(f"Chunk {i + 1} cost: {cost}")

print("Cumulative cost:")
print(cumulative_cost)

with open("sample_output.txt", "w", encoding="utf-8") as file:
    file.write(full_output)

Recursion depth: 0, Message list length: 2, Total cumulative cost: 0.0
Response appended to message list.
Response finish reason: length. Recursing with updated message list.
Recursion depth: 1, Message list length: 3, Total cumulative cost: 0.31349000000000005
Response appended to message list.
Last output line: 
111 The chapel service, a moment of communal reflection and spiritual preparation, provided a foundation of strength and resolve for the challenges and adventures that lay ahead, a reminder of the enduring human spirit and the bonds that unite us in our shared journey through life.
Last input line: 
140 I pondered some time without fully comprehending the reason for this. Father Mapple enjoyed such a wide reputation for sincerity and sanctity, that I could not suspect him of courting notoriety by any mere tricks of the stage. No, thought I, there must be some sober reason for this thing; furthermore, it must symbolize something unseen. Can it be, then, that by that act of phy

It took some wrestling, but we finally got the expected output, which can be inspected in the `sample_output.txt` file.

For reasons of cost saving, I am not going to process the entire text. However, to do so, you can simply remove `if i == 0:` from the `for` loop in the `Execution` section and unindent the following code block. (You can also feel free to remove debugging print statements throughout the code!)

Note that we've got 13 chunks, and the first chunk cost us nearly a dollar to process, which suggests we could expect to spend $10-12 to process the entire text of Moby Dick. Costs would increase if we used a larger chunk size, such as 50,000 tokens instead of 20,000. (I won't tell you how much I spent on API token usage for experimentation while developing this notebook!)