# Case Study: Document Processing

This case study covers an application which is used to correct a document for spelling errors and grammar mistakes.

More advanced applications might be used to implement the specific style guides of an organization, or review documents for statements that create exposure to legal risk, and so on.

LLMs are a powerful tool for document processing, however as the documents get longer, the time taken for the application to complete the task becomes increasingly important. If the application is an overnight batch process, this is less of a concern, however if a user is waiting for the output to continue their workflow, then the response time is key to the user experience.

The techniques shown here show an approximate 100x speed up in the processing of the document.

In [2]:
# import dependencies
import datetime
import json
import time
import os
import datetime
import json
import time
from openai import AzureOpenAI
from dotenv import load_dotenv
import json
import copy
import textwrap

# Load environment variables
load_dotenv()

def aoai_call(system_message,prompt,model):
    client = AzureOpenAI(
        api_version=os.getenv("API_VERSION"),
        azure_endpoint=os.getenv("AZURE_ENDPOINT"),
        api_key=os.getenv("API_KEY")
    )

    start_time = time.time()

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt},
        ],
    )

    end_time = time.time()
    e2e_time = end_time - start_time

    result=json.loads(completion.model_dump_json(indent=2))
    prompt_tokens=result["usage"]["prompt_tokens"]
    completion_tokens=result["usage"]["completion_tokens"]
    completion_text=result["choices"][0]["message"]["content"]

    return result,prompt_tokens,completion_tokens,completion_text,e2e_time

model=os.getenv("MODELGPT432k")

# Read essay from a text file
with open('document_with_errors.txt', 'r') as f:
    document_with_errors = f.read()


### Base Case: LLM used to rewrite document and implement corrections

**Time taken: 315 seconds**

In this example, the LLM is rewriting the document and correcting the errors, which is included in the JSON key "rewritten_document". The process is chunked, as at times the LLM will refuse to rewrite long documents. By chunking it, the LLM will write out the document with all the corrections.

The JSON also includes other parameters, where for each error in the document there is an explanation of why it was an error, the incorrect text, and the corrected text, so that in the front end of an application, the corrections are presented to a human, who checks the output.

In [52]:
import json
import textwrap

# Initialize variables
total_prompt_tokens = 0
total_completion_tokens = 0
total_time = 0
rewritten_document = ""
list_of_errors_to_correct = []
completion_text_list=[]

# Split the text into chunks of words
chunks = textwrap.wrap(document_with_errors, 1000)
for i, chunk in enumerate(chunks):
    system_message="""
    You are a helpful AI assistant.
    """
    prompt=f"""
    Document to correct:
    {chunk}

    """
    prompt=prompt+"""
    The output should be a JSON object. Only return the JSON object, with no comments or additional text.
    Use this structure:
    {
        "rewritten_document": "The document with the errors corrected.",
        "list_of_errors_to_correct": [
            {
                "explanation_of_error": "Think step-by-step about identifying potential spelling errors or grammar issues. Consider all errors together and consider the whole sentence of text before applying a rule.",
                "incorrect_text": "If there is a error fill this with the errors text sub-string, consider the full sentence and context of the incorrect text before filling this in",
                "fixed_text": "Think step by step about the error and then fill this with the fixed version of the text, don't apply any other fixes apart from the error if there is one"
            }
        ]
    }
    JSON:
    """

    result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model)
    total_prompt_tokens += prompt_tokens
    total_completion_tokens += completion_tokens
    total_time += e2e_time
    completion_text_list.append(completion_text)

import json

# Assuming completion_text_list is your list of dictionaries
combined_list = {"list_of_errors_to_correct":[],"rewritten_document":""}

for completion_text in completion_text_list:
    json_string = completion_text.replace('\n', '')
    data = json.loads(json_string)
    if not data["list_of_errors_to_correct"]:
        pass
    else:
        combined_list["list_of_errors_to_correct"].append(data["list_of_errors_to_correct"])
        combined_list["rewritten_document"]=combined_list["rewritten_document"]+data["rewritten_document"]

document_with_corrections=combined_list["rewritten_document"]
# print(document_with_corrections)


# Print totals
# print(f"Total Prompt Tokens: {total_prompt_tokens}")
# print(f"Total Completion Tokens: {total_completion_tokens}")
print(f"Total Time Taken: {total_time:.2f} seconds")
# print(f"List of Errors to Correct: {list_of_errors_to_correct}")

Total Prompt Tokens: 3292
Total Completion Tokens: 3725
Total Cost: $0.6445
Total Time Taken: 315.50 seconds
Rewritten Document: 
List of Errors to Correct: []


### Implement the "avoid-rewriting-documents" technique

**Time taken: 38 seconds**

Rewriting the entire document is very slow, as the length of the response will be proportionate to the length of the document. It will also be proportionate to the number of errors, as it will need to explain each mistake.

Instead of rewriting the entire document, the LLM is used to identify the mistakes, and python code is used for correcting the text, by replacing only the sections that are wrong (similar to how a human would do it!). 

The speed of the application is now proportionate to the number of mistakes in the document- this of course will vary from document to document!

In [4]:
import json
import textwrap

# Initialize variables
total_prompt_tokens = 0
total_completion_tokens = 0
total_time = 0
rewritten_document = ""
list_of_errors_to_correct = []
completion_text_list=[]

# Split the text into chunks of words
chunks = textwrap.wrap(document_with_errors, 1000)
for i, chunk in enumerate(chunks):
    system_message="""
    You are a helpful AI assistant.
    """
    prompt=f"""
    Document to correct:
    {chunk}

    """
    prompt=prompt+"""
    The output should be a JSON object. Only return the JSON object, with no comments or additional text.
    Use this structure:
    {
        "list_of_errors_to_correct": [
            {
                "explanation_of_error": "Think step-by-step about identifying potential spelling errors or grammar issues. Consider all errors together and consider the whole sentence of text before applying a rule.",
                "incorrect_text": "If there is a error fill this with the errors text sub-string, consider the full sentence and context of the incorrect text before filling this in",
                "fixed_text": "Think step by step about the error and then fill this with the fixed version of the text, don't apply any other fixes apart from the error if there is one"
            }
        ]
    }
    JSON:
    """

    result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model)
    total_prompt_tokens += prompt_tokens
    total_completion_tokens += completion_tokens
    total_time += e2e_time
    completion_text_list.append(completion_text)


import json

# Assuming completion_text_list is your list of dictionaries
combined_list = {"list_of_errors_to_correct":[]}

for completion_text in completion_text_list:
    json_string = completion_text.replace('\n', '')
    data = json.loads(json_string)
    if not data["list_of_errors_to_correct"]:
        pass
    else:
        combined_list["list_of_errors_to_correct"].append(data["list_of_errors_to_correct"])

# Assuming document_with_errors is your text document
for error_list in combined_list["list_of_errors_to_correct"]:
    for error_dict in error_list:
        document_with_errors = document_with_errors.replace(error_dict['violating_text'], error_dict['fixed_text'])
document_with_corrections=document_with_errors
# print(document_with_corrections)

## Print totals
# print(f"Total Prompt Tokens: {total_prompt_tokens}")
# print(f"Total Completion Tokens: {total_completion_tokens}")
print(f"Total Time Taken: {total_time:.2f} seconds")
# print(f"List of Errors to Correct: {list_of_errors_to_correct}")

Total Prompt Tokens: 3255
Total Completion Tokens: 421
Total Time Taken: 38.15 seconds
Rewritten Document: 
List of Errors to Correct: []


### Implement the "prompt-for-concision" technique

**Time taken: 20 seconds**

Prompting the model to be concise when explaining the errors may lead further improvements. In real world testing, this added up to a 2x improvement.

This is because the explanation of the error, unless it is particularly complex, can be truncated from the default model's verbosity to 2-3 words. It is also possible to save additional time, by asking the model for only the shortest amount of text possible for replacement.

Lastly, the JSON structure has been changed to an array. By avoiding writing the JSON keys for every single mistake, even further token generation has been prevented, speeding up the application.

In [18]:
import json
import textwrap

# Initialize variables
total_prompt_tokens = 0
total_completion_tokens = 0
total_time = 0
rewritten_document = ""
list_of_errors_to_correct = []
completion_text_list=[]

# Split the text into chunks of words
chunks = textwrap.wrap(document_with_errors, 1000)
for i, chunk in enumerate(chunks):
    system_message="""
    You are a helpful AI assistant.
    """
    prompt="""
    The output should be a list object. Only return the list object, with no comments or additional text.
    Use this structure:
    [["The first item in the list is an explanation of the error. Think step-by-step about identifying potential spelling errors or grammar issues. Consider all errors together and consider the whole sentence of text before applying a rule. Explain the error in the shortest sentence possible, ideally 3 to 7 words.","The second item in this list is the specific text that includes the error, typically around 3 words either side of the error. If there is a error fill this with the errors text sub-string, consider the full sentence and context of the incorrect text before filling this in. When selecting the substring of the text, use the shortest amount of text whilst ensuring the sub string is unique in the document.","The third item is the corrected text. It should be exactly the same string as the second item, but with the spelling error or grammar corrected. Think step by step about the error and then fill this with the fixed version of the text."]]

    START_EXAMPLE_1
    
    INPUT_DOCUMENT:
    Start Your Day Positively: Begin your mornings with small victories. Accomplish a tinny task, like making your bed or enjoying a cup of coffee. Set an intention for the day, whether it’s a guiding principle or a specific action you’ll take. Delayed using your phone upon waking up and replace social media scrolling with gratitude. Reflect on simple things you’re thankful for, like someone holding the door open or a warm cup of coffee from your partner.
    Prioritize Self-Care Throughout the Day: As the day progresses, continue nurturing your well-being. Savor encouraging words, read inspirational quotes, or revisit kind texts from friends. Stay active by taking short walks or practicing mindfulness. Remember that small, consistent efforts add up, contributing to your overall health and happiness. 
    OUTPUT_LIST:[["Verb tense","Delayed using your","Delay using your"],["Spelling","a tinny task","a tiny task"]]

    Do not output \n and many whitespaces.

    If there are no errors, return a list like this:[["NA","NA","NA"]]. Do not generate output like this:['\n    [["NA","NA","NA"]]']
    """

    prompt=prompt+f"""
    INPUT_DOCUMENT:
    {chunk}
    OUTPUT_LIST:
    """

    result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model)
    total_prompt_tokens += prompt_tokens
    total_completion_tokens += completion_tokens
    total_time += e2e_time
    completion_text_list.append(completion_text)

import json

# Assuming completion_text_list is your new list of strings
combined_list = []

for completion_text in completion_text_list:
    json_string = completion_text.replace('\n', '')
    data = json.loads(json_string)
    if data[0][0] == "NA":
        pass
    else:
        combined_list.append(data)

# Assuming document_with_errors is your text document
for error_list in combined_list:
    for error_dict in error_list:
        document_with_errors = document_with_errors.replace(error_dict[1], error_dict[2])
document_with_corrections = document_with_errors
# print(document_with_corrections)


# Print totals
# print(f"Total Prompt Tokens: {total_prompt_tokens}")
# print(f"Total Completion Tokens: {total_completion_tokens}")
print(f"Total Time Taken: {total_time:.2f} seconds")
# print(f"Rewritten Document: {rewritten_document}")
# print(f"List of Errors to Correct: {list_of_errors_to_correct}")

Total Prompt Tokens: 6006
Total Completion Tokens: 96
Total Time Taken: 24.35 seconds
Rewritten Document: 
List of Errors to Correct: []


### Implement the parallelization technique and make it even more concise

**Time taken: 3 seconds**

Because the document is not processed in a sequential order, each chunk can be analysed in parallel. This means the length of time taken to process the document is no longer proprotionate to the length of the document, and is only limited by the chunk size and the amount of compute available!

In [6]:
import asyncio
import aiohttp
import os
from aiohttp import ClientSession

async def fetch(session, system_message, user_message):
    url = f'{os.getenv("AZURE_ENDPOINT")}/openai/deployments/gpt-4/chat/completions?api-version={os.getenv("API_VERSION")}'
    headers = {
        "Content-Type": "application/json",
        "api-key": os.getenv("API_KEY")
    }  
    data = {
        "model": "gpt-4",  # Adjust the model as needed
        "messages": [
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message}
        ]
    }
    async with session.post(url, json=data, headers=headers) as response:
        return await response.json()

async def main(system_message, user_messages):
    api_call_batch_size = 16

    async with ClientSession() as session:
        tasks = []
        responses = []
        for i, user_message in enumerate(user_messages):
            # time.sleep(1) # Add a small delay to avoid hitting the rate limit
            task = asyncio.create_task(fetch(session, system_message, user_message))
            tasks.append(task)
            if len(tasks) >= api_call_batch_size: 
                responses.extend(await asyncio.gather(*tasks))
                tasks = []
        responses.extend(await asyncio.gather(*tasks))  # Process the last batch
        return responses

In [12]:
import json
import textwrap


# Split the text into chunks of words
chunks = textwrap.wrap(document_with_errors, 1000)
# Prepend "INPUT_DOCUMENT: " and append " OUTPUT_LIST:"
chunks = ["INPUT_DOCUMENT: " + chunk + " OUTPUT_LIST:" for chunk in chunks]

system_message="""
You are a helpful AI assistant.

###Important :
Do not add any additional information.
Make sure to complete all elements of the array'''

The output should be a list object. Only return the list object, with no comments or additional text.
Use this structure:
[["The first item in the list is an explanation of the error. Think step-by-step about identifying potential spelling errors or grammar issues. Consider all errors together and consider the whole sentence of text before applying a rule. Explain the violation in the shortest sentence possible, ideally 3 to 7 words.","The second item in this list is the specific text that includes the error, typically around 3 words either side of the error. If there is a error fill this with the errors text sub-string, consider the full sentence and context of the incorrect text before filling this in. When selecting the substring of the text, use the shortest amount of text whilst ensuring the sub string is unique in the document.","The third item is the corrected text. It should be exactly the same string as the second item, but with the spelling error or grammar corrected. Think step by step about the error and then fill this with the fixed version of the text."]]

START_EXAMPLE_1

INPUT_DOCUMENT:
Start Your Day Positively: Begin your mornings with small victories. Accomplish a tinny task, like making your bed or enjoying a cup of coffee. Set an intention for the day, whether it’s a guiding principle or a specific action you’ll take. Delayed using your phone upon waking up and replace social media scrolling with gratitude. Reflect on simple things you’re thankful for, like someone holding the door open or a warm cup of coffee from your partner.
Prioritize Self-Care Throughout the Day: As the day progresses, continue nurturing your well-being. Savor encouraging words, read inspirational quotes, or revisit kind texts from friends. Stay active by taking short walks or practicing mindfulness. Remember that small, consistent efforts add up, contributing to your overall health and happiness. 
OUTPUT_LIST:[["Verb tense","Delayed using your","Delay using your"],["Spelling","a tinny task","a tiny task"]]

Do not output \n and many whitespaces.

If there are no errors, return a list like this:[["NA","NA","NA"]]. Do not generate output like this:['\n    [["NA","NA","NA"]]']
"""


# result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_batched_call(system_message,json.dumps([chunks]), model)

start = time.time()
responses_async = await main(system_message, chunks)
end = time.time()
run_time = end - start
print(f"Total time taken: {run_time} seconds")

# responses
completion_text_list = []
for i in range(len(responses_async)):
    try:
        completion_text_list.append(responses_async[i]["choices"][0]["message"]["content"])
    except:
        pass
# print(completion_text_list)
# print(len(completion_text_list))
# # Print totals
# print(f"Total Prompt Tokens: {prompt_tokens}")
# print(f"Total Completion Tokens: {completion_tokens}")
# print(f"Total Time Taken: {e2e_time:.2f} seconds")
# print(f"Rewritten Document: {completion_text}")

import json

# Assuming completion_text_list is your new list of strings
combined_list = []

for completion_text in completion_text_list:
    json_string = completion_text.replace('\n', '')
    data = json.loads(json_string)
    if data[0][0] == "NA":
        pass
    else:
        combined_list.append(data)

# Assuming document_with_errors is your text document
for error_list in combined_list:
    for error_dict in error_list:
        document_with_errors = document_with_errors.replace(error_dict[1], error_dict[2])
document_with_corrections = document_with_errors
print(document_with_corrections)

Total time taken: 2.819835901260376 seconds
['[["NA","NA","NA"]]', '[["Spelling","Initial projecctions for","Initial projections for"]]', '[["NA","NA","NA"]]', '[["Plural verb agreement","micro-transactions needs some","micro-transactions need some"]]', '[["NA","NA","NA"]]', '[["NA", "NA", "NA"]]', '[["NA","NA","NA"]]', '[["Incorrect verb form","We aim to created","We aim to create"]]', '[["NA","NA","NA"]]']
9


### Further improvements

Some errors in a document may require the entire document to be analysed as a whole- for example, consistency errors in the name of a speaker (e.g. the person's name is used multiple times in the document, and spelled incorrectly in one instance). A final pass could be done over the whole document checking for errors, rather than chunking it. 

Similarly, there are risks of errors that overlap between different chunks. 

Clearly there is further work that could be done to improve this process- this is only intended as an reference of how to implement these techniques.