# Using TransformConfig with Long Text

In this example, we demonstrate how to leverage the Transform flow in handling long text that exceeds the token limitation of the ChatGPT API. There are two TransformConfigs for HuggingFace that you can add the parameter `auto_split_long_text`
- TransformQAHuggingFaceConfig
- TransformQAHuggingFaceJsonFormatConfig

We use the text from [War and Peace](https://github.com/mmcky/nyu-econ-370/blob/master/notebooks/data/book-war-and-peace.txt) as our example.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

We store the "War and Peace" text file in the `data/raw_input` directory as "book-war-and-peace.txt". You can download the file from [here](https://github.com/mmcky/nyu-econ-370/blob/master/notebooks/data/book-war-and-peace.txt).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install -q python-dotenv

### Import Dependency

In [3]:
from dotenv import load_dotenv
from uniflow.flow.client import TransformClient
from uniflow.flow.config import HuggingfaceModelConfig, TransformQAHuggingFaceJsonFormatConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Generate Context object with long text

In [4]:
def extract_contexts(file_path, ranges):
    # Read the entire file content
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Initialize the list for holding contexts
    contexts = []

    # Extract specified ranges
    for start, end in ranges:
        # Adjusting end index to fit within the file length if necessary
        end = min(end, len(content))
        context_text = content[start:end]
        contexts.append(Context(context=context_text))

    return contexts

# Define the ranges for the contexts
ranges = [(100, 32000), (32000+12600, 32000+13200)]

# Specify the file path
file_path = './data/raw_input/book-war-and-peace.txt'

# Extract contexts
contexts = extract_contexts(file_path, ranges)

for i, context in enumerate(contexts):
    preview_text = context.context[:50] + "..." if len(context.context) > 50 else context.context
    print(f"---\nContext {i+1} (Preview): '{preview_text}'\n---")

---
Context 1 (Preview): 'arn you, if you don't tell me that this means war,...'
---
---
Context 2 (Preview): ' as an emperor. So it
seems to me."

"Yes, yes, of...'
---


### Prepare sample prompts

In [5]:
guided_prompt = PromptTemplate(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    few_shot_prompt=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
        Context(
            context="""The Compute & Networking segment is comprised of our Data Center accelerated computing platforms and end-to-end networking platforms including Quantum for InfiniBand and Spectrum for Ethernet; our NVIDIA DRIVE automated-driving platform and automotive development agreements;""",
            question="What does the Compute & Networking segment include?",
            answer="""The Compute & Networking segment includes Data Center accelerated computing platforms, end-to-end networking platforms (Quantum for InfiniBand and Spectrum for Ethernet), the NVIDIA DRIVE automated-driving platform, and automotive development agreements.""",
        ),
])

### Use LLM to generate data

Here, we pass in our `guided_prompt` to the `TransformQAHuggingFaceJsonFormatConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

Please note that we include the `auto_split_long_text` parameter in the transform configuration. This ensures that if a `Context` object contains text exceeding the specified token length limit, it will automatically be split into multiple `Context` objects. Each of these objects will contain text segments that adhere to the limit, ready for submission to the OpenAI API.

In [6]:
config = TransformQAHuggingFaceJsonFormatConfig(
    prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(
        batch_size=4,
        response_start_key="question", 
        response_format={"type": "json_object"},
    ),
    auto_split_long_text=True
)

client = TransformClient(config)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.60s/it]


Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [7]:
outputs = client.run(contexts)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:24<00:00, 24.94s/it]


### Display the output

In [8]:
from IPython.display import display, HTML

# Define a function to wrap text in HTML tags with style
def format_html_question_answer(context, question, answer):
    return f"""
    <div style="margin-bottom: 20px;">
        <hr/>
        <div style="background-color: #ffffff; padding: 10px; border-radius: 5px;">
            <b>Context:</b> {context} ...
        </div>
        <div style="background-color: #f0f0f0; padding: 10px; border-radius: 5px;">
            <b>Question:</b> {question}
        </div>
        <div style="background-color: #d9edf7; padding: 10px; border-radius: 5px; margin-top: 5px;">
            <b>Answer:</b> {answer}
        </div>
    </div>
    """

html_output = ""
for o in outputs[0]["output"][0]["response"]:
    response = o
    # Split the response based on the first occurrence of '\nanswer'
    context = response['context'][:500]
    question = response['question']
    answer = response['answer']
    html_output += format_html_question_answer(context, question, answer)

# Display the formatted HTML
display(HTML(html_output))

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>