# Generate QAs based on the target PDF extracted from `Nougat`

This package uses `nougat`, a tool for extracting information from specific PDF files. It also includes a feature that uses OpenAI's language models to check the accuracy of this extracted information. Additionally, the package uses `uniflow` to create questions and answers based on the information taken from the PDFs.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip -q install transformers accelerate bitsandbytes scipy nougat-ocr

### Import Dependency

In [3]:
from dotenv import load_dotenv
from pprint import pprint
import os

from uniflow.flow.client import TransformClient
from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.flow.config import TransformOpenAIConfig, TransformHuggingFaceConfig, HuggingfaceModelConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()

True

### Prepare the input data

First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `Nougat`.

In [4]:
pdf_file = "amazon-10k-2023.pdf"

Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)
print(input_file)

/home/ubuntu/uniflow/example/transform/data/raw_input/amazon-10k-2023.pdf


In [6]:
base_name = os.path.splitext(pdf_file)[0]
output_directory = os.path.join(dir_cur, "data")
print(output_directory)

/home/ubuntu/uniflow/example/transform/data


Run `Nougat` model to process content of target PDF.

In [7]:
!nougat {input_file} -o {output_directory} -m 0.1.0-base --markdown --no-skipping 

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
INFO:root:Skipping amazon-10k-2023.pdf, already computed. Run with --recompute to convert again.


Below are helper functions designed to process the output of `Nougat`, ensuring that the output context is efficiently processed by the Hugging Face model.

#### Overview:
The `process_mmd_file` function is designed to process markdown files, particularly handling large sections and table content. It reads a markdown file, splits it into manageable sections, processes these sections to handle table content, and optionally utilizes OpenAI for further processing.

#### Inputs:
- `file_path`: A string representing the path to the markdown file to be processed.
- `client_openAI`: An object representing the OpenAI client, used for processing sections of the markdown file.

#### Workflow:
1. **Reading the File**: The function starts by reading the entire content of the markdown file specified by `file_path`.
2. **Initial Splitting**: The content is split into sections based on '##' headers. The first section is skipped if it's empty.
3. **Sub-Splitting for Large Sections**: Sections larger than a predefined word count (`max_word_count`) are further split using '###' headers.
4. **Processing for Table Content**: Each section is processed for table content if its word count exceeds `max_word_count_for_table`. This involves reducing the word count while preserving essential information.
5. **Word Count Reduction Check**: After processing, if the word count of a section is reduced below a certain threshold (`reduction_threshold`), the section is further processed using the OpenAI client.
6. **Compilation of Processed Sections**: All processed sections that are not empty are compiled into a list.
7. **Statistics**: The function prints the number of sections that were further split and the number of sections that were significantly reduced in word count.
  
#### Output:
- Returns a list of strings, where each string is a processed section of the original markdown file. This list represents the cleaned and potentially AI-processed sections of the markdown content.

#### Note
- We've observed that some text chunks, post table syntax removal processing, contain only headers. To enhance the relevance of the output, you can eliminate these header-only chunks by setting a minimum length requirement for each chunk.

In [8]:
def process_mmd_file(file_path, client_openAI):
    with open(file_path, 'r') as file:
        content = file.read()

    # Constants and counters
    max_word_count_for_table = 25
    max_word_count = 4096
    reduction_threshold = 0.30
    further_splitted_count = 0
    significantly_reduced_count = 0

    # Splitting the content
    sections = content.split('##')
    intermediate_sections = []

    for i, section in enumerate(sections):
        if i == 0 and not section.strip():
            continue

        # Add '##' back to the section header
        if not section.lstrip().startswith('#'):
            section = '##' + section

        # Split large sections using '###'
        if len(section.split()) > max_word_count:
            sub_sections = section.split('###')
            for sub_section in sub_sections:
                if len(sub_section.split()) > max_word_count:
                    further_splitted_sub_sections = split_large_section(sub_section, max_word_count)
                    further_splitted_count += len(further_splitted_sub_sections) - 1
                    intermediate_sections.extend(further_splitted_sub_sections)
                else:
                    intermediate_sections.append(sub_section)
        else:
            intermediate_sections.append(section)

    # Process each section for table content and check word count reduction
    cleaned_sections = []
    for section in intermediate_sections:
        original_word_count = len(section.split())
        processed_section = process_for_table_content(section, max_word_count_for_table)

        # Calculate word count reduction
        processed_word_count = len(processed_section.split())
        if processed_word_count == 0 or processed_word_count / original_word_count < reduction_threshold:
            significantly_reduced_count += 1
            # Use OpenAI-based processing for sections that are significantly reduced
            temp_processed_section = clean_text_from_table_syntax_with_openAI(section, client_openAI)
            if temp_processed_section:
                processed_section = temp_processed_section

        if processed_section:
            cleaned_sections.append(processed_section)

    print(f"Number of chunks further split: {further_splitted_count}")
    print(f"Number of significantly reduced chunks: {significantly_reduced_count}")

    return cleaned_sections

#### Overview:
The `split_large_section` function is designed to split a large text section into smaller chunks based on a specified maximum word count. This function is particularly useful for processing large blocks of text that need to be broken down for readability or specific processing requirements.

#### Inputs:
- `section`: A string representing the text section to be split.
- `max_word_count`: An integer specifying the maximum word count for each chunk.

#### Output:
- Returns a list of strings, where each string represents a chunk of the original section. Each chunk contains words up to the specified `max_word_count`, ensuring no chunk exceeds this limit.

In [9]:
def split_large_section(section, max_word_count):
    words = section.split()
    chunks = []
    current_chunk = []

    for word in words:
        current_chunk.append(word)
        if len(' '.join(current_chunk)) > max_word_count:
            chunks.append(' '.join(current_chunk[:-1]))
            current_chunk = [word]

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

#### Overview:
The `process_for_table_content` function is designed to filter and process text sections, specifically targeting content structured like tables. It aims to retain meaningful content while considering a maximum word count for each processed chunk.

#### Inputs:
- `section`: A string representing the text section to be processed. This section typically contains markdown content.
- `max_word_count_for_table`: An integer specifying the maximum word count for each chunk within a table-like structure.

#### Output:
- Returns a string that represents the processed section. This string is composed of filtered lines that meet the criteria of having an appropriate word count and not being markdown headers.

In [10]:
def process_for_table_content(section, max_word_count_for_table):
    lines = [line for line in section.split('\n') if line.strip() and not line.strip().startswith('##') and not line.strip().startswith('###')]
    filtered_lines = []
    i = 0

    while i < len(lines):
        end_index = min(i + 4, len(lines))
        word_count = sum(len(line.split()) for line in lines[i:end_index])

        if word_count >= max_word_count_for_table or end_index - i < 4:
            filtered_lines.extend(lines[i:end_index])
        i = end_index

    return '\n'.join(filtered_lines).strip()

#### Overview:
The `clean_text_from_table_syntax_with_openAI` function is designed to process a text chunk, particularly focusing on cleaning and formatting text from table-like syntax, using the OpenAI API for advanced processing. This function is ideal for refining and simplifying complex text structures.

#### Inputs:
- `text_chunk`: A string representing the text chunk to be processed. It is expected to be potentially complex or table-like in structure.
- `client_openAI`: An OpenAI client object used to process the text chunk.

#### Output:
- Returns the cleaned and processed text as a string if a valid 'cleaned_context' is extracted from the OpenAI client's response.
- Returns an empty list if the input is invalid, or if the necessary data isn't found in the OpenAI response.

In [11]:
def clean_text_from_table_syntax_with_openAI(text_chunk, client_openAI):
    # Validate input
    if not isinstance(text_chunk, str):
        return []

    input_data = [Context(context=text_chunk)]
    output_openAI = client_openAI.run(input_data)

    # Check if 'output' is in the first item of the output_openAI list
    if isinstance(output_openAI, list) and len(output_openAI) > 0 and 'output' in output_openAI[0]:
        first_output = output_openAI[0]['output']

        # Check if first_output is a list and not empty
        if isinstance(first_output, list) and len(first_output) > 0:
            first_response = first_output[0]

            # Check if 'response' is in the first_response and it's not empty
            if isinstance(first_response, dict) and 'response' in first_response and isinstance(first_response['response'], list) and len(first_response['response']) > 0:
                first_responses = first_response['response'][0]

                # Check if 'responses' is in first_responses and it has at least two elements
                if isinstance(first_responses, dict) and 'responses' in first_responses and isinstance(first_responses['responses'], list) and len(first_responses['responses']) > 1:
                    cleaned_context = first_responses['responses'][1].get('cleaned_context')

                    # Check if cleaned_context is not None
                    if cleaned_context is not None:
                        return cleaned_context

    return []  # Return an empty list if the conditions are not met

Print the location of the output file.

In [12]:
base_name = os.path.splitext(pdf_file)[0]
output_file = os.path.join(output_directory, f"{base_name}.mmd")
print(output_file)

/home/ubuntu/uniflow/example/transform/data/amazon-10k-2023.mmd


Create OpenAI client instance from `uniflow`, for further usage of `process_mmd_file` function

In [13]:
guided_prompt_openAI = PromptTemplate(
instruction="""Revise the original text, focusing on fully retaining the core textual content while removing elements resembling table 
syntax, including lines with a single number and a sign. Preserve headers like '##' and '###' in markdown format. Follow the format of the 
examples below to include original_context and cleaned_context in the response, under the 'responses' key in the JSON object.""",   
few_shot_prompt=[
    Context(
        original_context="Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. \[\text{NON-U.S. RETAIL STORES}\] Shannon introduced the concept of\ninformation entropy for the first time. \[\frac{\text{$}}{\text{$}}\]. \n21%\n507\n25%\n25%\n",
        cleaned_context="Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. Shannon introduced the concept of\ninformation entropy for the first time.",
    ),
])

config_openAI = TransformOpenAIConfig(
    prompt_template=guided_prompt_openAI,
    model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
)

client_openAI = TransformClient(config_openAI)

Number of chunks split by the processed output using the helper function above.

In [14]:
page_contents = process_mmd_file(output_file, client_openAI)
print(len(page_contents))

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

INFO [abs_llm_processor]: Attempt 1 failed, retrying...
INFO [abs_llm_processor]: Attempt 2 failed, retrying...
INFO [abs_llm_processor]: Attempt 3 failed, retrying...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Number of chunks further split: 12
Number of significantly reduced chunks: 51
95


### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM, those include instruction and sample json format. We do this by giving a sample instruction and list of `Context` examples to the `PromptTemplate` class.

In [15]:
sample_instruction = """Generate one question and its corresponding answer based on the context. Following \
the format of the examples below to include only question and answer in the response with reasonable length."""

sample_examples = [
        Context(
            context="The quick brown fox jumps over the lazy dog.",
            question="What is the color of the fox?",
            answer="brown."
        ),
        Context(
            context="The quick brown fox jumps over the lazy black dog.",
            question="What is the color of the dog?",
            answer="black."
        )]

guided_prompt = PromptTemplate(
    instruction=sample_instruction,
    few_shot_prompt=sample_examples
)

Next, for the given page_contents above, we convert them to the Context class to be processed by uniflow.

In [16]:
data = [ Context(context=p[:800], summary="") for p in page_contents[8:18] if len(p) > 200 ]
data

[Context(context='Our businesses encompass a large variety of product types, service offerings, and delivery channels. The worldwide marketplace in which we compete is evolving rapidly and intensely competitive, and we face a broad array of competitors from many different industry sectors around the world. Our current and potential competitors include: (1) physical, e-commerce, and omnichannel retailers, publishers, vendors, distributors, manufacturers, and producers of the products we offer and sell to consumers and businesses; (2) publishers, producers, and distributors of physical, digital, and interactive media of all types and all distribution channels; (3) web search engines, comparison shopping websites, social networks, web portals, and other online and app-based means of discovering, using, or acqu', summary=''),
 Context(context='We regard our trademarks, service marks, copyrights, patents, domain names, trade dress, trade secrets, proprietary technologies, and similar intell

### Use LLM to generate data

In this example, we will use the [HuggingfaceModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L39)'s default LLM to generate questions and answers. Let's import the config and client of this model.

Here, we pass in our `guided_prompt` to the `HuggingfaceConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

In [17]:
config = TransformHuggingFaceConfig(
    prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(batch_size=128))
client = TransformClient(config)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Now we call the run method on the client object to execute the question-answer generation operation on the data shown above.

In [18]:
output = client.run(data)

  0%|          | 0/1 [00:00<?, ?it/s]



### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [19]:
import re

keywords = ["context:", "question:", "answer:"]
pattern = '|'.join(map(re.escape, keywords))

o = output[0]['output'][0]['response'][0] ## we only postprocess the first output
segments = [segment for segment in re.split(pattern, o) if segment.strip()]
result = {
    "context": segments[-3].rstrip("summary:   "),
    "question": segments[-2],
    "answer": segments[-1]
}

pprint(result, sort_dicts=False)

{'context': ' Our businesses encompass a large variety of product types, '
            'service offerings, and delivery channels. The worldwide '
            'marketplace in which we compete is evolving rapidly and intensely '
            'competitive, and we face a broad array of competitors from many '
            'different industry sectors around the world. Our current and '
            'potential competitors include: (1) physical, e-commerce, and '
            'omnichannel retailers, publishers, vendors, distributors, '
            'manufacturers, and producers of the products we offer and sell to '
            'consumers and businesses; (2) publishers, producers, and '
            'distributors of physical, digital, and interactive media of all '
            'types and all distribution channels; (3) web search engines, '
            'comparison shopping websites, social networks, web portals, and '
            'other online and app-based means of discovering, using, or acqu\n',
 

Congrats! Your question answers from the given knowledge context are generated!

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>