# Build Your Own 10K Agent: Transform a Unstructured Financial Report to an Finetuned LLM

# Context Summarization from Annual Reports

Are you looking to transform dense annual reports (10K) into concise, informative summaries? This guide introduces the use of Uniflow to both extract and summarize key information from unstructured annual reports, enabling efficient knowledge discovery and summarization with a Large Language Model (LLM).

---

## 🚀 Process Overview

We simplify the process into a single, streamlined step:

- **Summary Extraction**:
  - Utilize Uniflow for parsing and summarizing PDF-formatted annual reports.
  - **Example Reports**: Nike, Amazon, and Alphabet serve as our case studies for demonstration.

---

## 📋 Prerequisites

> **Important**: A GPU is necessary for running Reinforcement Learning from Human Feedback (RLHF), a crucial component of the fine-tuning process.

### Setup Instructions

Ensure your environment is ready by following these steps:

#### Conda Environment

- Create a conda environment specific to this project. Follow the provided setup instructions for guidance.

#### Installation of Dependencies

Install necessary libraries within your environment:

```bash
pip3 install uniflow
pip3 install "pykoi[huggingface, rag, rlhf]"
```

For GPU support, ensure the correct version of torch is installed:

```bash
pip3 uninstall torch
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
```

#### OpenAI API Key Configuration

Obtain an OpenAI API key and set it as the `OPENAI_API_KEY` environment variable in a `.env` file at the root directory of this project. Refer to the provided [instructions](https://github.com/CambioML/cambio-cookbook/tree/main#api-keys) for more detail.

---

## 🌟 Getting Started

With your environment prepared and dependencies in place, you're ready to dive into the notebook. Follow the guide to start extracting and summarizing information from annual reports. This tool is designed to streamline the analysis of complex financial documents, making your review process more efficient and insightful.

### Update System Path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

If you already have these installed, feel free to skip this step.

In [2]:
!{sys.executable} -m pip install -q pandas nougat-ocr

### Import dependency

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd

from uniflow.pipeline import MultiFlowsPipeline
from uniflow.flow.client import TransformClient, ExtractClient
from uniflow.flow.config import PipelineConfig
from uniflow.flow.config import TransformOpenAIConfig, ExtractPDFConfig
from uniflow.flow.config import OpenAIModelConfig, NougatModelConfig
from uniflow.flow.config import TransformHuggingFaceConfig, HuggingfaceModelConfig, TransformQAHuggingFaceJsonFormatConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### 1. Prepare the input data

Uncomment the 10k that you want to use.

In [4]:
pdf_file = "amazon-10k-2023.pdf"
# pdf_file = "amazon-10k-2023.pdf"
# pdf_file = "alphabet-10k-2023.pdf"

Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

### 2. Load the pdf using Nougat

For this example, we'll run the ExtractPDF flow to extract the text from the 10K pdf. This uses the Nougat PDF parser.

In [6]:
pdf_directory = [
    {"pdf": input_file},
]

extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 128 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    )
)

nougat_client = ExtractClient(extract_config)

pdf_output = nougat_client.run(pdf_directory)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0%|                                                                                                                 | 0/1 [00:00<?, ?it/s]

INFO: likely hallucinated title at the end of the page: ## Appendix B


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [07:03<00:00, 423.12s/it]


### 3. Pre-process data extracted from pdf

Convert the list of strings into a single string to enable efficient processing by the helper function below.

In [7]:
combined_pdf_output = '\n'.join(pdf_output[0]['output'][0]['text'])

Import helper function `process_content` to pre-process the content extracted from target PDF. If you are interested in the details of the function, you can find it in the dictory `./helper_func/nougat_helper_function`.

In [8]:
from helper_func.nougat_helper_function import process_content

Create OpenAI client instance from uniflow, for further usage of `process_content` function.

In [9]:
guided_prompt_openAI = PromptTemplate(
instruction="""Revise the original text, focusing on fully retaining the core textual content while removing elements resembling table 
syntax, including lines with a single number and a sign. Preserve headers like '##' and '###' in markdown format. Follow the format of the 
examples below to include original_context and cleaned_context in the response, under the 'responses' key in the JSON object.""",   
few_shot_prompt=[
    Context(
        original_context="Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. \[\text{NON-U.S. RETAIL STORES}\] Shannon introduced the concept of\ninformation entropy for the first time. \[\frac{\text{$}}{\text{$}}\]. \n21%\n507\n25%\n25%\n",
        cleaned_context="Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. Shannon introduced the concept of\ninformation entropy for the first time.",
    ),
])

config_openAI = TransformOpenAIConfig(
    prompt_template=guided_prompt_openAI,
    model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
)

client_openAI = TransformClient(config_openAI)

Then, refine extracted PDF content using the process_content function, then encapsulate each data chunk in a `Content` object for LLM summary generation compatibility.

In [10]:
preprocessed_combined_pdf_output = process_content(combined_pdf_output, client_openAI)
data = [ Context(context=p[:800], summary="") for p in preprocessed_combined_pdf_output[6:16] if len(p) > 200 ]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.05s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.76s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.34s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.13s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.58s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.08s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.72s/it]
100%|████████

Number of chunks further split: 0
Number of significantly reduced chunks: 22





### 4. Use LLM to generate data 

Craft prompts to generate Q&A pairs from a given paragraph, each including instructions and examples featuring "context" and "summary". Then, initialize a TransformClient with TransformHuggingFaceConfig. Ensure the data is prepped for client integration.

In [11]:
guided_prompt = PromptTemplate(
    instruction="Generate a one sentence summary based on the last context below. Follow the format of the examples below to include original context and its summary in the response",
    few_shot_prompt=[
        Context(
            context="When you're operating on the maker's schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in. Plus you have to remember to go to the meeting. That's no problem for someone on the manager's schedule. There's always something coming on the next hour; the only question is what. But when someone on the maker's schedule has a meeting, they have to think about it.",
            summary="Meetings disrupt the productivity of those following a maker's schedule, dividing their time into impractical segments, while those on a manager's schedule are accustomed to a continuous flow of tasks.",
        ),
    ],
)

huggingface_config = TransformHuggingFaceConfig(
    prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(
        batch_size=128
))

huggingface_client = TransformClient(huggingface_config)

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.47s/it]


Feed the data into the client and await the generated results.

In [12]:
huggingface_output = huggingface_client.run(data)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.66s/it]


### 5. Process the output

Let's take a look of the generation output. We need to do a little postprocessing on the raw output.

In [13]:
from IPython.display import display, HTML

# Function to escape LaTeX special characters for HTML display
def escape_latex_for_html(text):
    return (text.replace("\\", "\\\\")  # Escape backslashes
                .replace("{", "{{")     # Double curly braces
                .replace("}", "}}")     # are placeholders in `str.format()`
                .replace("$", "\\$"))   # Escape dollar signs

# Iterate over each response item
for i, item in enumerate(huggingface_output[0]['output'][0]['response']):
    # Find the last occurrences of "context:" and "summary:"
    last_context_index = item.rfind("context: ")
    last_summary_index = item.rfind("summary: ")

    # Extract the last context and summary from the item
    last_context = escape_latex_for_html(item[last_context_index + len("context: "):last_summary_index].strip())
    last_summary = escape_latex_for_html(item[last_summary_index + len("summary: "):].strip())

    # Display the last context and summary pair with different colors and an index
    display(HTML(f"<div><strong>Pair #{i+1}</strong></div>"
                 f"<div style='margin-bottom: 20px;'><strong style='color: #000000;'>Context:</strong>"
                 f"<p style='background-color: #FFFFFF; padding: 10px;'>{last_context}</p></div>"
                 f"<div style='background-color: #f5f5f5; padding: 10px;'><strong style='color: #000000;'>Summary:</strong>"
                 f"<p>{last_summary}</p></div><hr/>"))

Congratulations! You have now received the summary tailored to the context you provided.

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>