## Build Your Personal Guide:  From Dense PDFs to Accessible Q&A

### Overview

It's that time of the year again—tax season! Are you finding yourself swamped with piles of tax filing information and feeling a bit overwhelmed? Why not give our AI assistant a try? Simply feed it the tax guide, and voilà, it's ready to answer all your related queries with ease.

Today, we will show you how Uniflow turns difficult-to-read PDF documents into a responsive steward that answers any questions you may have.

### Before running the code

You will need to uniflow conda environment to run this notebook. You can set up the environment following the [instruction](https://github.com/CambioML/cambio-cookbook/tree/main#installation).

For more details, see this [instruction](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/tree/main?tab=readme-ov-file#huggingfacemodelconfig).

Finally, we are storing the IRS dataset in the `data\raw_input` directory as "IRS_2023.pdf". You can download the file from [here](https://www.irs.gov/forms-pubs/guide-to-business-expense-resources).

### Update System Path

In [None]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

If you already have these installed, feel free to skip this step.

In [38]:
!{sys.executable} -m pip install -q langchain pandas pypdf

### Import Dependency

In [39]:
import os

import pandas as pd

from uniflow.flow.client import ExtractClient, TransformClient
from uniflow.flow.config import ExtractPDFConfig, TransformOpenAIConfig
from uniflow.op.extract.split.constants import MARKDOWN_HEADER_SPLITTER
from uniflow.op.model.model_config import OpenAIModelConfig, NougatModelConfig
from uniflow.op.prompt import PromptTemplate, Context
from uniflow.op.extract.split.constants import PARAGRAPH_SPLITTER

### Prepare Input data

First, let's set the current directory and input data directory, and load the raw input data.

In [41]:
dir_cur = os.getcwd()
pdf_file = "IRS_2023.pdf"
input_file = os.path.join(f"{dir_cur}", pdf_file)

data = [
    {"filename": input_file},
]

### Extract PDF using Nougat

For this example, we'll run the `ExtractPDF` flow to extract the text from the IRS pdf. This uses the [Nougat](https://pypi.org/project/nougat-ocr/0.1.17/) PDF parser.

#### Create extract_config

In [42]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        batch_size = 4 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=PARAGRAPH_SPLITTER
)

#### Create extract_client

In [44]:
extract_client = ExtractClient(extract_config)

#### Run extract_client

In [45]:
extract_output = extract_client.run(data)

  0%|          | 0/1 [00:00<?, ?it/s]

### Generating Questions

#### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. We do this by giving a sample instruction to the PromptTemplate class

In [46]:
sample_instruction = """Assume you are an expert on tax, please generate as many question as possible based on the context.
Make sure those questions can cover any question people can think of by reading the context."""

guided_prompt = PromptTemplate(instruction=sample_instruction)

In [47]:
input_context = [Context(context=ctx) for ctx in extract_output[0]["output"][0]["text"]]

print("sample size of processed input data: ", len(input_context))

input_context[:2]

sample size of processed input data:  1554


[Context(context='**Publication 535**'),
 Context(context='**Publication 535**')]

#### Use LLM to generate data

In this example, we will use the `TransformQAHuggingFaceJsonFormatConfig`'s LLM to generate questions and answers. Let's import the config and client of this model.

Here, we pass in our `guided_prompt` to the `TransformQAHuggingFaceJsonFormatConfig` to use our customized instructions, instead of the uniflow default ones.

We also want to get the response in the json format instead of the text default, so we set the `response_format` to `json_object`.

You can update the `batch_size` based on the size of the data.

In [48]:
question_config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(
        response_format={"type": "json_object"},
    ),
)
question_client = TransformClient(question_config)

Now we call the `run` method on the `question_client` object to execute the question generation operation on the data shown above.

In [49]:
output_question = question_client.run(input_context)

  0%|          | 0/1554 [00:00<?, ?it/s]

#### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [68]:
contexts = []
questions_list = []

for index, output in enumerate(output_question):
    current_context = input_context[index]
    for item in output.get("output", []):
        for response in item.get("response", []):
            for question in response.get("questions", []):
                contexts.append(current_context)
                questions_list.append(question)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions_list
})
df = df[df["Question"].str.strip().astype(bool)]

pd.set_option("display.max_colwidth", 1000)
pd.set_option("display.width", 1000)

df

Unnamed: 0,Context,Question
0,context='**Publication 535**',What is Publication 535?
1,context='**Publication 535**',What topics are covered in Publication 535?
2,context='**Publication 535**',Is Publication 535 related to individual or business taxes?
3,context='**Publication 535**',Where can I find Publication 535?
4,context='**Publication 535**',Is Publication 535 available in different languages?
...,...,...
15813,context='**Cry323N 53**',Are there any tax planning strategies related to Cry323N 53?
15814,context='**Cry323N 53**',What are the potential tax consequences of investing in Cry323N 53?
15815,context='**Cry323N 53**',Are there any tax incentives or exemptions related to Cry323N 53?
15816,context='**Cry323N 53**',What is the tax treatment of gains or losses from Cry323N 53?


If you want to save the output and do the rest of the generation later, you can save and retrieve it here.

In [None]:
df.to_pickle('my_dataframe.pkl')

### Generating Answers

#### Prepare sample prompts

We need to create the prompt and instruction for answer generation.

In [63]:
answer_instruction = """
Based on the context provided, generate an answer that directly addresses the question. Start your response with the question number followed by a period and a space. For example, if the question is number 1, begin your answer with '1. ' followed by the response.
"""


answer_prompt = PromptTemplate(instruction=answer_instruction)

print("answer_instruction:")
print(answer_instruction, "\n")

answer_instruction:

Based on the context provided, generate an answer that directly addresses the question. Start your response with the question number followed by a period and a space. For example, if the question is number 1, begin your answer with '1. ' followed by the response.
 



#### TransformConfig for answer

In [65]:
answer_config = TransformOpenAIConfig(
    prompt_template=answer_prompt,
    model_config=OpenAIModelConfig(
        response_format={"type": "json_object"},
    ),
)
answer_client = TransformClient(answer_config)

#### Format data to feed into `answer_client`

In [66]:
input_question = [
    Context(
        context=row["Context"],
        question=row["Question"],
    )
    for index, row in df.iterrows()
]

print("sample size of processed input data: ", len(input_question))

input_question[:2]

sample size of processed input data:  15818


[Context(context=Context(context='**Publication 535**'), question='What is Publication 535?'),
 Context(context=Context(context='**Publication 535**'), question='What topics are covered in Publication 535?')]

#### `run` the `answer_client`

In [70]:
output_answer = answer_client.run(input_question)

In [None]:
#### Process the output

In [None]:
contexts = []
questions = []
answers = []

for output in output_answer:
    for i in output.get("output", []):
        for response in i.get("response", []):
            parts = response.split("\n")
            response_dict = {}
            last_key = None

            for i, part in enumerate(parts):
                if not part or len(part) == 0:
                    continue
                if ":" in part:
                    key, value = part.split(":", 1)
                    key, value = key.strip(), value.strip()
                    if key not in response_dict:
                        response_dict[key] = value
                    else:
                        print("duplicate values")
                    last_key = key
                else:
                    response_dict[last_key] += " " + part

            if any(
                key not in response_dict
                for key in ["instruction", "context", "question", "answer"]
            ):
                continue

            contexts.append(response_dict["context"])
            questions.append(response_dict["question"])
            answers.append(response_dict["answer"])

In [None]:
pd.set_option("display.max_colwidth", 1000)
pd.set_option("display.width", 100)

print(len(contexts))
print(len(questions))
print(len(answers))

df = pd.DataFrame({"Context": contexts, "Question": questions, "Answer": answers})

In [None]:
df

In [None]:
df.to_csv("output.csv", index=False)