## Build Your Personal Guide:  From Dense PDFs to Accessible Q&A

### Overview
It's that time of the year again—tax season! Are you finding yourself swamped with piles of tax filing information and feeling a bit overwhelmed? Why not give our AI assistant a try? Simply feed it the tax guide, and voilà, it's ready to answer all your related queries with ease.

Today, we will show you how Uniflow turns difficult-to-read PDF documents into a responsive steward that answers any questions you may have.

### Before running the code

You will need to uniflow conda environment to run this notebook. You can set up the environment following the [instruction](https://github.com/CambioML/cambio-cookbook/tree/main#installation).

For more details, see this [instruction](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/tree/main?tab=readme-ov-file#huggingfacemodelconfig).

Finally, we are storing the IRS dataset in the `data\raw_input` directory as "IRS_2023.pdf". You can download the file from [here](https://www.irs.gov/forms-pubs/guide-to-business-expense-resources).

### Update System Path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

If you already have these installed, feel free to skip this step.

In [None]:
!{sys.executable} -m pip install -q langchain pandas pypdf

### Import Dependency

In [5]:
import os

import pandas as pd

from uniflow.flow.client import ExtractClient, TransformClient
from uniflow.flow.config import ExtractPDFConfig, TransformOpenAIConfig
from uniflow.op.extract.split.constants import MARKDOWN_HEADER_SPLITTER
from uniflow.op.model.model_config import OpenAIModelConfig, NougatModelConfig
from uniflow.op.prompt import PromptTemplate, Context
from uniflow.op.extract.split.constants import PARAGRAPH_SPLITTER

In [6]:
import pickle
with open('/content/drive/My Drive/output_question.pkl', 'rb') as file:
    output_question = pickle.load(file)

df = pd.read_csv('/content/drive/My Drive/qlist.csv')

### Prepare Input data

First, let's set the current directory and input data directory, and load the raw input data.

In [7]:
dir_cur = os.getcwd()
pdf_file = "IRS_2023.pdf"
input_file = os.path.join(f"{dir_cur}", pdf_file)

data = [
    {"filename": input_file},
]

### Extract PDF using Nougat

For this example, we'll run the `ExtractPDF` flow to extract the text from the IRS pdf. This uses the [Nougat](https://pypi.org/project/nougat-ocr/0.1.17/) PDF parser.

#### Create extract_config

In [8]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        batch_size = 4 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=PARAGRAPH_SPLITTER
)

#### Create extract_client

In [9]:
extract_client = ExtractClient(extract_config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

#### Run extract_client

In [10]:
extract_output = extract_client.run(data)

  0%|          | 0/1 [00:00<?, ?it/s]

### Generating Questions

#### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. We do this by giving a sample instruction to the PromptTemplate class

In [12]:
sample_instruction = """Assume you are an expert on tax, please generate as many question as possible based on the context.
Make sure those questions can cover any question people can think of by reading the context."""

guided_prompt = PromptTemplate(instruction=sample_instruction)

In [13]:
input_context = [Context(context=ctx) for ctx in extract_output[0]["output"][0]["text"]]

print("sample size of processed input data: ", len(input_context))

input_context[:2]

sample size of processed input data:  1554


[Context(context='**Publication 535**'),
 Context(context='**Publication 535**')]

#### Use LLM to generate data

In this example, we will use the `TransformQAHuggingFaceJsonFormatConfig`'s LLM to generate questions and answers. Let's import the config and client of this model.

Here, we pass in our `guided_prompt` to the `TransformQAHuggingFaceJsonFormatConfig` to use our customized instructions, instead of the uniflow default ones.

We also want to get the response in the json format instead of the text default, so we set the `response_format` to `json_object`.

You can update the `batch_size` based on the size of the data.

In [None]:
question_config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(
        response_format={"type": "json_object"},
    ),
)
question_client = TransformClient(question_config)

Now we call the `run` method on the `question_client` object to execute the question generation operation on the data shown above.

In [None]:
output_question = question_client.run(input_context)

  0%|          | 0/1554 [00:00<?, ?it/s]

INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...


#### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [None]:
contexts = []
questions_list = []

for index, output in enumerate(output_question):
    current_context = input_context[index]
    for item in output.get("output", []):
        for response in item.get("response", []):
            for question in response.get("questions", []):
                contexts.append(current_context)
                questions_list.append(question)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions_list
})
df = df[df["Question"].str.strip().astype(bool)]

pd.set_option("display.max_colwidth", 1000)
pd.set_option("display.width", 1000)

df

Unnamed: 0,Context,Question
0,context='**Publication 535**',What is Publication 535?
1,context='**Publication 535**',What topics does Publication 535 cover?
2,context='**Publication 535**',Is Publication 535 related to individual taxes or business taxes?
3,context='**Publication 535**',Where can I access Publication 535?
4,context='**Publication 535**',Are there any updates or revisions to Publication 535?
...,...,...
15722,context='**Cry323N 53**',What is the purpose of Cry323N 53?
15723,context='**Cry323N 53**',How is Cry323N 53 enforced and regulated?
15724,context='**Cry323N 53**',Are there any recent changes or updates to Cry323N 53?
15725,context='**Cry323N 53**',"Is Cry323N 53 a federal, state, or local tax?"


If you want to save the output and do the rest of the generation later, you can save and retrieve it here.

### Generating Answers

#### Prepare sample prompts

We need to create the prompt and instruction for answer generation.

In [None]:
answer_instruction = """
Based on the context provided, generate an answer that directly addresses the question. Start your response with the question number followed by a period and a space. For example, if the question is number 1, begin your answer with '1. ' followed by the response.
"""


answer_prompt = PromptTemplate(instruction=answer_instruction)

print("answer_instruction:")
print(answer_instruction, "\n")

answer_instruction:

Based on the context provided, generate an answer that directly addresses the question. Start your response with the question number followed by a period and a space. For example, if the question is number 1, begin your answer with '1. ' followed by the response.
 



#### TransformConfig for answer

In [None]:
answer_config = TransformOpenAIConfig(
    prompt_template=answer_prompt,
    model_config=OpenAIModelConfig(
        response_format={"type": "json_object"},
    ),
)
answer_client = TransformClient(answer_config)

#### Format data to feed into `answer_client`

In [None]:
input_question = [
    Context(
        context=row["Context"],
        question=row["Question"],
    )
    for index, row in df.iterrows()
]

print("sample size of processed input data: ", len(input_question))

input_question[:2]

sample size of processed input data:  15727


[Context(context="context='**Publication 535**'", question='What is Publication 535?'),
 Context(context="context='**Publication 535**'", question='What topics does Publication 535 cover?')]

#### `run` the `answer_client`

In [None]:
output_answer = answer_client.run(input_question)

  0%|          | 0/15727 [00:00<?, ?it/s]

INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uniflow.op.model.abs_model:Attempt 1 failed, retrying...
INFO:uni

#### Process the output

In [None]:
with open('/content/drive/My Drive/output_answer.pkl', 'rb') as file:
    output_answer = pickle.load(file)

#### Get answers list

In [None]:
answers = []
for item in output_answer:
    for output in item.get('output', []):
        for response in output.get('response', []):
            answer = next(iter(response.values()))
            answers.append(answer)
answers

In [None]:
df['Answer'] = pd.NA
df.loc[:99, 'Answer'] = answers[:100]

In [None]:
df[:99]

Unnamed: 0,Context,Question,Answer
0,context='**Publication 535**',What is Publication 535?,Publication 535 is a document that provides guidance on business expenses for sole proprietors and statutory employees.
1,context='**Publication 535**',What topics does Publication 535 cover?,"Publication 535 covers topics related to business expenses, including what qualifies as a deductible business expense and how to report them on your tax return."
2,context='**Publication 535**',Is Publication 535 related to individual taxes or business taxes?,Publication 535 is related to individual taxes.
3,context='**Publication 535**',Where can I access Publication 535?,You can access Publication 535 online on the IRS website or request a copy by mail.
4,context='**Publication 535**',Are there any updates or revisions to Publication 535?,"Yes, there may be updates or revisions to Publication 535. It is important to check the latest version for any changes or new information."
...,...,...,...
94,"context=""## What's New for 2022""",What updates have there been to the rules for deducting medical expenses for 2022?,"The updates to the rules for deducting medical expenses for 2022 include a temporary reduction in the adjusted gross income (AGI) threshold for claiming the itemized deduction for medical expenses, allowing individuals to deduct unreimbursed medical expenses that exceed 7.5% of their AGI."
95,"context=""## What's New for 2022""",Are there any changes to the rules for deducting home office expenses for 2022?,"Yes, there are changes to the rules for deducting home office expenses for 2022. The IRS has introduced a simplified method for calculating the home office deduction, allowing taxpayers to use a standard deduction of $5 per square foot of home office space, up to a maximum of 300 square feet."
96,context='The following items highlight some changes in the tax law for 2022.',What are the major changes in the tax law for 2022?,The major changes in the tax law for 2022 include...
97,context='The following items highlight some changes in the tax law for 2022.',How will the changes in the tax law for 2022 impact individuals?,"The changes in the tax law for 2022 may impact individuals by potentially affecting their tax deductions, credits, and overall tax liability. It is important for individuals to stay informed about these changes to understand how they may be affected."


In [None]:
df.to_csv("output.csv", index=False)