# NB_DocAI_PDF_Parsing_and_Chunking
this notebook use DocAI to do the following task:

* parsing and chunking pdf using DocAI batch_processing
* convert the parsed and chunked pdf into JSONL files and saved it back to the bucket
* load the JSONL files into text chunks to be used for RAG


In [1]:
import sys
import json
import os
import importlib
sys.path.append(os.path.abspath('../utils'))


import doc_ai
importlib.reload(doc_ai)
from doc_ai import batch_process_documents, load_json_from_bucket, convert_json_to_jsonl, read_jsonl_to_json_list


In [2]:
import json

with open('../config.json', 'r') as file:
    ids = json.load(file)
project_id = ids['project_id']
processor_id = ids['processor_id']
bucket_name = ids['bucket_name']

## DocAI chunking and parsing

In [3]:
pdf_name="naf23.pdf"
# pdf_name="stc.pdf"

location = "us"
gcs_output_uri = "gs://"+bucket_name+"/react_data/output/" 
processor_version_id = "pretrained-layout-parser-v1.0-2024-06-03" 
gcs_input_uri = "gs://"+bucket_name+"/react_data/public_auditor_notes/"+pdf_name
input_mime_type = "application/pdf"
gcs_input_prefix = "gs://"+bucket_name+"/react_data/public_auditor_notes"

In [4]:
chunk_size=200
batch_process_documents(
    project_id=project_id,
    location=location,
    processor_id=processor_id,
    gcs_output_uri=gcs_output_uri,
    processor_version_id=processor_version_id,
    gcs_input_uri=gcs_input_uri,
    input_mime_type=input_mime_type,
    chunk_size=chunk_size,
)

Waiting for operation projects/984528513541/locations/us/operations/6845398121975108687 to complete...
success, all processing complete


## Convert JSON to JSONL, save back to the bucket

In [5]:
fileNum='6845398121975108687'

In [6]:
prefix='react_data/output/'+fileNum+'/0'

In [7]:
pdf_json=load_json_from_bucket(bucket_name=bucket_name, prefix=prefix)

Fetching react_data/output/6845398121975108687/0/naf23-0.json


In [8]:
print(pdf_json.keys())
print(pdf_json['chunkedDocument'].keys())
print(len(pdf_json['chunkedDocument']['chunks']))

dict_keys(['chunkedDocument', 'documentLayout', 'shardInfo'])
dict_keys(['chunks'])
53


In [9]:
# print a chunk to check first
for chunk in pdf_json['chunkedDocument']['chunks']:
    display(chunk['content'])
    break

"# Annual Financial Report\n\nNational Ataxia Foundation St. Louis Park, Minnesota For the years ended December 31, 2023 and 2022 Abdo Lighting the path forward Edina Office 5201 Eden Avenue, Ste 250 Edina, MN 55436 P 952.835.9090 Mankato Office 100 Warren Street, Ste 600 Mankato, MN 56001 P 507.625.2727 Scottsdale Office 14500 N Northsight Blvd, Ste 233 Scottsdale, AZ 85260 P 480.864.5579 National Ataxia Foundation Table of Contents December 31, 2023 and 2022 2\n\n|-|-|\n|  | Page No. |\n| Independent Auditor's Report | 3 |\n| Financial Statements |  |\n| Statements of Financial Position | 6 |\n| Statements of Activities | 7 |\n| Statements of Functional Expenses | 9 |\n| Statements of Cash Flows | 11 |\n| Notes to the Financial Statements | 12 |\n\nAbdo\n\n## INDEPENDENT AUDITOR'S REPORT\n\nBoard of Directors National Ataxia Foundation St. Louis Park, Minnesota"

In [10]:
# output_json_fileName='stc-0.json'
output_json_fileName='naf23-0.json'
output_jsonl_path='react_data/output/'+fileNum+'/0/naf23-0_'+'ChunkSize'+str(chunk_size)+'.jsonl'
print(output_jsonl_path)
convert_json_to_jsonl(pdf_json=pdf_json, fileName=output_json_fileName, fileNum=fileNum, 
                      bucket_name=bucket_name, output_path=output_jsonl_path)




react_data/output/6845398121975108687/0/naf23-0_ChunkSize200.jsonl
number of lines: 53


## read from jsonl on gcp to text chunks

In [31]:
prefix='react_data/output/'+fileNum+'/0'
print(prefix)

react_data/output/8630714554969055591/0


In [32]:
chunk_json_list=read_jsonl_to_json_list(bucket_name, prefix)

current file name: react_data/output/8630714554969055591/0/stc-0_ChunkSize200.jsonl


In [33]:
len(chunk_json_list)

49

In [34]:
chunk_json_list[0]

{'chunkID': 'c1',
 'fileName': 'stc-0.json',
 'fileNum': '8630714554969055591',
 'pageStart': 1,
 'pageEnd': 1,
 'content': 'TRANSPORT COMMUNITY 2021 Annual Financial Statements'}

In [35]:
text_chunks=[]
for json_obj in chunk_json_list:
    text_chunks.append(json_obj['content'])


In [36]:
text_chunks[0]

'TRANSPORT COMMUNITY 2021 Annual Financial Statements'

In [37]:
for chunk in text_chunks:
    print(chunk)
    print("*"*80)

TRANSPORT COMMUNITY 2021 Annual Financial Statements
********************************************************************************
# STATEMENT OF FINANCIAL POSITION AS AT 31 DECEMBER 2021

1 II 1 II LIABILITIES A. Current liabilities 1. Accounts payable Trade payables Payables to staff Legal commitments Accruals Total accounts payables

|-|-|-|-|-|
|  |  |  | 31 December | 31 December |
|  |  |  | 2021 | 2020 |
|  | ASSETS | Notes | (in EUR) | (in EUR) |
| A. | Current assets |  |  |  |
| . | Cash and cash equivalents | 1 | 1,492,643.83 | 3,254,771.91 |
| . | Receivables and prepayments | 2 | 1,464.48 |  |
|  | Budget contributions receivables |  |  | 185,753.87 |
|  | Total current assets |  | 1,494,108.31 | 3,440,525.78 |
| B. | Non-current assets |  |  |  |
| . | Intangible assets |  |  |  |
|  | Software and licences | 3 | 27,116.40 |  |
| . | Tangible assets |  |  |  |
|  | Infrastructure, plant and equipment | 4 | 4,985.03 | 8,244.43 |
|  | Total non-current assets |  | 32,101