# NB_DocAI_PDF_Parsing_and_Chunking
this notebook use DocAI to do the following task:

* parsing and chunking pdf using DocAI batch_processing
* convert the parsed and chunked pdf into JSONL files and saved it back to the bucket
* load the JSONL files into text chunks to be used for RAG


In [5]:
import sys
import json
import os
import importlib
sys.path.append(os.path.abspath('../utils'))


import doc_ai
importlib.reload(doc_ai)
from doc_ai import batch_process_documents, load_json_from_bucket, convert_json_to_jsonl, read_jsonl_to_json_list


In [2]:
import json

with open('../config.json', 'r') as file:
    ids = json.load(file)
project_id = ids['project_id']
processor_id = ids['processor_id']
bucket_name = ids['bucket_name']

## DocAI chunking and parsing

In [26]:
location = "us"
gcs_output_uri = "gs://"+bucket_name+"/react_data/output/" 
processor_version_id = "pretrained-layout-parser-v1.0-2024-06-03" 
gcs_input_uri = "gs://"+bucket_name+"/react_data/public_auditor_notes/naf23.pdf" 
input_mime_type = "application/pdf"
gcs_input_prefix = "gs://"+bucket_name+"/react_data/public_auditor_notes"

In [7]:
batch_process_documents(
    project_id=project_id,
    location=location,
    processor_id=processor_id,
    gcs_output_uri=gcs_output_uri,
    processor_version_id=processor_version_id,
    gcs_input_uri=gcs_input_uri,
    input_mime_type=input_mime_type,
    chunk_size=800,
)

Waiting for operation projects/984528513541/locations/us/operations/470041917927123113 to complete...
success, all processing complete


## Convert JSON to JSONL, save back to the bucket

In [8]:
prefix='react_data/output/470041917927123113/0'

In [11]:
pdf_json=load_json_from_bucket(bucket_name=bucket_name, prefix=prefix)

Fetching react_data/output/470041917927123113/0/naf23-0.json


In [12]:
pdf_json.keys()

dict_keys(['chunkedDocument', 'documentLayout', 'shardInfo'])

In [13]:
pdf_json['chunkedDocument'].keys()

dict_keys(['chunks'])

In [14]:
len(pdf_json['chunkedDocument']['chunks'])

14

In [17]:
# print a chunk to check first
for chunk in pdf_json['chunkedDocument']['chunks']:
    display(chunk['content'])
    break

"# Annual Financial Report\n\nNational Ataxia Foundation St. Louis Park, Minnesota For the years ended December 31, 2023 and 2022 Abdo Lighting the path forward Edina Office 5201 Eden Avenue, Ste 250 Edina, MN 55436 P 952.835.9090 Mankato Office 100 Warren Street, Ste 600 Mankato, MN 56001 P 507.625.2727 Scottsdale Office 14500 N Northsight Blvd, Ste 233 Scottsdale, AZ 85260 P 480.864.5579 National Ataxia Foundation Table of Contents December 31, 2023 and 2022 2\n\n|-|-|\n|  | Page No. |\n| Independent Auditor's Report | 3 |\n| Financial Statements |  |\n| Statements of Financial Position | 6 |\n| Statements of Activities | 7 |\n| Statements of Functional Expenses | 9 |\n| Statements of Cash Flows | 11 |\n| Notes to the Financial Statements | 12 |\n\nAbdo\n\n## INDEPENDENT AUDITOR'S REPORT\n\nBoard of Directors National Ataxia Foundation St. Louis Park, Minnesota\n\n## Opinion\n\nWe have audited the accompanying financial statements of National Ataxia Foundation (the Foundation), which

In [23]:
output_json_fileName='naf23-0.json'
fileNum='470041917927123113'
output_jsonl_path='react_data/output/470041917927123113/0/naf23-0.jsonl'

convert_json_to_jsonl(pdf_json=pdf_json, fileName=output_json_fileName, fileNum=fileNum, 
                      bucket_name=bucket_name, output_path=output_jsonl_path)




number of lines: 14


## read from jsonl on gcp to text chunks

In [3]:
prefix='react_data/output/470041917927123113/0'

In [6]:
chunk_json_list=read_jsonl_to_json_list(bucket_name, prefix)

current file name: react_data/output/470041917927123113/0/naf23-0.jsonl


In [7]:
len(chunk_json_list)

14

In [8]:
chunk_json_list[0]

{'chunkID': 'c1',
 'fileName': 'naf23-0.json',
 'fileNum': '470041917927123113',
 'pageStart': 1,
 'pageEnd': 7,
 'content': "# Annual Financial Report\n\nNational Ataxia Foundation St. Louis Park, Minnesota For the years ended December 31, 2023 and 2022 Abdo Lighting the path forward Edina Office 5201 Eden Avenue, Ste 250 Edina, MN 55436 P 952.835.9090 Mankato Office 100 Warren Street, Ste 600 Mankato, MN 56001 P 507.625.2727 Scottsdale Office 14500 N Northsight Blvd, Ste 233 Scottsdale, AZ 85260 P 480.864.5579 National Ataxia Foundation Table of Contents December 31, 2023 and 2022 2\n\n|-|-|\n|  | Page No. |\n| Independent Auditor's Report | 3 |\n| Financial Statements |  |\n| Statements of Financial Position | 6 |\n| Statements of Activities | 7 |\n| Statements of Functional Expenses | 9 |\n| Statements of Cash Flows | 11 |\n| Notes to the Financial Statements | 12 |\n\nAbdo\n\n## INDEPENDENT AUDITOR'S REPORT\n\nBoard of Directors National Ataxia Foundation St. Louis Park, Minnesota

In [9]:
text_chunks=[]
for json_obj in chunk_json_list:
    text_chunks.append(json_obj['content'])


In [10]:
text_chunks[0]

"# Annual Financial Report\n\nNational Ataxia Foundation St. Louis Park, Minnesota For the years ended December 31, 2023 and 2022 Abdo Lighting the path forward Edina Office 5201 Eden Avenue, Ste 250 Edina, MN 55436 P 952.835.9090 Mankato Office 100 Warren Street, Ste 600 Mankato, MN 56001 P 507.625.2727 Scottsdale Office 14500 N Northsight Blvd, Ste 233 Scottsdale, AZ 85260 P 480.864.5579 National Ataxia Foundation Table of Contents December 31, 2023 and 2022 2\n\n|-|-|\n|  | Page No. |\n| Independent Auditor's Report | 3 |\n| Financial Statements |  |\n| Statements of Financial Position | 6 |\n| Statements of Activities | 7 |\n| Statements of Functional Expenses | 9 |\n| Statements of Cash Flows | 11 |\n| Notes to the Financial Statements | 12 |\n\nAbdo\n\n## INDEPENDENT AUDITOR'S REPORT\n\nBoard of Directors National Ataxia Foundation St. Louis Park, Minnesota\n\n## Opinion\n\nWe have audited the accompanying financial statements of National Ataxia Foundation (the Foundation), which