# Chonking for Better DocQnA

## Queries

|Query|Expectation|Implementation Detail|
|---|---|---|
| What information present in an Manufacturer's Manual is actually irrelevant for writing an SOP? | <li> The irrelevant sections can be dropped off </li> <li>If SOP contains ALL of the information in the Manufacturer's Manual then what's the point of rewriting?</li>|Filter Table of Contents & drop irrelevant pages|
|How is the procedure in the Manufacturer's Manual different from what's in an SOP? | <li>SOPs have pre-defined tasks that need to be carried out - user specific experiments. </li><li>These tasks require combining the various sub-procedures specified by the manufacturer.</li>|<ol><li>Identify sub-processes</li><li>Store subprocess summaries in metadata</li><li>Prompt LLM with experiment & Subprocess Options</li></ol>

In [258]:
import fitz
import re
import pandas as pd
import os
import shutil
from IPython.display import display, HTML
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.utilities import PythonREPL

from dotenv import load_dotenv

In [259]:
if load_dotenv('openai.env'):
    bg, tx, msg = "lightgreen", "darkgreen", "OpenAI environment loaded successfully"
else:
    bg, tx, msg = "#FFC0CB", "#8B0000", "Could not find <code>openai.env</code> file."
display(HTML(f"""<div style="background-color: {bg}; color: {tx}; border-radius: 10px; padding: 10px;">
{msg}
</div>
"""))

In [260]:
model = ChatOpenAI()

In [86]:
doc = fitz.open("data/Akta_Pure_User_Manual.PDF")

In [87]:
pages = []
for page in doc.pages(start = 1, stop = 10):
    pages.append(page)

## Extract Table of Contents

In [382]:
contents_text = pages[0].get_text()
pattern = r'(\d+(\.\d+)*)\s+(.*?)\s+(\d+)'
matches = re.findall(pattern, contents_text)
table_of_contents = []
for match in matches:
    number = match[0]
    heading = re.sub('\.{2,}', '', match[2])
    page_number = match[3]
    table_of_contents.append({"Index": number, "Heading": heading})

table_of_contents = pd.DataFrame(table_of_contents)
table_of_contents.head()

Unnamed: 0,Index,Heading
0,1.0,Introduction
1,1.1,Important user information
2,1.2,ÄKTA pure overview
3,1.3,ÄKTA pure user documentation
4,2.0,The ÄKTA pure instrument


Easy Alternative!

In [383]:
table_of_contents = doc.get_toc()
table_of_contents = pd.DataFrame(table_of_contents, columns = ['Level', 'Title', 'Page Number'])
table_of_contents.head()

Unnamed: 0,Level,Title,Page Number
0,1,Coverpage,1
1,1,Table of Contents,2
2,1,1 Introduction,6
3,2,1.1 Important user information,7
4,2,1.2 ÄKTA pure overview,9


In [387]:
table_of_contents['End Page'] = table_of_contents['Page Number'].shift(-1) - 1

In [414]:
level_wise_ends = []
max_level = table_of_contents.Level.max()
for level in range(2, max_level+2):
    level_wise_ends.append(table_of_contents[table_of_contents.Level.apply(lambda l: l < level)]['Page Number'].shift(-1) - 1)

In [419]:
section_end_pages = pd.Series(dtype='float')
for series in level_wise_ends:
    section_end_pages = section_end_pages.combine_first(series)

In [420]:
table_of_contents['End Page'] = section_end_pages

In [421]:
table_of_contents.head(30)

Unnamed: 0,Level,Title,Page Number,End Page
0,1,Coverpage,1,1.0
1,1,Table of Contents,2,5.0
2,1,1 Introduction,6,12.0
3,2,1.1 Important user information,7,8.0
4,2,1.2 ÄKTA pure overview,9,10.0
5,2,1.3 ÄKTA pure user documentation,11,12.0
6,1,2 The ÄKTA pure instrument,13,85.0
7,2,2.1 Overview illustrations,14,24.0
8,2,2.2 Liquid flow path,25,26.0
9,2,2.3 Instrument control panel,27,31.0


## Extract Images

In [60]:
IMG_DIR = 'images'
os.makedirs(IMG_DIR, exist_ok=True)

In [61]:
for page_index in range(len(doc)):
	page = doc[page_index]
	image_list = page.get_images()

	for image_index, img in enumerate(image_list, start=1):
		xref = img[0]
		pix = fitz.Pixmap(doc, xref)

		if pix.n - pix.alpha > 3:
			pix = fitz.Pixmap(fitz.csRGB, pix)

		pix.save(f"{IMG_DIR}/page_{page_index}-image_{image_index}.png")
		pix = None
print(f"Found {len(os.listdir(IMG_DIR))} images. Saved all under {IMG_DIR}/")

Found 732 images. Saved all under images/


## Extract Tables

In [116]:
from pprint import pprint

page = doc[9]
print(page.get_text()[0:200])
print('\n','-'*50)

tabs = page.find_tables()
print(f"{len(tabs.tables)} tables found on {page}\n", '-'*50)
pd.DataFrame(tabs[0].extract()[1:], columns = tabs[0].extract()[0])

Main functions
Module
Create and edit methods using one or a combination of:
Method Editor
•
Predefined methods with built-in application support
•
Drag-and-drop function to build methods with relevan

 --------------------------------------------------
1 tables found on page 9 of data/Akta_Pure_User_Manual.PDF
 --------------------------------------------------


Unnamed: 0,Module,Main functions
0,Method Editor,Create and edit methods using one or a combina...
1,System Control,"Start, monitor and control runs. The current f..."
2,Evaluation,"Open results, evaluate runs and create reports..."
3,,


In [153]:
TABLE_DIR = 'tables'
os.makedirs(TABLE_DIR, exist_ok=True)
os.makedirs(os.path.join(TABLE_DIR, '.temp'), exist_ok=True)

In [154]:
def extract_and_save_tables(doc, output_directory = os.path.join(TABLE_DIR, '.temp')): 
    count = 0
    for page_num in range(doc.page_count):
        page = doc[page_num]
        tabs = page.find_tables()
        if len(tabs.tables) > 0:
            for idx, tab in enumerate(tabs):
                table_data = tab.extract()
                header, *rows = table_data
    
                # Convert table data to DataFrame
                df = pd.DataFrame(rows, columns=header)
    
                # Save DataFrame as CSV
                table_filename = f"table_page_{page_num + 1}_idx_{idx + 1}.csv"
                table_path = os.path.join(output_directory, table_filename)
                if df.shape[0]>0:
                    df.to_csv(table_path, index=False)
                    count+=1
    print(f"Saved {count} tables")

In [158]:
def merge_and_save_tables(output_directory):
    input_directory = os.path.join(output_directory, '.temp')
    table_files = sorted(os.listdir(input_directory))
    queue = []

    for table_file in table_files:
        table_path = os.path.join(input_directory, table_file)
        df = pd.read_csv(table_path)

        try: 
            assert (not queue or all(df.columns == queue[0].columns))
            queue.append(df)
        except:
            merged_table = pd.concat(queue, ignore_index=True)
            merged_filename = f"merged_{table_file}"
            merged_table_path = os.path.join(output_directory, merged_filename)
            merged_table.to_csv(merged_table_path, index=False)

            queue = [df]

    if queue:
        merged_table = pd.concat(queue, ignore_index=True)
        merged_filename = f"merged_last_{table_file}"
        merged_table_path = os.path.join(output_directory, merged_filename)
        merged_table.to_csv(merged_table_path, index=False)

    shutil.rmtree(input_directory)
    print(f"{len(os.listdir(output_directory))} tables in repository after merging")

In [156]:
extract_and_save_tables(doc)

Saved 402 tables


In [159]:
merge_and_save_tables(TABLE_DIR)

180 tables in repository after merging


## Extract Text

### Identify relevant pages

In [292]:
prompt = f'''
Below is the table of contents for my document:
```
{table_of_contents[table_of_contents.Level.apply(lambda l: l<3)]['Title'].to_markdown(index = False)}
```
I want to only keep the actual useful content for further processing.
Most documents have useless sections at the beginning and at the end like covers, index, contents, preface, acknowledgements, promotions, catalogs, appendix etc.
The useless contents are usually at the beginning and at the end of the document.
Can you return the list of such useless titles in the contents I provided that I can drop off? 
Only respond with a comma separated list of Titles. Don't write anything else.
'''

In [303]:
#response = model.predict(prompt) # commented out to avoid un-necessary API calls

response: 'Coverpage, Table of Contents, Index'

In [297]:
titles_to_drop = list(map(lambda s: s.strip(), response.split(',')))

In [304]:
filtered_table_of_contents = table_of_contents[~table_of_contents['Title'].isin(titles_to_drop)]

start_page = filtered_table_of_contents['Page Number'].iloc[0]
end_page = filtered_table_of_contents['End Page'].iloc[-1]

print(f"Range of useful pages: {start_page} - {end_page}")

Range of useful pages: 6 - 512.0


### Extract text bodies

Since all pages have headers and footers, we need to have an additional step in the front end where the user gets to select the area of the pages where relevant text is available for clean extraction.<br>
For dev purposes, the bounding box is hard coded by skipping 10% of page height from the top margin and 5% of page height from the bottom margin.

In [373]:
page_bbox = doc[0].bound()

In [374]:
page_height = page_bbox.height
page_width = page_bbox.width

In [375]:
x0, y0, x1, y1 = page_bbox

In [376]:
page_without_header_footer = fitz.Rect(x0, y0+0.1*page_height, x1, y1-0.05*page_height)

In [377]:
page = doc[236]

In [378]:
fulltext = page.get_text()
filteredtext = page.get_text(clip = page_without_header_footer)
print(fulltext.replace(filteredtext, ''))

ÄKTA pure User Manual 29119969 AC
237
6 Performance tests
6.6 Fraction collector F9-C test



Above is a quick test that the bounding box defined is indeed able to filter out the headers and footers.

In [379]:
def extract_text_bodies(start_page: int, end_page: int = None, page_bbox: fitz.Rect = page_without_header_footer, doc: fitz.Document = doc):
    all_text_outside_tables = []

    for page_num in range(start_page, (start_page if end_page is None else end_page)+1):
        page = doc[page_num]
        all_text = page.get_text("text", clip = page_bbox)
        table_texts = []
    
        for table in page.find_tables():
            table_bbox = table.bbox
            table_text = page.get_textbox(table_bbox)
            table_texts.append(table_text)
    
        for table_text in table_texts:
            all_text = all_text.replace(table_text, '')
    
        all_text_outside_tables.append(all_text)
    
    return '\n'.join(all_text_outside_tables)

In [422]:
table_of_contents.head(10)

Unnamed: 0,Level,Title,Page Number,End Page
0,1,Coverpage,1,1.0
1,1,Table of Contents,2,5.0
2,1,1 Introduction,6,12.0
3,2,1.1 Important user information,7,8.0
4,2,1.2 ÄKTA pure overview,9,10.0
5,2,1.3 ÄKTA pure user documentation,11,12.0
6,1,2 The ÄKTA pure instrument,13,85.0
7,2,2.1 Overview illustrations,14,24.0
8,2,2.2 Liquid flow path,25,26.0
9,2,2.3 Instrument control panel,27,31.0


In [381]:
print(extract_text_bodies(6,12))

1.1
Important user information
Read this before operating
ÄKTA pure
All users must read the entire ÄKTA pure Operating Instructions before installing, operating, or
maintaining the instrument. Always keep the ÄKTA pure Operating Instructions at hand when operating
ÄKTA pure.
Do not operate ÄKTA pure in any other way than described in the user documentation. If you do, you
may be exposed to hazards that can lead to personal injury and you may cause damage to the
equipment.
Intended use
ÄKTA pure is intended for purification of bio-molecules, in particular proteins, for research purposes
by trained laboratory staff members in research laboratories.
ÄKTA pure shall not be used in any clinical procedures, or for diagnostic purposes.
Prerequisites
In order to operate the system according to the intended purpose, it is important that:
•
you have a general understanding of how the computer and the Microsoft® Windows® operating
system work.
•
you understand the concepts of liquid chromatograph

## What Next?