# Revising Docling Errors

The examples below demonstrate a couple of known misconversions that we have encountered and investigates/fixes them with the methods available on Docling Documents

----

## Imports

In [None]:
!pip install -q docling

In [None]:
import re

from docling.document_converter import DocumentConverter
from docling_core.types.doc.document import (
    DoclingDocument,
    RefItem,
    TableCell
)

## BofA 01

The first BofA document that was provided in the PoC: ```01_Advantage_Savings.pdf```

First, I will convert the file and export it to markdown to check the conversion

In [None]:
FILE_SOURCE = 'files/pdf/01_Advantage_Savings.pdf'

converter = DocumentConverter()
result = converter.convert(FILE_SOURCE)

doc_1 = result.document

print(doc_1.export_to_markdown())

There are a fair number of imperfections with this conversion, but the one that I will focus on for this example is at the bottom of the document. Notice how the "Keep the Change" section header had gotten split from its body text by the table. Additionally, it appears that a large chunk of text from the start of the description for this section has become mixed up. These two issues should be easy to fix.

To re-order the section header and table, I will find the RefItems of both in the body of the document and simply reverse their positions.

In [None]:
# Save the document to JSON for inspection

doc_1.save_as_json(filename = "files/json/01_converted.json")

From inspection of the JSON, the self_ref of the misplaced header is "#/texts/35", the jumbled text is "#/texts/36," and the self_ref of the table is "#/tables/0." All are children of the body, which makes our job easier. Here are the steps we will take:

- Create RefItems with the self_ref values and use resolve() to get the NodeItems of these elements
- Find the indexes of the misordered NodeItems in the body of the document
- Switch the position of the NodeItems using their indexes
- Change the text property of the jumbled description

In [None]:
# Create RefItems and access NodeItems

text_ref = RefItem(cref="#/texts/35")
table_ref = RefItem(cref="#/tables/0")

text_item = text_ref.resolve(doc=doc_1)
table_item = table_ref.resolve(doc=doc_1)

print("Text Item Parent: ", text_item.parent)
print("Table Item Parent: ", table_item.parent)

In [None]:
# Get the parent items

text_parent = text_item.parent.resolve(doc=doc_1)
table_parent = table_item.parent.resolve(doc=doc_1)

# Get the indexes of each item

text_index = text_parent.children.index(text_item.get_ref())
table_index = table_parent.children.index(table_item.get_ref())

print("Text index: ", text_index)
print("Table index: ", table_index)

In [None]:
# Switch the values at the two indexes/parents

text_parent.children[text_index] = table_item.get_ref()
table_parent.children[table_index] = text_item.get_ref()

# Visualize the new document

print(doc_1.export_to_markdown())

As can be seen, the elements have flipped in position, and the section header now appears in the correct location, above the body text. Lastly, let's change the text property of the jumbled text by just copy-pasting the correct text from the PDF.

In [None]:
# Get the jumbled node item

jumbled_ref = RefItem(cref="#/texts/36")
jumbled_item = jumbled_ref.resolve(doc=doc_1)

# Correct text from the PDF

correct_text = "Build your savings automatically when you enroll in our Keep the Change savings program. Simply make everyday purchases with your Bank of America debit card, and we’ll round up your purchases to the nearest dollar amount and transfer the difference from your checking account to your savings account."

# Update the text property of the jumbled item
# The original text is still stored in the jumbled_item.orig property, which can help with provenance

jumbled_item.text= correct_text

# Visualize the new document

print(doc_1.export_to_markdown())

Now, **the desired part of the document has been fixed**. Let's save the updated JSON.

In [None]:
doc_1.save_as_json(filename = "files/json/01_fixed.json")

## BofA 02

The second BofA document that was provided in the PoC: ```02_BofA_CoreChecking_en_ADA.pdf```

First, I will convert the file and export it to markdown to check the conversion

In [None]:
FILE_SOURCE = 'files/pdf/02_BofA_CoreChecking_en_ADA.pdf'

converter = DocumentConverter()
result = converter.convert(FILE_SOURCE)

doc_2 = result.document

print(doc_2.export_to_markdown())

In this document, there are two major issues that I will address/explore. First of all, there is the $45.00 row outside of the "Additional fees" table. Additionally, there is an excluded table between Option 1 and Option 2 in the "Overdraft settings and fees section."

In [None]:
# Save the document to JSON for inspection

doc_2.save_as_json(filename = "files/json/02_converted.json")

**\#1: Fix the additional fees table**

From inspecting the JSON, it appears that the self_ref of the items of interest are:

- Table: "#/tables/2"
- Texts to go in table: "#/texts/32-37"

In [None]:
# Get the table item

table_ref = RefItem(cref="#/tables/2")
table_item = table_ref.resolve(doc=doc_2)

# Get the text items

text_refs = [RefItem(cref=f"#/texts/{i}") for i in range(32, 38)]
text_items = [ref.resolve(doc=doc_2) for ref in text_refs]

# Print text items for reference

for idx, item in enumerate(text_items):
    print(f"{idx}: {item.text}")

Now that we have the text saved that needs to go into the table, let's remove the text items themselves from the document.

In [None]:
# Delete the text items

doc_2.delete_items(node_items=text_items)

Finally, let's add a few rows to the table

In [None]:
# Access the data property of the table

table_data = table_item.data

In [None]:
# Save the previous number of rows

prev_rows = table_data.num_rows

# Add the three rows necessary to fit our new data

table_data.num_rows += 3

We need to now append table cells to the table_cells property of the table data to fill in our new rows. To do this in a streamlined manner, I will set up the text that needs to be added to each cell in a 2d array.

In [None]:
table_text = [
    ["International wire transfers", text_items[0].text, text_items[1].text],
    ["International wire transfers", "", text_items[2].text],
    [f"{text_items[3].text} {text_items[4].text}", "", text_items[5].text]
]

Now, I will instantiate table cells and add them to the table

In [None]:
# Iterate through table cells and add them to the document

for i, row in enumerate(table_text):
    for j, text in enumerate(row):
        table_data.table_cells.append(
            TableCell(
                row_span = 1,
                col_span = 1,
                start_row_offset_idx = prev_rows + i,
                end_row_offset_idx = prev_rows + i + 1,
                start_col_offset_idx = j,
                end_col_offset_idx = j + 1,
                text = text
            )
        )

In [None]:
# Save the updated JSON

doc_2.save_as_json(filename = "files/json/02_fixed.json")

# Visualize the new document

print(doc_2.export_to_markdown())

As can be seen, **the table is now formatted exactly as we would like it to be, and the extra text has been removed.**

**\#2: Inspect the missing table**

Upon inspecting the JSON, it appears that the table between option 1 and option 2 does appear in the Docling Document and is fomatted exactly as expected. It is also referenced correctly as the 11th index of the body pointing to "#/tables/1". **Thus, this appears to be an issue with the Markdown transformer, rather than the document conversion process.** The nature and source of this issue are deserving of their own investigation, but I will demonstrate below how we can use the Docling Document data to detect omissions like this in the Markdown.

In [None]:
# Get the markdown text

markdown = doc_2.export_to_markdown()

# Get the raw text from the DocTags representation of the Docling Document

doctags = doc_2.export_to_doctags()
raw_text = re.sub(r"<.*?>", "", doctags, flags=re.DOTALL).strip()

In [None]:
# Compare the non-overlapping elements that exist in the markdown text and the raw text

from collections import Counter

# The methodology of this function can be refined a lot more depending on our use case
# This is a quick mock up that I made in a few prompts with GenAI

def remove_overlaps_preserving_order(a, b):
    count_a = Counter(a)
    count_b = Counter(b)
    
    overlap = {item: min(count_a[item], count_b[item]) for item in count_a & count_b}
    
    def filter_list(lst, keep_counts):
        seen = Counter()
        result = []
        for item in lst:
            if item in keep_counts:
                if seen[item] < keep_counts[item]:
                    result.append(item)
                    seen[item] += 1
            else:
                result.append(item)
        return result
    
    keep_a = {k: count_a[k] - overlap.get(k, 0) for k in count_a}
    keep_b = {k: count_b[k] - overlap.get(k, 0) for k in count_b}

    return filter_list(a, keep_a), filter_list(b, keep_b)

markdown_list = markdown.split()
raw_text_list = raw_text.split()

(markdown_unique, raw_text_unique) = remove_overlaps_preserving_order(markdown_list, raw_text_list)

print("Unique in the Docling Document: \n\n", ' '.join(raw_text_unique), '\n\n')
print("Unique in the Markdown representation: \n\n", ' '.join(markdown_unique))

As can be seen, this method reveals that **there is a chunk of text present in the original docment that does not exist in the markdown representation**. With more time/refinement this method could more clearly show significant content that is missing from the markdown representation, if we deem this useful.

## Takeaways

**It is definitely possible to use the editing methods discussed in** ```docling_doc_structure.ipynb``` **to fix errors in document conversion** that we have observed in our PoC's. However, this process is difficult to do in a Jupyter Notebook, since it is hard to locate the RefItem's of objects that you would like to edit. Thus, connecting these editing functions to a UI seems to be the best option, as the RefItem's of different objects could be stored with their previews and updated when the document is edited. 

Simple actions like moving sections with drag-and-drop and editing text/tables are definitely possible given the methods/implementations shown above. However, the existing "Rendered Docling Document" would not be a suitable base for building this editing UI due to it's static nature (tooltips placed over static png images). **A new way of previewing DoclingDocuments (possibly in a basic format similar to Markdown) would need to be developed before making a UI to edit them.**