# Docling Document Structure

This notebook examines the structure of ```DoclingDocument``` objects, as well as implementations of some relevant properties/methods for editing documents.

---
## Imports
Run the following cells to import necessary packages for this notebook, namely Docling.

In [None]:
!pip install -q docling

In [None]:
import re

from docling.document_converter import DocumentConverter
from docling_core.types.doc.document import (
    DoclingDocument,
    RefItem,
    ListItem,
    TextItem,
    GroupItem
)
from docling_core.types.doc.labels import DocItemLabel

## Create Example Document

The following function converts the ```01_Advantage_Savings.pdf``` file with default conversion setttings to use as an example throughout the notebook.

In [None]:
FILE_SOURCE = 'files/pdf/01_Advantage_Savings.pdf'

converter = DocumentConverter()
result = converter.convert(FILE_SOURCE)

docling_doc = result.document
print(type(docling_doc))

## Visualizing Docling Documents

The following cells cover the methods available for visualizing Docling Documents, both in their structure and content.

<br>

```print_element_tree()``` - Visualizes the content layers and parent/child relationships in the document

In [None]:
docling_doc.print_element_tree()

# Alternative: export_to_element_tree() - same as print_element_tree(), but saves the tree as a string

# tree = docling_doc.export_to_element_tree()
# print(tree)

<br>

```export_to_markdown()``` - Converts the content of the document to Markdown format, which can be printed

In [None]:
markdown = docling_doc.export_to_markdown()

print(markdown[0:200], "\n...")

<br>

```save_as_markdown()``` - A wraper on export_to_markdown(), saves the Markdown version of the document to the specified file path

In [None]:
MARKDOWN_FILENAME = "files/markdown/" + docling_doc.name + ".md"

docling_doc.save_as_markdown(MARKDOWN_FILENAME)

<br>

```export_to_text()``` - Just wraps ```export_to_markdown()``` currently and gives the same output

<br>

```export_to_html()``` - Saves a string representation of the document in HTML

In [None]:
html = docling_doc.export_to_html()

print(html[0:200], "\n...")

<br>

```save_as_html()``` - A wraper on export_to_markdown(), saves the Markdown version of the document to the specified file path

In [None]:
HTML_FILENAME = "files/html/" + docling_doc.name + ".html"

docling_doc.save_as_html(HTML_FILENAME)

<br>

```get_visualization()``` - **Should be explored further in a separate spike**, as it uses functions from the transforms.visualizer module, which I felt was too far out of scope. However, this seems to be related to how Docling renders its DoclingRendered preview in Docling Serve UI

```export_to_doctags()``` - Saves a string representation of the document with Docling's DocTags language. **Can be extended to get a 1:1 string of the raw text extracted by Docling from a document**

In [None]:
# Get DocTags

doctags = docling_doc.export_to_doctags()

print(doctags[0:200], "\n...")

In [None]:
raw_text = re.sub(r"<.*?>", "", doctags, flags=re.DOTALL).strip()

print(raw_text[0:200], "\n...")

## Saving/Loading Docling Documents

The following cells cover the methods available for saving 1-to-1 representations of Docling Documents and sourcing them to create new Docling Document objects. Unlike exporting as markdown or HTML, the following exports are lossless, meaning that they store all of the data present in the pydantic class definition of a Docling Document. This makes them good candidates for storing the Docling Document during review, as it allows for edits to be made in a more controlled manner through methods on the Docling Document object.

<br>

```save_as_json()``` - Saves a 1:1 representaiton of the DoclingDocument in a JSON file with the file path specified

In [None]:
JSON_FILENAME = "files/json/" + docling_doc.name + ".json"

docling_doc.save_as_json(JSON_FILENAME)

<br>

```load_from_json()``` - Loads the contents of a JSON file with the file path specified and (if valid) turns them into a DoclingDocument object.

In [None]:
from_json = docling_doc.load_from_json(JSON_FILENAME)

print(type(from_json))

## The Parts of Docling Documents

Before attempting to edit Docling Documents with built-in methods, it is important to understand how Docling Documents are organized as a data type and how to access parts within them.

If you look in ```01_Advantage_Savings.json```, you will see the standard structure of a Docling Document. There are a few important parts to point out:

1\. Origin - This contains metadata about the original document and can be useful for tracing back document conversions

In [None]:
print(docling_doc.origin)

2\. Furniture & Body - The furniture of the document refers to elements like headers and footers, while the body contains most of the document's content. These two properties of Docling Documents are responsible for reflecting the document's structure. Each has a children property, which contains a list of reference pointers to other items in the DoclingDocuments.

In [None]:
print(docling_doc.body.children[:10])

As can be seen, the children property is a list of objects called "RefItem's," each with a property "cref" that points to a string. If you open the texts property of the Docling Document JSON and look through the elements, you will notice that these "cref" values align with the "self_ref" properties of the text items. This elucidates how Docling stores the document structure through references. At a high level, the furniture and body are like a "table of contents," showing how the different elements should be ordered and related. Then, one must go to the actual content, stored in groups, texts, pictures, tables, key_value_items, and form_items, in order to piece together the converted document. 

In reality, every object has a chidren property; it is just the body and furniture that all children step from. For example, look at item 11 in the children of the body. Its ref is "#/groups/0," and following this to the groups property of the root document reveals an object with more RefItems. Typically, only furniture, body and groups have children with RefItems, but the property exists on all objects.

Lastly, it is important to outline exactly how Docling uses RefItems. While as humans, we search through the JSON for matching $ref and self_ref values by comparing the tag ("#/groups/0"), Docling Documents parse this tag in a more literal manner. The second value in the tag references the top-level property to access, and the third tag references the element to access. Thus, to access a RefItem, which is implemented through ```RefItem.resolve(doc=document)```, the code is actually accessing the following:

```python
document.groups[0]
```

This is important to remember when using other DoclingDocument methods, because it means that the cref and self_ref values are dangerous to tamper with after they have been set by Docling. The following code explores RefItem's further.

In [None]:
# Access a RefItem
text_ref = docling_doc.body.children[0]

print(type(text_ref))

<br>

```RefItem()``` - Class constructure for RefItem's. Just requires a cref string that should match/point to the item that you are trying to access in a document

In [None]:
# Create a RefItem
new_ref = RefItem(cref="#/groups/0")

print(type(new_ref))

<br>

```get_ref()``` - Returns the cref value of a RefItem

In [None]:
# Get the cref values

print(text_ref.get_ref())
print(new_ref.get_ref())

<br>

```resolve()``` - Called off of a RefItem and passed a DoclingDocument. If the RefItem's cref points to an object in the DoclingDocument, that document's NodeItem will be returned. **This is the standard way to go from RefItem's to NodeItem's**

In [None]:
# **** VERY IMPORTANT ****
# Get items in a document from RefItems

text_item = text_ref.resolve(docling_doc)
group_item = new_ref.resolve(docling_doc)

print(type(text_item))
print(type(group_item))

Note that once using ```resolve```, we are able to access items, rather than refs. There are many different types of items defined for Docling Documents that make up the different components of any document. However, they are all based on the ```NodeItem``` object, which has a self_ref (string), parent (optional, RefItem), children (RefItem[]), and content_layer (tag). For our purposes, the content_layer is not very important. Let's explore these properties, as well as the base functions available on a NodeItem.

In [None]:
# The core properties of any NodeItem. These can all be seen in the JSON as well

print(text_item.self_ref)
print(text_item.parent)
print(text_item.children)
print(text_item.content_layer)

<br>

```get_ref()``` - Returns the RefItem for a NodeItem. **This is the standard way to go from NodeItem's to RefItem's**

In [None]:
# Get a RefItem for a NodeItem. Reverse of resolve() to go from RefItem to NodeItem

text_item.get_ref()

At a high level, these are all of the properties that one needs to know in order to access items within a Docling Document so that they may edit problematic items, delete unecessary parts, and add new ones. However, there are a couple more methods on Docling Documents that may make it easier to access these Node and Ref items.

<br>

```iterate_items()``` - A method to traverse all NodeItems in a document. Optional parameters can be set to determine what items are returned:

- root (NodeItem): will only traverse the document downwards from a certain root item. If a root is not provided, the document will be traversed from the body downwards
- with_groups (bool): if set to True, will include groups in the NodeItems that are returned. Otherwise, it will only include leaf-level NodeItems (the NodeItems within groups, but not the GroupItems themseles). Set to False by default
- traverse_pictures (bool): if set to True, will include PictureItems as part of the returned NodeItems. Otherwise, they will be excluded. Set to False by default
- page_no (int): if provided, will only yield NodeItems that include the specified page number in their provenance
- included_content_layers (set of ContentLayers): if provided, will only yield NodeItems with a content layer that is a member of included_content_layers. If not provided, will default to DEFAULT_CONTENT_LAYERS, which is just the body layer

Note that iterate_items() returns an iterable, rather than a list. It is shown below how to convert the iterable into a list. Also, it is not a list of NodeItems that is returned, but rather a list of tuples with NodeItems and their layer, if applicable. To access the node items, get the first element of this tuple.

In [None]:
# Default iterate_items behavior

default_items = docling_doc.iterate_items()
default_items_list = [item for item in default_items]

print("Default iterate_items() length: ", len(default_items_list))
print("Group items: ", [item for item in default_items_list if isinstance(item[0], GroupItem)])

In [None]:
# Include GroupItems

with_group_items = docling_doc.iterate_items(with_groups = True)
with_group_items_list = [item for item in with_group_items]

print("Default iterate_items() length: ", len(with_group_items_list))
print("Group items: ", [item for item in with_group_items_list if isinstance(item[0], GroupItem)])

In the above output, it can also be seen that the body is defined as a NodeItem that is a GroupItem with a self_ref of "#/body" and its own list of children.

In [None]:
# Iterate from GroupItems

# Get the first LIST group item, which has cref "#/groups/0

group_0_cref = "#/groups/0"
group_0_ref = RefItem(cref = group_0_cref)
group_0_item = group_0_ref.resolve(doc = docling_doc)

iterate_from_group = docling_doc.iterate_items(root = group_0_item)
iterate_from_group_list = [item for item in iterate_from_group]

print("From group 0 iterate_items() length: ", len(iterate_from_group_list))
print("Items: ", iterate_from_group_list)

<br>

While iterating through items can be useful for filtering certain types of items out of the entire document and following all necessary child trees recursively, if you know what part of the document your desired item/items are, you can simply get them by accessing properties on the DoclingDocument object (exactly as outlined in ```01_Advantage_Savings.json```). You will need to use a map to turn the RefItems into NodeItems if that is what is required. This also does not give the layer of the item like iterate_items() does. However, it allows for different freedoms in targeting areas of the document.

In [None]:
# Get the children NodeItem objects of group_0, as is done above with iterate_items()

group_0 = docling_doc.groups[0]
group_0_ref_items = group_0.children
group_0_node_items = [ref.resolve(doc = docling_doc) for ref in group_0_ref_items]

print("Group 0 length: ", len(group_0_ref_items))
print("Group 0 RefItems: ", group_0_ref_items)
print("Group 0 NodeItems: ", group_0_node_items)

## "Safe" Item Addition

The following methods exist on Docling Documents as "safe" ways of adding a new NodeItem to a DoclingDocument. I consider these methods as "safe" because they ensure that each time a NodeItem is added, it is a separate object from every other NodeItem in the document. Since Docling Documents store NodeItems by reference, if a single NodeItem is uploaded twice to a document, the second auto-generated self_ref will overwrite the first, causing errors in processes like deletion. This will be demonstrated later.

**Note:** In the document.py file, there is currently a ##TODO## to "refactor the add* methods," so there may be plans to update the methods below

```add_ordered_list()```

- name (str): optional (None)
- parent (NodeItem): optional (body)
- content_layer (ContentLayer): optional (None)

```add_unordered_list()```

- name (str): optional (None)
- parent (NodeItem): optional (body)
- content_layer (ContentLayer): optional (None)

```add_inline_group()```

- name (str): optional (None)
- parent (NodeItem): optional (body)
- content_layer (ContentLayer): optional (None)

```add_group()```

- label (GroupLabel): optional (None)
- name (str): optional (None)
- parent (NodeItem): optional (body)
- content_layer (ContentLayer): optional (None)

```add_list_item()```

- text (str)
- enumerated (bool): defaults to False
- marker (str): optional ("-")
- orig (str): optional (text)
- prov (ProvenanceItem): optional (None)
- parent (NodeItem): optional (body)
- content_layer (Content_Layer): optional (None)
- formatting (Formatting): optional (None)
- hyperlink (AnyURL | Path): optional (None)

```add_text()```

- label (DocItemLabel)
- text (str)
- orig (str): optional (text)
- prov (ProvenanceItem): optional (None)
- parent (NodeItem): optional (body)
- content_layer (Content_Layer): optional (None)
- formatting (Formatting): optional (None)
- hyperlink (AnyURL | Path): optional (None)

```add_table()```

- data (TableData)
- caption (TextItem | RefItem): optional (None)
- prov (ProvenanceItem): optional (None)
- parent (NodeItem): optional (body)
- label (DocItemLabel): defaults to DocItemLabel.TABLE
- content_layer (ContentLayer): optional (None)
- annotations (TableAnnotationType[]): optional (None)

**See** ```docling_error_revision.ipynb``` **for an example of using TableData**

```add_picture()```

- annotations (PictureDataType[]): optional (None)  
- image (ImageRef): optional (None)  
- caption (TextItem | RefItem): optional (None)  
- prov (ProvenanceItem): optional (None)  
- parent (NodeItem): optional (body)  
- content_layer (ContentLayer): optional (None)  

```add_title()```

- text (str)  
- orig (str): optional (text)  
- prov (ProvenanceItem): optional (None)  
- parent (NodeItem): optional (body)  
- content_layer (ContentLayer): optional (None)  
- formatting (Formatting): optional (None)  
- hyperlink (AnyUrl | Path): optional (None)  

`add_code()`

- text (str)  
- code_language (CodeLanguageLabel): optional (None)  
- orig (str): optional (text)  
- caption (TextItem | RefItem): optional (None)  
- prov (ProvenanceItem): optional (None)  
- parent (NodeItem): optional (body)  
- content_layer (ContentLayer): optional (None)  
- formatting (Formatting): optional (None)  
- hyperlink (AnyUrl | Path): optional (None)  

```add_formula()```

- text (str)  
- orig (str): optional (text)  
- prov (ProvenanceItem): optional (None)  
- parent (NodeItem): optional (body)  
- content_layer (ContentLayer): optional (None)  
- formatting (Formatting): optional (None)  
- hyperlink (AnyUrl | Path): optional (None) 

```add_heading()```

- text (str)  
- orig (str): optional (text)  
- level (LevelNumber): optional (1)  
- prov (ProvenanceItem): optional (None)  
- parent (NodeItem): optional (body)  
- content_layer (ContentLayer): optional (None)  
- formatting (Formatting): optional (None)  
- hyperlink (AnyUrl | Path): optional (None) 

```add_key_values()```

- graph (GraphData)  
- prov (ProvenanceItem): optional (None)  
- parent (NodeItem): optional (body)  

```add_form()```

- graph (GraphData)  
- prov (ProvenanceItem): optional (None)  
- parent (NodeItem): optional (body)  

There are too many methods to show implementations for all of them, so I will demonstrate the simple example of adding a list item to group 0:

In [None]:
new_list_item = docling_doc.add_list_item(
    text="This is an item to add onto the end of the list",
    parent=docling_doc.groups[0]
)

JSON_FILENAME = "files/json/extra-list-item-" + docling_doc.name + ".json"
docling_doc.save_as_json(JSON_FILENAME)

print(new_list_item)

Check the save JSON file, and you should also be able to observe the change. Note that, when using these "safe" functions, there is no way to control where inside of a parent element the new item will go. It can only ever be appended to the end of its parent's children. Functionality for inserting elements behind/in front of siblings is only provided with unsafe methods, as will be shown below. **Thus, a potential next step could be to contribute this extended "safe" Docling Document add functionality to docling-core.**

## "Unsafe" Item Addition & Item Deletion/Replacement

**"Unsafe" Item Addition:** The following three methods I consider "unsafe" because they require as a parameter the full item object that is to be added. If used incorrectly (as I will show), they can cause unintended behavior. However, they are very useful in the functionality that they provide to more flexibly insert items into different places within the document.

<br>

```append_child_item()``` - This has a functionality very similar to the "safe" item addition methods in that it appends an item to the end of a list of children based on a parent NodeItem. However, a NodeItem is passed for the value of this child item, rather than multiple individual parameters. 

Note that the function edits parts of the NodeItem, such as the self_ref, after it is passed, so you do not neeed to worry about setting appropriate ref values. Your values just need to pass the initial regex check in the format "#/{text}/{number}, and then they will be reassigned.

To demonstrate the unsafe editing methods (as well as deletion/update methods), I will continue to use list items added to the first group.

In [None]:
# Instantiate a list item

text = "This will appear at the end of the list"
to_append = ListItem(self_ref = "#/new_text/0", orig = text, text = text)

# Get the parent item

parent = docling_doc.groups[0]

# Append item

docling_doc.append_child_item(child=to_append, parent=parent)

print(docling_doc.groups[0].children)

<br>

```insert_item_after_sibling()``` - This function also adds an item to a document, but instead of adding it to the end of a list of children, it places it based on the location of another NodeRef in the document: on the same level (a child of the same parent) and one position after.

In [None]:
# Instantiate a list item

text = "This will be inserted as the third element of the list"
to_insert_after = ListItem(self_ref = "#/new_text/0", orig = text, text = text)

# Get the sibling item (second item in the list)

sibling_ref = docling_doc.groups[0].children[1]
sibling_item = sibling_ref.resolve(doc = docling_doc)

# Insert item after sibling

docling_doc.insert_item_after_sibling(new_item=to_insert_after, sibling = sibling_item)

print(docling_doc.groups[0].children)

<br>

```insert_item_before_sibling()``` - This function is the same as insert_item_after_sibling, but just places the new object in front of the old object, rather than behind.

In [None]:
# Instantiate a list item

text = "This is the first item!!"
to_insert_after = ListItem(self_ref = "#/new_text/0", orig = text, text = text)

# Get the sibling item (first item in the list)

sibling_ref = docling_doc.groups[0].children[0]
sibling_item = sibling_ref.resolve(doc = docling_doc)

# Insert item after sibling

docling_doc.insert_item_before_sibling(new_item=to_insert_after, sibling = sibling_item)

print(docling_doc.groups[0].children)

As can be seen, the "Unsafe Item Addition" methods are fairly robust compared to the standard add* methods explored above. Especially in the use case of editing Docling Documents after conversion, it is very likely that elements will need to be placed within children lists rather than at the ends of them.

**Removing Elements**: Removing items from a Docling Document is fairly simple: all that needs to be done is pass the NodeItems of the elements to be removed, and the reassignment of the self_ref strings will be done automatically. Note that when deleting items, RefItems are used internally that are created from the self_ref string of the NodeItem. Thus, if two NodeItems have the same self_ref string, it can cause unintended effects in deletion.

<br>

```delete_items()``` - Takes in a list of NodeItems and removes them from the document, automatically updating the self_ref values of all other elements to make sure they are sequential

In [None]:
# Remove the 2nd and 6th items in the group from above

print("Group before:", docling_doc.groups[0].children)

# Get NodeItems to remove

to_remove = [docling_doc.groups[0].children[i] for i in [1, 5]]

print("To remove:", to_remove)

docling_doc.delete_items(node_items = to_remove)

print("Group after:", docling_doc.groups[0].children)

Notice how, when the two elements are removed, all other cref values update. This is done for the purpose of the ```RefItem.resolve()``` method, as it finds items based on their index (stripped from the cref), rather than matching the cref with one in the document. This means that when items are removed from the "texts" of the document, all other items must have their cref's updated to ensure that the cref points to the new location in the "texts" list that the item is now in. **This can be problematic** because it means that there is no persistent identifier of a NodeItem. If a user was to edit a Docling Document, we would need to take care to ensure that the cref values are updated elsewhere that we want to track the provency of NodeItems, such as in chunking and SDG, or contribute a permanent identifier to NodeItems within the Docling Document datatype.

**"Unsafe" Demonstration**: Now that the ```delete_items()``` method has been introduced, I will quickly show one possible issue that arises from the unsafe item addition methods. Essentially, since removing items is done by passing NodeRefs, and NodeRefs are passed through the unsafe methods by reference, if the same NodeRef gets added twice to a document, the two instances will forever be coupled in cref, updates, and deletion.

In [None]:
# Load a fresh docling_doc

example_doc = DoclingDocument.load_from_json(filename="files/json/01_Advantage_Savings.json")

# Instantiate a List Item

text = "This bullet is repeated in the document"
repeated = ListItem(self_ref = "#/new_text/0", orig = text, text = text)

# Get the parent group

parent = example_doc.groups[0]

# Add the List Item twice

example_doc.append_child_item(child=repeated, parent=parent)
example_doc.append_child_item(child=repeated, parent=parent)

print(example_doc.groups[0])

Now let's say that we give this document to someone who doesn't know that the same NodeItem was added twice to the document. From the view above, it seems like there are no duplicates, since the RefItem is instantiated from the self_ref of the NodeItem at the moment that it is added to the document. In reality though, the RefItems are linked and point to the same NodeItem. To demonstrate the odd nature of this, let's model our second programmer attempting to remove the last item from this group.

In [None]:
# Save the current document

example_doc.save_as_json(filename="files/json/unsafe-01.json")

In [None]:
# Get the NodeItem of the last element in group_0 using resolve()

last_ref = example_doc.groups[0].children[-1]
last_item = last_ref.resolve(doc = example_doc)

print(last_item)

In [None]:
# Remove ONLY this last Node Item

example_doc.delete_items(node_items = [last_item])

print(example_doc.groups[0])

As you can now see, we attempted to just remove the last item from the group, but we instead removed both duplicate items that were at the end of the list. This is because, since the items pointed to the same object, they had the same self_ref, so the deletion method (based on self_ref) removed both pointers. Look in the saved ```unsafe-01.json``` and you will see that these two items (texts 39 and 40) had the same self_ref. **This inadvertent entanglement of NodeItems and mismatch between RefItems and self_refs is what makes the "unsafe methods" unsafe.** To remedy this, we could consider implementing more advanced insertion behavior to complement the safe addition methods, as these methods guarantee that a new NodeItem is created every time. Alternatively, if this odd behavior is not exposed to users and we are conscious of it, we could just ensure that we never duplicate NodeItems and avoid this issue.

**Replacing Items**: Lastly, there is a method that implements both delete and insert in order to replace an old NodeItem with a new one. This seems to be one of the most useful methods for editing Docling Documents, although since the NodeItems are passed by reference, we can also just change their properties to make similar edits (as long as the type of NodeItem does not need to be changed)

<br>

```replace_item()``` - Takes a new_item and old_item, replacing the old_item in the document with the new_item.

Let's say that I wanted to replace the entire group_0 with a single text item. This can be done by replacing items.

In [None]:
# Load a new document

replace_doc = DoclingDocument.load_from_json(filename="files/json/01_Advantage_Savings.json")

# Access the first group

old = replace_doc.groups[0]

# Create a text item to replace it with

text = "Group 0 was here"
label = DocItemLabel.TEXT

new = TextItem(text=text, orig=text, label=label, self_ref="#/new_texts/0")

replace_doc.replace_item(old_item=old, new_item=new)

print(replace_doc.export_to_markdown()[450:800])

As can be seen, we successfully replaced the group item with a text item. For more practical demonstrations of fixing Docling Documents with these tools, see the ```docling_error_revision.ipynb``` notebook.