# Docling Document Structure

This notebook examines the structure of ```DoclingDocument``` objects, as well as implementations of some relevant properties/methods for editing documents.

---
## Imports
Run the following cells to import necessary packages for this notebook, namely Docling.

In [15]:
!pip install -q docling

In [16]:
from docling.document_converter import DocumentConverter
from docling_core.types.doc.document import (
    DoclingDocument,
    RefItem,
    ListItem,
)

## Create Example Document

The following function converts the ```01_Advantage_Savings.pdf``` file with default conversion setttings to use as an example throughout the notebook.

In [19]:
FILE_SOURCE = 'files/pdf/01_Advantage_Savings.pdf'

converter = DocumentConverter()
result = converter.convert(FILE_SOURCE)

docling_doc = result.document
print(type(docling_doc))

<class 'docling_core.types.doc.document.DoclingDocument'>


## Visualizing Docling Documents

The following cells cover the methods available for visualizing Docling Documents, both in their structure and content.

<br>

```print_element_tree()``` - Visualizes the content layers and parent/child relationships in the document

In [27]:
docling_doc.print_element_tree()

# Alternative: export_to_element_tree() - same as print_element_tree(), but saves the tree as a string

# tree = docling_doc.export_to_element_tree()
# print(tree)

 0: unspecified with name=_root_
  1: section_header
  2: section_header
  3: section_header
  4: text
  5: text
  6: text
  7: text
  8: text
  9: text
  10: text
  11: text
  12: list with name=list
   13: list_item
   14: list_item
   15: list_item
   16: list_item
  17: section_header
  18: key_value_area with name=group
   19: text
   20: text
   21: text
   22: text
   23: text
   24: text
   25: text
  26: section_header
  27: text
  28: text
  29: text
  30: text
  31: section_header
  32: text
  33: text
  34: text
  35: picture
  36: section_header
  37: section_header
  38: table
  39: text
  40: text


<br>

```export_to_markdown()``` - Converts the content of the document to Markdown format, which can be printed

In [36]:
markdown = docling_doc.export_to_markdown()

print(markdown[0:200], "\n...")

## Bank of America Advantage Savings

## Clarity Statement ®  - Overview of key policies and fees

## Your Bank of America Advantage Savings Account

FDIC

Coverage

This account is insured by the Fed 
...


<br>

```save_as_markdown()``` - 

<br>

```export_to_text()``` - Just wraps ```save_as_markdown()``` currently and gives the same output

<br>

```export_to_html()``` - Saves a string representation of the document in HTML

In [34]:
html = docling_doc.export_to_html()

print(html[0:200], "\n...")

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>01_Advantage_Savings</title>
<meta name="generator" content="Docling HTML Serializer">
<style>
    html {
        background-color: #f5f5f5; 
...


<br>

```save_as_html()``` - 

<br>

```get_visualization()``` - 

```export_to_doctags()``` - 

## Saving/Loading Docling Documents

The following cells cover the methods available for saving 1-to-1 representations of Docling Documents and sourcing them to create new Docling Document objects. Unlike exporting as markdown or HTML, the following exports are lossless, meaning that they store all of the data present in the pydantic class definition of a Docling Document. This makes them good candidates for storing the Docling Document during review, as it allows for edits to be made in a more controlled manner through methods on the Docling Document object.

<br>

```save_as_json()``` -

<br>

```load_from_json()``` - 

```save_as_yaml()``` -

```load_from_yaml()``` -

```save_as_doctags()``` -

```load_from_doctags()``` - 

## The Parts of Docling Documents

Before attempting to edit Docling Documents with built-in methods, it is important to understand how Docling Documents are organized as a data type and how to access parts within them. 

<br>

```export_to_dict()``` - Saves the document as a python dictionary

In [32]:
doc_dict = docling_doc.export_to_dict()

print(list(doc_dict.keys()))

['schema_name', 'version', 'name', 'origin', 'furniture', 'body', 'groups', 'texts', 'pictures', 'tables', 'key_value_items', 'form_items', 'pages']
