# Docugami
This notebook covers how to load documents from `Docugami`. It provides the advantages of using this system over alternative data loaders.

## Prerequisites
1. Install necessary python packages.
2. Grab an access token for your workspace, and make sure it is set as the `DOCUGAMI_API_KEY` environment variable.
3. Grab some docset and document IDs for your processed documents, as described here: https://help.docugami.com/home/docugami-api

In [1]:
# You need the dgml-utils package to use the DocugamiLoader (run pip install directly without "poetry run" if you are not using poetry)
!poetry run pip install --upgrade dgml-utils==0.2.0 --upgrade --quiet

## Quick start

1. Create a [Docugami workspace](http://www.docugami.com) (free trials available)
2. Add your documents (PDF, DOCX or DOC) and allow Docugami to ingest and cluster them into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. There is no fixed set of document types supported by the system, the clusters created depend on your particular documents, and you can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later.
3. Create an access token via the Developer Playground for your workspace. [Detailed instructions](https://help.docugami.com/home/docugami-api)
4. Explore the [Docugami API](https://api-docs.docugami.com) to get a list of your processed docset IDs, or just the document IDs for a particular docset. 
6. Use the DocugamiLoader as detailed below, to get rich semantic chunks for your documents.
7. Optionally, build and publish one or more [reports or abstracts](https://help.docugami.com/home/reports). This helps Docugami improve the semantic XML with better tags based on your preferences, which are then added to the DocugamiLoader output as metadata. Use techniques like [self-querying retriever](/docs/modules/data_connection/retrievers/self_query/) to do high accuracy Document QA.

## Advantages vs Other Chunking Techniques

Appropriate chunking of your documents is critical for retrieval from documents. Many chunking techniques exist, including simple ones that rely on whitespace and recursive chunk splitting based on character length. Docugami offers a different approach:

1. **Intelligent Chunking:** Docugami breaks down every document into a hierarchical semantic XML tree of chunks of varying sizes, from single words or numerical values to entire sections. These chunks follow the semantic contours of the document, providing a more meaningful representation than arbitrary length or simple whitespace-based chunking.
2. **Semantic Annotations:** Chunks are annotated with semantic tags that are coherent across the document set, facilitating consistent hierarchical queries across multiple documents, even if they are written and formatted differently. For example, in set of lease agreements, you can easily identify key provisions like the Landlord, Tenant, or Renewal Date, as well as more complex information such as the wording of any sub-lease provision or whether a specific jurisdiction has an exception section within a Termination Clause.
3. **Structured Representation:** In addition, the XML tree indicates the structural contours of every document, using attributes denoting headings, paragraphs, lists, tables, and other common elements, and does that consistently across all supported document formats, such as scanned PDFs or DOCX files. It appropriately handles long-form document characteristics like page headers/footers or multi-column flows for clean text extraction.
4. **Additional Metadata:** Chunks are also annotated with additional metadata, if a user has been using Docugami. This additional metadata can be used for high-accuracy Document QA without context window restrictions. See detailed code walk-through below.


In [2]:
import os
from langchain.document_loaders import DocugamiLoader

## Load Documents

If the DOCUGAMI_API_KEY environment variable is set, there is no need to pass it in to the loader explicitly otherwise you can pass it in as the `access_token` parameter.

In [3]:
DOCUGAMI_API_KEY = os.environ.get("DOCUGAMI_API_KEY")

In [4]:
docset_id = "26xpy3aes7xp"
document_ids = ["d7jqdzcj50sj", "cgd1eacfkchw"]

# To load all docs in the given docset ID, just don't provide document_ids
loader = DocugamiLoader(docset_id=docset_id, document_ids=document_ids)
chunks = loader.load()
len(chunks)

134

The `metadata` for each `Document` (really, a chunk of an actual PDF, DOC or DOCX) contains some useful additional information:

1. **id and source:** ID and Name of the file (PDF, DOC or DOCX) the chunk is sourced from within Docugami.
2. **xpath:** XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.
3. **structure:** Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.
4. **tag:** Semantic tag for the chunk, using various generative and extractive techniques. More details here: https://github.com/docugami/DFM-benchmarks

You can control chunking behavior by setting the following properties on the `DocugamiLoader` instance:

1. You can set min and max chunk size, which the system tries to adhere to with minimal truncation. You can set `loader.min_text_length` and `loader.max_text_length` to control these.
2. By default, only the text for chunks is returned. However, Docugami's XML knowledge graph has additional rich information including semantic tags for entities inside the chunk. Set `loader.include_xml_tags = True` if you want the additional xml metadata on the returned chunks.
3. In addition, you can set `loader.parent_hierarchy_levels` if you want Docugami to return parent chunks in the chunks it returns. You can further set . This is useful e.g. with the [MultiVector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) for [small-to-big](https://www.youtube.com/watch?v=ihSiRrOUwmg) retrieval. See detailed example later in this notebook.

In [5]:
loader.min_text_length = 64
loader.include_xml_tags = True
chunks = loader.load()

for chunk in chunks[:5]:
    print(chunk)

page_content='MASTER SERVICES AGREEMENT\n <ThisServicesAgreement> This Services Agreement (the “Agreement”) sets forth terms under which <Company>MagicSoft, Inc. </Company>a <Org><USState>Washington </USState>Corporation </Org>(“Company”) located at <CompanyAddress><CompanyStreetAddress><Company>600 </Company><Company>4th Ave</Company></CompanyStreetAddress>, <Company>Seattle</Company>, <Client>WA </Client><ProvideServices>98104 </ProvideServices></CompanyAddress>shall provide services to <Client>Daltech, Inc.</Client>, a <Company><USState>Washington </USState>Corporation </Company>(the “Client”) located at <ClientAddress><ClientStreetAddress><Client>701 </Client><Client>1st St</Client></ClientStreetAddress>, <Client>Kirkland</Client>, <State>WA </State><Client>98033</Client></ClientAddress>. This Agreement is effective as of <EffectiveDate>February 15, 2021 </EffectiveDate>(“Effective Date”). </ThisServicesAgreement>' metadata={'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-sectio

## Basic Use: Docugami Loader for Document QA

You can use the Docugami Loader like a standard loader for Document QA over multiple docs, albeit with much better chunks that follow the natural contours of the document. There are many great tutorials on how to do this, e.g. [this one](https://www.youtube.com/watch?v=3yPBVii7Ct0). We can just use the same code, but use the `DocugamiLoader` for better chunking, instead of loading text or PDF files directly with basic splitting techniques.

In [6]:
!poetry run pip install --upgrade openai tiktoken chromadb --quiet

In [7]:
# For this example, we already have a processed docset for a set of lease documents
loader = DocugamiLoader(docset_id="zo954yqy53wp")
chunks = loader.load()

# strip semantic metadata intentionally, to test how things work without semantic metadata
for chunk in chunks:
    stripped_metadata = chunk.metadata.copy()
    for key in chunk.metadata:
        if key not in ["name", "xpath", "id", "structure"]:
            # remove semantic metadata
            del stripped_metadata[key]
    chunk.metadata = stripped_metadata

print(len(chunks))

5868


The documents returned by the loader are already split, so we don't need to use a text splitter. Optionally, we can use the metadata on each document, for example the structure or tag attributes, to do any post-processing we want.

We will just use the output of the `DocugamiLoader` as-is to set up a retrieval QA chain the usual way.

In [8]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms.openai import OpenAI
from langchain.chains import RetrievalQA

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=chunks, embedding=embedding)
retriever = vectordb.as_retriever()
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True
)

In [9]:
# Try out the retriever with an example query
qa_chain("What can tenants do with signage on their properties?")

{'query': 'What can tenants do with signage on their properties?',
 'result': " Tenants can provide signage on their properties with accountability or notice to Tenant or any other party, in the manner Landlord shall determine, at Tenant's expense.",
 'source_documents': [Document(page_content="t accountability or notice to Tenant or any other party, in the manner Landlord shall determine, at Tenant's expense.", metadata={'id': '39146811b6c9d0161daefb4a19f4b995', 'name': 'Sample Commercial Leases/Shorebucks LLC_FL.pdf', 'structure': 'div', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:Florida-section/docset:Florida/dg:chunk/dg:chunk[2]/dg:chunk/dg:chunk/dg:chunk/docset:TheTerms/dg:chunk[4]/docset:INDEMNIFICATION-section/docset:INDEMNIFICATION/docset:TheFailure/dg:chunk[7]/docset:ENDOFTERM-section/docset:ENDOFTERM'}),
  Document(page_content="t accountability or notice to Tenant or any other party, in the manner Landlord shal

## Using Docugami Knowledge Graph for High Accuracy Document QA

One issue with large documents is that the correct answer to your question may depend on chunks that are far apart in the document. Typical chunking techniques, even with overlap, will struggle with providing the LLM sufficent context to answer such questions. With upcoming very large context LLMs, it may be possible to stuff a lot of tokens, perhaps even entire documents, inside the context but this will still hit limits at some point with very long documents, or a lot of documents.

For example, if we ask a more complex question that requires the LLM to draw on chunks from different parts of the document, even OpenAI's powerful LLM is unable to answer correctly.

In [10]:
chain_response = qa_chain("What is rentable area for the property owned by DHA Group?")
chain_response["result"]  # correct answer should be 13,500 sq ft

" I don't know."

In [11]:
chain_response["source_documents"]

[Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:DhaGroup/dg:chunk[6]/dg:chunk'}),
 Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:CatalystGroup/dg:chunk

At first glance the answer may seem reasonable, but it is incorrect. If you review the source chunks carefully for this answer, you will see that the chunking of the document did not end up putting the Landlord name and the rentable area in the same context, since they are far apart in the document. The retriever therefore ends up finding unrelated chunks from other documents not even related to the **DHA Group** landlord. That landlord happens to be mentioned on the first page of the file **Shorebucks LLC_NJ.pdf** file, and source chunks from other docs are included therefore the answer is incorrect (should be **13,500 sq ft**)

Docugami can help here. Chunks are annotated with additional metadata created using different techniques if a user has been [using Docugami](https://help.docugami.com/home/reports). More technical approaches will be added later.

Specifically, let's ask Docugami to return XML tags on its output, as well as additional metadata:

In [12]:
loader = DocugamiLoader(docset_id="zo954yqy53wp")
loader.include_xml_tags = True  # for additional semantics from the Docugami knowledge graph
chunks = loader.load()
print(chunks[0].metadata)

{'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': '47297e277e556f3ce8b570047304560b', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'structure': 'h1 h1 p', 'tag': 'chunk Lease', 'Lease Date': 'March  29th , 2019', 'Landlord': 'Menlo Group', 'Tenant': 'Shorebucks LLC', 'Premises Address': '1564  E Broadway Rd ,  Tempe ,  Arizona  85282', 'Term of Lease': '96  full calendar months', 'Square Feet': '16,159', 'rudtbce8uctq': {'Lease Date': 'April  30  ,  2020', 'Landlord': 'GLORY ROAD', 'Tenant': 'Truetone Lane LLC'}, 'ik0xx7iubkux': {'Lease Date': 'October 15, 2021', 'Landlord': 'BIRCH STREET , LLC', 'Tenant': 'Trutone Lane LLC'}, 'ql7mfzapbunf': {'Lease Date': 'June 1, 2021', 'Landlord': 'LANDLORDIUS, LLC', 'Tenant': 'TruTone Lane LLC'}, 'pfskzgkwlkug': {'Lease Date': 'December 1, 2021', 'Landlord': 'CHURCH STREET , LLC', 'Tenant': 'Trutone Lane LLC'}, 't3fyvpbdt0k5': {'Lease Date': 'May  17th , 2018', 'Landlor

We can use a [self-querying retriever](/docs/modules/data_connection/retrievers/how_to/self_query/) to improve our query accuracy, using this additional metadata:

In [13]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.vectorstores.chroma import Chroma

EXCLUDE_KEYS = ["id", "xpath", "structure"]
metadata_field_info = [
    AttributeInfo(
        name=key,
        description=f"The {key} for this chunk",
        type="string",
    )
    for key in chunks[0].metadata
    if key.lower() not in EXCLUDE_KEYS
]

document_content_description = "Contents of this chunk"
llm = OpenAI(temperature=0)

vectordb = Chroma.from_documents(documents=chunks, embedding=embedding)
retriever = SelfQueryRetriever.from_llm(
    llm, vectordb, document_content_description, metadata_field_info, verbose=True
)
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True, verbose=True,
)

Let's run the same question again. It returns the correct result since all the chunks have metadata key/value pairs on them carrying key information about the document even if this information is physically very far away from the source chunk used to generate the answer.

In [None]:
qa_chain(
    "What is rentable area for the property owned by DHA Group?"
)  # correct answer should be 13,500 sq ft

This time the answer is correct, since the self-querying retriever created a filter on the landlord attribute of the metadata, correctly filtering to document that specifically is about the DHA Group landlord. The resulting source chunks are all relevant to this landlord, and this improves answer accuracy even though the landlord is not directly mentioned in the specific chunk that contains the correct answer.

# Advanced Topic: Small-to-Big Retrieval with Document Knowledge Graph Hierarchy

Documents are inherently semi-structured and the DocugamiLoader is able to navigate the semantic and structural contours of the document to provide parent chunk references on the chunks it returns. This is useful e.g. with the [MultiVector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) for [small-to-big](https://www.youtube.com/watch?v=ihSiRrOUwmg) retrieval.

To get parent chunk references, you can set `loader.parent_hierarchy_levels` to a non-zero value. In the case of `xml_mode` this uses the XML hierarchy, otherwise it uses an sliding window on the text chunks to provide additional parent context.

In [None]:
from langchain.document_loaders import DocugamiLoader

loader = DocugamiLoader(docset_id="zo954yqy53wp")
loader.include_xml_tags = True  # for additional semantics from the Docugami knowledge graph
loader.parent_hierarchy_levels = 3  # for expanded context
loader.min_text_length = 512
loader.max_text_length = (
    1024 * 8
)  # 8K chars are roughly 2K tokens (ref: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)
loader.include_project_metadata_in_page_content = True  # Any project metadata will be included inside the retrieved small chunks, to help boost relevance of direct retrieval from the vector store
loader.include_project_metadata_in_doc_metadata = False  # Not filtering on vector metadata, so remove to lighten the vectors
chunks = loader.load()

In [None]:
# Inspect parent chunks
for chunk in chunks[:15]:
    print(f"PARENT CHUNK {chunk.parent.metadata['id']}: {chunk.parent}")
    print(f"CHUNK {chunk.metadata['id']}: {chunk}")

In [None]:
from langchain.vectorstores.chroma import Chroma
from langchain.storage import InMemoryStore
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="big2small", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    search_kwargs={"k": 10},  # for the vector store
    docstore_k=2,  # for the docstore
)


parents = {}
for chunk in chunks:
    if not chunk.parent:
        continue

    parent_id = chunk.parent.metadata["id"]

    # Set parent metadata on all child chunks
    chunk.metadata["doc_id"] = parent_id

    # Keep track of all unique parents (by ID)
    if parent_id not in parents:
        parents[parent_id] = chunk.parent

# Add child chunks to vector store
retriever.vectorstore.add_documents(chunks)

# Add parent chunks to docstore
retriever.docstore.mset(parents.items())

In [None]:
# Query vector store directly, should return chunks
found_chunks = vectorstore.similarity_search("what signs does Perry & Blair allow on their property?")

for chunk in found_chunks:
    print(chunk)
    print(chunk.metadata['doc_id'])


In [None]:
# Query retriever, should return parents
retrieved_parent_docs = retriever.get_relevant_documents("what signs does Perry & Blair allow on their property?")
for chunk in retrieved_parent_docs:
    print(chunk)
    print(chunk.metadata['id'])