# Using the Contextual AI Document Parser


Contextual AI's `/parse` API is purpose-built for RAG to ensure enterprise AI agents can navigate and understand large and complex documents with superior accuracy and context awareness.

Please see our [blog post](https://contextual.ai/blog/) for details on its comparative advantages to other parsers.

This notebook demonstrates how to use `/parse` with the Contextual API directly and our Python SDK. We'll use the same doc, [Attention is all you need](https://arxiv.org/pdf/1706.03762) for both.

You can run this notebook entirely in Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/03-standalone-api/03-parse/parse.ipynb)

### Fetch the doc

First, we will fetch the document that we'll use throughout the notebook.

In [None]:
url = "https://arxiv.org/pdf/1706.03762"

In [None]:
import requests

# Download doc
file_path = "attention-is-all-you-need.pdf"
with open(file_path, "wb") as f:
    f.write(requests.get(url).content)

### API Key 🔑

In [None]:
# Set the API key in the 🔑 pane of google colab
from google.colab import userdata
api_key = userdata.get('CONTEXTUAL_API_KEY')

# REST API implementation

You can use our API directly with the `requests` package. See [docs.contextual.ai](https://docs.contextual.ai/api-reference/parse/parse-file) for details.

In [None]:
import requests
import json

base_url = "https://api.contextual.ai/v1"

headers = {
    "accept": "application/json",
    "authorization": f"Bearer {api_key}"
}

### Submit Parse Job

Next, we'll define the configuration for our parse job and submit it. This initiates an async parse job and returns a `job_id` we can use to monitor progress.

In [None]:
url = f"{base_url}/parse"

config = {
    "parse_mode": "standard",
    "figure_caption_mode": "concise",
    "enable_document_hierarchy": True,
    "page_range": "0-5",
    "enable_split_tables": False,
    "max_split_table_cells": 30,
}

with open(file_path, "rb") as fp:
    file = {"raw_file": fp}
    result = requests.post(url, headers=headers, data=config, files=file)
    response = json.loads(result.text)

job_id = response['job_id']
job_id

In [None]:
response

### Monitor Job Status

Using the `job_id` from above, we can monitor the status of our parse job.

In [None]:
# Check on parse job status
from time import sleep

url = f"{base_url}/parse/jobs/{job_id}/status"

while True:
    result = requests.get(url, headers=headers)
    parse_response = json.loads(result.text)['status']
    print(f"Job is {parse_response}")
    if parse_response == "completed":
        break
    sleep(30)

### List all jobs

If we submit multiple jobs and want to see the status of each of them, then we can use the list jobs api:

In [None]:
url = f"{base_url}/parse/jobs"

result = requests.get(url, headers=headers)
parse_response = json.loads(result.text)
parse_response

### Get Parse results

In [None]:
url = f"{base_url}/parse/jobs/{job_id}/results"

output_types = ["markdown-per-page"]

result = requests.get(
    url,
    headers=headers,
    params={"output_types": ",".join(output_types)},
)

result = json.loads(result.text)
from pprint import pprint
pprint(result)

### View Document Hierarchy

In [None]:
from IPython.display import display, Markdown

display(Markdown(result['table_of_contents']['markdown']))

### Display 1st Page

In [None]:
from IPython.display import display, Markdown

display(Markdown(result['pages'][0]['markdown']))

# Contextual SDK

In [None]:
try:
  from contextual import ContextualAI
except:
  %pip install --upgrade --quiet contextual-client
  from contextual import ContextualAI

# Setup Contextual Python SDK
client = ContextualAI(api_key=api_key)

### Submit Parse Job

In [None]:
with open(file_path, "rb") as fp:
    response = client.parse.create(
        raw_file=fp,
        parse_mode="standard",
        figure_caption_mode="concise",
        enable_document_hierarchy=True,
        page_range="0-5",
    )

job_id = response.job_id
job_id

### Monitor Job Status

In [None]:
# Check on parse job status
from time import sleep


while True:
    result = client.parse.job_status(job_id)
    parse_response = result.status
    print(f"Job is {parse_response}")
    if parse_response == "completed":
        break
    sleep(30)

### Get Job Results

In [None]:
results = client.parse.job_results(job_id, output_types=['markdown-per-page'])

results

## Parse UI

To navigate to the UI, run the following cell:

In [None]:
tenant = "your-tenant-name"
print(f"https://app.contextual.ai/{tenant}/components/parse?job={job_id}")

![](parse-ui.png)

## Document Hierarchy (Table of Contents)

To see the document hierarchy, otherwise known as the table of contents you can run:

In [None]:
from IPython.display import display, Markdown

display(Markdown(results.table_of_contents.markdown))

## Output Types

You can set the desired output format(s) of the parsed file using the `output_types` parameter. Must be `markdown-document`, `markdown-per-page`, and/or `blocks-per-page`. Specify multiple values to get multiple formats in the response:
* `markdown-document` parses the whole document into a single concatenated markdown output.
* `markdown-per-page` provides markdown output per page.
* `blocks-per-page` provides a structured JSON representation of the content blocks on each page, sorted by reading order.

### Display Markdown-per-page

In [None]:
results = client.parse.job_results(job_id, output_types=['markdown-per-page'])

for page in results.pages:
    display(Markdown(page.markdown))

### Blocks per page

In [None]:
results = client.parse.job_results(job_id, output_types=['blocks-per-page'])

for page in results.pages:
    for block in page.blocks:
        display(Markdown(block.markdown))

### Markdown-document

This returns the document text into a single field `markdown_document`.

In [None]:
result = client.parse.job_results(job_id, output_types=['markdown-document'])

display(Markdown(result.markdown_document))

## Document Hierachy

LLM's work best when their fed as much data on the hierachy of the document as possible. That's why we've written the parse api to be context aware, i.e. we can include metadata such as which section the text is from.

To do this we'll set output_type to `blocks-per-page` and use the parameter `parent_ids` to get the section.

In [None]:
result = client.parse.job_results(job_id, output_types=['blocks-per-page'])

hash_table = {}

for page in result.pages:
  for block in page.blocks:
    hash_table[block.id] = block.markdown

for page in result.pages:
  for block in page.blocks:
    if block.parent_ids:
      parent_content = "\n".join([hash_table[parent_id] for parent_id in block.parent_ids])
      display(Markdown(f"Text\n======\n {block.markdown} \n\n Section:\n======\n{parent_content}"))

## Table Extraction

If we're interested in extracting large tables, sometimes we need to split up those tables to use them in the LLM but preserve table header information across each chunk. To do that we'll use the `enable_split_tables` and `max_split_table_cells` parameters like so:

In [None]:
url = 'https://raw.githubusercontent.com/ContextualAI/examples/refs/heads/main/03-standalone-api/04-parse/data/omnidocbench-text.pdf'

# Download doc
file_path = "omnidocbench-text_pdf.pdf"
with open(file_path, "wb") as f:
    f.write(requests.get(url).content)

In [None]:
file_path = 'omnidocbench-text_pdf-dense-rotated--scihub_journal.pone.0166577.pdf_5.pdf'
with open(file_path, "rb") as fp:
    response = client.parse.create(
        raw_file=fp,
        parse_mode="standard",
        enable_split_tables=True,
        max_split_table_cells=50,
    )

job_id = response.job_id
job_id

In [None]:
client.parse.job_status(job_id)

In [None]:
result = client.parse.job_results(job_id, output_types=['markdown-per-page'])

for page in result.pages:
  display(Markdown(page.markdown))
