# Document Processing

This notebook shows you how to extract PDF documents into text (e.g. Markdown, HTML or JSON) and images using the SAIA API (see [README.md](./README.md)).

#### Prequesites
1. To install the required packages remove the comment character before the next line
1. Add your API Key to a .env file in the root directory (see [.env.example](./.env.example) file)

In [1]:
# !pip install openai python-dotenv

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

True

SAIA provides [Docling](https://docling-project.github.io/docling/) as a service via the API interface on this endpoint:
```bash
https://chat-ai.academiccloud.de/v1/documents
```

A minimal example is:



In [None]:
import requests
import json

# API configuration
api_key = os.getenv("API_KEY")
base_url = "https://chat-ai.academiccloud.de/v1"

# Path to your document
file_path = "./test-document.pdf"

# Convert PDF to Text and Images
response = requests.post(
    f"{base_url}/documents/convert",
    headers={
        "accept": "application/json",
        "Authorization": f"Bearer {api_key}"
    },
    files={"document": open(file_path, "rb")}
)

# Print full response as JSON or extract the response content from the JSON object 
print(json.dumps(response.json(), indent=2))

{
  "filename": "test-document",
  "images": [],
  "markdown": "i966]\n\nCOUNTEREXAMPLE  TO EULER'S  CONJECTURE\n\n2.  F.  P.  Ramsey, On  a  problem  of formal logic, Proc.  London  Math.  Soc.  (2) 30  (1930),  264-286.\n\nDARTMOUTH  COLLEGE\n\n## COUNTEREXAMPLE  TO  EULER'S  CONJECTURE ON  SUMS  OF  LIKE  POWERS\n\nBY  L.  J.  LANDER  AND  T.  R.  PARKIN\n\nCommunicated  by J.  D. Swift,  June 27, 1966\n\nA direct search on the  CDC  6600 yielded\n\n27 5 +  84 5 +  HO 5 +  133 6 - 144 5\n\nas  the  smallest  instance  in  which  four  fifth  powers  sum  to  a  fifth power. This is a counterexample to a conjecture  by Euler  [l] that  at least n nth powers are required to sum to an nth power, n>2.\n\n## REFERENCE\n\n1.  L.  E.  Dickson, History of the  theory  of numbers, Vol.  2,  Chelsea,  New York, 1952, p. 648."
}


To extract only the "markdown" field from the response, you can use: `print(response.json().get("markdown", ""))`

You can use advanced settings in your request by adding query parameters:

| Parameter | Values | Description |
|-|-|-|
| `response_type` | `markdown`, `html`, `json`, `tokens` | The output file type |
| `extract_tables_as_images` | `true`, `false` | Whether tables should be returned as images |
| `image_resolution_scale` | `1`, ..., `4` | Scaling factor for image resolution |

For example, in order to extract tables as images, scale image resolution by 4, and convert to HTML, you can call:

```python
f"{base_url}/documents/convert?response_type=json&extract_tables_as_images=false&image_resolution_scale=4",
```