In [1]:
import json
from pathlib import Path
import deepsearch as ds

# IBM Deep Search Document Conversion

## Getting started

The [Deep Search Toolkit](https://ds4sd.github.io/deepsearch-toolkit/) allows document conversion with the following few lines of code. It's that simple! For more info or step-by-step guide:
- Visit https://ds4sd.github.io/deepsearch-toolkit/guide/convert_doc/
- Follow this example notebook

⚠️ Before running this notebook, generate the file `ds-auth.json` via
```shell
deepsearch login --output ../ds-auth.json
```
More details in the [docs](https://ds4sd.github.io/deepsearch-toolkit/getting_started/#authentication).

In [2]:
host = "https://deepsearch-experience.res.ibm.com"
proj = "1234567890abcdefghijklmnopqrstvwyz123456"

# This file can be generated via `deepsearch login --output ../ds-auth.json`,
# or see the example ../ds-auth.json.example
config_file = Path("../ds-auth.json")

config = ds.DeepSearchConfig.parse_file(config_file)
client = ds.CpsApiClient(config)
api = ds.CpsApi(client)

documents = ds.convert_documents(
    api=api,
    proj_key=proj,
    source_path="../data/samples/2206.01062.pdf",
    progress_bar=True
)           
documents.download_all(result_dir="./converted_docs")
info = documents.generate_report(result_dir="./converted_docs")
print(info) 

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00, 53.47it/s][38;2;15;98;254m                                                                                                                                                  [0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [01:17<00:00, 77.74s/it][38;2;15;98;254m                                                                                                                                                  [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [01:06<00:00, 66.89s/it][38;2;15;98;254m                                                                                                                                                  [0m


{'Total documents': 1, 'Successfully converted documents': 1}


---

# There's more! 

The Deep Search Toolkit provides utility functions which can convert documents from different type of inputs.

- From a single url
- From a list of urls. In this case, the toolkit will launch a batch processing with all tasks.
- From a local PDF file
- From a local zip archive containing PDF files.
- From a local folder containing PDF files. In this case, the toolkit is packaging the files into batches and creates multiple zip archives.


---

# Let's explore document conversion

### Authentication via stored credentials

In this example, we initialize the Deep Search client from the credentials
contained in the file `../ds-auth.json`. This can be generated with

```shell
deepsearch login --output ../ds-auth.json
```

More details in the [docs](https://ds4sd.github.io/deepsearch-toolkit/getting_started/#authentication).

In [3]:
config_file = Path("../ds-auth.json")

config = ds.DeepSearchConfig.parse_file(config_file)

client = ds.CpsApiClient(config)
api = ds.CpsApi(client)

In [4]:
PROJ_KEY="1234567890abcdefghijklmnopqrstvwyz123456"

## Single URL

In [5]:
documents = ds.convert_documents(api=api, 
                                 proj_key=PROJ_KEY, 
                                 urls="https://arxiv.org/pdf/2206.00785.pdf", 
                                 progress_bar=True)

Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:01<00:00,  1.94s/it][38;2;15;98;254m                                                                                                                                                  [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:26<00:00, 26.88s/it][38;2;15;98;254m                                                                                                                                                  [0m


In [6]:
# let's check what happened. 
# we generate a csv report about the conversion task and store it locally
result_dir = './converted_docs/'
info = documents.generate_report(result_dir=result_dir)
print(info)

{'Total documents': 1, 'Successfully converted documents': 1}


The saved report may help in debugging and analysing the conversion task

In [7]:
# let's download all the converted documents:
documents.download_all(result_dir=result_dir, progress_bar=True)

Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:01<00:00,  1.89s/it][38;2;15;98;254m                                                                                                                                                  [0m


In [8]:
# the documents object stores some additional info like:
documents.statuses, documents.task_ids

(['SUCCESS'], ['1bf1524d-c4d9-40ab-8f8f-505434e5f69e'])

## Multiple URLs

In [9]:
# let's create a list of urls we want to convert:
urls = ["https://arxiv.org/pdf/2206.00785.pdf", "https://arxiv.org/pdf/2206.01062.pdf"]

In [10]:
# Process multiple urls
documents = ds.convert_documents(
    api=api, 
    proj_key=PROJ_KEY, 
    urls= urls, 
    progress_bar=True
)

Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:01<00:00,  1.31it/s][38;2;15;98;254m                                                                                                                                                  [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:59<00:00, 29.87s/it][38;2;15;98;254m                                                                                                                                                  [0m


In [11]:
# as before we can use the documents object to download all jsons. We can also iterate over them individually.
for doc in documents:
    doc.download(result_dir=result_dir, progress_bar=True)

Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:02<00:00,  2.03s/it][38;2;15;98;254m                                                                                                                                                  [0m
Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:02<00:00,  2.07s/it][38;2;15;98;254m                                                                                                                                                  [0m


## Process local file

In [12]:
documents = ds.convert_documents(
    api=api, 
    proj_key=PROJ_KEY, 
    source_path="../data/samples/2206.01062.pdf", 
    progress_bar=True
)

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00, 26.95it/s][38;2;15;98;254m                                                                                                                                                  [0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:03<00:00,  3.92s/it][38;2;15;98;254m                                                                                                                                                  [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [01:01<00:00, 61.99s/it][38;2;15;98;254m                                                                                                                                                  [0m


# Process folder of files

In [13]:
documents = ds.convert_documents(
    api=api, 
    proj_key=PROJ_KEY, 
    source_path="../data/samples", 
    progress_bar=True
)

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:00<00:00, 43.12it/s][38;2;15;98;254m                                                                                                                                                  [0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:04<00:00,  4.57s/it][38;2;15;98;254m                                                                                                                                                  [0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:59<00:00, 59.14s/it][38;2;15;98;254m                                                                                                                                                  [0m


In [14]:
info = documents.generate_report(result_dir)
print(info)

{'Total documents': 2, 'Successfully converted documents': 2}


In [15]:
# let's download all the converted documents:
documents.download_all(result_dir=result_dir,progress_bar=True)

Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:03<00:00,  1.95s/it][38;2;15;98;254m                                                                                                                                                  [0m


---

# What's next?

Explore other examples which demonstrate possible use cases of the document conversion

- Visualize the text bounding boxes
- Extract figures
- Convert document to epub for your e-reader
