# Document Conversion - Quick start

## Getting started

The [Deep Search Toolkit](https://ds4sd.github.io/deepsearch-toolkit/) allows document conversion with the following few lines of code. It's that simple! For more info or step-by-step guide:
- Visit https://ds4sd.github.io/deepsearch-toolkit/guide/convert_doc/
- Follow this example notebook

### Set notebook parameters

In [1]:
from dsnotebooks.settings import ProjectNotebookSettings

# notebook settings auto-loaded from .env / env vars
notebook_settings = ProjectNotebookSettings()

PROFILE_NAME = notebook_settings.profile  # the profile to use
PROJ_KEY = notebook_settings.proj_key     # the project to use

### Import example dependencies

In [2]:
import deepsearch as ds

### Connect to Deep Search

In [3]:
api = ds.CpsApi.from_env(profile_name=PROFILE_NAME)

In [4]:
documents = ds.convert_documents(
    api=api,
    proj_key=PROJ_KEY,
    source_path="../../data/samples/2206.01062.pdf",
    progress_bar=True
)           
documents.download_all(result_dir="./converted_docs")
info = documents.generate_report(result_dir="./converted_docs")
print(info) 

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00, 45.80it/s][38;2;15;98;254m[0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:17<00:00, 17.23s/it][38;2;15;98;254m[0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:44<00:00, 44.85s/it][38;2;15;98;254m[0m


{'Total documents': 1, 'Successfully converted documents': 1}


---

# There's more! 

The Deep Search Toolkit provides utility functions which can convert documents from different type of inputs.

- From a single url
- From a list of urls. In this case, the toolkit will launch a batch processing with all tasks.
- From a local PDF file
- From a local zip archive containing PDF files.
- From a local folder containing PDF files. In this case, the toolkit is packaging the files into batches and creates multiple zip archives.


---

# Let's explore document conversion

## Single URL

In [5]:
documents = ds.convert_documents(api=api, 
                                 proj_key=PROJ_KEY, 
                                 urls="https://arxiv.org/pdf/2206.00785.pdf", 
                                 progress_bar=True)

Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00,  1.57it/s][38;2;15;98;254m[0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:18<00:00, 18.78s/it][38;2;15;98;254m[0m


In [6]:
# let's check what happened. 
# we generate a csv report about the conversion task and store it locally
result_dir = './converted_docs/'
info = documents.generate_report(result_dir=result_dir)
print(info)

{'Total documents': 1, 'Successfully converted documents': 1}


The saved report may help in debugging and analysing the conversion task

In [7]:
# let's download all the converted documents:
documents.download_all(result_dir=result_dir, progress_bar=True)

Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00,  1.08it/s][38;2;15;98;254m[0m


In [8]:
# the documents object stores some additional info like:
documents.statuses, documents.task_ids

(['SUCCESS'], ['ae6d4337-abf7-4c2f-8943-a9baf547b91a'])

## Multiple URLs

In [9]:
# let's create a list of urls we want to convert:
urls = ["https://arxiv.org/pdf/2206.00785.pdf", "https://arxiv.org/pdf/2206.01062.pdf"]

In [10]:
# Process multiple urls
documents = ds.convert_documents(
    api=api, 
    proj_key=PROJ_KEY, 
    urls= urls, 
    progress_bar=True
)

Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:01<00:00,  1.98it/s][38;2;15;98;254m[0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:50<00:00, 25.42s/it][38;2;15;98;254m[0m


In [11]:
# as before we can use the documents object to download all jsons. We can also iterate over them individually.
for doc in documents:
    doc.download(result_dir=result_dir, progress_bar=True)

Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00,  1.11it/s][38;2;15;98;254m[0m
Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00,  1.25it/s][38;2;15;98;254m[0m


## Process local file

In [12]:
documents = ds.convert_documents(
    api=api, 
    proj_key=PROJ_KEY, 
    source_path="../../data/samples/2206.01062.pdf", 
    progress_bar=True
)

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:00<00:00, 44.58it/s][38;2;15;98;254m[0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:04<00:00,  4.07s/it][38;2;15;98;254m[0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:33<00:00, 33.33s/it][38;2;15;98;254m[0m


# Process folder of files

In [13]:
documents = ds.convert_documents(
    api=api, 
    proj_key=PROJ_KEY, 
    source_path="../../data/samples", 
    progress_bar=True
)

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:00<00:00, 52.06it/s][38;2;15;98;254m[0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:12<00:00, 12.53s/it][38;2;15;98;254m[0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:35<00:00, 35.31s/it][38;2;15;98;254m[0m


In [14]:
info = documents.generate_report(result_dir)
print(info)

{'Total documents': 2, 'Successfully converted documents': 2}


In [15]:
# let's download all the converted documents:
documents.download_all(result_dir=result_dir,progress_bar=True)

Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:01<00:00,  1.01it/s][38;2;15;98;254m[0m


---

# What's next?

Explore other examples which demonstrate possible use cases of the document conversion

- Visualize the text bounding boxes
- Extract figures
- Convert document to epub for your e-reader
