# Export your data

In this example we demonstrate how you can export your data from a given project data
index to JSON. To illustrate this, we first create a demo project and project data index.
Using these resources, we then upload some documents for Deep Search to convert. This
structured data is now easily exported to JSON.

### Access required

The content of this notebook requires access to Deep Search capabilities which are not
available on the public access system.

[Contact us](https://ds4sd.github.io/#unlimited-access) if you are interested in exploring
this Deep Search capabilities.

### Authentication via stored credentials

In this example, we initialize the Deep Search client from the credentials
contained in the file `../../ds-auth.ext-v2.json`. This can be generated with

```shell
!deepsearch login --host https://deepsearch-ext-v2-535206b87b82b5365d9d6671fbc19165-0000.us-south.containers.appdomain.cloud/ --output ../../ds-auth.ext-v2.json
```

The extra `--host` argument is required in this example to target the limited access instance

More details in the [docs](https://ds4sd.github.io/deepsearch-toolkit/getting_started/#authentication).

### Notebooks parameters

The following block defines the parameters used to execute the notebook

- `CONFIG_FILE`: location of the Deep Search configuration file
- `PROJ_NAME`: the name for the project to create
- `INDEX_NAME`: the name for the data index to create
- `INPUT_FILES_FOLDER`: the files to upload
- `CLEANUP`: if true, it will delete the resources at the end of the execution

In [1]:
# Input parameters for the example flow
from pathlib import Path
import datetime
CONFIG_FILE = Path("../../ds-auth.ext-v2.json")

timestamp = datetime.datetime.now()
PROJ_NAME = f"Example project of {timestamp.strftime('%Y%m%d')}"
INDEX_NAME = f"Example docs upload {timestamp.strftime('%Y%m%d %H%M%S')}"
INPUT_FILES_FOLDER = Path("../../data/samples/")

CLEANUP = True

### Import example dependencies

In [2]:
# Import standard dependenices
from copy import deepcopy
from tqdm.notebook import tqdm
import pandas as pd
pd.set_option('display.max_colwidth', None)
import json
import tempfile

# IPython utilities
from IPython.display import display, Markdown

# Import the deepsearch-toolkit
import deepsearch as ds
from deepsearch.cps.queries import DataQuery
from deepsearch.cps.data_indices import utils as data_indices_utils


### Connect to Deep Search

In [3]:
# Initialize the Deep Search client from the config file
config = ds.DeepSearchConfig.parse_file(CONFIG_FILE)
client = ds.CpsApiClient(config)
api = ds.CpsApi(client)

### Create data index and upload data

In [4]:
# Create example project
proj = api.projects.create(name=PROJ_NAME)
# Create a new data index in your project
data_index = api.data_indices.create(proj_key=proj.key, name=INDEX_NAME)

In [5]:
# Upload and convert documents
data_indices_utils.upload_files(api=api, coords=data_index.source, local_file=INPUT_FILES_FOLDER)

Completed successfully


In [6]:
display(Markdown(f"The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />{api.client.config.host}/projects/{data_index.source.proj_key}/data/project/{data_index.source.index_key}/"))

The data is now available. You can query it programmatically (see next section) or access it via the Deep Search UI at <br />https://deepsearch-ext-v2-535206b87b82b5365d9d6671fbc19165-0000.us-south.containers.appdomain.cloud//projects/2fc4bda3f400dda2373418e38589dc5f283cac8a/data/project/af22219e0eaec35c6d6e26d336f25c00529e04b7/

### Download your data

In [7]:
# Run query
query = DataQuery(search_query="*", source=["*"], coordinates=data_index.source)
cursor = api.queries.run_paginated_query(query)

# Using a temp dir for demo purposes; to persist instead, set output dir accordingly
temp_dir = tempfile.TemporaryDirectory()
output_dir = temp_dir.name

# Iterate through query results
all_results = []
for result_page in tqdm(cursor):
    for row in result_page.outputs["data_outputs"]:

        # Download JSON file
        file_path = Path(output_dir) / f"{row['_id']}.json"
        with open(file_path, "w") as outfile:
            json.dump(row["_source"], outfile, indent=2)

        all_results.append({
            "Title": row["_source"]["description"]["title"],
            "Path": file_path,
        })

print(f"Finished fetching all data. Total is {len(all_results)} records.")
print(f"Data downloaded in {output_dir}")

0it [00:00, ?it/s]

Finished fetching all data. Total is 2 records.
Data downloaded in /var/folders/nv/nc34fsqx41d5nsp6bds_yw440000kp/T/tmpnkgflki3


In [8]:
# Visualize the table with all results
df = pd.json_normalize(all_results)
display(df)

Unnamed: 0,Title,Path
0,DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis,/var/folders/nv/nc34fsqx41d5nsp6bds_yw440000kp/T/tmpnkgflki3/5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c846634ff924e635e0dc.json
1,Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness,/var/folders/nv/nc34fsqx41d5nsp6bds_yw440000kp/T/tmpnkgflki3/6627d1b67955c51ff1aa8858de671bb5a62ad70c77e62e0ac57c153d0078b7ea.json


In [9]:
# Peek first lines of a downloaded file
with open(df.iloc[0]["Path"]) as demo_file:
    for i in range(20):
        line = demo_file.readline()
        print(line.rstrip())


{
  "page-headers": [
    {
      "text": " KDD \u201922, August 14-18, 2022, Washington, DC, USA Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar",
      "type": "page-header",
      "prov": [
        {
          "bbox": [
            53.091064453125,
            723.42395,
            558.6610107421875,
            731.77294921875
          ],
          "page": 2,
          "span": [
            0,
            133
          ]
        }
      ]


### Cleanup
Deleting all the resources created in the example—if cleanup option enabled.

In [10]:
if CLEANUP:
    # Delete data index
    api.data_indices.delete(data_index.source)
    print("Project data deleted")

    # Delete project
    api.projects.remove(project=proj)
    print("Project deleted")

Project data deleted
Project deleted
