# Tutorial 2: Exploring and downloading datasets from the DigitalTWINS Platform

## Introduction
The 12 LABOURS DigitalTWINS Platform’s harmonised database is organised into __Programs__ and __Projects__. For example, Exemplar Project 1 (__EP1__) is a project within the 12 LABOURS (__12L__) Program. Users can only access and download datasets from these projects once they have been granted access. See Tutorial 1 for information on how to request access and connect to the platform. This tutorial shows how to explore and download existing datasets from the DigitalTWINS Platform using its Python API.

## Definitions
API - Application Programming Interface used to access the features or data of an application or service.

## Learning outcomes
In this tutorial, you will learn how to:
- access the platform using its Python API.
- find datasets stored in the platform.
- download datasets in SDS format.

## Accessing the platform using its Python API
First, we will use Python's built in `configparser` module to load a `config.ini` file that specifies the location and API access keys for your instance of the DigitalTWINS Platform.

Please contact your DigitalTWINS Platform maintainer to access your `config.ini` and `credentials.json` and place both files in the same directory.

In [1]:
import pathlib
# Change the path below to point to the location of your config.ini file.
#config_file = pathlib.Path(r"./path/to/config.ini")

config_file = pathlib.Path(r"./configs_12l.ini")


We will use the DigitalTWINS Python API's `Querier` class to list or search for existing datasets in the platform.

In [2]:
import digitaltwins as dts

querier = dts.Querier(config_file)

### Listing program and  projects in the platform

A list of existing programs in the platform can be retrieved as follows.

In [3]:
programs = querier.get_all_programs()

ConnectionError: HTTP connection error: Please make sure you have access to the remote server. then try again!

A list of existing projects within a program can be retrieved as follows. The optional `program` argument can be used to only list projects in a specific program.

<span style="color:yellow"> [choose a specific program by the order on the list?]</span>

In [None]:
projects = querier.get_projects(program=programs[0])

## Finding datasets
Each dataset stored in the platform has a unique identifier (ID). Datasets can be retrieved using the platform API's `Dataset` class. This provides multiple methods to help with accessing metadata files in a dataset without needing to download the entire dataset. For example, the `get_dataset_description` method will return a Python dictionary of metadata elements for the SDS dataset description metadata file. The Digital Twin Platform’s API documentation (TODO) lists all the methods available for accessing dataset metadata.

The first step in accessing a specific dataset is to find the dataset's ID. There are multiple approaches to identify the ID for a dataset of interest. These are described below.

<span style="color:yellow"> [1. Elements are used here before tutorial 3. 2. Maybe explain what SDS and detaset_description is here? SDS data is organised in six high-level folders: primary, source, derivative, code, protocol, and docs. dataset_description metadata file is mandatory for all SPARC datasets. It is a required file where an investigator provides basic metadata]</span>


### Using the data catalogue on the platform's portal
Dataset IDs are included in each dataset listed on the data catalogue page of the 12 LABOURS DigitalTWINS platform's portal. 

<span style="color:yellow">[Dataset IDs of each dataset are listed ...] </span>

TODO add screenshot.

### Using the platform's API to list all datasets
The `get_datasets` method of the `Querier` class allows a list of `Dataset` objects to be retrieved from the platform. The UID of a dataset can then be accessed using the `Dataset` classes `get_uid` method.

<span style="color:yellow">[What is UID and how is it different from dataset ID?]

In [None]:
datasets = querier.get_datasets(program=all, project=all)
for dataset in datasets:
    print(dataset.get_id())

### Using the platform's API to search for datsets
The `search_datasets` method of the `Querier` class allows for searching of datasets, and returns a python list of `Dataset` objects that match the search criteria . 

Currently, only searching text that matches exactly with the title of existing datasets in the platform is supported.

In [None]:
dataset_id = 'dataset-1-version-1'
datasets = querier.search_datasets(query=dataset_id)

## Downloading datasets
Datasets are stored in SDS format within the platforms harmonised database. We can use the DigitalTWINS Python API's `Downloader` class to select and download a dataset in SDS format. Once downloaded, the `sparc-me` Python module can be used explore the metadata in a dataset (see Tutorial 3).

By default, datsets are downloaded to the current working directory, however, the `save_dir` optional argument can be specified to select a different download destination path.

In [None]:
downloader = dts.Downloader(config_file)
downloader.download(dataset_id, save_dir='./')

Some datasets can be very large, so an option is provided to  only download the metadata files in a dataset, or the entire dataset.

In [None]:
downloader.download(dataset_id, save_dir='./', metadata_only=True)

## Next steps
The next tutorial will show how to load and explore SDS datasets using the sparc-me Python tool.