This notebook will demonstrate the process of making a request to the Azul [/index/projects](https://service.azul.data.humancellatlas.org/#/Index/get_index_projects__project_id_) endpoint for a single project and downloadiing all the project level matrix files contained within the response.

The first step will be to import the modules we'll need for this notebook.

If any of these modules are not installed on your system / virtual environment they can be installed using the command `python -m pip install {module_name}` in your terminal.

In [1]:
import requests
import os
from tqdm import tqdm

The following function downloads a file from a given url and saves the file to the given output path. This will be used to download the individual matrix files.

In [2]:
def download_file(url, output_path):
    url = url.replace('/fetch', '')  # Work around https://github.com/DataBiosphere/azul/issues/2908
    
    response = requests.get(url, stream=True)
    response.raise_for_status()
    
    total = int(response.headers.get('content-length', 0))
    print(f'Downloading to: {output_path}', flush=True)
    
    with open(output_path, 'wb') as f:
        with tqdm(total=total, unit='B', unit_scale=True, unit_divisor=1024) as bar:
            for chunk in response.iter_content(chunk_size=1024):
                size = f.write(chunk)
                bar.update(size)

Matrices are included in the Azul [/index/projects](https://service.azul.data.humancellatlas.org/#/Index/get_index_projects__project_id_) endpoint response as a JSON tree structure with the keys indicating the stratification of each matrix file. An abridged example of what this tree might look can be seen here:

```
{
    "genusSpecies": {
        "Homo sapiens": {
            "developmentStage": {
                "adult": {
                    "libraryConstructionApproach": {
                        "10X v2 sequencing": {
                            "organ": {
                                "blood": [
                                    {
                                        "size": 2377128,
                                        "name": "TCellActivation-Blood-10x_cell_type_2020-03-11.csv",
                                        "source": "HCA Release",
                                        "uuid": "237538e6-7f05-5e56-a47d-01cdfd136a7e",
                                        "version": "2020-11-20T09:03:11.285229Z",
                                        "url": "https://..."
                                    }
                                ],
                                "lung": [
                                    {
                                        "size": 1460428,
                                        "name": "TCellActivation-lung-10x_cell_type_2020-03-11.csv",
                                        "source": "HCA Release",
                                        "uuid": "978eb768-f27a-5a68-9e5d-155b3f35ff95",
                                        "version": "2020-11-20T09:03:11.285229Z",
                                        "url": "https://..."
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}
```

The following function will recursively traverse a matrices tree and yield the leaf nodes which contain the details for each matrix file (e.g. file name, url, size)

In [3]:
def iterate_matrices_tree(tree, keys=()):
    if isinstance(tree, dict):
        for k, v in tree.items():
            yield from iterate_matrices_tree(v, keys=(*keys, k))
    elif isinstance(tree, list):
        for file in tree:
            yield keys, file
    else:
        assert False

Now it is time to set the configuration variables. Change these values as needed to specify the UUID of the desired project, the catalog containing the project, the local path to save the downloaded files, and the URL of the projects service endpoint.

In [4]:
project_uuid = '4a95101c-9ffc-4f30-a809-f04518a23803'
catalog = 'dcp2'
endpoint_url = f'https://service.azul.data.humancellatlas.org/index/projects/{project_uuid}'

save_location = '/tmp'

Finally, this block of code puts everything together by making the request for the project entity metadata and downloading each matrix file contained within.

Because it is posssible for a matrix file to be included multiple times in the projects response (each occurance with different stratification information), a list of downloaded URLs is maintained to prevent downloading from any URL more than once.

In [5]:
response = requests.get(endpoint_url, params={'catalog': catalog})
response.raise_for_status()

response_json = response.json()
project = response_json['projects'][0]

file_urls = set()
for key in ('matrices', 'contributedAnalyses'):
    tree = project[key]
    for path, file_info in iterate_matrices_tree(tree):
        url = file_info['url']
        if url not in file_urls:
            dest_path = os.path.join(save_location, file_info['name'])
            download_file(url, dest_path)
            file_urls.add(url)
print('Downloads Complete.')

Downloading to: /tmp/4a95101c-9ffc-4f30-a809-f04518a23803.TCellActivation-Blood-10x_cell_type_2020-03-11.csv


100%|██████████| 2.27M/2.27M [00:00<00:00, 9.97MB/s]


Downloading to: /tmp/4a95101c-9ffc-4f30-a809-f04518a23803.GSE126030_RAW.tar


100%|██████████| 111M/111M [00:08<00:00, 14.4MB/s] 


Downloading to: /tmp/4a95101c-9ffc-4f30-a809-f04518a23803.TCellActivation-lymph-node-10x_cell_type_2020-03-11.csv


100%|██████████| 2.12M/2.12M [00:00<00:00, 8.08MB/s]


Downloading to: /tmp/4a95101c-9ffc-4f30-a809-f04518a23803.TCellActivation-lung-10x_cell_type_2020-03-11.csv


100%|██████████| 1.39M/1.39M [00:00<00:00, 7.15MB/s]


Downloading to: /tmp/4a95101c-9ffc-4f30-a809-f04518a23803.TCellActivation-bone-marrow-10x_cell_type_2020-03-11.csv


100%|██████████| 875k/875k [00:00<00:00, 5.66MB/s]

Downloads Complete.



