In [1]:
import json
import numpy as np
from dandi.dandiapi import DandiAPIClient
from tqdm.notebook import tqdm
from pandas import Timedelta

In [2]:
client = DandiAPIClient()
dandisets = list(client.get_dandisets())

# More specific identification of NWB dandisets

The simpler tutorial simply tested if NWB was in the name of any of the data standards for a dandiset.

The more official and precise method is to use the specific RRID of NWB, which is `"RRID:SCR_015242"`.

In [3]:
nwb_dandisets = []

for dandiset in tqdm(dandisets):
    raw_metadata = dandiset.get_raw_metadata()

    if any(
        data_standard['identifier'] == "RRID:SCR_015242"  # this is the RRID for NWB
        for data_standard in raw_metadata['assetsSummary'].get('dataStandard', [])
    ):
        nwb_dandisets.append(dandiset)
print(f"There are currently {len(nwb_dandisets)} NWB datasets on DANDI!")

  0%|          | 0/222 [00:00<?, ?it/s]

There are currently 128 NWB datasets on DANDI!


# Average age of subjects used in a dandiset

Let's consider a more advanced calculation - calculating the average age of the the subjects used in a particular dandiset.

For this we will be diretly accessing the asset level fields `wasAttributedTo` as a key of the `asset_metadata`, instead of as an attribute.

We will also have to do some manual data manipulation to parse the form of the ISO 8601; you can read more about the standard at https://en.wikipedia.org/wiki/ISO_8601

In [5]:
dandiset = client.get_dandiset("000398")
all_subject_ages_in_days = []
assets = list(dandiset.get_assets())

def timedelta_to_fractional_days(time_delta: Timedelta):
    """
    Defining a helper function which returns the duration of a Timedelta in float-valued days.
    
    This is because a Timedelta can only return either its `.days` (integer, rounded down) or
    its `total_seconds()`.
    """
    return age_as_timedelta.total_seconds() / (  # Evaluate using the total number of seconds
        60 *  # 60 seconds per minute
        60 *  # 60 minutes per hour
        24  # 24 hours per day (ignoring daylight savings time)
    )

for asset in tqdm(assets):
    raw_metadata = asset.get_raw_metadata()
    subjects = raw_metadata["wasAttributedTo"]

    for subject_metadata in subjects:
        if "age" in subject_metadata:
            age_as_time_delta = Timedelta(subject_metadata["age"]["value"])
            if age_as_time_delta.total_seconds():  # Skip if the age is null
                all_subject_ages_in_days.append(timedelta_to_fractional_days(time_delta=age_as_time_delta))
print(f"The average age of the subjects in dandiset #398 is: {np.mean(all_subject_ages_in_days)} days")

  0%|          | 0/42 [00:00<?, ?it/s]

NameError: name 'age_as_timedelta' is not defined

# Maybe a parallelized loop over multiple dandisets
# counting something like the distribution of subject species used for ecephys experiments

# Going beyond
These examples show a few types of queries, but since the metadata structures are quite rich on both the dandiset and asset levels, they enable many complex queries beyond the examples here.

These metadata structures are also expanding over time as DANDI becomes more strict about what counts as essential metadata.

The `.get_raw_metadata` method of both `client.get_dandiset(...)` and `client.get_dandiset(...).get_assets()` provides a nice view into the available fields.

Note: for any attribute, it is recommended to first check that it is not `None` before checking for its value.

In [None]:
print(json.dumps(dandisets[0].get_raw_metadata(), indent=4))

In [None]:
print(json.dumps(assets[0].get_raw_metadata(), indent=4))