# Getting Experimental Metadata from DANDI
It can be helpful to view general information about the experimental sessions that produced your data. Since typically each NWB File represents one session, a dandiset's files can be examined to get an overview of each of the sessions. This can vary, depending on who produced the NWB file. In this notebook, NWB Files within one of the Allen Institute's datasets are opened and some basic information is used to make a table of the experimental sessions and their properties.

### Environment Setup

In [1]:
### if running on Google Colab, run this cell once, then restart the runtime and run the rest of the notebook
import os
if "COLAB_GPU" in os.environ:
    !git clone https://github.com/AllenInstitute/openscope_databook.git
    %cd openscope_databook
    %pip install -e .

In [2]:
import fsspec
import h5py
import pandas as pd

from dandi import dandiapi
from fsspec.implementations.cached import CachingFileSystem
from pynwb import NWBHDF5IO

%matplotlib inline

### Getting Dandiset Metadata
To view other data, change `dandiset_id` to be the id of the dandiset you're interested in. If the dandiset is embargoed, have `authenticate` set to True, and `dandi_api_key` to your DANDI API key. 

In [3]:
dandiset_id = "000248"
authenticate = True
dandi_api_key = os.environ["DANDI_API_KEY"]

In [4]:
if authenticate:
    my_dandiset = dandiapi.DandiAPIClient(token=dandi_api_key).get_dandiset(dandiset_id)
else:
    my_dandiset = dandiapi.DandiAPIClient().get_dandiset(dandiset_id)
print(f"Got dandiset {my_dandiset}")

Got dandiset DANDI:000248/draft


### Get NWB Info
Below are two definitions of thefunction `get_nwb_info`. These function are tailored to our NWB Files; Our *Ophys* and our *Ecephys* datasets respectively. It retrieves a series of important metadata values from the NWB file object. It is likely that the code for accessing the fields of interest to you will be slightly different for your files. This can easily altered to extract any other information from an NWB file you want as long as you're familiar with the internal layout of your files. However, make sure to change the `columns` field in the pandas dataframe below to properly reflect any changes to the function.

In [5]:
# get experimental information from within ophys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
# def get_nwb_info(nwb):
#         session_time = getattr(nwb, "session_start_time", None)

#         metadata_obj = getattr(nwb, "lab_meta_data", {})
#         metadata = metadata_obj.get("metadata", None)
#         session_id = getattr(metadata, "ophys_session_id", None)
#         experiment_id = getattr(metadata, "ophys_experiment_id", None)

#         fov_height = getattr(metadata, "field_of_view_height", None)
#         fov_width = getattr(metadata, "field_of_view_width", None)
#         imaging_depth = getattr(metadata, "imaging_depth", None)
#         group = getattr(metadata, "imaging_plane_group", None)
#         group_count = getattr(metadata, "imaging_plane_group_count", None)
#         container_id = getattr(metadata, "experiment_container_id", None)
        
#         subject = getattr(nwb, "subject", None)
#         specimen_name = getattr(subject, "subject_id", None)
#         age = getattr(subject, "age", None)
#         sex = getattr(subject, "sex", None)
#         genotype = getattr(subject, "genotype", None)
        
#         try: n_rois = nwb.processing["ophys"]["dff"].roi_response_series["traces"].data.shape[1]
#         except: n_rois = None
#         try: location = list(nwb.imaging_planes.values())[0].location
#         except: location = None
        
#         intervals = getattr(nwb, "intervals", {})
#         stim_types = set(intervals.keys())
#         stim_tables = [intervals[table_name] for table_name in intervals]
#         # gets highest value among final "stop times" of all stim tables in intervals
#         session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])

#         return [session_time, session_id, experiment_id, container_id, group, group_count, imaging_depth, location, fov_height, fov_width, specimen_name, sex, age, genotype, stim_types, n_rois, session_end]

In [6]:
# get experimental information from within ecephys file
# getattr is used because not all nwb files have all properties. If not handled like this, errors will arise
def get_nwb_info(nwb):
        session_time = getattr(nwb, "session_start_time", None)

        subject = getattr(nwb, "subject", None)
        specimen_name = getattr(subject, "specimen_name", None)
        age = getattr(subject, "age_in_days", None)
        sex = getattr(subject, "sex", None)
        genotype = getattr(subject, "genotype", None)

        probes = set(getattr(nwb, "devices", {}).keys())
        units = getattr(nwb, "units", [])
        n_units = len(units) if hasattr(units, '__len__') else 0        
        
        intervals = getattr(nwb, "intervals", {})
        stim_types = set(intervals.keys())
        stim_tables = [intervals[table_name] for table_name in intervals]
        # gets highest value among final "stop times" of all stim tables in intervals
        session_end = max([table.stop_time[-1] for table in stim_tables if len(table) > 1])

        return [session_time, specimen_name, sex, age, genotype, probes, stim_types, n_units, session_end]

### Getting Table
Here, each relevant file in the dandiset is streamed and opened remotely to get the information of interest using the function `get_nwb_info`, defined above, and then it is added to a table of sessions and their metadata. Since some files are for specific probes rather than entire sessions, they are skipped. Opening each NWB File can take several minutes. Depending on how many files your dandiset loops through, this step can take a very long time.

In [7]:
# set up streaming filesystem
fs = fsspec.filesystem("http")

nwb_table = []
# skip files that aren't main session files
files = [asset for asset in my_dandiset.get_assets() if "probe" not in asset.path]
# swap this with line above for one of our ophys dandisets
# files = [asset for asset in my_dandiset.get_assets() if "raw" not in asset.path]
n_files = len(files)
print(f"{n_files} files retrieved")

for i, file in enumerate(files):
    print(f"Examining file {i}/{n_files}: {file.identifier}")    
    # get basic file metadata
    row = [file.identifier, file.size, file.path]
    
    base_url = file.client.session.head(file.base_download_url)
    file_url = base_url.headers["Location"]

    # open and read nwb file with streaming
    with fs.open(file_url, "rb") as f:
        with h5py.File(f) as file:
            with NWBHDF5IO(file=file, mode="r", load_namespaces=True) as io:
                nwb = io.read()
                # extract experimental info from within file
                row += get_nwb_info(nwb)
                nwb_table.append(row)

16 files retrieved
Examining file 0/16: dbc426a0-aafa-460b-a25a-a86bb31b9ddc


  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."


Examining file 1/16: 181b7651-5f5c-491b-be70-e5d0354439d4
Examining file 2/16: 85bfd56c-f104-4c83-937c-be0d58fce48e
Examining file 3/16: c5e97840-4988-4da8-9f57-a24fb0a4a865
Examining file 4/16: a7ff352c-0b00-47d6-a49f-97027d18264e
Examining file 5/16: a8bc8aaf-ccba-4c27-bb5c-f1bc3c232c84
Examining file 6/16: 32af00b4-4aa6-48de-8210-26a5cf7935a9
Examining file 7/16: 016e7321-807f-4b59-be42-c33511f8f55c
Examining file 8/16: 7252ab67-7acd-4cb7-b7a6-600df600d8e7
Examining file 9/16: 3c6a7667-5f5d-432f-829c-e915dab15c27
Examining file 10/16: 9d0ed5c2-f9e4-4c5e-b5ab-cb4a9d4e7ef6
Examining file 11/16: e0392a2a-0e07-4f7a-82dd-df354bf571d5
Examining file 12/16: 03eba9bf-f850-41a5-9e99-6f65fc5ea13d
Examining file 13/16: 0c343e3e-8f00-4ee8-9778-fc1d953e453b
Examining file 14/16: 348b1e83-4fde-480a-9e4d-ef55b5cac7c5


  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."


Examining file 15/16: 46f9bf9b-f799-4af9-b7b8-d6ed95a5446d


  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."


In [8]:
# convert table to pandas dataframe
sessions = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session time", "sub name", "sub sex", "sub age", "sub genotype", "probes", "stim types", "# units", "session length"))
# swap this with line above for one of our ophys dandisets
# sessions = pd.DataFrame(nwb_table, columns=("identifier", "size", "path", "session time", "session id", "experiment id", "container id", "group", "group count", "imaging depth", "location", "fov height", "fov width", "specimen name", "sex", "age", "genotype", "stim types", "# rois", "session end"))
sessions

Unnamed: 0,identifier,size,path,session time,sub name,sub sex,sub age,sub genotype,probes,stim types,# units,session length
0,dbc426a0-aafa-460b-a25a-a86bb31b9ddc,2242666496,sub_1175512783/sub_1175512783sess_1187930705/s...,2022-06-29 00:00:00-07:00,619296,M,154.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",1918,7278.15799
1,181b7651-5f5c-491b-be70-e5d0354439d4,2803525629,sub_1172968426/sub_1172968426sess_1182865981/s...,2022-06-08 00:00:00-07:00,625545,M,89.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2793,7279.234305
2,85bfd56c-f104-4c83-937c-be0d58fce48e,2372313526,sub_1172969394/sub_1172969394sess_1183070926/s...,2022-06-09 00:00:00-07:00,625555,F,90.0,Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2621,7278.592876
3,c5e97840-4988-4da8-9f57-a24fb0a4a865,2466318464,sub_1181585608/sub_1181585608sess_1194644312/s...,2022-07-27 00:00:00-07:00,630507,F,99.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2464,7278.96487
4,a7ff352c-0b00-47d6-a49f-97027d18264e,2809532134,sub_1176214862/sub_1176214862sess_1188137866/s...,2022-06-30 00:00:00-07:00,620333,M,148.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2593,7283.10806
5,a8bc8aaf-ccba-4c27-bb5c-f1bc3c232c84,3393216313,sub_1174569641/sub_1174569641sess_1184671550/s...,2022-06-01 00:00:00-07:00,625554,M,82.0,Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2930,7315.456085
6,32af00b4-4aa6-48de-8210-26a5cf7935a9,3556822422,sub_1181314060/sub_1181314060sess_1191383105/s...,2022-07-13 00:00:00-07:00,630502,M,85.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2368,7277.54836
7,016e7321-807f-4b59-be42-c33511f8f55c,2491393884,sub_1177693342/sub_1177693342sess_1189887297/s...,2022-07-06 00:00:00-07:00,620334,M,154.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2092,7279.915735
8,7252ab67-7acd-4cb7-b7a6-600df600d8e7,3393216313,sub_1171903433/sub_1171903433sess_1181330601/s...,2022-06-01 00:00:00-07:00,625554,M,82.0,Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2930,7315.456085
9,3c6a7667-5f5d-432f-829c-e915dab15c27,2483160990,sub_1182593231/sub_1182593231sess_1192952695/s...,2022-07-20 00:00:00-07:00,630506,F,92.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",2517,7279.167735


In [9]:
# output all session metadata to local CSV file
sessions.to_csv("sessions.csv")

### Selecting Files
**Pandas** syntax can be used to filter the table above and select individual sessions.

In [10]:
selected_sessions = sessions[sessions["size"] <= 2_300_000_000]
# selected_sessions = sessions[sessions["sub sex"] == "F"]
# selected_sessions = sessions[sessions["# units"] > 2900]
selected_sessions

Unnamed: 0,identifier,size,path,session time,sub name,sub sex,sub age,sub genotype,probes,stim types,# units,session length
0,dbc426a0-aafa-460b-a25a-a86bb31b9ddc,2242666496,sub_1175512783/sub_1175512783sess_1187930705/s...,2022-06-29 00:00:00-07:00,619296,M,154.0,Sst-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",1918,7278.15799
10,9d0ed5c2-f9e4-4c5e-b5ab-cb4a9d4e7ef6,1882268693,sub_1183369803/sub_1183369803sess_1194857009/s...,2022-07-28 00:00:00-07:00,631570,F,92.0,Pvalb-IRES-Cre/wt;Ai32(RCL-ChR2(H134R)_EYFP)/wt,"{probeF, probeB, probeD, probeA, probeC, probeE}","{RFCI_presentations, invalid_times, ICwcfg1_pr...",1789,7278.94295


### Downloading Selected Files
To download the files, we use the same method that is explained in [Downloading an NWB File](./download_nwb.ipynb). This can be used with the paths from the selected sessions above to just download the files of interest.  Set `download_loc` to be the relative path of where the files should be downloaded. Note that if the files are large, this can take a long time.

In [11]:
download_loc = "."

In [12]:
selected_paths = set(selected_sessions.path)
selected_paths

{'sub_1175512783/sub_1175512783sess_1187930705/sub_1175512783+sess_1187930705_ecephys.nwb',
 'sub_1183369803/sub_1183369803sess_1194857009/sub_1183369803+sess_1194857009_ecephys.nwb'}

In [13]:
for dandi_filepath in selected_paths:
    filename = dandi_filepath.split("/")[-1]
    file = my_dandiset.get_asset_by_path(dandi_filepath)
    file.download(f"{download_loc}/{filename}")
    print(f"Downloaded file to {download_loc}/{filename}")

Downloaded file to ./sub_1175512783+sess_1187930705_ecephys.nwb
Downloaded file to ./sub_1183369803+sess_1194857009_ecephys.nwb
