# Exploring the DANDI Archive

This notebook serves as a quick-start guide of the [Distributed Archive for Neurophysiology Data Integration (DANDI)](https://registry.opendata.aws/dandiarchive/).

<div style="background:#fff2cc; border:1px solid #000000; border-radius:2px; padding:2px 5px; margin:-5px 0 5px 20px;
color:#000000;
width:fit-content; ">
    <div style="font-weight:bold;">‚ÑπÔ∏è Definition</div>
    A <i>Dandiset</i> is a collection of neurophysiology data and metadata hosted on the <a
    href="https://dandiarchive.org" style="font-weight:bold;">DANDI Archive</a>.
</div>

The DANDI Archive holds hundreds of Dandisets with a diverse range of neurodata modalities.

These modalities span the spectrum of microscopy, optogenetics, intracellular and extracellular
electrophysiology, and optophysiology.

While we cannot hope to completely showcase this diversity here, there are two key examples which provide a good
starting point:
- [000728 - Visual Coding - Optical Physiology](https://dandiarchive.org/dandiset/000728/) by the Allen Institute for
Brain Science (AIBS)
- [000409 - Brain Wide Map](https://dandiarchive.org/dandiset/000409/) by the International Brain Laboratory (IBL)

For even more [usage guides](https://docs.dandiarchive.org/user-guide-using/exploring-dandisets/),
[dandiset-specific tutorials](https://dandi.github.io/example-notebooks/), and general documentation, check out the main
[DANDI Docs](https://docs.dandiarchive.org/).

### Q: How do I navigate the archive and its datasets?

DANDI provides a [web interface](https://dandiarchive.org/dandiset), [REST API](https://api.dandiarchive.org/api/docs/swagger/),
and [command-line tool](https://pypi.org/project/dandi/) to help users intuitively navigate the contents.

The easiest place to start is the primary [Dandiset listing page](https://dandiarchive.org/dandiset).

After scrolling around a while, we choose our first Dandiset from the web interface [000728](https://dandiarchive.org/dandiset/000728/0.240827.1809).

We can see the contents by going to the ["Files" tab](https://dandiarchive.org/dandiset/000728/0.240827.1809/files).

From here, we can see that a Dandiset is organized as a collection of folders organized by subject ID.

Each folder contains files named according to session ID or other unique discriminators.

```text
000728/
‚îú‚îÄ‚îÄ sub-691657859/sub-691657859_ses-712919679-StimB_ophys.nwb
‚îÇ   ‚îú‚îÄ‚îÄ sub-691657859_ses-712919679-StimB_ophys.nwb
‚îÇ   ‚îú‚îÄ‚îÄ sub-691657859_ses-710504563-StimA_behavior+image+ophys.nwb
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îú‚îÄ‚îÄ sub-501800590/
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ...
```

### Setup

Before we start accessing data contents, we will need to install and import some Python libraries.

In [None]:
!pip install -q dandi matplotlib remfile

from pathlib import Path

import h5py
import remfile
import matplotlib.pyplot as plt
from dandi.dandiapi import DandiAPIClient
from pynwb import read_nwb, NWBHDF5IO

Next, we will initialize our DANDI API client to interact with the archive database and list a few of the
available Dandisets.

In [None]:
client = DandiAPIClient()
dandisets = list(client.get_dandisets())

# Print the dandiset IDs and titles of the first 3 dandisets
for dandiset in dandisets[:3]:
    print(f"{dandiset.identifier}: {dandiset.get_raw_metadata()["name"]}")

000003: Physiological Properties and Behavioral Correlates of Hippocampal Granule Cells and Mossy Cells
000004: A NWB-based dataset and processing pipeline of human single-neuron activity during a declarative memory task
000005: Electrophysiology data from thalamic and cortical neurons during somatosensation


Now let's return to our first example Dandiset and list out a few of its contents.

In [None]:
dandiset = client.get_dandiset(dandiset_id="000728", version_id="0.240827.1809")
assets = list(dandiset.get_assets())

# Print the file paths as seen on the DANDI web interface
for asset in assets[:3]:
    print(asset.get_raw_metadata()["path"])

sub-691657859/sub-691657859_ses-712919679-StimB_ophys.nwb
sub-501800590/sub-501800590_ses-509522931-StimC_ophys.nwb
sub-572569275/sub-572569275_ses-591563201-StimB_ophys.nwb


<div style="background:#cfe2f3; border:1px solid #000000; border-radius:2px; padding:10px; margin:5px 0 5px 0px;
color:#000000; max-width:650px;">
    <div style="font-weight:bold;">üí° Info</div>
    <div style="margin-top:5px;">
        Notice that we passed a <code style="background:#cfe2f3; color:#000000">version_id</code> in this case.
        Dandisets that are published on the archive are given a citable DOI, such as:
    </div>
    <div style="padding-left:20px; margin:5px 0;">
        Allen Institute (2024) <em>Allen Institute - Visual Coding - Optical Physiology</em> (Version 0.240827.1809) [Data set]. DANDI archive. <a href="https://doi.org/10.48324/dandi.000728/0.240827.1809">https://doi.org/10.48324/dandi.000728/0.240827.1809</a>
    </div>
    <div>These citations should be used in any scientific reuse of the data.</div>
    <div style="margin-top:5px;">
        Otherwise, the most recent 'draft' state of the Dandiset is used by default and is subject to change by the Dandiset contributors.
    </div>
</div>



<div style="background:#e9e2f8; border:1px solid #000000; border-radius:2px; padding:2px 5px; margin:10px 0 5px 0px;
color:#000000;
width:fit-content; ">
    <div style="font-weight:bold;">üß† Learn more</div>
    You may have also noticed that in several cases above, we fetched the metadata associated with the Dandisets and
    their assets.
    <br>
    These are very rich models whose full potential is best showcased in the <a href="https://docs.dandiarchive
    .org/example-notebooks/tutorials/cosyne_2023/advanced_asset_search/#going-beyond" style="font-weight:bold;">Advanced Search Tutorial</a>.
</div>

### Q: What kinds of data are hosted and what formats do they use?

DANDI accepts a relatively small number of open, community-driven file formats designed according to NIH-accepted
data standards*.

|                                         <center>Data Standard</center>                                          | <center>Acronym</center>  |                 <center>Domain</center>                  |    <center>Data Format(s)</center>    |
|:---------------------------------------------------------------------------------------------------------------:|:-------------------------:|:--------------------------------------------------------:|:-------------------------------------:|
|                         <center>[Neurodata Without Borders](https://nwb.org/)</center>                          |   <center>NWB</center>    |     <center>Neurophysiology<br>and behavior</center>     |     <center>HDF5<br>Zarr</center>     |
|                 <center>[Brain Imaging Data Structure](https://bids.neuroimaging.io/)</center>                  |   <center>BIDS</center>   |    <center>Neuroimaging<br>(MRI, EEG, etc.)</center>     | <center>NIfTI<br>JSON<br>TSV</center> |
|    <center>[Open Microscopy Environment](https://docs.openmicroscopy.org/ome-model/5.6.3/ome-tiff/)</center>    | <center>OME-TIFF</center> |           <center>Microscopy imaging</center>            |         <center>TIFF</center>         |
| <center>[Open Microscopy Environment<br>Next Generation File Format](https://ngff.openmicroscopy.org/)</center> | <center>OME-Zarr</center> | <center>Microscopy imaging<br>(cloud-optimized)</center> |         <center>Zarr</center>         |

These data standards are specifically designed to integrate multi-modal raw and processed neurodata alongside
behavioral data and metadata annotations.

The S3 bucket hosting the DANDI archive allows users to take advantage of cloud-native
services for scalable data access, computation, visualization, and analysis.

This allows DANDI to integrate with many external visualization tools, accessible via the "Open With" button on the web interface:
- [NWB: Neurosift](https://neurosift.app)
    - [Example: 000728/sub-495727000/sub-495727000_ses-51254258-StimC_behavior+image+ophys.nwb](https://neurosift.app/nwb?url=https://api.dandiarchive.org/api/assets/0205b9b1-10c4-467c-b027-20bbbfcce3a0/download/&dandisetId=001172&dandisetVersion=0.260129.0829)
- [OME: Neuroglancer](https://github.com/google/neuroglancer)
    - [Example: 000026/sub-I58/ses-Hip-CT/micr/](https://neuroglancer-demo.appspot.com/#!%7B%22dimensions%22:%7B%22z%22:%5B0.00001513%2C%22m%22%5D%2C%22y%22:%5B0.00001513%2C%22m%22%5D%2C%22x%22:%5B0.00001513%2C%22m%22%5D%7D%2C%22position%22:%5B5257.03564453125%2C4706%2C4218.56396484375%5D%2C%22crossSectionScale%22:21.48356465187443%2C%22projectionScale%22:16384%2C%22layers%22:%5B%7B%22type%22:%22image%22%2C%22source%22:%22https://dandiarchive.s3.amazonaws.com/zarr/5c37c233-222f-4e60-96e7-a7536e08ef61%22%2C%22tab%22:%22rendering%22%2C%22shaderControls%22:%7B%22normalized%22:%7B%22range%22:%5B23257%2C24764%5D%2C%22window%22:%5B22877%2C25144%5D%7D%7D%2C%22name%22:%22798b8b1b-c88d-42e8-91f8-247fd4282fe7%22%7D%5D%2C%22selectedLayer%22:%7B%22visible%22:true%2C%22layer%22:%22798b8b1b-c88d-42e8-91f8-247fd4282fe7%22%7D%2C%22layout%22:%224panel%22%7D)

<div style="background:#e9e2f8; border:1px solid #000000; border-radius:2px; padding:2px 5px; margin:25px 0 5px 0px;
color:#000000;
width:fit-content; ">
    <div style="font-weight:bold;">üß† Learn more</div>
    <div style="margin-top:2px;">
        For readers interested in exploring more tools compatible with DANDI-supported data formats, refer to:
    </div>
    <ul style="margin-top:5px; margin-bottom:5px;">
        <li><a href="https://nwb-overview.readthedocs.io/en/latest/tools/analysis_tools_home.html#analysis-and-visualization-tools" style="font-weight:bold;">NWB: Analysis Tools</a></li>
        <li><a href="https://bids.neuroimaging.io/tools/others.html#analysis" style="font-weight:bold;">BIDS: Analysis Tools</a></li>
    </ul>
</div>

*The difference between data formats and standards is elaborated in greater detail in the [Data
Standards](https://docs.dandiarchive.org/getting-started/data-standards/#data-standards) section of the documentation.

<!-- DATA PROVIDER INSTRUCTIONS
The goal of this section is to demonstrate loading a portion of data from your dataset, and reveal something about its structure.
1. Load an object from S3
2. Show the structure of data in the object
DATA PROVIDER INSTRUCTIONS -->

### Q: How do I access the contents of a Dandiset?

Data contents from assets on the DANDI archive can either be downloaded or streamed directly from S3.

In [None]:
# Look up a specific file asset from a different Dandiset
dandiset = client.get_dandiset(dandiset_id="000728")
dandi_filename = "sub-491604983/sub-491604983_ses-501560436-StimC_behavior+image+ophys.nwb"
asset = dandiset.get_asset_by_path(path=dandi_filename)

# Download entire file (alter the base directory as needed)
output_path = Path.cwd() / Path(dandi_filename).name
asset.download(filepath=output_path)

To open the file after the download completes, we can use the [PyNWB](https://pynwb.readthedocs.io/en/stable/)
library to read the NWB file and display the basic content layout.

In [None]:
nwbfile = read_nwb(path=output_path)
print(nwbfile)

When running the notebook in compatible environments (_e.g._, Jupyter), you can also interact with the filetree.

In [None]:
nwbfile

Specific data arrays can be accessed by traversing the NWB file structure.

For example, the $\Delta F/F$ time series derived from the raw two-photon calcium imaging can be found under the
'processing' module.

In [None]:
df_over_f_array = nwbfile.processing["ophys"]["DfOverF"]["DfOverF"].data

# Get a subset of the data for visualization
# Note that the `df_over_f_array` has shape `number of frames x number of regions of interest`
# reflecting dimensions of `time x ROIs`
time_series_data = df_over_f_array[:1000, :5]

plt.figure(figsize=(12, 6))
for i in range(time_series_data.shape[1]):
    plt.plot(time_series_data[:, i], alpha=0.7)

plt.xlabel('Time (frames)')
plt.ylabel('ŒîF/F')
plt.title('Calcium Imaging Time Series (ŒîF/F)')
plt.show()

Note that all data access when reading from an NWB file is 'lazy' in the sense that data arrays are not read into memory
until explicitly requested via slicing operations.

This is particularly useful when working with large (> 60 GB) datasets that
may not otherwise fit into memory.

Additionally, some files on the DANDI archive can be quite large (up to TB-size files in multi-TB Dandisets)!

Instead of downloading these, you can stream data directly from S3 using any of the [libraries supported by PyNWB](https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/streaming.html).

The previous command can be adapted to stream directly from S3.

In [None]:
s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
rem_file = remfile.File(url=s3_url)
h5py_file = h5py.File(name=rem_file, mode="r")
io = NWBHDF5IO(file=h5py_file)
streamed_nwbfile = io.read()

streamed_df_over_f_array = streamed_nwbfile.processing["ophys"]["DfOverF"]["DfOverF"].data

streamed_time_series_data = streamed_df_over_f_array[:1000, :5]

plt.figure(figsize=(12, 6))
for i in range(streamed_time_series_data.shape[1]):
    plt.plot(streamed_time_series_data[:, i], alpha=0.7)

plt.xlabel('Time (frames)')
plt.ylabel('ŒîF/F')
plt.title('Calcium Imaging Time Series (ŒîF/F)')
plt.show()

<!-- DATA PROVIDER INSTRUCTIONS
The goal here is to visualize some aspect of your dataset in order to help users understand it. In addition to helping users of your dataset understand the dataset, an additional goal is to impress!

Please demonstrate any data preprocessing or reshaping required for your visualization(s).

https://www.reddit.com/r/dataisbeautiful/ for inspiration.
DATA PROVIDER INSTRUCTIONS -->

### Q: A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.

Using the same Dandiset as the previous section, we can visualize the connection between the segmentation and the
cellular imaging by overlying the summary image with the ROIs.

In [None]:
summary_image = nwbfile.processing["ophys"]["SummaryImages"]["mean_image"].data

plt.figure(figsize=(10, 8))
plt.imshow(summary_image[:], cmap='gray')
plt.title('Mean Summary Image of Imaging Field')
plt.xlabel('X (pixels)')
plt.ylabel('Y (pixels)')
plt.colorbar(label='Intensity')
plt.tight_layout()
plt.show()

In [None]:
summary_image = nwbfile.processing["ophys"]["SummaryImages"]["maximum_intensity_projection"]
summary_image

In [None]:
nwbfile.processing["ophys"]["ImageSegmentation"]["PlaneSegmentation"]

# TODO: cody finish making pretty overlay of ROI to max proj

The examples above showcase optophysiology data, but DANDI hosts diverse neurophysiology modalities.

Let's explore some other data types - such as electrophysiology!

In [None]:
# TODO: Heberto waveforms

DANDI doesn't just host neural data, either - it is quite common for Dandisets to include behavioral data as well.

Let's take a look at one of the most common behavioral data types - pose estimated video!

### The International Brain Laboratory and Standardized Behavior

The [International Brain Laboratory (IBL)](https://www.internationalbrainlab.com/) is a collaboration of over 20 neuroscience laboratories working together to understand how the brain produces decisions. One of their key innovations is the use of **standardized behavioral protocols** across all participating laboratories.

In each IBL experiment, a mouse performs a visually-guided decision-making task while being recorded by three synchronized cameras:
- **Left camera**: captures facial features and left paw movements
- **Right camera**: captures facial features and right paw movements  
- **Body camera**: records the animal's posture from above

Rather than relying on physical markers attached to the animal, IBL uses [DeepLabCut](https://www.mackenziemathislab.org/deeplabcut) and [Lightning Pose](https://lightning-pose.readthedocs.io/) for markerless pose estimation. These deep learning methods automatically track anatomical landmarks (like paws, nose, tongue, and pupil) across video frames.

We can showcase how to load and visualize this pose estimation data using an example NWB file from the IBL Dandiset using the `nwb-video-widgets` library.

The IBL data is available on DANDI as [Dandiset 000409](https://dandiarchive.org/dandiset/000409) - "IBL - Brain Wide Map".

In [None]:
# Install the nwb-video-widgets package for interactive pose visualization
!pip install -q nwb-video-widgets[dandi]

In [None]:
from nwb_video_widgets import NWBDANDIPoseEstimationWidget
from dandi.dandiapi import DandiAPIClient

# Connect to DANDI and get the IBL Dandiset
client = DandiAPIClient()
dandiset = client.get_dandiset("000409", "draft")

# IBL session with pose estimation data
session_eid = "64e3fb86-928c-4079-865c-b364205b502e"

# Find assets for this session
session_assets = [asset for asset in dandiset.get_assets() if session_eid in asset.path]
raw_asset = next((asset for asset in session_assets if "desc-raw" in asset.path), None)
processed_asset = next((asset for asset in session_assets if "desc-processed" in asset.path), None)

# Display pose estimation widget - streams video from S3 and overlays keypoints
NWBDANDIPoseEstimationWidget(
    asset=processed_asset,
    video_asset=raw_asset,
)

The widget above demonstrates how behavioral data in NWB format can be visualized alongside neural recordings. Each colored dot represents a tracked body part, with coordinates extracted frame-by-frame by the pose estimation model.

This integration of video, pose estimation, and neural data in a single standardized format (NWB) is a key feature that makes DANDI datasets suitable for studying brain-behavior relationships at scale.

<!-- DATA PROVIDER INSTRUCTIONS
This section is less prescriptive / freeform than previous sections. The goal here is to show an opinionated example of answering a question using your data. The scale of your dataset may preclude a full example, and so feel free to limit the scope of this example (e.g. work on a subset of data). Users should be able to replicate your example in this notebook, and get a sense of how they would scale up.

A "toy" example is better than no example.

Ideally, your example would:
1. Transmit some of your domain & dataset experience to the reader, drawing on your own work as much as possible
2. Provide a jumping off point for users to extend your work, and do novel work of their own.

DATA PROVIDER INSTRUCTIONS -->

### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?

Focusing on the optophysiology example used above - the Visual Coding project by the Allen Institute - one question
that was addressed involves characterizing population-level response characteristics across
visual cortex.


Multiple stimuli (natural scenes, drifting gratings, static gratings) were presented to each subject over the course
of the experiment. Different structures within the visual cortex were targeted across subjects. The neural responses during each presentation were then characterized to
 show differing response properties across visual areas. This demonstrated that different cortical layers have distinct
 response properties and tuning characteristics. The experiments also quantified how correlated activity between
 neurons affects information coding by showed that noise correlations are stronger between neurons with similar tuning
 properties. Additional findings demonstrate that correlations are modulated by behavioral state (running vs.
 stationary movements).

A full reproducible analysis of this work can be found through its very detailed [example notebook](https://github.com/dandi/example-notebooks/blob/master/000728/AllenInstitute/visual_coding_ophys_tutorial.ipynb).

<!-- DATA PROVIDER INSTRUCTIONS
This section is, like the previous one, intended to be freeform / non-prescriptive. The goal here is to provide a challenge to the community to do something novel with your dataset. That can either be novel in terms of the task, or novel in terms of methodological or computational approach.

Another way to consider this section, is as a wishlist. If you were less constrained by time, cost, skill, etc., what would you like to see achieved using these data? 

The challenge should, however, be somewhat realistic. A challenge that assumes e.g. original data collection, is likely to go unanswered.
DATA PROVIDER INSTRUCTIONS -->

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

One such proposal might be how do different visual cortical areas (V1, LM, AL, PM) coordinate their activity over time
during naturalistic scene viewing, and can we identify temporal "routing" patterns that predict behavioral state transitions?

While the Visual Coding dataset(s) have characterized individual area responses, differences across cell types, and
other correlations, the temporal dynamics of information flow between areas during natural scene processing remains
less explored - particularly how running vs. stationary states modulate inter-area communication.

It is worth mentioning in this context that the NWB group hosts a regular [NeuroDataReHack event](https://nwb.org/events/hck26-2026-janelia-ndrh/)
where researchers are brought together to work precisely on such questions of how to analyze existing datasets in
novel ways, rather than running entirely new experiments. Check the [NWB Events](https://nwb.org/events/) page and
sign up for the newsletter to stay informed about these kinds of events!