# Introduction to `pybids`

---
## Table of contents
- [The `BIDSLayout`](#layout)
- [Querying the `BIDSLayout`](#query)
  - [Filtering files by entities](#filtering_files)
  - [Filtering by metadata](#filtering_metadata)
  - [Other `return_type` values](#other_return_types)
  - [Other `get()` options](#other_get_options)
- [The BIDSFile](#bids_file)
- [Retrieving BIDS variables](#retrieving_variables)
  


---

[`pybids`](https://github.com/bids-standard/pybids) is a tool to query, summarize and manipulate data using the BIDS standard. 

In this tutorial we will use a `pybids` test dataset to illustrate some of the functionality of `pybids.layout`

---
<a id='layout'></a>
## The `BIDSLayout`

At the core of pybids is the `BIDSLayout` object. 

A `BIDSLayout` is a lightweight Python class that represents a BIDS project file tree and provides a variety of helpful methods for querying and manipulating BIDS files. 

While the `BIDSLayout` initializer has a large number of arguments you can use to control the way files are indexed and accessed, you will most commonly initialize a `BIDSLayout` by passing in the BIDS dataset root location as a single argument:

In [13]:
# lint with black
%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


In [2]:
from bids import BIDSLayout
from bids.tests import get_test_data_path
import os

In [3]:
# Here we're using an example BIDS dataset that's bundled with the pybids tests
data_path = os.path.join(get_test_data_path(), "7t_trt")
print(f"Initializing layout with: {data_path}")

# Initialize the layout
layout = BIDSLayout(data_path)

# Print some basic information about the layout
layout

Initializing layout with: /home/remi/github/pybids/bids/tests/data/7t_trt


BIDS Layout: .../pybids/bids/tests/data/7t_trt | Subjects: 10 | Sessions: 20 | Runs: 20

---
<a id='query'></a>
## Querying the `BIDSLayout`

When we initialize a `BIDSLayout`, all of the files and metadata found under the specified root folder are indexed. 

This can take a few seconds (or, for very large datasets, a minute or two). 

Once initialization is complete, we can start querying the `BIDSLayout` in various ways. 

The workhorse method is [`.get()`](https://bids-standard.github.io/pybids/generated/bids.grabbids.BIDSLayout.html#bids.grabbids.BIDSLayout.get). 

If we call `.get()` with no additional arguments, we get back a list of all the BIDS files in our dataset:

In [36]:
all_files = layout.get()
print(f"There are {len(all_files)} files in the layout.")
print("\nThe first 10 files are:")
all_files[:10]

There are 339 files in the layout.

The first 10 files are:


[<BIDSJSONFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/dataset_description.json'>,
 <BIDSDataFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/participants.tsv'>,
 <BIDSFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/README'>,
 <BIDSImageFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1map.nii.gz'>,
 <BIDSImageFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1w.nii.gz'>,
 <BIDSImageFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude1.nii.gz'>,
 <BIDSImageFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude2.nii.gz'>,
 <BIDSJSONFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_phasediff.json'>,
 <BIDSImageFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub

The returned object is a Python list. 

By default, each element in the list is a `BIDSFile` object. We discuss the `BIDSFile` object in much more detail [below](#bids_file). 

For now, let's simplify things and work with just filenames:

In [16]:
layout.get(return_type="filename")[:10]

['/home/remi/github/pybids/bids/tests/data/7t_trt/dataset_description.json',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/participants.tsv',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/README',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1map.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/anat/sub-01_ses-1_T1w.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude1.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_magnitude2.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_phasediff.json',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-1_phasediff.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/fmap/sub-01_ses-1_run-2_magnitude1.nii.gz']

This time, we get back only the names of the files.

<a id='filtering_files'></a>
### Filtering files by entities

The utility of the `BIDSLayout` would be pretty limited if all we could do was retrieve a list of all files in the dataset. 

Fortunately, the `.get()` method accepts all kinds of arguments that allow us to filter the result set based on specified criteria. 

In fact, we can pass *any* BIDS-defined keywords (or, as they're called in PyBIDS, *entities*) as constraints. 

For example, here's how we would retrieve all BOLD runs with `.nii.gz` extensions for subject `'01'`:

In [17]:
# Retrieve filenames of all BOLD runs for subject 01
layout.get(subject="01", extension="nii.gz", suffix="bold", return_type="filename")

['/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_bold.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-2_bold.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-prefrontal_bold.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-fullbrain_run-1_bold.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-fullbrain_run-2_bold.nii.gz',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-prefrontal_bold.nii.gz']

If you're wondering what entities you can pass in as filtering arguments, the answer is contained in the `.json` configuration files [housed here](https://github.com/bids-standard/pybids/tree/master/bids/layout/config). 

To save you the trouble, here are a few of the most common entities:

* `suffix`: The part of a BIDS filename just before the extension (e.g., `'bold'`, `'events'`, `'physio'`, etc.).
* `subject`: The subject label
* `session`: The session label
* `run`: The run index
* `task`: The task name

New entities are continually being defined as the spec grows, and in principle (though not always in practice), PyBIDS should be aware of all entities that are defined in the BIDS specification.

<a id='filtering_metadata'></a>
### Filtering by metadata
All of the entities listed above are found in the names of BIDS files. 

But sometimes we want to search for files based not just on their names, but also based on metadata defined (per the BIDS spec) in JSON files. 

Fortunately for us, when we initialize a `BIDSLayout`, all metadata files associated with BIDS files are automatically indexed. 

This means we can pass any key that occurs in any JSON file in our project as an argument to `.get()`. 

We can combine these with any number of core BIDS entities (like `subject`, `run`, etc.). 

For example, say we want to retrieve all files where:
- the value of `SamplingFrequency` (a metadata key) is `100`, 
- the `acquisition` type is `"prefrontal"`, 
- the subject is `"01"` or `"02"`. Here's how we can do that:

In [18]:
# Retrieve all files where SamplingFrequency (a metadata key) = 100
# and acquisition = prefrontal, for the first two subjects
layout.get(subject=["01", "02"], SamplingFrequency=100, acquisition="prefrontal")

[<BIDSDataFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-prefrontal_physio.tsv.gz'>,
 <BIDSDataFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-2/func/sub-01_ses-2_task-rest_acq-prefrontal_physio.tsv.gz'>,
 <BIDSDataFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-02/ses-1/func/sub-02_ses-1_task-rest_acq-prefrontal_physio.tsv.gz'>,
 <BIDSDataFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-02/ses-2/func/sub-02_ses-2_task-rest_acq-prefrontal_physio.tsv.gz'>]

Notice that we passed a list in for `subject` rather than just a string. 

This principle applies to all filters: you can always pass in a list instead of a single value, and this will be interpreted as a logical disjunction (i.e., a file must match any one of the provided values).

<a id='other_return_types'></a>
### Other `return_type` values

While we'll typically want to work with either `BIDSFile` objects or filenames, we can also ask `get()` to return unique values (or ids) of particular entities. 

For example, say we want to know which subjects have at least one `T1w` file. 

We can request that information by setting `return_type='id'`. 

When using this option, we also need to specify a target entity (or metadata keyword) called `target`.

This combination tells the `BIDSLayout` to return the unique values for the specified `target` entity. 

For example, in the next example, we ask for all of the unique subject IDs that have at least one file with a `T1w` suffix:

In [19]:
# Ask get() to return the ids of subjects that have T1w files
layout.get(return_type="id", target="subject", suffix="T1w")

['02', '10', '06', '04', '01', '03', '05', '08', '07', '09']

If our `target` is a BIDS entity that corresponds to a particular directory in the BIDS spec (e.g., `subject` or `session`) we can also use `return_type='dir'` to get all matching subdirectories:

In [20]:
layout.get(return_type="dir", target="subject")

['/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-02',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-03',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-04',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-05',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-06',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-07',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-08',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-09',
 '/home/remi/github/pybids/bids/tests/data/7t_trt/sub-10']

<a id='other_get_options'></a>
### Other `get()` options
The `.get()` method has a number of other useful arguments that control its behavior. 

We won't discuss these in detail here, but briefly, here are a couple worth knowing about:
* `regex_search`: If you set this to `True`, string filter argument values will be interpreted as regular expressions.
* `scope`: If your BIDS dataset contains BIDS-derivatives sub-datasets, you can specify the scope (e.g., `derivatives`, or a BIDS-Derivatives pipeline name) of the search space.

---
<a id='bids_file'></a>
## The `BIDSFile`

When you call `.get()` on a `BIDSLayout`, the default returned values are objects of class `BIDSFile`. 

A `BIDSFile` is a lightweight container for individual files in a BIDS dataset. It provides easy access to a variety of useful attributes and methods. 

Let's take a closer look. First, let's pick a random file from our existing `layout`.

In [21]:
# Pick the 15th file in the dataset
bf = layout.get()[15]

# Print it
bf

<BIDSDataFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_physio.tsv.gz'>

Here are some of the attributes and methods available to us in a `BIDSFile`:
* `.path`: The full path of the associated file
* `.filename`: The associated file's filename (without directory)
* `.dirname`: The directory containing the file
* `.get_entities()`: Returns information about entities associated with this `BIDSFile` (optionally including metadata)
* `.get_image()`: Returns the file contents as a nibabel image (only works for image files)
* `.get_df()`: Get file contents as a pandas DataFrame (only works for TSV files)
* `.get_metadata()`: Returns a dictionary of all metadata found in associated JSON files
* `.get_associations()`: Returns a list of all files associated with this one in some way

**Note**: some of these are only available for certain subclasses of `BIDSFile`; e.g., you can't call `get_image()` on a `BIDSFile` that doesn't correspond to an image file!

Let's see some of these in action.

In [22]:
# Print all the entities associated with this file, and their values
bf.get_entities()

{'acquisition': 'fullbrain',
 'datatype': 'func',
 'extension': '.tsv.gz',
 'run': 1,
 'session': '1',
 'subject': '01',
 'suffix': 'physio',
 'task': 'rest'}

In [23]:
# Print all the metadata associated with this file
bf.get_metadata()

{'Columns': ['cardiac', 'respiratory', 'trigger', 'oxygen saturation'],
 'SamplingFrequency': 100,
 'StartTime': 0}

In [31]:
# We can the union of both of the above in one shot like this
bf.get_entities(metadata="all")

{'Columns': ['cardiac', 'respiratory', 'trigger', 'oxygen saturation'],
 'SamplingFrequency': 100,
 'StartTime': 0,
 'acquisition': 'fullbrain',
 'datatype': 'func',
 'extension': '.tsv.gz',
 'run': 1,
 'session': '1',
 'subject': '01',
 'suffix': 'physio',
 'task': 'rest'}

Here are all the files associated with our target file in some way. 

**Note**: we get back both the JSON sidecar for our target file, and the BOLD run that our target file contains physiological recordings for.

In [25]:
bf.get_associations()

[<BIDSJSONFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/task-rest_acq-fullbrain_run-1_physio.json'>,
 <BIDSImageFile filename='/home/remi/github/pybids/bids/tests/data/7t_trt/sub-01/ses-1/func/sub-01_ses-1_task-rest_acq-fullbrain_run-1_bold.nii.gz'>]

In cases where a file has a `.tsv.gz` or `.tsv` extension, it will automatically be created as a `BIDSDataFile`, and we can easily grab the contents as a pandas `DataFrame`:

In [32]:
# Use a different test dataset--one that contains physio recording files
data_path = os.path.join(get_test_data_path(), "synthetic")
layout2 = BIDSLayout(data_path)

# Get the first physiological recording file
recfile = layout2.get(suffix="physio")[0]

# Get contents as a DataFrame and show the first few rows
df = recfile.get_df()
df.head()

Unnamed: 0,onset,respiratory,cardiac
0,0.0,-0.714844,-0.262109
1,0.1,-0.757342,0.048933
2,0.2,-0.796851,0.355185
3,0.3,-0.833215,0.626669
4,0.4,-0.866291,0.83681


While it would have been easy enough to read the contents of the file ourselves with pandas' `read_csv()` method, notice that in the above example, `get_df()` saved us the trouble of having to read the physiological recording file's metadata, pull out the column names and sampling rate, and add timing information.

Mind you, if we don't *want* the timing information, we can ignore it:

In [27]:
recfile.get_df(include_timing=False).head()

Unnamed: 0,respiratory,cardiac
0,-0.714844,-0.262109
1,-0.757342,0.048933
2,-0.796851,0.355185
3,-0.833215,0.626669
4,-0.866291,0.83681


In [28]:
# Define paths to root and derivatives folders
root = os.path.join(get_test_data_path(), "synthetic")
layout2 = BIDSLayout(root, derivatives=True)
layout2



BIDS Layout: ...bids/bids/tests/data/synthetic | Subjects: 5 | Sessions: 10 | Runs: 10

---
<a id='retrieving_variables'></a>
## Retrieving BIDS variables 
BIDS variables are stored in .tsv files at the run, session, subject, or dataset level. You can retrieve these variables with `layout.get_collections()`. The resulting objects can be converted to dataframes and merged with the layout to associate the variables with corresponding scans.

In the following example, we request all subject-level variable data available anywhere in the BIDS project, and merge the results into a single `DataFrame` (by default, we'll get back a single `BIDSVariableCollection` object for each subject). 

In [33]:
# Get subject variables as a dataframe and merge them back in with the layout
subj_df = layout.get_collections(level="subject", merge=True).to_df()
subj_df.head()

Unnamed: 0,subject,session,CCPT_FN_count,CCPT_FP_count,CCPT_avg_FN_RT,CCPT_avg_FP_RT,CCPT_avg_succ_RT,CCPT_succ_count,caffeine_daily,diastolic_blood_pressure_left,...,specific_vague,subject_id,surroundings,systolic_blood_pressure_left,systolic_blood_pressure_right,thirst,vigilance,vigilance_nyc-q,words,suffix
0,1,1,0.0,1.0,,507.0,500.770833,96.0,0.5,64,...,95,1,0,108,109,9,9,100,100,sessions
1,1,2,0.0,1.0,,595.0,503.333333,96.0,2.0,66,...,100,1,20,101,101,9,7,100,100,sessions
2,2,1,0.0,5.0,,297.6,351.729167,96.0,0.0,65,...,100,2,70,99,100,2,7,100,100,sessions
3,2,2,1.0,0.0,0.0,,366.315789,95.0,0.0,75,...,60,2,60,108,106,4,3,100,80,sessions
4,3,1,0.0,1.0,,441.0,426.71875,96.0,1.0,69,...,100,3,10,122,128,3,8,100,0,sessions
