## BIDS_files

### What is BIDS?

BIDS is a special naming convention for files and folders. See https://bids-specification.readthedocs.io/en/stable/
It determines where files are, and how they are named.

BIDS_files are a datatype that automaticall incorporates the aforementioned BIDS convention.

The BIDS_files can automatically create a BIDS compliant dataset and provides Object to filter and loop through

In [1]:
# Lets import the class that represent a whole data set.
from TPTBox import BIDS_Global_info

# TODO Replace /media/data/robert/datasets/spinegan_T2w/raw/ with a BIDS compline data set path, where rawdata and derivatives are.
# You can parse multiple datasets and select what parent folder are read (e.g. rawdata, derivatives)
bids_global_object = BIDS_Global_info(
    ["/media/data/robert/datasets/spinegan_T2w/raw/"],
    ["rawdata", "rawdata_dixon", "derivatives"],
    additional_key=["sequ", "seg", "ovl"],
    verbose=True,
)
# The Parser will inform you about every non standard files. To add additional key add them to additional_key list, so you don't get the msg that this is not a valid key

[!] Dataset  does not start with 'dataset-'


FileNotFoundError: /media/data/robert/datasets/spinegan_T2w/raw/

### How to iterate through a Bids dataset?

BIDS splits data samples roughly into:
- Subject: different patients
- Sessions: one patient can have multiple scans

You use enumerate_subjects to iterate over different, unique subjects.
Then, you can use queries to apply various filters. If you use flatten=True, that means you filter inividual files, and not a group/family of files.

In [None]:
# First loop: Loop over subjects
for subject_name, subject_container in bids_global_object.enumerate_subjects(sort=True):
    # Lets filter out in formation we don't want.
    # Lets only search for CT images

    # start the search, you can start multiple independent filters.
    query = subject_container.new_query(flatten=True)
    # We want to filter only now for individual files and not for a group of files (file family), so we set flatten=True

    # This call removes all files that do not end with "_ct.[filetype]"
    query.filter("format", "ct")
    # Lets remove all files that don't have a nifty.
    query.filter("Filetype", "nii.gz")

    # now we can loop over the CT files.
    for bids_file in query.loop_list(sort=True):
        # finally we get a bids_file
        print("CT BIDS file:", bids_file)
        # We will look at bids_files closer soon, lets just open the nifty as a nibabel.
        nii = bids_file.open_nii()
        print("shape of nii-file:", nii.shape)
        break
    break

# What is a BIDS_file


Terminologies:

A BIDS conform path looks like this:
[dataset                                   ]/[  parent   ]/[  subpath ]/[     file_name                                 ]

Example:
/media/data/robert/datasets/spinegan_T2w/raw/rawdata_dixon/spinegan0001/sub-spinegan0001_ses-20220527_sequ-202_ct.nii.gz

A file has all the information to find relations to other files.
Lets look at this file.

"sub-spinegan0001_ses-20220527_sequ-202_ct.nii.gz"

The ending consists of a filetype and a format:

filetype: "nii.gz"
format: ct

The rest are key-value pairs (stored in info) split with "_" and look like this <key>-<value>.
For example, "sub-spinegan0001" means the key is "sub" (standing for subject)" and its value is "spinegan0001".

The above sample filename yields:

sub : spinegan0001 <- must be the first key
ses : 20220527
sequ: 202


In [None]:
# Lets find this information in the Bids_file

print("\nFull file name")
print(bids_file.file["nii.gz"])
print("\nfiletypes")
print(bids_file.file.keys())
print("\nformat")
print(bids_file.format)
print("\nkey-value")
print(bids_file.info)

print("\n\nparent")
print(bids_file.get_parent("nii.gz"))
print("\nthe 4 path parts")
print(bids_file.get_path_decomposed())

# file family

Everyone needs a family! 
Files that are generated from others should belong to a family. We automatically find related files and cluster them into a dictionary.


In [2]:
from BIDS.bids_constants import sequence_splitting_keys

print("We consider a file not to be in the same family if there is at least one key that is different an in this list:")
print(sequence_splitting_keys)

We consider a file not to be in the same family if there is at least one key that is different an in this list:
['ses', 'sequ', 'acq', 'hemi', 'sample', 'ce', 'trc', 'stain', 'rec', 'res', 'run', 'desc', 'split']


In [None]:
# First loop: Loop over subjects
for subject_name, subject_container in bids_global_object.enumerate_subjects(sort=True):
    # Lets search for CTs images and related files

    query = subject_container.new_query(flatten=False)  # <- flatten=False means we search for family
    # This call removes all families that do not have at least one file that end with "_ct.[filetype]"
    query.filter("format", "ct")
    # Lets require a segmentation
    query.filter("seg", "vert")
    query.filter("seg", "subreg")

    # now we can loop over the CT files.
    for bids_family in query.loop_dict(sort=True):
        # finally we get a bids_family
        print("Files in this family:", bids_family.get_key_len())
        print(bids_family)
        break
    break

In [None]:
# We can now collect the individual files by using the short key. Not that we can find multiple instances of a key
# Usually it is just the "format" tag
ct_file = bids_family["ct"][0]
# We could find multiple ct, so we return always a list.
from BIDS.bids_constants import sequence_naming_keys

print('These formats will be tagged on with "_", instead of replaced', sequence_naming_keys)
# so a ..._seg-vert_msk.nii.gz will get the key: msk_seg-vert
vert_seg = bids_family["msk_seg-vert"][0]

print(vert_seg.file["nii.gz"])
print(vert_seg)

## Lets generate a new file

We can get new datapaths in bids-format by using <bids_file>.get_changed_path()

In [None]:
# 1. Take an existing file
ct_file = bids_family["ct"][0]
# 2 Tell the bids file what should be different from the current file, the rest will be copied
path1 = ct_file.get_changed_path("nii.gz", format="msk", info={"seg": "vert"}, parent="derivatives")
print("Path1:", path1)
path2 = ct_file.get_changed_path("json", format="msk", info={"seg": "vert"}, path="ses-{ses}_sub-{sub}/{sequ}", parent="rawdata")
print("Path1:", path2)
print(type(path2))
# 3 Just use it as a normal path

# Running in true parallel

Python runs only in one thread. You have to spawn new Thread with Parallel. Here is an example. You have to create a helper function

In [None]:
from TPTBox import BIDS_Global_info
from joblib import Parallel, delayed
from TPTBox import Subject_Container
import time, random

n_jobs = 10


def __helper(subj_name, subject: Subject_Container):
    time.sleep(random.random() * 0.1)
    # TODO: here is what it should do for each subject
    print(subj_name)


# initialize BIDS dataset
global_info = BIDS_Global_info(
    ["/media/data/robert/datasets/spinegan_T2w/raw"],
    ["sourcedata", "rawdata", "rawdata_ct", "rawdata_dixon", "derivatives"],
    additional_key=["sequ", "seg", "ovl", "e"],
    clear=True,
)

# Call parallel, which starts a number of threads equal to n_jobs and those call __helper() for each subject in bids_dataset
Parallel(n_jobs=n_jobs)(delayed(__helper)(subj_name, subject) for subj_name, subject in global_info.enumerate_subjects(sort=True))
print("finished")