## BIDS_files

### What is BIDS?

BIDS is a special naming convention for files and folders. See https://bids-specification.readthedocs.io/en/stable/
It determines where files are, and how they are named.

BIDS_files are a datatype that automaticall incorporates the aforementioned BIDS convention.

The BIDS_files can automatically create a BIDS compliant dataset and provides Object to filter and loop through

In [12]:
# Lets import the class that represent a whole data set.
from TPTBox import BIDS_Global_info

# TODO Replace /DATA/NAS/datasets_processed/CT_spine/dataset-rsna/ with a BIDS compline data set path, where rawdata and derivatives are.
# You can parse multiple datasets and select what parent folder are read (e.g. rawdata, derivatives)
ds_path = "/DATA/NAS/datasets_processed/CT_spine/dataset-rsna/"
bids_global_object = BIDS_Global_info(
    [ds_path],
    ["rawdata", "rawdata_dixon", "derivatives"],
    additional_key=["sequ", "seg", "ovl"],
    verbose=True,
)
# The Parser will inform you about every non standard files. To add additional key add them to additional_key list, so you don't get the msg that this is not a valid key

Found: 8744   total= 87      

### How to iterate through a Bids dataset?

BIDS splits data samples roughly into:
- Subject: different patients
- Sessions: one patient can have multiple scans

You use enumerate_subjects to iterate over different, unique subjects.
Then, you can use queries to apply various filters. If you use flatten=True, that means you filter inividual files, and not a group/family of files.

In [2]:
# First loop: Loop over subjects
for subject_name, subject_container in bids_global_object.enumerate_subjects(sort=True):
    # Lets filter out in formation we don't want.
    # Lets only search for CT images

    # start the search, you can start multiple independent filters.
    query = subject_container.new_query(flatten=True)
    # We want to filter only now for individual files and not for a group of files (file family), so we set flatten=True

    # This call removes all files that do not end with "_ct.[filetype]"
    query.filter("format", "ct")
    # Lets remove all files that don't have a nifty.
    query.filter("Filetype", "nii.gz")

    # now we can loop over the CT files.
    for bids_file in query.loop_list(sort=True):
        # finally we get a bids_file
        print("CT BIDS file:", bids_file)
        # We will look at bids_files closer soon, lets just open the nifty as a nibabel.
        nii = bids_file.open_nii()
        print("shape of nii-file:", nii.shape)
        break
    break

CT BIDS file: sub-10633_ct.['json', 'nii.gz']	 parent = rawdata
shape of nii-file: (512, 512, 429)


# What is a BIDS_file


Terminologies:

A BIDS conform path looks like this:
[dataset                  ]/[parent ]/[ subpath]/[     file_name                                 ]

Example:
/media/data/dataset-spinegan/rawdata/spinegan0001/sub-spinegan0001_ses-20220527_sequ-202_ct.nii.gz

A file has all the information to find relations to other files.
Lets look at this file.

"sub-spinegan0001_ses-20220527_sequ-202_ct.nii.gz"

The ending consists of a filetype and a format:

- filetype: "nii.gz"
- bids_format: ct

The rest are key-value pairs (stored in info) split with "_" and look like this <key>-<value>.
For example, "sub-spinegan0001" means the key is "sub" (standing for subject)" and its value is "spinegan0001".

The above sample filename yields:

- sub : spinegan0001 <- must be the first key
- ses : 20220527
- sequ: 202


In [3]:
# Lets find this information in the Bids_file

print("\nFull file name")
print(bids_file.file["nii.gz"])
print("\nfiletypes")
print(bids_file.file.keys())
print("\nformat")
print(bids_file.format)
print("\nkey-value")
print(bids_file.info)

print("\n\nparent")
print(bids_file.get_parent("nii.gz"))
print("\nthe 4 path parts")
print(bids_file.get_path_decomposed())


Full file name
/DATA/NAS/datasets_processed/CT_spine/dataset-rsna/rawdata/10633/ses-baseline/sub-10633_ct.nii.gz

filetypes
dict_keys(['json', 'nii.gz'])

format
ct

key-value
{'sub': '10633'}


parent
rawdata

the 4 path parts
(PosixPath('/DATA/NAS/datasets_processed/CT_spine/dataset-rsna'), 'rawdata', '10633/ses-baseline', 'sub-10633_ct.json')


# file family

Everyone needs a family! 
Files that are generated from others should belong to a family. We automatically find related files and cluster them into a dictionary.


In [4]:
from TPTBox.core.bids_constants import sequence_splitting_keys

print("We consider a file not to be in the same family if there is at least one key that is different an in this list:")
print(sequence_splitting_keys)
print("You can change the splitting keys during initializing the BIDS_Global_info")
sequence_splitting_keys = sequence_splitting_keys.copy()
sequence_splitting_keys.remove("run")

bids_global_object = BIDS_Global_info(
    [ds_path],
    ["rawdata", "rawdata_dixon", "derivatives"],
    additional_key=["sequ", "seg", "ovl"],
    verbose=True,
    sequence_splitting_keys=sequence_splitting_keys
)

We consider a file not to be in the same family if there is at least one key that is different an in this list:
['ses', 'sequ', 'acq', 'hemi', 'sample', 'ce', 'trc', 'stain', 'res', 'dir', 'run', 'split', 'chunk']
You can change the splitting keys during initializing the BIDS_Global_info
Found: 8744   total= 87      

In [5]:
# First loop: Loop over subjects
for subject_name, subject_container in bids_global_object.enumerate_subjects(sort=True):
    # Lets search for CTs images and related files

    query = subject_container.new_query(flatten=False)  # <- flatten=False means we search for family
    # This call removes all families that do not have at least one file that end with "_ct.[filetype]"
    query.filter("format", "ct")
    # Lets require a segmentation
    query.filter("seg", "vert")
    query.filter("seg", "subreg")

    # now we can loop over the CT files.
    for bids_family in query.loop_dict(sort=True):
        # finally we get a bids_family
        print("Files in this family:", bids_family.get_key_len())
        print(bids_family)
        break

    break

Files in this family: {'ct': 1, 'msk_seg-subreg': 1, 'ctd_seg-subreg': 1, 'snp': 1, 'msk_seg-vert': 1}
ct                             : [sub-10633_ct.['json', 'nii.gz']	 parent = rawdata]
msk_seg-subreg                 : [sub-10633_seg-subreg_msk.['nii.gz']	 parent = derivatives]
ctd_seg-subreg                 : [sub-10633_seg-subreg_ctd.['json']	 parent = derivatives]
snp                            : [sub-10633_snp-cor_snp.['png']	 parent = derivatives]
msk_seg-vert                   : [sub-10633_seg-vert_msk.['nii.gz']	 parent = derivatives]


In [7]:
from TPTBox.core.bids_constants import sequence_naming_keys
# We can now collect the individual files by using the short key. Not that we can find multiple instances of a key
# Usually it is just the "format" tag
ct_file = bids_family["ct"][0]
# We could find multiple ct, so we return always a list.


print('These formats will be tagged on with "_"', sequence_naming_keys)
# so a ..._seg-vert_msk.nii.gz will get the key: msk_seg-vert
vert_seg = bids_family["msk_seg-vert"][0]

print(vert_seg.file["nii.gz"])
print(vert_seg)

print("You")

These formats will be tagged on with "_", instead of replaced ['seg', 'label']
/DATA/NAS/datasets_processed/CT_spine/dataset-rsna/derivatives/10633/ses-baseline/sub-10633_seg-vert_msk.nii.gz
sub-10633_seg-vert_msk.['nii.gz']	 parent = derivatives


## Lets generate a new file

We can get new datapaths in bids-format by using <bids_file>.get_changed_path()

In [10]:
# 1. Take an existing file
from TPTBox.core.bids_files import BIDS_FILE


ct_file = bids_family["ct"][0]
# 2 Tell the bids file what should be different from the current file, the rest will be copied
path1 = ct_file.get_changed_path("nii.gz", bids_format="msk", info={"seg": "vert"}, parent="derivatives",make_parent=False)
print("Path1:", path1)
# You can set the path and use key information with the following syntax:
path2 = ct_file.get_changed_path("json", bids_format="msk", info={"seg": "vert"}, path="ses-{ses}_sub-{sub}/{sequ}", parent="rawdata",make_parent=False)
print("Path1:", path2)
print(type(path2))
# 3 Just use it as a normal path

# If you want make a new file handle as a BIDS file you can use:
bids1 = ct_file.get_changed_bids("nii.gz", bids_format="msk", info={"seg": "vert"}, parent="derivatives",make_parent=False)
print(bids1)
# or create a new BIDS FILE
bids2 = BIDS_FILE(path1,ct_file.dataset)
print(bids2)

Path1: /DATA/NAS/datasets_processed/CT_spine/dataset-rsna/derivatives/10633/ses-baseline/sub-10633_seg-vert_msk.nii.gz
Path1: /DATA/NAS/datasets_processed/CT_spine/dataset-rsna/rawdata/ses-ses_sub-10633/sequ/sub-10633_seg-vert_msk.json
<class 'pathlib.PosixPath'>
sub-10633_seg-vert_msk.['nii.gz']	 parent = derivatives
sub-10633_seg-vert_msk.['nii.gz']	 parent = derivatives


####################################
  path2 = ct_file.get_changed_path("json", bids_format="msk", info={"seg": "vert"}, path="ses-{ses}_sub-{sub}/{sequ}", parent="rawdata",make_parent=False)
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/anaconda3/envs/py3.11/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/anaconda3/envs/py3.11/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/anaconda3/envs/py3.11/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/opt/anaconda3/envs/py3.11/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/opt/anaconda3/envs/py3.11/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/anac

# Running in true parallel

Python runs only in one thread. You have to spawn new Thread with Parallel. Here is an example. You have to create a helper function

In [14]:
import time, random
from TPTBox import BIDS_Global_info,Subject_Container
from joblib import Parallel, delayed


n_jobs = 10


def __helper(subj_name, subject: Subject_Container):
    time.sleep(random.random() * 0.1)
    # TODO: here is what it should do for each subject
    print(subj_name)


# initialize BIDS dataset
global_info = BIDS_Global_info(
    [ds_path],
    ["sourcedata", "rawdata", "rawdata_ct", "rawdata_dixon", "derivatives"],
    additional_key=["sequ", "seg", "ovl", "e"],
)

# Call parallel, which starts a number of threads equal to n_jobs and those call __helper() for each subject in bids_dataset
Parallel(n_jobs=n_jobs)(delayed(__helper)(subj_name, subject) for subj_name, subject in global_info.enumerate_subjects(sort=True))
print("finished")

1363d: 8744   total= 87      
11988
12833
10921
11827
10633
16919
15206
12281
16092
17960
14267
1573
12292
18480
1542
17481
1868
1480
19333
18935
19388
2243
18906
20647
23904
20928
20120
25704
19021
21321
24140
18968
26110
24606
21651
24891
26442
26068
26898
25833
27752
24617
26990
26979
30067
29425
30524
27016
26498
26492
26740
31077
3168
27292
28327
32590
28025
30487
28665
5671
32071
5783
5782
32658
3992
30640
32280
30565
780
3376
8024
6125
32434
8330
4769
4202
8744
6078
6376
8884
8574
9926
32370
5002
3882
32436
finished
