## 0 - Introduction

This tutorial will gide through the `prepation` step, where
a generic human-organized (or even desorganized) dataset
will be automatically origanized into a more standard form.

This prepared (organized) dataset in one hand will help
`bidsme` to retrieve and discover data, and on another hand
will allow user to modifiy data (convert to another format)
without fearing to overwrite original data.

But first, we need load the paths to example dataset
defined in [dataset installation](./00-tutoral-paths.ipynb):

In [None]:
%run "./Configuration/00-tutorial-paths.py"

If you see any error, please verify the path definitions in `../01-installation/00-paths.ipynb`.

We will also need a helper functions, defined in [Tools folder](../Tools/tools.py),
we just need to execute the file:

In [None]:
%run "../Tools/tools.py"

This will make `clean_data` function aviable:

In [None]:
?clean_data

We will use this function to reset the prepared datasets.

Finally we will initialize `bidsme` and get the `logger` object
which will control the logging of all bidsme functions:

In [None]:
import bidsme
logger = bidsme.init()

If the bidsme is installed in current environment, you should see
the version of `bidsme`.

## 1 - Preparation of the dataset

In order to bidsify a given dataset, it must first be "prepared",
i.e. put in a standardised format that bidsme expects.

To do so a `bidsme.prepare()` function should be executed on the dataset.

### Basic usage

The `prepare` contains several options and parameters to accomodate a large variety of datasets:

In [None]:
help(bidsme.prepare)

The two mandatory arguments `source` and `destination` are the paths 
to the unbidsified dataset and to the prepared dataset, respectively.
These paths are stored in `SOURCE_PATH` and `PREPARED_PATH`:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH)

The command above will produce a few warnings and eventually do nothing
because bidsme failed to find any data in source dataset.

> **Warning**:
>```
>prepare(337) - WARNING Unable to identify data in folder /home/beliy/Works/bidscoin_example/example1/source/004/s01629
>```
>says exactly this: *I scanned folder example1/source/004/s01629 but didn't find any data*.

### Getting logger report

The main goal of the `bidsme.init()` function is to setup logging:

In [None]:
help(bidsme.init)

The `level` and `formatter` parameters allow to setup the
[verbosity level](https://docs.python.org/3/library/logging.html#logging-levels)
and the [format](https://docs.python.org/3/library/logging.html#logrecord-attributes)
of logging string.

The `log_dir` parameterer, if specified, will save all logging messages in
provided directory.

The `bidsme.init()` returns a [logger object](https://docs.python.org/3/library/logging.html),
manipulating which allows to change logging level without re-running `bidsme.init()`:

In [None]:
logger.setLevel("WARNING")
bidsme.prepare(SOURCE_PATH, PREPARED_PATH)

The `logger` object also keep in memory the number of warnings and errors,
the function `bidsme.tools.info.reporterrors()` will show these counts:

In [None]:
logger.setLevel("INFO")
bidsme.tools.info.reporterrors(logger)

The counter will increase at each call of bidsme functions,
in order to track the errors only from last execution,
we can reset the counters using `bidsme.tools.info,reseterrors()`:

In [None]:
bidsme.tools.info.reseterrors(logger)
bidsme.tools.info.reporterrors(logger)

### Data retrieval

Bidsme expects a naive organisation of source dataset:
```
$SOURCE_PATH/<subject folder>/<session folder>/<data files>
```

In the example dataset, this corresponds to:
- `<subject folder>`: 001, 002, 003, 004 etc.
- `<session folder>`: s01629, s01599 etc.

The problem with this dataset is that the actual data are stored in `nii` subfolder:

In [None]:
!ls $SOURCE_PATH/001/s01513/nii

In this case, the user should communicate this information to bidsme using
`data_dirs` option, which expects a dictionary with keys of folder names,
and values the data type of included images.

For example `data_dirs={"nii": "MRI"}` will tell that in `nii` sub-folder
we have `MRI` images:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH, data_dirs={"nii": "MRI"})
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

> **CLI:**
>```bash
> bidsme prepare $SOURCE_PATH $PREPARED_PATH -r "nii=MRI"
>```

The data files from `SOURCE_PATH` are copied into `PREPARED_PATH`
and organised into bids-like structure: `sub-001/ses-s01513/MRI`:

In [None]:
!ls $PREPARED_PATH/sub-001/ses-s01513/MRI/006-cmrr_mbep2d_bold_mb2_invertpe/

The top folder `sub-001` is the subject folder with `sub-` prefix; the name '001' is generated from the name of the `<subject folder>` in the source dataset.

The first subfolder `ses-s01513` is the session folder with `ses-` prefix;
the name is generated from the name of the `<session folder>` in source dataset.

The second subfolder `MRI` is the data-type folder; it allows to distinguish files
from differnt modalities, like `MRI`, `PET` or `EEG`; the data type was indicated by the user with `-r` or `--recfolder` option. If not specified by the user, information on data type is retrieved directly from the source data.

Finally, the last hierarchical subfolder `006-cmrr_mbep2d_bold_mb2_invertpe` is the "series"
folder which contains all images taken during one acquisition or series.
The name of the folder is composed of two parts: 
- the number of acquisition in the current session 
(the value and meaning of the number depends on the data type;
in the case of MRI, it corresponds to the series number)
- the name of aquisition
(the value and meaning of the number depends on the data type;
if not defined, it will be the same as the name of data file)

Files that share the same number and name of acquisition will be considered as part of the same series.

> **Important**: Images in some data formats are consistent of several files.
For example an EEG recording stored in BrainVision format usually consists of 3 files:
`.vhdr`, `.vmrk` and `.eeg`. Another example is the MRI nifti images associated with a `.json`
file containing the metadata extracted from DICOM header. 
>
>`bidsme` view such files as one entity, and in what follows the "file" or "image" will refer
to full set of files corresponding to same image.

## 2 - Table of participants

The BIDS standard requires that each dataset contains a
[table](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file)
with the list of participants and some basic demographic data.

For convenience, this table is created by 'bidsme' during the preparation step.
By default, the table file contains only one column `participant_id`, and serves as
record keeper of the already prepared subjects.
During this step, 'bidsme' will also automatically create the description file `participants.json`.

**participants.tsv**: file with basic demographic data

In [None]:
import pandas
pandas.read_csv(open(os.path.join(PREPARED_PATH, 'participants.tsv')), sep='\t')

**participants.json**: description of the demographic file

In [None]:
pandas.read_json(open(os.path.join(PREPARED_PATH, 'participants.json')))

### Adding information to the participants.tsv file

In order to add additional columns, `bidsme` needs a template `.json` file,
which is just a sidecar json file for participants
(see [documentation](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file) for examples).
An example of such file for the tutorial dataset can be found in the `resources` folder, e.g. `participants.json`:

In [None]:
pandas.read_json(open(os.path.join(RESOURCES_PATH, 'participants.json')))

This template can be passed to `bidsme` using `--part-template` option:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH,
               part_template=os.path.join(RESOURCES_PATH, "participants.json"),
               data_dirs={"nii": "MRI"})
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

> **CLI:**
> `bidsme prepare <source> <destination> --part-template <participants.json> -r "nii=MRI"`

If prepared dataset already contains `participants.tsv`, and the columns are different from 
the ones described in template, the preparation step will fail:

**Error:**
```
bidsMeta.BidsTable(125) - ERROR participants.tsv: Columns ['age', 'sex', 'education', 'group', 'handiness', 'paired', 'ses_1', 'ses_2', 'ses_3'] not found in table
```

This is done to maintain participants table consistent for all participants.

>To correct this error, the table in the prepared dataset must have columns sychronized to the template
and already filled for the already processed participants.
>
>An alternative would consist in just removing the `participants.tsv`, as done by
`clean_data` function, defined in [path definition notebook](../Installation/bidsme_path.ipynb)

In [None]:
clean_data(PREPARED_PATH)

Once the conflicts in the table are fixed, the `prepare` step should work:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH,
               part_template=os.path.join(RESOURCES_PATH, "participants.json"),
               data_dirs={"nii": "MRI"})
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

If the prepared dataset already contains some data, you may encounter the following
warning message:

**Warning:**
```
Modules.MRI.hmriNIFTI(233) - WARNING 014-cmrr_mbep2d_diff_NODDI_noise/40: File s1629-0014-00005-000281-01.nii exists at destination
```
This warning just indicates that the files in prepeared dataset are being overwritten.

Now the `participants.json` file in the prepared dataset is updated automatically
to match the template:

In [None]:
pandas.read_json(open(os.path.join(RESOURCES_PATH, 'participants.json')))

The  `participants.tsv` contains now all the columns defined in template,
including age, sex, education etc.

The values for all subjects are `n/a` -- not available;
these values are supposed to be filled either manually or using plugins,
as it will be demonstrated in a dedicated [tutorial](04-advanced-preparation.ipynb).

> BIDS require to use `n/a` as indicator of missing/non defined values. These values are parced as `NaN` by pandas.

In [None]:
pandas.read_csv(open(os.path.join(PREPARED_PATH, 'participants.tsv')), sep='\t')

### Resolving conflicts in `participants.tsv`

`bidsme` tracks also values in the `participants.tsv` table,
and will report any inconsistency, for example if same subject
previously was reported as left-handed and now reported as right-handed.

To demonstrate this, you can manually change some values in `participants.tsv`
and re-run the preparation step:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH,
               data_dirs={"nii": "MRI"})
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

You will encounter an error (the list of participants may varie):

**Error:**
```
prepare(373) - CRITICAL Participant list contains one or several duplicated entries: ['sub-001', 'sub-001', 'sub-002', 'sub-002', 'sub-003', 'sub-003', 'sub-004', 'sub-004']
```

This error is raised when the values of the table from the previous run
do not match those in the new run.
`bidsme` will put all non-conflicting participants in the `participants.tsv` table,
and all conflicting ones in `__duplicated.tsv`:

In [None]:
!ls $PREPARED_PATH/*.tsv

**participants.tsv** contains all subjects without conflicts

In [None]:
pandas.read_csv(open(os.path.join(PREPARED_PATH, 'participants.tsv')), sep='\t')

**__duplicated.tsv** contains the original entry and the new entries for
the each subject with conflicts:

In [None]:
pandas.read_csv(open(os.path.join(PREPARED_PATH, '__duplicated.tsv')), sep='\t')

To resolve the conflict, you just need to move the correct entries from
the `__duplictaed.tsv` into the `participants.tsv`,
then remove the `__duplicated.tsv`.

> **Note:** The option `--part-template` may be used only once, to create the initial table; for later runs,
`bidsme` will load the template directly from the `participants.json` file in the prepared dataset.

## 3 - Running on subset of subjects

By default `bidsme` runs on all discovereded sujects in the source dataset,
but sometimes you will be needed to run only on the selected ones.

Conviently, the `prepare` function can be run only on a selected subset of participants
from source dataset using one of the following options:

```
sub_list: list
        list of subject to process. Subjects
        are checked after plugin and must
        start with 'sub-', as in destination
        folder
    sub_skip_tsv: bool
        if set to True, subjects found in
        destination/participants.tsv will be
        ignored
    sub_skip_dir: bool
        if set to true, subjects with already
        created directories will be ignored
        Can conflict with sub_no_dir
    ses_skip_dir: bool
        if set to True, sessions with already
        created directories will be ignored
        Can conflict with ses_no_dir
```

>**CLI**
>```
--participants ID [ID ...]
                        Space-separated list of subjects to process, as
                        defined in source folder (i.e. before affecting by
                        plugin) (default: None)
  --skip-in-tsv         Skip participants that exists in the participants.tsv
                        file in destination dataset. (default: False)
  --skip-existing       Skip participants with corresponding folders exists in
                        destination dataset. (default: False)
  --skip-existing-sessions
                        Skip sessions that exists in destination dataset.
                        (default: False)
>```

For example, in order to run only on third subject, it's enough to add to previous `prepare()`
function an option `sub_list=[sub-003]` (Note that the subject id is in it's bidsified form, containing `sub-`.):

### Providing explicit list of participants

The `--participants`, followed by a list of subject ids will tell `bidsme` to run only
on the specified subjects.

For example, if we want to run the preparation step only for subject `003` we can add
`--participant sub-003`. 

> **Note that the provoded sunbject id must starts with `sub-`, in other words, you need
to give the bidsified subject id.**

In [None]:
clean_data(PREPARED_PATH)
bidsme.prepare(SOURCE_PATH, PREPARED_PATH,
               part_template=os.path.join(RESOURCES_PATH, "participants.json"),
               data_dirs={"nii": "MRI"},
               sub_list=["sub-003", "sub-004"])
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

`bidsme` will scan full dataset, and skip all subjectst hat are not in arguments of `sub_list`.

```
prepare(280) - INFO Skipping subject 'sub-001'
```

All subjects are skipped, except the requested one:

In [None]:
!ls $PREPARED_PATH

Running on a short list of subject is performed in the same way,
you just need to give that list to `--participants` option:

> **CLI:**
> `bidsme prepare <source> <destination> -r "nii=MRI" --part-template <participants.json> --participant sub-001 sub-004`

Now `renamed` folder should contain only subjects 3 and 4:

In [None]:
!ls $PREPARED_PATH

### Skipping already prepeared data

The switches `sub_skip_tsv`, `sub_skip_dir`, `ses_skip_dir` allow users
to skip already processed subjects and sessions.

`sub_skip_tsv=True` will skip all subjects that are present in `participants.tsv` file.

In this example, it will skip subjects 3, and 4, and will only run on the second subject:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH,
               part_template=os.path.join(RESOURCES_PATH, "participants.json"),
               data_dirs={"nii": "MRI"},
               sub_skip_tsv=True)
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

> **CLI:**
`bidsme prepare <source> <destination> -r "nii=MRI" --skip-in-tsv`

Now the prepared dataset should contain also the missing second subject:

In [None]:
!ls $PREPARED_PATH

Parameters `--skip-existing` and `--skip-existing-sessions` will skip subjects and
sessions, respectively, that already have their folders in prepared dataset.

Please, remove `sub-002` and `sub-003/ses-s01584` folders from prepared dataset:

In [None]:
shutil.rmtree(os.path.join(PREPARED_PATH, 'sub-002'))
shutil.rmtree(os.path.join(PREPARED_PATH, 'sub-003', 'ses-s01584'))

Then run the `prepare` step asking to skip existing subjects:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH,
               part_template=os.path.join(RESOURCES_PATH, "participants.json"),
               data_dirs={"nii": "MRI"},
               sub_skip_dir=True)
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

> **CLI:** `bidsme prepare <source> <destination> -r "nii=MRI" --skip-existing`

The `sub-002` folder should now be recreated in prepared dataset,
but session `ses-s01584` are still missing, because subject `sub-003`
already has been processed:

In [None]:
!ls $PREPARED_PATH
!ls $PREPARED_PATH/sub-003

Running next command should restore missing session:

In [None]:
bidsme.prepare(SOURCE_PATH, PREPARED_PATH,
               part_template=os.path.join(RESOURCES_PATH, "participants.json"),
               data_dirs={"nii": "MRI"},
               ses_skip_dir=True)
bidsme.tools.info.reporterrors(logger)
bidsme.tools.info.reseterrors(logger)

> **CLI:** `bidsme prepare <source> <destination> -r "nii=MRI" --skip-existing-session`

All sessions should be skipped:
```
prepare(319) - INFO Skipping session 'ses-s01512'
```
except the `ses-s01584`:
```
prepare(59) - INFO Processing: sub 'sub-003', ses 'ses-s01584' (41 files)
```

And the session `ses-s01584` is in it's place:

In [None]:
!ls $PREPARED_PATH/sub-003

## 4 - Prepared dataset

If all steps above was succesful, we should have a newly created "prepared" dataset,
which contains all four processed subjects, and demographics table `participants.tsv`:

In [None]:
!ls $PREPARED_PATH

For now, each subject has the same id, as the name of folder in the source dataset,
and the `participants.tsv` table remains unfilled.
The subject names can be changed and the table can be filled by using plug-ins,
as described in [advanced prepare tutorial](04-advanced-preparation.ipynb).

Going down one level, each subject folder should contain three sessions,
with criptic names, which again can be renamed using plugins:

In [None]:
!ls $PREPARED_PATH/sub-003

Finally, in eachsession folder, you can see a modality folder `MRI`, containing
a list of MRI sequences acquired during one session: 

In [None]:
!ls $PREPARED_PATH/sub-003/ses-s01599/MRI

Based on this structure, in the [next tutorial](02-basic-mapping.ipynb),
we will configure the `bidsme`, so it could bidsify the dataset.