Skip to content

A script used to delete 'junk' files produced by fMRIPrep, to reduce storage space used by up to 95%

License

Notifications You must be signed in to change notification settings

NickESouter/fMRIPrepCleanup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 

Repository files navigation

fMRIPrepCleanup

IMPROPER USE OF THIS SCRIPT COULD RESULT IN THE UNINTENDED DELETION OF IMPORTANT FILES. PLEASE READ THIS DOCUMENT IN FULL BEFORE EXECUTING.

As well as reading this file, consider watching the following video, which provides an overview of how the tool works, and some examples of usage.

Link to 'fMRIPrepCleanup: Instructions for use' video

Background

fMRIPrep (https://fmriprep.org) is an optimised fMRI preprocessing pipeline that combines best practice tools from a number of sources. This includes the generation of a large number of files, including those produced as preprocessed output that will be used in subsequent analysis, output files that could be used by the researcher but in practice may never be used, and working directory files which are temporary in nature and are very unlikely to serve any practical use. In some cases, 95% of the file size generated by fMRIPrep could be constituted as 'junk' (In our case, down from 5.8 GB to 285 MB per subject, see below).

Servers use energy not only to process data, but also to store it. Additionally, filling servers with data increases the need to produce and purchase additional hardware to accomodate new projects. In both of these ways, the accumulation of unneeded 'junk' data contributes to the carbon footprint of research computing.

Usage

As such, we provide here a script that can be used to automatically identify and delete junk files produced by fMRIPrep, used as follows:

fMRIPrepCleanup.py -dir <path to fMRIPrep directory>
-method <'sim_link'/'sim_copy'/'delete'>
-also_keep <keep1,keep2,keep3>
-also_delete <del1,del2,del3>
-out_path <path to simultion output directory>

Compulsory arguments

  • -dir should be a valid directory. This script was written under the assumption that all files generated by fMRIPrep (including output files, log files, and working directory files) that the user might want tidied up are contained somewhere within this directory. However, the script does not rely on any particular file structure or naming convention. Depending on how you have structured your output folders, this could be run at the level of one participant or a full sample.

  • -method must either be 'sim_link' (simulation link mode), 'sim_copy' (simulation copy mode), or 'delete' (deletion mode).

Optional arguments

Several arguments can be provided by the user, but are not needed.

  • -also_keep should be a list of strings that the user also wants targeted for retention. Any files that contain these strings will not be deleted (or simulated for deletion). This list should be comma seperated, with no spaces between them.

  • -also_delete is as above, but for strings that users actively want to delete.

  • -out_path is only valid in simulation mode. If selected, simulated links to or copies of your target files will be placed in a new folder within this valid existing directory. This argument is not compatible with deletion mode.

The method

Simulation mode will create a new folder within the user's current working directory (or specified out_path if provided). Within this folder, two folders will be created named 'Retained' and 'Deleted'. These folders will contain either symbolic links to or copies of the full relative path of each file to be retained or deleted, respectively. The structure and contents of the 'Retained' folder will reflect what would remain if the user were to run deletion mode on the given directory. The 'Deleted' folder will show you what fMRIPrepCleanup would delete. You can use this folder to check that there's nothing important in there that would be lost (e.g., analysis files). In simulation LINK mode, no actual files will be moved or deleted. This will be the fastest option to run and will place minimal strain on compute. However, this mode may not work on certain systems (e.g., outside of Linux). In these cases, simulation COPY mode should still work. We recommend deleting generated simulation folders once you have verified that the script is working correctly. Copy mode may take a particularly long time to run with increasingly large samples, given that a large number of files will need to be replicated.

Deletion mode will instead simply delete non-target files within the provided fMRIPrep directory. This script works by deleting all files that do not contain specific strings (see below) in their name.

PLEASE NOTE: The files targeted here for retention and deletion are based on those useful to the authors of this script given our research needs. In its current form, this script will delete fMRIPrep output files that other researchers may have good theoretical motivations to retain, such as those generated by FreeSurfer surface reconstruction. As such, researchers should make sure they fully understand the execution of this code before using it on their data, and update it as necessary. Without this, one may risk irrevocably deleting important data. Improper use of this script could be particularly catastrophic if pointed at a directory that contains files other than those generated by fMRIPrep, these would also be deleted. We recommend testing this script on a copy of one subject's preprocessed data before using it on the full sample/original data, as well as using simulation mode on the full dataset.

The optional -also_keep and -also_delete arguments allow the user to specify criteria beyond the target strings provided as default. For example, using -also_keep the user may request that any files containing 'AROMA' are not deleted, if ICA AROMA has been employed within fMRIPrep. If our selection criteria are deemed to be too broad, users can employ -also_delete to delete files that are not of interest to them. For example, if not interested in retaining preprocessed anatomical files in subjects' native space, one could request that any files containing the string 'T1' be deleted. Again, take extreme care when specifying additional files for deletion.

Any strings in our list of target strings that overlap with those provided within -also_delete will be removed from the target strings. To play it safe, if any strings occur within both -also_keep and -also_delete, they will be removed from -also_delete. As such, -also_keep trumps -also_delete, which in turn trumps our default target strings.

Default strings currently flagged for retention (with rationale for each):

  • 'preproc': Targets any preprocessed functional and anatomical files generated by fMRIPrep, in each output space generated.
  • 'brain_mask': Targets the brain 'mask' generated by fMRIPrep that accompanies preprocessed functional and anatomical output files.
  • 'confounds': Targets the confounds TSV file generated for each BOLD run, as this data is often used in subsequent analysis steps.
  • 'html': Targets the final output report generated by fMRIPrep for each subject, as well as some of the 'figures' used to populate it.
  • 'svg': Targets the remaining 'figures' used to populate the output report for each subject.
  • 'emissions': Targets the estimated carbon emissions output file generated by CodeCarbon (https://codecarbon.io; if toggled on when running fMRIPrep using the 'track-carbon' flag, which we recommend for monitoring purposes).

In deletion mode, any folders that are themselves totally empty following file deletion will also be deleted, to provide a cleaner view of what remains in your fMRIPrep directory. Note that some non-target files do slip through the cracks according to the inclusion criteria specified above. As such, the following are specifically targeted for deletion:

  • 'index.html': A non-subject-specific working directory file that will not be needed.
  • 'single_subject_{subject number}_wf': Subject specific working directory folders that would otherwise remain given the presence of HTML files as well as several containing the string 'preproc'.
  • 'fsaverage': A non-subject-specific FreeSurfer output directory folder that would otherwise remain given the presence of two files containing the string 'preproc'.

Upon executing the script, users will be warned that files within the specified directory are about to be deleted/simulated with symbolic links or copies. The user can proceed with 'Y', or abort with 'N'. If the specified fMRIPrep directory does not exist or is not found, or the provided method is not 'sim_link', 'sim_copy', or 'delete', the user will be warned and the script will exit.

This stage is followed by a final sanity check where the the script checks whether the provided fMRIPrep directory appears to contain fMRIPrep output file structure (folders containing 'sub-', files containing 'preproc'). If these conditions are not met, the user is warned, and again will need to confirm if they are happy to continue. This will catch cases where the script has been pointed at a directory that does not contain fMRIPrep output, but will not catch cases where the specified directory contains other files outside of fMRIPrep output. Again, extreme care should be taken.

Example output

Below is an example file tree of ALL files that would remain for a single subject using the default version of this script, where all working directory, output, and log files were located in the directory targeted. This data was derived from the default fMRIPrep pipeline, with no extra output spaces specified.

  • derivatives /

    • sub-001.html
    • logs /
      • CITATION.html
      • emissions.csv
    • sub-001 /
      • anat /
        • sub-001_desc-brain_mask.json
        • sub-001_desc-brain_mask.nii.gz
        • sub-001_desc-preproc_T1w.json
        • sub-001_desc-preproc_T1w.nii.gz
        • sub-001_space-MNI152NLin6Asym_res-2_desc-brain_mask.json
        • sub-001_space-MNI152NLin6Asym_res-2_desc-brain_mask.nii.gz
        • sub-001_space-MNI152NLin6Asym_res-2_desc-preproc_T1w.json
        • sub-001_space-MNI152NLin6Asym_res-2_desc-preproc_T1w.nii.gz
      • figures /
        • sub-001_desc-about_T1w.html
        • sub-001_desc-conform_T1w.html
        • sub-001_desc-reconall_T1w.svg
        • sub-001_desc-summary_T1w.html
        • sub-001_dseg.svg
        • sub-001_space-MNI152NLin2009cAsym_T1w.svg
        • sub-001_space-MNI152NLin6Asym_T1w.svg
        • sub-001_task-stopsignal_desc-bbregister_bold.svg
        • sub-001_task-stopsignal_desc-carpetplot_bold.svg
        • sub-001_task-stopsignal_desc-compcorvar_bold.svg
        • sub-001_task-stopsignal_desc-confoundcorr_bold.svg
        • sub-001_task-stopsignal_desc-rois_bold.svg
        • sub-001_task-stopsignal_desc-summary_bold.html
        • sub-001_task-stopsignal_desc-validation_bold.html
      • func /
        • sub-001_task-stopsignal_desc-confounds_timeseries.json
        • sub-001_task-stopsignal_desc-confounds_timeseries.tsv
        • sub-001_task-stopsignal_space-MNI152NLin6Asym_res-2_desc-brain_mask.json
        • sub-001_task-stopsignal_space-MNI152NLin6Asym_res-2_desc-brain_mask.nii.gz
        • sub-001_task-stopsignal_space-MNI152NLin6Asym_res-2_desc-preproc_bold.json
        • sub-001_task-stopsignal_space-MNI152NLin6Asym_res-2_desc-preproc_bold.nii.gz

    For context, the full size of files for this subject prior to deletion was 5.8 GB. Following the clean-up, this number dropped to 285 MB, a 95% reduction in size.

About

A script used to delete 'junk' files produced by fMRIPrep, to reduce storage space used by up to 95%

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages