Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding robustica option to ICA decomposition to achieve consistent results #1013

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
f4eaa3e
Add robustica method
BahmanTahayori Jul 31, 2023
b0cac3a
Incorporation of major comments regarding robustica addition
BahmanTahayori Dec 5, 2023
55c2ae4
Add robustica 0.1.3 to dependency list
BahmanTahayori Nov 1, 2023
cd55a3f
Multiple fixes to RobustICA addition from code review
BahmanTahayori Dec 5, 2023
2d9b007
Specify magic number fixed seed of 42 as a constant
BahmanTahayori Nov 29, 2023
09e565e
Merge remote-tracking branch 'upstream/main' into add_robustica_rsclean
BahmanTahayori Dec 5, 2023
fc5f9ea
Updated
BahmanTahayori Dec 5, 2023
4fc3043
Robustica Updates
BahmanTahayori Dec 6, 2023
a20ff57
Incorporating the third round of Robert E. Smith's comments
BahmanTahayori Dec 20, 2023
cc5e05d
Merge pull request #3 from BahmanTahayori/add_robustica_rsclean
BahmanTahayori Dec 20, 2023
a449fec
Merge branch 'ME-ICA:main' into main
BahmanTahayori Feb 9, 2024
78c8140
Enhance the "ica_method" description suggested by D. Handwerker
BahmanTahayori Feb 9, 2024
ac85e6a
Enhancing the "n_robust_runs" description suggested by D. Handwerkerd
BahmanTahayori Feb 9, 2024
979d026
RobustICA: Restructure code loop over robust methods (#4)
Lestropie Feb 11, 2024
71d8d4a
merging recent changes
BahmanTahayori Feb 21, 2024
cac38cd
Applied suggested changes
BahmanTahayori Feb 29, 2024
5fcf148
Fixing the conflict
BahmanTahayori Feb 29, 2024
b7d08e9
Merge branch 'ME-ICA:main' into main
BahmanTahayori Feb 29, 2024
a113423
Incorporating more comments
BahmanTahayori Mar 4, 2024
b60e9a6
Merge remote-tracking branch 'upstream/main'
Lestropie Apr 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ dependencies = [
"pandas>=2.0,<=2.2.1",
"pybtex",
"pybtex-apa-style",
"robustica>=0.1.3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this dependency makes a lot of sense, but I want to highlight to @tsalo & others that this will also install two additional dependencies: scikit-learn-extra & https://tqdm.github.io/
We've had issues with adding modestly supported modules in the past (😱duecredit 😱) so we'll want to keep an eye on all three of these, particularly in relation to #934.

Also tqdm is a progress bar module. If we're going to require it, we might want to think if there are others parts of the code where it might be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's a problem with scikit-learn-extra in the Python 3.12 tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an opinion here - I had a branch where I played with putting the progress bar in for curvefit (I think) cause it is slow and can be confusing when it just sits there - I think it worked nicely but got distracted. Once this is pulled in, I'll look into doing that again. I don't think it makes sense for loglin, but could anyway (being forward looking, maybe it will be slow someday on some data?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. I've used tqdm before without issue (e.g., in NiMARE), so I'm all for adding progress bars. Not sure about applicability to loglin though. Don't we only loop over unique adaptive mask values there, rather than voxels?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - yeah, makes no sense there. I hadn't even checked it. Curve fit can use it, maybe figure creation? Honestly, once it punches through the ICA steps its hustling so may not be needed elsewhere.

"scikit-learn>=0.21, <=1.4.1.post1",
"scipy>=1.2.0, <=1.12.0",
"threadpoolctl",
Expand Down
19 changes: 19 additions & 0 deletions tedana/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""Setting default values for ICA decomposition."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd lean against making this config file. Anyone who installs tedana using pip install tedana won't have easy access to editing or viewing this config file and it would be overwritten every time they reinstall or someone pulls a new version of the code from github. It would be one thing if tedana looking for a config file in some other default location (definitely worth considering), but I think having it with the code will cause more problems than benefits.

I'd lean towards keeping our current method of setting defaults with input options so it's more clear to the end-user what the used options are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing that the config file variables get to be used in both the Argument Parser and the variable defaults for tedana_workflow Given we've had a couple of bugs where those diverged, having them defined in one central place might be useful. I don't have a strong opinion on whether a config.py or global variables at the top of tedana.py make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're concerned about defaults not matching across functions, we can just make all parameters require the parameter name (e.g., func(*, param1, param2)) and drop defaults from our internal functions. The only places where there could be default mismatch at that point are the CLI argument parser and the main workflow function. I think that's more straightforward than adding in a config file or using global variables.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without necessarily pushing for one solution or the other code-structure-wise, I would say that perhaps if this kind of solution is used, perhaps defaults.py would be better than config.py?

Also, while removing defaults from function arguments would reduce the problem, it might not remove it completely. Sometimes for a command-line option, you want to be able to state what the default behaviour or value will be within the help string. While it's easy to cross-reference within the add_argument() call what's quoted in the help string vs. what's set as the default, having both read from the same symbol is IMO more robust.


DEFAULT_ICA_METHOD = "robustica"
DEFAULT_N_ROBUST_RUNS = 30
DEFAULT_N_MAX_ITER = 500
DEFAULT_N_MAX_RESTART = 10
DEFAULT_SEED = 42


"""Setting extreme values for number of robust runs."""

MIN_N_ROBUST_RUNS = 5
MAX_N_ROBUST_RUNS = 500
WARN_N_ROBUST_RUNS = 200


"""Setting the warning threshold for the index quality."""

WARN_IQ = 0.6
182 changes: 175 additions & 7 deletions tedana/decomposition/ica.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,47 @@
import warnings

import numpy as np
from robustica import RobustICA
from scipy import stats
from sklearn.decomposition import FastICA

from tedana.config import (
DEFAULT_ICA_METHOD,
DEFAULT_N_MAX_ITER,
DEFAULT_N_MAX_RESTART,
DEFAULT_N_ROBUST_RUNS,
WARN_IQ,
WARN_N_ROBUST_RUNS,
)

LGR = logging.getLogger("GENERAL")
RepLGR = logging.getLogger("REPORT")


def tedica(data, n_components, fixed_seed, maxit=500, maxrestart=10):
"""Perform ICA on ``data`` and return mixing matrix.
def tedica(
data,
n_components,
fixed_seed,
ica_method=DEFAULT_ICA_METHOD,
n_robust_runs=DEFAULT_N_ROBUST_RUNS,
maxit=DEFAULT_N_MAX_ITER,
maxrestart=DEFAULT_N_MAX_RESTART,
):
"""Perform ICA on `data` with the user selected ica method and returns mixing matrix.

Parameters
----------
data : (S x T) :obj:`numpy.ndarray`
Dimensionally reduced optimally combined functional data, where `S` is
samples and `T` is time
n_components : :obj:`int`
Number of components retained from PCA decomposition
Number of components retained from PCA decomposition.
fixed_seed : :obj:`int`
Seed for ensuring reproducibility of ICA results
Seed for ensuring reproducibility of ICA results.
ica_method : :obj: `str'
slected ICA method, can be fastica or robutica.
n_robust_runs : :obj: `int'
selected number of robust runs when robustica is used. Default is 30.
maxit : :obj:`int`, optional
Maximum number of iterations for ICA. Default is 500.
maxrestart : :obj:`int`, optional
Expand All @@ -38,16 +60,162 @@ def tedica(data, n_components, fixed_seed, maxit=500, maxrestart=10):
fixed_seed : :obj:`int`
Random seed from final decomposition.

Notes
-----
Uses `sklearn` implementation of FastICA for decomposition
"""
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
RepLGR.info(
"Independent component analysis was then used to "
"decompose the dimensionally reduced dataset."
)

ica_method = ica_method.lower()

if ica_method == "robustica":
mmix, fixed_seed = r_ica(
data,
n_components=n_components,
fixed_seed=fixed_seed,
n_robust_runs=n_robust_runs,
max_it=maxit,
)
elif ica_method == "fastica":
mmix, fixed_seed = f_ica(
data,
n_components=n_components,
fixed_seed=fixed_seed,
maxit=maxit,
maxrestart=maxrestart,
)
else:
raise ValueError("The selected ICA method is invalid!")

return mmix, fixed_seed


def r_ica(data, n_components, fixed_seed, n_robust_runs, max_it):
"""Perform robustica on `data` and returns mixing matrix.

Parameters
----------
data : (S x T) :obj:`numpy.ndarray`
Dimensionally reduced optimally combined functional data, where `S` is
samples and `T` is time
n_components : :obj:`int`
Number of components retained from PCA decomposition.
fixed_seed : :obj:`int`
Seed for ensuring reproducibility of ICA results.
n_robust_runs : :obj: `int'
selected number of robust runs when robustica is used. Default is 30.
maxit : :obj:`int`, optional
Maximum number of iterations for ICA. Default is 500.

Returns
-------
mmix : (T x C) :obj:`numpy.ndarray`
Z-scored mixing matrix for converting input data to component space,
where `C` is components and `T` is the same as in `data`
fixed_seed : :obj:`int`
Random seed from final decomposition.
"""
if n_robust_runs > WARN_N_ROBUST_RUNS:
LGR.warning(
"The selected n_robust_runs is a very big number! The process will take a long time!"
)

RepLGR.info("RobustICA package was used for ICA decomposition \\citep{Anglada2022}.")

if fixed_seed == -1:
fixed_seed = np.random.randint(low=1, high=1000)

for robust_method in ("DBSCAN", "AgglomerativeClustering"):

try:
rica = RobustICA(
n_components=n_components,
robust_runs=n_robust_runs,
whiten="arbitrary-variance",
max_iter=max_it,
random_state=fixed_seed,
robust_dimreduce=False,
fun="logcosh",
robust_method=robust_method,
)

s, mmix = rica.fit_transform(data)
q = rica.evaluate_clustering(
rica.S_all, rica.clustering.labels_, rica.signs_, rica.orientation_
)

except Exception:
continue

LGR.info(
f"The {robust_method} clustering algorithm was used clustering "
f"components across different runs"
)

iq = np.array(
np.mean(q[q["cluster_id"] >= 0].iq)
) # Excluding outliers (cluster -1) from the index quality calculation

if iq < WARN_IQ:
LGR.warning(
f"The resultant mean Index Quality is low ({iq}). It is recommended to rerun the "
"process with a different seed."
)

mmix = mmix[
:, q["cluster_id"] >= 0
] # Excluding outliers (cluster -1) when calculating the mixing matrix
mmix = stats.zscore(mmix, axis=0)

LGR.info(
f"RobustICA with {n_robust_runs} robust runs and seed {fixed_seed} was used. "
f"The mean Index Quality is {iq}."
)

no_outliers = np.count_nonzero(rica.clustering.labels_ == -1)
if no_outliers:
LGR.info(
f"The {robust_method} clustering algorithm detected outliers when clustering "
f"components for different runs. These outliers are excluded when calculating "
f"the index quality and the mixing matrix to maximise the robustness of the "
f"decomposition."
)

return mmix, fixed_seed


def f_ica(data, n_components, fixed_seed, maxit, maxrestart):
"""Perform FastICA on `data` and returns mixing matrix.

Parameters
----------
data : (S x T) :obj:`numpy.ndarray`
Dimensionally reduced optimally combined functional data, where `S` is
samples and `T` is time
n_components : :obj:`int`
Number of components retained from PCA decomposition
fixed_seed : :obj:`int`
Seed for ensuring reproducibility of ICA results
maxit : :obj:`int`, optional
Maximum number of iterations for ICA. Default is 500.
maxrestart : :obj:`int`, optional
Maximum number of attempted decompositions to perform with different
random seeds. ICA will stop running if there is convergence prior to
reaching this limit. Default is 10.

Returns
-------
mmix : (T x C) :obj:`numpy.ndarray`
Z-scored mixing matrix for converting input data to component space,
where `C` is components and `T` is the same as in `data`
fixed_seed : :obj:`int`
Random seed from final decomposition.

Notes
-----
Uses `sklearn` implementation of FastICA for decomposition
"""
if fixed_seed == -1:
fixed_seed = np.random.randint(low=1, high=1000)

Expand Down
11 changes: 11 additions & 0 deletions tedana/resources/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -333,3 +333,14 @@ @article{tedana_decision_trees
year = {2024},
doi = {10.6084/m9.figshare.25251433.v1}
}

@Article{Anglada2022,
Author = {Anglada-Girotto Miquel and Miravet-Verde Samuel and Serrano Luis and Head Sarah},
Title = {robustica: customizable robust independent component analysis},
Journal = {BMC Bioinformatics},
Volume = {23},
Number = {519},
doi = {10.1186/s12859-022-05043-9},
year = 2022
}

9 changes: 9 additions & 0 deletions tedana/tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,8 @@ def test_integration_five_echo(skip_integration):
tedana_cli.tedana_workflow(
data=datalist,
tes=echo_times,
ica_method="robustica",
n_robust_runs=6,
out_dir=out_dir,
tedpca=0.95,
fittype="curvefit",
Expand Down Expand Up @@ -302,6 +304,7 @@ def test_integration_four_echo(skip_integration):
data=datalist,
mixm=op.join(op.dirname(datalist[0]), "desc-ICA_mixing_static.tsv"),
tes=[11.8, 28.04, 44.28, 60.52],
ica_method="fastica",
out_dir=out_dir,
tedpca="kundu-stabilize",
gscontrol=["gsr", "mir"],
Expand Down Expand Up @@ -346,6 +349,8 @@ def test_integration_three_echo(skip_integration):
tedana_cli.tedana_workflow(
data=f"{test_data_path}/three_echo_Cornell_zcat.nii.gz",
tes=[14.5, 38.5, 62.5],
ica_method="robustica",
n_robust_runs=5,
out_dir=out_dir,
low_mem=True,
tedpca="aic",
Expand All @@ -361,6 +366,10 @@ def test_integration_three_echo(skip_integration):
"62.5",
"--out-dir",
out_dir_manual,
"--ica_method",
"robustica",
"--n_robust_runs",
"5",
"--debug",
"--verbose",
"-f",
Expand Down
27 changes: 27 additions & 0 deletions tedana/workflows/parser_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
import argparse
import os.path as op

from tedana.config import MAX_N_ROBUST_RUNS, MIN_N_ROBUST_RUNS


def check_tedpca_value(string, is_parser=True):
"""
Expand Down Expand Up @@ -33,6 +35,31 @@ def check_tedpca_value(string, is_parser=True):
return intarg


def check_n_robust_runs_value(string, is_parser=True):
"""
Check n_robust_runs argument.

Check if argument is an int between MIN_N_ROBUST_RUNS and MAX_N_ROBUST_RUNS.
"""
error = argparse.ArgumentTypeError if is_parser else ValueError
try:
intarg = int(string)
except ValueError:
msg = (
f"Argument to n_robust_runs must be an integer "
f"between {MIN_N_ROBUST_RUNS} and {MAX_N_ROBUST_RUNS}."
)
raise error(msg)

if not (MIN_N_ROBUST_RUNS <= intarg <= MAX_N_ROBUST_RUNS):
raise error(
f"n_robust_runs must be an integer between {MIN_N_ROBUST_RUNS} "
f"and {MAX_N_ROBUST_RUNS}."
)
else:
return intarg


def is_valid_file(parser, arg):
"""Check if argument is existing file."""
if not op.isfile(arg) and arg is not None:
Expand Down