Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document decision tree steps in report #959

Merged
merged 30 commits into from
Feb 21, 2024
Merged

Conversation

tsalo
Copy link
Member

@tsalo tsalo commented Aug 3, 2023

Closes None.

Changes proposed in this pull request:

  • Start adding comments explaining each of the steps in the minimal decision tree. There are still some things I don't understand.
    • I'd like to add more layperson's descriptions in the _comment fields.

@tsalo tsalo added the refactoring issues proposing/requesting changes to the code which do not impact behavior label Aug 3, 2023
tedana/resources/decision_trees/minimal.json Outdated Show resolved Hide resolved
tedana/resources/decision_trees/minimal.json Show resolved Hide resolved
tedana/resources/decision_trees/minimal.json Outdated Show resolved Hide resolved
tedana/resources/decision_trees/minimal.json Outdated Show resolved Hide resolved
@tsalo
Copy link
Member Author

tsalo commented Aug 3, 2023

@handwerkerd do you know why the paths to the artifacts in the CircleCI config were wrong? Is it just that we changed the actual output folders in the DTM PR and forgot to update the artifact paths, or did we choose to stop uploading artifacts because of free plan limits or something similar?

@tsalo
Copy link
Member Author

tsalo commented Aug 4, 2023

I didn't get through all of the Kundu decision tree steps, because some are quite complicated, so the report will be incomplete for that one.

@handwerkerd
Copy link
Member

Is it just that we changed the actual output folders in the DTM PR and forgot to update the artifact paths

Looking at your change, I think that's what happened. I changed the location of the outputs for testing, but didn't realize I should have also made this change.

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (b139386) 89.54% compared to head (3875ea7) 89.37%.
Report is 11 commits behind head on main.

❗ Current head 3875ea7 differs from pull request most recent head 2e4f290. Consider uploading reports for the commit 2e4f290 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #959      +/-   ##
==========================================
- Coverage   89.54%   89.37%   -0.18%     
==========================================
  Files          26       26              
  Lines        3396     3397       +1     
  Branches      619      619              
==========================================
- Hits         3041     3036       -5     
- Misses        207      211       +4     
- Partials      148      150       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tsalo tsalo changed the title Add comments to minimal decision tree Document decision tree steps in report Aug 12, 2023
@tsalo
Copy link
Member Author

tsalo commented Aug 12, 2023

I think I've successfully documented the remaining nodes of the kundu tree.

@tsalo tsalo requested a review from handwerkerd August 14, 2023 13:31
@tsalo tsalo requested a review from eurunuela August 23, 2023 19:25
Copy link
Collaborator

@eurunuela eurunuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of comments that I find impossible to understand. Could you please make them more understandable to the regular user/contributor? Thanks!

tedana/resources/decision_trees/kundu.json Show resolved Hide resolved
"tag_if_true": "Unlikely BOLD"
"tag_if_true": "Unlikely BOLD",
"log_extra_info": "",
"log_extra_report": "Any components with negative t-statistic comparing distribution of T2* model F-statistics from voxels in clusters to those of voxels not in clusters and variance explained greater than median are rejected."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not easy to understand.

Copy link
Member Author

@tsalo tsalo Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

Suggested change
"log_extra_report": "Any components with negative t-statistic comparing distribution of T2* model F-statistics from voxels in clusters to those of voxels not in clusters and variance explained greater than median are rejected."
"log_extra_report": "Any components with (1) greater than median variance explained and (2) a t-statistic less than zero, determined from a t-test of T2* model F-statistics from voxels in clusters (i.e., "signal" voxels) to voxels not in clusters (i.e., "noise" voxels), are rejected."

"log_extra_info": "",
"log_extra_report": "If any components are still labeled as unclassified or unclassified high-variance, then a revised decision table score is calculated from the provisionally accepted, unclassified, and unclassified high-variance components."
},
"_comment": "Calculate a new d_table_score. Why can't we use the original d_table_score function? Because it's a metric function?"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a comment for us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the _comments are, I believe, for developers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_extra_report is for text that will be outputted in the reports. _comment lives just in this file and can be used to provide additional information to anyone who decides to read the json file. In this case, the _comment should be changed.

Suggested change
"_comment": "Calculate a new d_table_score. Why can't we use the original d_table_score function? Because it's a metric function?"
"_comment": "Calculate a new d_table_score. 5 metric are sorted and ranked across components and this is the average. A new score is calculated on a subset of components because the rankings will change depending on the removed components"

@eurunuela
Copy link
Collaborator

@tsalo I've been told there's a typo in the docs:

---manrej should be component indices to reject instead of to accept.

Could you please fix it in this PR? Just asking because this is docs-related.

@tsalo
Copy link
Member Author

tsalo commented Sep 8, 2023

Currently, that text says "This is based on the criteria of the MEICA v2.5 decision tree" and someone can look at several places to see what happens in that decision tree.

I don't think they can. Previous papers that I've read that describe the decision tree don't cover everything in sufficient detail. The main paper we cite for the v2.5 decision tree is Kundu et al. (2013), but that paper (and its supplement) seems to only describe Kappa- and Rho-based decisions. Olafsson et al. (2015) describes each of the steps in the decision tree in some detail, but it's completely missing the additional constraint of greater-than-median variance explained in several of the nodes.

Plus, that only applies to the v2.5 decision tree. The minimal decision tree doesn't have a paper that describes each step AFAIK, nor would custom decision trees.

Do we want a line of text in the report explain what happens in every node?

I think that is the best way to handle it.

If we're expecting people to copy the report into manuscripts (with light editing), then that's too much detail.

I think all of this information is necessary for manuscripts, but I can also see people moving some contents into their supplementary materials. I've seen folks do this with fMRIPrep methods sections as well.

@handwerkerd
Copy link
Member

When I wrote "someone can look at several places to see what happens in that decision tree." I meant that there are several places in the existing outputs to understand what happens, but I agree these aren't detailed in any previous publication.

I'm going to disagree that all this information should be included in a manuscript. I've included the report output that would be generated by this PR for the kundu tree below. Not only is this long, but it's confusing and a poor way for anyone to understand what is happening in this process. We can edit various sentences, but I don't think anyone who reads any block of text like that would leave with a solid understanding of the process. If I reviewed a manuscript with a paragraph like that, I'd ask for revisions. The flow chart is a much better way of sharing and understanding this process.

This is an example of the limits of manuscripts as a format and the original purpose for things like BIDS. If a manuscript includes details of every step to this level of detail, it would very long and unreadable. For example, there are many critical MRI acquisition parameters that are never reported in manuscripts, but I worked with the original BIDS group to make sure most of this acquisition parameter list was in the BIDS jsons. I think we should take a similar approach here. We should have very clear output files that make it easy for people to know what happened and should expect those files to be shared with data.

Right now the relevant data is in several places. (1) The flow chart (2) The info file tedana_[date].tsv as descriptors of each step (3) desc-ICA_decision_tree.json has additional descriptors.

Additional things we can do would be to generate a run-specific flow chart (#888) or just pull the info lines relating to the decision tree into a stand-alone file. I think much of the added text in this PR is useful for helping someone understand the process, but I just don't see it as useful text to include in a manuscript.

Here's what the report text currently looks like:

TE-dependence analysis was performed on input data using the tedana workflow \citep{dupre2021te}. An initial mask was generated from the first echo using nilearn's compute_epi_mask function. An adaptive mask was then generated, in which each voxel's value reflects the number of echoes with 'good' data. A two-stage masking procedure was applied, in which a liberal mask (including voxels with good data in at least the first echo) was used for optimal combination, T2*/S0 estimation, and denoising, while a more conservative mask (restricted to voxels with good data in at least the first three echoes) was used for the component classification procedure. A monoexponential model was fit to the data at each voxel using log-linear regression in order to estimate T2* and S0 maps. For each voxel, the value from the adaptive mask was used to determine which echoes would be used to estimate T2* and S0. Multi-echo data were then optimally combined using the T2* combination method \citep{posse1999enhancement}. Global signal regression was applied to the multi-echo and optimally combined datasets.

The following metrics were calculated: kappa, rho, countnoise, countsigFT2, countsigFS0, dice_FT2, dice_FS0, signal-noise_t, variance explained, normalized variance explained, d_table_score. Kappa (kappa) and Rho (rho) were calculated as measures of TE-dependence and TE-independence, respectively. A t-test was performed between the distributions of T2*-model F-statistics associated with clusters (i.e., signal) and non-cluster voxels (i.e., noise) to generate a t-statistic (metric signal-noise_z) and p-value (metric signal-noise_p) measuring relative association of the component to signal over noise. The number of significant voxels not from clusters was calculated for each component.

Next, component selection was performed to identify BOLD (TE-dependent) and non-BOLD (TE-independent) components using a decision tree. This is based on the criteria of the MEICA v2.5 decision tree \citep{kundu2013integrated}. For a detailed description of the decision tree steps, with the rationale for each step, see \citet{olafsson2015enhanced}. All components are initially labeled as 'unclassified'. Any components with rho greater than kappa are rejected. Any components with more voxels that are significant based on the S0 model's F-statistics than the T2* model's are rejected, as long as there is at least one significant voxel for the T2 model. The median variance explained is calculated across all components, for later steps. Any components with higher S0 model beta map-F-statistic map Dice similarity index than T2 model beta map-F-statistic map Dice similarity index and greater than median variance explained are rejected. Any components with negative t-statistic comparing distribution of T2* model F-statistics from voxels in clusters to those of voxels not in clusters and variance explained greater than median are rejected. The kappa elbow is calculated from all components, for later steps. Unclassified components exhibiting a large step down in variance explained are classified as 'unclassified high-variance'. The 'kundu' rho elbow is calculated from all and unclassified components, for later steps. Any unclassified components with kappa greater than or equal to the kappa elbow are provisionally accepted. Any provisionally accepted components with rho greater than the rho elbow are reset to 'unclassified'. The variance explained upper threshold is calculated as the 90th percentile of variance explained from provisionally accepted components. The variance explained lower threshold is calculated as the 25th percentile of variance explained from provisionally accepted components. The 'extend factor' is determined based on the number of volumes in the run. The 'max_good_meanmetricrank' is calculated as the number of provisionally accepted components times the extend factor. The variance explained-kappa ratio is calculated from the provisionally accepted components, as the maximum kappa minus the minimum kappa, divided by the maximum variance explained minus the minimum variance explained. Any provisionally accepted, unclassified, or unclassified high-variance components with a decision table score greater than 'max_good_meanmetricrank' and variance explained greater than the variance explained upper threshold multiplied by the extend factor are rejected. Any provisionally accepted, unclassified, or unclassified high-variance components with a decision table score greater than 'max_good_meanmetricrank', variance explained less than or equal to the variance explained lower threshold, and kappa less than or equal to the kappa elbow will be accepted and labeled as 'low variance'. If no components are still labeled as unclassified or unclassified high-variance, then all remaining provisionally accepted components are accepted. If any components are still labeled as unclassified or unclassified high-variance, then a revised decision table score is calculated from the provisionally accepted, unclassified, and unclassified high-variance components. Any provisionally accepted, unclassified, or unclassified high-variance components with a revised decision tree score greater than the 'conservative_guess', variance explained-kappa ratio greater than the extend factor times two, and variance explained greater than the variance explained upper threshold times the extend factor are rejected. Any provisionally accepted, unclassified, or unclassified high-variance components with a revised decision table score greater than 'num_acc_guess' times 0.9 and variance explained greater than variance explained lower threshold times the extend factor are rejected. An updated variance explained lower threshold (25th percentile) is calculated from the 'num_acc_guess' highest variance explained components among the remaining provisionally accepted, unclassified, and unclassified high-variance components. Any provisionally accepted, unclassified, or unclassified high-variance components with a revised decision table score greater than 'num_acc_guess' and variance explained greater than the new variance explained lower threshold are accepted and labeled as 'borderline'. Any provisionally accepted, unclassified, or unclassified high-variance components with kappa less than or equal to the kappa elbow and variance explained greater than the new variance explained lower threshold are accepted and labeled as 'borderline'. All remaining unclassified, unclassified high-variance, or provisionally accepted components are accepted. Minimum image regression was then applied to the data in order to remove spatially diffuse noise \citep{kundu2013integrated}.

This workflow used numpy \citep{van2011numpy}, scipy \citep{virtanen2020scipy}, pandas \citep{mckinney2010data,reback2020pandas}, scikit-learn \citep{pedregosa2011scikit}, nilearn, bokeh \citep{bokehmanual}, matplotlib \citep{Hunter:2007}, and nibabel \citep{brett_matthew_2019_3233118}. This workflow also used the Dice similarity index \citep{dice1945measures,sorensen1948method}.

@handwerkerd
Copy link
Member

handwerkerd commented Oct 12, 2023

The plan for this was to create a DOI for the decision trees so that the json file could reference the tree and include that in the report. In the last meeting we suggested doing this through zenodo or figshare, but I realized we could also get DOIs from osf.io and that's already where tedana is sharing other data. My plan would be to create a new project tedana decision trees or put the files into an existing project like https://osf.io/bpe8h/ and create a DOI. Then the reference would say ...used the kundu decision tree specified in doi:xyz Sounds good to others?

That said, one issue I have is that I'd like to store the json file and the flow chart visualization at the DOI link. The trouble is that I'd need to create a DOI and then upload the json file with the newly created DOI into the project. I'm not sure if that newly created DOI would then reference a version of the project before the json was uploaded. Anyone have experience with this?

Update: I'm realizing that the DOI would go into references.bib so I just need to include a stable citep label in the decision tree json file. This means we'll need to keep a running list of unique labels in references.bib might might get annoying if we end up creating a lot of tree variations, but this should otherwise be fine.

Update 2: https://help.osf.io/article/220-create-dois "DOIs point to the current version of the project or registration. OSF does not support DOI versioning at this time." Since we specifically want to generate updated DOIs with content updates, we cannot use OSF. I'm fairly sure both zenodo and figshare will let me create one project with multiple files where I can get a new doi for each version. Anyone of preferences for figshare vs zenodo?

@tsalo
Copy link
Member Author

tsalo commented Oct 18, 2023

I feel like FigShare might be a better fit for this, but I don't have a strong preference.

@eurunuela
Copy link
Collaborator

I have never used FigShare so I have no opinion on this.

@handwerkerd
Copy link
Member

@tsalo Could you try to merge Main into this PR? I tried to do it on my branch and I'm running into problems. If you get it working, that might make it easier for me to figure out what's going wrong on my end. See: #970 (comment)

@tsalo
Copy link
Member Author

tsalo commented Oct 31, 2023

@handwerkerd I updated the branch yesterday, but forgot to comment to that effect until now. Sorry about that!

@handwerkerd
Copy link
Member

Planned changes from past developers calls:
Work in progress: https://github.com/handwerkerd/tedana/tree/doc-tree-dh

  • Put more text in _comments rather than making a hyper-detailed report
  • The report log will link to a DOI where each version of the tree is detailed as a json file and a flow chart
  • Use figshare for the DOI since it allows different DOIs for multiple versions of an object
  • Remove log_extra_report fields from the json files but save removing the code that interacts with that field for a separate PR

@handwerkerd
Copy link
Member

handwerkerd commented Feb 20, 2024

Not sure if anyone but @tsalo can see this yet, but I uploaded the decision trees to figureshare in a new tedana project and created: 10.6084/m9.figshare.25251433 The updated data are at: https://figshare.com/account/articles/25251433

This is still private because I think there are limits to what I can change once I publish, but I can probably give individuals access to let me know if it includes what you think it should include.

@tsalo tsalo changed the base branch from main to doc-tree February 20, 2024 17:33
@tsalo
Copy link
Member Author

tsalo commented Feb 20, 2024

The current approach of PRs to my branch are confusing me, so I've changed the target branch to a new one under ME-ICA (doc-tree). @handwerkerd you can merge into the new branch if you'd like and just modify it directly.

@handwerkerd
Copy link
Member

This is either my git skills or a permissions issue, but I'm not sure how I can push directly to https://github.com/tsalo/tedana/tree/doc-tree. My current updates are now in https://github.com/handwerkerd/tedana/tree/doc-tree
I think this is very close to ready to review/merge. I've fully removed log_extra_report from the code. I needed to change https://osf.io/f6g45 (data for testing reclassify) to remove that key so that tests would pass, but the change was otherwise fairly easy.

I also fixed the five-echo test failure above, but there's a new error with five-echo and I'm having trouble solving. Five-echo is the one that uses the minimal decision tree with this line for reports:
https://github.com/handwerkerd/tedana/blob/09b9e2ef7f79262fad9ab7a32207012b51304ad6/tedana/resources/decision_trees/minimal.json#L4
For some reason, it's now unhappy with the comma separate citations \\citep{kundu2013integrated, dupre2021te} Not quite sure why this is happening now.

@tsalo
Copy link
Member Author

tsalo commented Feb 21, 2024

@handwerkerd I have "allow edits by maintainers" selected, so you should be able to push to my branch if you add it as a remote, but I was originally proposing that you merge this PR into the ME-ICA/tedana@doc-tree branch (the new target branch for this PR) and make your changes there.

@handwerkerd handwerkerd merged commit fb4c290 into ME-ICA:doc-tree Feb 21, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring issues proposing/requesting changes to the code which do not impact behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants