Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run doublet detection on ground truth data #454

Merged
merged 43 commits into from
May 24, 2024

Conversation

sjspielman
Copy link
Member

@sjspielman sjspielman commented May 22, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

#446

What is the goal of this pull request?

This PR runs two (with a secret third!) doublet detection methods across four ground-truth datasets. The methods are scDblFinder (which runs a version of cxds for free) and scrublet. Results are saved in results/benchmark_results for further exploration in a subsequent PR.

Briefly describe the general approach you took to achieve this goal.

I wrote a couple different scripts to make this all happen, as documented in scripts/README.md.
First, I have a shell script to download the files from Zenodo, which also calls an R script to format into SCE and AnnData. My reasoning here was getting files into ScPCA-esque formats will make code more adaptable for future actual use.
Then, I have an R script that runs scDblFinder, and a python script that runs scrublet. At one point I thought about using reticulate, but then decided I'd rather not get into the R vs python environment weeds, and again keep things modular for each language for future portability.

There is also an overall script run_doublet-detection.sh to wrap everything.

If known, do you anticipate filing additional pull requests to complete this analysis module?

I sure do!

Results

What is the name of your results bucket on S3?

Everything is here: s3://researcher-654654257431-us-east-2/doublet-detection

What types of results does your code produce (e.g., table, figure)?

  • SCE files with scDblFinder doublet results embedded
  • TSV files with scrublet results

What is your summary of the results?

None yet - that's for the next PR where I analyze these outputs in a notebook.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

  • Can be run on a laptop.
  • The module has both renv and conda environments, and directions for the latter are in README.md.

Are there particularly areas you'd like reviewers to have a close look at?

Have I gotten docs everywhere they need to be so far?

Is there anything that you want to discuss further?

Worth noting that the Dockerfile is not currently up-to-date with the environment, and currently considers only renv. I will need to get conda dependencies in here too, but waiting to do that until I take some time to evaluate a good base environment that could be used for this case.

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@sjspielman sjspielman changed the title Sjspielman/446 run methods Run doublet detection on ground truth data May 22, 2024
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave some python style comments...

analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall organization looks pretty good. My main comment is about making the decision to have these scripts run on any SCE or any AnnData, rather than hard coding in the filenames for benchmarking. I think I would add a few more options to the scripts, including the SCE file and an output file and then have the script run on a single sample. You already have a bash script for running the benchmarking datasets so you can loop through the script with the specific files you are interested in there. That will make it easier to re-use these scripts when we get to ScPCA data.

analyses/doublet-detection/README.md Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01a_detect-doublets.R Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01a_detect-doublets.R Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
@sjspielman
Copy link
Member Author

Code has now been heavily modularized and, I think, much improved; thanks for all the reviews! And bonus: the code now uses seeds 😄 Results are now all TSV files in my bucket.

I am hoping the code in here is "self-explanatory" from the docs 🤞 - if I'm very wrong, please stop reviewing and let me know!

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reorg looks much better and a lot easier to follow! I just had a few remaining comments that I think should be addressed before merging.

analyses/doublet-detection/run_doublet-detection.sh Outdated Show resolved Hide resolved
analyses/doublet-detection/run_doublet-detection.sh Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01a_detect-doublets.R Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01a_detect-doublets.R Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is specifically running scDblFinder so I might rename the script to be explicit about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the filename.

args.results_dir.mkdir(parents = True, exist_ok = True)

# Run scrublet and export the results
input_anndata = args.dataset_name + "_anndata.h5ad"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a good idea to explicitly check that this file exists rather than the directory.

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more little thoughts

parser.add_argument(
"--dataset_name",
type=str,
default="",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to set this as required=True rather than setting the default like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done all around where required (🥁 )

analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01b_detect-doublets.py Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01a_detect-doublets.R Outdated Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but otherwise looks good!

analyses/doublet-detection/scripts/01a_run-scdblfinder.R Outdated Show resolved Hide resolved
analyses/doublet-detection/scripts/01a_run-scdblfinder.R Outdated Show resolved Hide resolved
@sjspielman
Copy link
Member Author

@allyhawkins I ended up doing one more thing in 2b1410c - added a sample_var arg to the script and pass it along with the seed into the scdblfinder function, in case future us would like this.

@allyhawkins
Copy link
Member

@allyhawkins I ended up doing one more thing in 2b1410c - added a sample_var arg to the script and pass it along with the seed into the scdblfinder function, in case future us would like this.

👍

@sjspielman sjspielman merged commit 9eb11b9 into AlexsLemonade:main May 24, 2024
2 checks passed
@sjspielman sjspielman deleted the sjspielman/446-run-methods branch May 24, 2024 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants