-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run doublet detection on ground truth data #454
Run doublet detection on ground truth data #454
Conversation
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave some python style comments...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall organization looks pretty good. My main comment is about making the decision to have these scripts run on any SCE or any AnnData, rather than hard coding in the filenames for benchmarking. I think I would add a few more options to the scripts, including the SCE file and an output file and then have the script run on a single sample. You already have a bash script for running the benchmarking datasets so you can loop through the script with the specific files you are interested in there. That will make it easier to re-use these scripts when we get to ScPCA data.
analyses/doublet-detection/scripts/00a_download-benchmark-data.sh
Outdated
Show resolved
Hide resolved
analyses/doublet-detection/scripts/00a_download-benchmark-data.sh
Outdated
Show resolved
Hide resolved
Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Code has now been heavily modularized and, I think, much improved; thanks for all the reviews! And bonus: the code now uses seeds 😄 Results are now all TSV files in my bucket. I am hoping the code in here is "self-explanatory" from the docs 🤞 - if I'm very wrong, please stop reviewing and let me know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reorg looks much better and a lot easier to follow! I just had a few remaining comments that I think should be addressed before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is specifically running scDblFinder
so I might rename the script to be explicit about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the filename.
args.results_dir.mkdir(parents = True, exist_ok = True) | ||
|
||
# Run scrublet and export the results | ||
input_anndata = args.dataset_name + "_anndata.h5ad" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a good idea to explicitly check that this file exists rather than the directory.
Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more little thoughts
parser.add_argument( | ||
"--dataset_name", | ||
type=str, | ||
default="", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to set this as required=True
rather than setting the default like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done all around where required (🥁 )
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments, but otherwise looks good!
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
…ltiple samples better
…enScPCA-analysis into sjspielman/446-run-methods
@allyhawkins I ended up doing one more thing in 2b1410c - added a |
👍 |
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
#446
What is the goal of this pull request?
This PR runs two (with a secret third!) doublet detection methods across four ground-truth datasets. The methods are
scDblFinder
(which runs a version ofcxds
for free) andscrublet
. Results are saved inresults/benchmark_results
for further exploration in a subsequent PR.Briefly describe the general approach you took to achieve this goal.
I wrote a couple different scripts to make this all happen, as documented in
scripts/README.md
.First, I have a shell script to download the files from Zenodo, which also calls an R script to format into SCE and AnnData. My reasoning here was getting files into ScPCA-esque formats will make code more adaptable for future actual use.
Then, I have an R script that runs
scDblFinder
, and a python script that runsscrublet
. At one point I thought about usingreticulate
, but then decided I'd rather not get into the R vs python environment weeds, and again keep things modular for each language for future portability.There is also an overall script
run_doublet-detection.sh
to wrap everything.If known, do you anticipate filing additional pull requests to complete this analysis module?
I sure do!
Results
What is the name of your results bucket on S3?
Everything is here:
s3://researcher-654654257431-us-east-2/doublet-detection
What types of results does your code produce (e.g., table, figure)?
What is your summary of the results?
None yet - that's for the next PR where I analyze these outputs in a notebook.
Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
renv
andconda
environments, and directions for the latter are inREADME.md
.Are there particularly areas you'd like reviewers to have a close look at?
Have I gotten docs everywhere they need to be so far?
Is there anything that you want to discuss further?
Worth noting that the
Dockerfile
is not currently up-to-date with the environment, and currently considers onlyrenv
. I will need to getconda
dependencies in here too, but waiting to do that until I take some time to evaluate a good base environment that could be used for this case.Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.