Run doublet detection on ground truth data #454

sjspielman · 2024-05-22T18:32:25Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

This PR runs two (with a secret third!) doublet detection methods across four ground-truth datasets. The methods are scDblFinder (which runs a version of cxds for free) and scrublet. Results are saved in results/benchmark_results for further exploration in a subsequent PR.

Briefly describe the general approach you took to achieve this goal.

I wrote a couple different scripts to make this all happen, as documented in scripts/README.md.
First, I have a shell script to download the files from Zenodo, which also calls an R script to format into SCE and AnnData. My reasoning here was getting files into ScPCA-esque formats will make code more adaptable for future actual use.
Then, I have an R script that runs scDblFinder, and a python script that runs scrublet. At one point I thought about using reticulate, but then decided I'd rather not get into the R vs python environment weeds, and again keep things modular for each language for future portability.

There is also an overall script run_doublet-detection.sh to wrap everything.

If known, do you anticipate filing additional pull requests to complete this analysis module?

I sure do!

Results

What is the name of your results bucket on S3?

Everything is here: s3://researcher-654654257431-us-east-2/doublet-detection

What types of results does your code produce (e.g., table, figure)?

SCE files with scDblFinder doublet results embedded
TSV files with scrublet results

What is your summary of the results?

None yet - that's for the next PR where I analyze these outputs in a notebook.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Can be run on a laptop.
The module has both renv and conda environments, and directions for the latter are in README.md.

Are there particularly areas you'd like reviewers to have a close look at?

Have I gotten docs everywhere they need to be so far?

Is there anything that you want to discuss further?

Worth noting that the Dockerfile is not currently up-to-date with the environment, and currently considers only renv. I will need to get conda dependencies in here too, but waiting to do that until I take some time to evaluate a good base environment that could be used for this case.

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

analyses/doublet-detection/environment.yml

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

jashapiro

I gave some python style comments...

analyses/doublet-detection/scripts/01b_detect-doublets.py

allyhawkins

Overall organization looks pretty good. My main comment is about making the decision to have these scripts run on any SCE or any AnnData, rather than hard coding in the filenames for benchmarking. I think I would add a few more options to the scripts, including the SCE file and an output file and then have the script run on a single sample. You already have a bash script for running the benchmarking datasets so you can loop through the script with the specific files you are interested in there. That will make it easier to re-use these scripts when we get to ScPCA data.

analyses/doublet-detection/README.md

analyses/doublet-detection/scripts/00a_download-benchmark-data.sh

analyses/doublet-detection/scripts/00b_format-benchmark-data.R

analyses/doublet-detection/scripts/01a_detect-doublets.R

analyses/doublet-detection/scripts/01b_detect-doublets.py

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

analyses/doublet-detection/scripts/01a_detect-doublets.R

sjspielman · 2024-05-23T20:55:23Z

Code has now been heavily modularized and, I think, much improved; thanks for all the reviews! And bonus: the code now uses seeds 😄 Results are now all TSV files in my bucket.

I am hoping the code in here is "self-explanatory" from the docs 🤞 - if I'm very wrong, please stop reviewing and let me know!

allyhawkins

I think the reorg looks much better and a lot easier to follow! I just had a few remaining comments that I think should be addressed before merging.

analyses/doublet-detection/run_doublet-detection.sh

analyses/doublet-detection/scripts/00_format-benchmark-data.R

analyses/doublet-detection/scripts/01a_detect-doublets.R

allyhawkins · 2024-05-23T21:57:40Z

analyses/doublet-detection/scripts/01a_detect-doublets.R

This file is specifically running scDblFinder so I might rename the script to be explicit about that.

allyhawkins · 2024-05-23T21:57:54Z

analyses/doublet-detection/scripts/01b_detect-doublets.py

Same comment about the filename.

allyhawkins · 2024-05-23T22:00:36Z

analyses/doublet-detection/scripts/01b_detect-doublets.py

+    args.results_dir.mkdir(parents = True, exist_ok = True)
+
+    # Run scrublet and export the results
+    input_anndata = args.dataset_name + "_anndata.h5ad"


Maybe a good idea to explicitly check that this file exists rather than the directory.

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>

jashapiro

A few more little thoughts

jashapiro · 2024-05-24T15:13:56Z

analyses/doublet-detection/scripts/01b_detect-doublets.py

+    parser.add_argument(
+        "--dataset_name",
+        type=str,
+        default="",


You may want to set this as required=True rather than setting the default like this?

done all around where required (🥁 )

analyses/doublet-detection/scripts/01b_detect-doublets.py

analyses/doublet-detection/scripts/01a_detect-doublets.R

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

analyses/doublet-detection/scripts/01b_run-scrublet.py

allyhawkins

A few small comments, but otherwise looks good!

analyses/doublet-detection/scripts/01a_run-scdblfinder.R

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

…ltiple samples better

…enScPCA-analysis into sjspielman/446-run-methods

sjspielman · 2024-05-24T16:57:07Z

@allyhawkins I ended up doing one more thing in 2b1410c - added a sample_var arg to the script and pass it along with the seed into the scdblfinder function, in case future us would like this.

allyhawkins · 2024-05-24T17:14:52Z

@allyhawkins I ended up doing one more thing in 2b1410c - added a sample_var arg to the script and pass it along with the seed into the scdblfinder function, in case future us would like this.

👍

sjspielman added 14 commits May 21, 2024 16:40

conda environment with scrublet

47e58ee

Merge upstream main

eb7d62d

anndata to conda

3306608

remove outdated notebook

46df73b

Scripts to download and format the data, and associated documentation

26d449f

lock file update

b28074e

remove scripts/.gitkeep

32b0ec9

Add scripts to run doublet detection and associated documentation

40ef01e

cores option

1c2fafa

readme updated

555a649

add module bash script

a88de1b

Merge branch 'AlexsLemonade:main' into sjspielman/446-run-methods

18102e8

conda update

8f2df99

newline for github

e74f95a

sjspielman requested a review from allyhawkins as a code owner May 22, 2024 18:32

Merge branch 'main' into sjspielman/446-run-methods

371794d

sjspielman changed the title ~~Sjspielman/446 run methods~~ Run doublet detection on ground truth data May 22, 2024

jashapiro reviewed May 22, 2024

View reviewed changes

analyses/doublet-detection/environment.yml Show resolved Hide resolved

Update analyses/doublet-detection/environment.yml

54dafd2

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

jashapiro reviewed May 22, 2024

View reviewed changes

allyhawkins reviewed May 22, 2024

View reviewed changes

sjspielman and others added 9 commits May 23, 2024 13:45

Apply suggestions from code review

178dec5

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com> Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

Merge branch 'main' into sjspielman/446-run-methods

3fff9dc

make sure pandas is locked down

0c81d92

update python script based on review comments

519f061

add zenodo to top comments and use outdir variable

30c9b2f

better path handling

de07fe8

Actually, only do 1 file at a time

0dd2bdc

too many inputs

6550615

totally forgot a seed here

e5b8866

sjspielman added 5 commits May 23, 2024 16:15

module run script massively updated for modularity

43e7f49

final.final environment

0d0a64b

documentation update

85d0d71

results needs a readme

8714800

formatted directory for each dataset, and put the original ones in raw

e03fc28

sjspielman commented May 23, 2024

View reviewed changes

analyses/doublet-detection/scripts/01a_detect-doublets.R Outdated Show resolved Hide resolved

sjspielman added 2 commits May 23, 2024 16:51

Update analyses/doublet-detection/scripts/01a_detect-doublets.R

c988dcb

new line bork

9987a29

sjspielman requested a review from allyhawkins May 23, 2024 20:55

allyhawkins reviewed May 23, 2024

View reviewed changes

Apply suggestions from code review

fbcf313

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>

jashapiro reviewed May 24, 2024

View reviewed changes

Apply suggestions from code review

f997544

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

sjspielman mentioned this pull request May 24, 2024

Update doublet-detection Dockerfile #460

Closed

sjspielman added 2 commits May 24, 2024 11:57

change script names to reflect specific method

4db360e

Better argument handling and checking, and update Path specification

9500945

sjspielman requested review from allyhawkins and jashapiro May 24, 2024 16:05

sjspielman mentioned this pull request May 24, 2024

GHA for doublet-detection module #462

Merged

8 tasks

jashapiro reviewed May 24, 2024

View reviewed changes

analyses/doublet-detection/scripts/01b_run-scrublet.py Outdated Show resolved Hide resolved

allyhawkins approved these changes May 24, 2024

View reviewed changes

analyses/doublet-detection/scripts/01a_run-scdblfinder.R Outdated Show resolved Hide resolved

analyses/doublet-detection/scripts/01a_run-scdblfinder.R Outdated Show resolved Hide resolved

sjspielman and others added 4 commits May 24, 2024 12:38

Update analyses/doublet-detection/scripts/01b_run-scrublet.py

1d03455

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

spacing and better arg checking

a9f81ff

make the script a tad more flexible for future us by accommodating mu…

2b1410c

…ltiple samples better

Merge branch 'sjspielman/446-run-methods' of github.com:sjspielman/Op…

d812619

…enScPCA-analysis into sjspielman/446-run-methods

sjspielman merged commit 9eb11b9 into AlexsLemonade:main May 24, 2024
2 checks passed

sjspielman deleted the sjspielman/446-run-methods branch May 24, 2024 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run doublet detection on ground truth data #454

Run doublet detection on ground truth data #454

sjspielman commented May 22, 2024 •

edited

Loading

jashapiro left a comment

allyhawkins left a comment

sjspielman commented May 23, 2024

allyhawkins left a comment

allyhawkins May 23, 2024

allyhawkins May 23, 2024

allyhawkins May 23, 2024

jashapiro left a comment

jashapiro May 24, 2024

sjspielman May 24, 2024

allyhawkins left a comment

sjspielman commented May 24, 2024

allyhawkins commented May 24, 2024

Run doublet detection on ground truth data #454

Run doublet detection on ground truth data #454

Conversation

sjspielman commented May 22, 2024 • edited Loading

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jashapiro left a comment

Choose a reason for hiding this comment

allyhawkins left a comment

Choose a reason for hiding this comment

sjspielman commented May 23, 2024

allyhawkins left a comment

Choose a reason for hiding this comment

allyhawkins May 23, 2024

Choose a reason for hiding this comment

allyhawkins May 23, 2024

Choose a reason for hiding this comment

allyhawkins May 23, 2024

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro May 24, 2024

Choose a reason for hiding this comment

sjspielman May 24, 2024

Choose a reason for hiding this comment

allyhawkins left a comment

Choose a reason for hiding this comment

sjspielman commented May 24, 2024

allyhawkins commented May 24, 2024

sjspielman commented May 22, 2024 •

edited

Loading