Process to export SCE as AnnData #350

allyhawkins · 2023-06-09T14:20:35Z

Closes #226

This PR adds in both the script and process to the workflow to convert an RDS file containing an SCE object to an HDF5 file containing an AnnData object.

The script is pretty straightforward and takes in one SCE file at a time. I'm using the function we have in scpcaTools to do the conversion and export the AnnData object.
The process is also pretty simple and takes in a tuple of the meta object, the sce file to be converted, and then the type of file, e.g., if the file represents an unfiltered, filtered, or processed SCE object. The script for converting to AnnData is then run within that process.
The workflow that I added here is where most of the work is being done. The output of the post_process_sce which contains the meta object and all three files, is passed as the input. I then mapped the channel so that each entry of the channel was a tuple with the meta and then a single SCE file which gets passed to the process for converting. The output from the process is then grouped back together by library ID to create a single tuple with the meta object and all three AnnData files.

Food for thought/ next steps:

For right now I have this publishing to the same results directory as the SCE objects and organized in the same way. This means all RDS and HDF5 files for a single sample will live in the same folder. We may want to re-arrange this depending on how downloads will be organized. I think if the goal is to allow for people to download things as either RDS or HDF5 then we probably want to have them separate?
Once we have a tuple with all of the AnnData files, we can then pass them onto additional downstream processes. One of our goals with this is to make our processed data CZI compliant. Part of that is making sure that things are in the right places and labeled correctly in the AnnData object. I am envisioning a separate process that runs a python script that works directly with the AnnData to make any modifications that we need.

sjspielman · 2023-06-09T14:39:39Z

modules/export-anndata.nf

+
+    emit: anndata_ch
+
+}


github gods demand a ~~sacrifice~~ new line.

jashapiro

Thanks for doing this! I think it looks good, with a modification to the final mapping that might be needed.

For right now I have this publishing to the same results directory as the SCE objects and organized in the same way. This means all RDS and HDF5 files for a single sample will live in the same folder. We may want to re-arrange this depending on how downloads will be organized. I think if the goal is to allow for people to download things as either RDS or HDF5 then we probably want to have them separate?

I am not worried about this for this stage. I think at ingestion to the website they can be moved around as needed, so I don't think we should worry about subfolders.

jashapiro · 2023-06-16T17:16:51Z

bin/sce_to_anndata.R

+scpcaTools::sce_to_anndata(sce,
+                           anndata_file = opt$output_h5_file)


This is incredibly minor, but follows from my feeling that we may want to standardize on tidyverse code formatting.

Suggested change

scpcaTools::sce_to_anndata(sce,

anndata_file = opt$output_h5_file)

scpcaTools::sce_to_anndata(

sce,

anndata_file = opt$output_h5_file

)

I'm also just wondering how this function does handling ADT data? Do we need to convert that separately if it is present? (I have not looked, but I'm kind of assuming you did when writing this)

Yes, so this does not handle alternative experiments. Our function that we are using will only convert the main experiment. Looking back on AlexsLemonade/scpcaTools#115, we had some initial thoughts on how we wanted to handle it, one of which was outputting a separate file for each altExp. I think we may want to address this in a separate issue/PR here because we will have to think about what we want the output to look like there.

In looking briefly at the Scanpy documentation, it looks like they store everything in one matrix with the adt data as additional rows in the gene by cell counts matrix. Maybe we could do something similar prior to converting to anndata to keep everything in one file?
https://scanpy-tutorials.readthedocs.io/en/multiomics/cite-seq/pbmc5k.html

Yes, so this does not handle alternative experiments. Our function that we are using will only convert the main experiment. Looking back on AlexsLemonade/scpcaTools#115, we had some initial thoughts on how we wanted to handle it, one of which was outputting a separate file for each altExp. I think we may want to address this in a separate issue/PR here because we will have to think about what we want the output to look like there.

In looking briefly at the Scanpy documentation, it looks like they store everything in one matrix with the adt data as additional rows in the gene by cell counts matrix. Maybe we could do something similar prior to converting to anndata to keep everything in one file? https://scanpy-tutorials.readthedocs.io/en/multiomics/cite-seq/pbmc5k.html

Doing that seems kind of hacky, and I don't love it. I think maybe the right approach going forward is to to export mudata objects? https://mudata.readthedocs.io/en/latest/. This allows wrapping multiple anndata objects in a way much more similar to SCE. The accessing the underlying AnnData objects is done with calls like mudata['rna'].

There are some remaining questions though: For example: do we want all files to be mudata for output, even if there is only RNA data?

All of this I think falls into future discussion, but we should probably resolve it pretty soon to prevent too much rewriting later.

I'm going to file a new issue about this, but think we should tackle it after this goes in/ this sprint. Another question I have is about making sure our output is compliant with CZI cellxgene, since that is part of the goal of creating the AnnData output. I think because of that we may have to keep everything as AnnData rather than use mudata?
https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#general-requirements

That being said, it sounds like they are starting to figure out CITE-seq, so maybe we could discuss with them, how best to store the CITE-seq data.

Agree on a new issue. But I think that we can do both pretty easily, as we should be able to make the AnnData RNA object within the mudata compatible with cellxgene.

modules/export-anndata.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2023-06-20T17:30:46Z

Just noting that I tested this with a library with CITE-seq and things do still work, we just only get the RNA experiment in the output right now. This is ready for another review.

jashapiro

I added a few comments to clarify that (at least for now) the output file only includes RNA data.

Other than that this looks good.

bin/sce_to_anndata.R

main.nf

modules/export-anndata.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins added 3 commits June 8, 2023 17:05

process for exporting anndata

f1c56c5

add some comments

c1ca268

make sce to anndata executable

c271c26

sjspielman reviewed Jun 9, 2023

View reviewed changes

modules/export-anndata.nf Outdated

emit: anndata_ch

}

Copy link

Member

sjspielman Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github gods demand a ~~sacrifice~~ new line.

allyhawkins and others added 3 commits June 9, 2023 09:42

add stub

4db7472

Merge branch 'development' into allyhawkins/export-anndata

99e1d91

new line

9b3fcee

allyhawkins requested a review from jashapiro June 12, 2023 20:03

jashapiro reviewed Jun 16, 2023

View reviewed changes

allyhawkins and others added 2 commits June 20, 2023 09:32

Apply suggestions from code review

352d287

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

style update and update comment

becceb9

allyhawkins mentioned this pull request Jun 20, 2023

Handle altExps with AnnData output #351

Closed

allyhawkins requested a review from jashapiro June 20, 2023 17:30

jashapiro approved these changes Jun 20, 2023

View reviewed changes

bin/sce_to_anndata.R Outdated Show resolved Hide resolved

main.nf Outdated Show resolved Hide resolved

modules/export-anndata.nf Outdated Show resolved Hide resolved

Apply suggestions from code review

a63485a

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins merged commit 19722f4 into development Jun 20, 2023
2 checks passed

allyhawkins deleted the allyhawkins/export-anndata branch June 20, 2023 22:56

allyhawkins mentioned this pull request Jun 28, 2023

Add process to convert SCE to AnnData #226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process to export SCE as AnnData #350

Process to export SCE as AnnData #350

allyhawkins commented Jun 9, 2023

sjspielman Jun 9, 2023

jashapiro left a comment

jashapiro Jun 16, 2023

allyhawkins Jun 20, 2023

jashapiro Jun 20, 2023

allyhawkins Jun 20, 2023

jashapiro Jun 20, 2023

allyhawkins commented Jun 20, 2023

jashapiro left a comment

		scpcaTools::sce_to_anndata(sce,
		anndata_file = opt$output_h5_file)

Process to export SCE as AnnData #350

Process to export SCE as AnnData #350

Conversation

allyhawkins commented Jun 9, 2023

sjspielman Jun 9, 2023

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jun 16, 2023

Choose a reason for hiding this comment

allyhawkins Jun 20, 2023

Choose a reason for hiding this comment

jashapiro Jun 20, 2023

Choose a reason for hiding this comment

allyhawkins Jun 20, 2023

Choose a reason for hiding this comment

jashapiro Jun 20, 2023

Choose a reason for hiding this comment

allyhawkins commented Jun 20, 2023

jashapiro left a comment

Choose a reason for hiding this comment