-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process to export SCE as AnnData #350
Conversation
modules/export-anndata.nf
Outdated
|
||
emit: anndata_ch | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github gods demand a sacrifice new line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I think it looks good, with a modification to the final mapping that might be needed.
For right now I have this publishing to the same results directory as the SCE objects and organized in the same way. This means all RDS and HDF5 files for a single sample will live in the same folder. We may want to re-arrange this depending on how downloads will be organized. I think if the goal is to allow for people to download things as either RDS or HDF5 then we probably want to have them separate?
I am not worried about this for this stage. I think at ingestion to the website they can be moved around as needed, so I don't think we should worry about subfolders.
bin/sce_to_anndata.R
Outdated
scpcaTools::sce_to_anndata(sce, | ||
anndata_file = opt$output_h5_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredibly minor, but follows from my feeling that we may want to standardize on tidyverse code formatting.
scpcaTools::sce_to_anndata(sce, | |
anndata_file = opt$output_h5_file) | |
scpcaTools::sce_to_anndata( | |
sce, | |
anndata_file = opt$output_h5_file | |
) |
I'm also just wondering how this function does handling ADT data? Do we need to convert that separately if it is present? (I have not looked, but I'm kind of assuming you did when writing this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so this does not handle alternative experiments. Our function that we are using will only convert the main experiment. Looking back on AlexsLemonade/scpcaTools#115, we had some initial thoughts on how we wanted to handle it, one of which was outputting a separate file for each altExp
. I think we may want to address this in a separate issue/PR here because we will have to think about what we want the output to look like there.
In looking briefly at the Scanpy documentation, it looks like they store everything in one matrix with the adt data as additional rows in the gene by cell counts matrix. Maybe we could do something similar prior to converting to anndata to keep everything in one file?
https://scanpy-tutorials.readthedocs.io/en/multiomics/cite-seq/pbmc5k.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so this does not handle alternative experiments. Our function that we are using will only convert the main experiment. Looking back on AlexsLemonade/scpcaTools#115, we had some initial thoughts on how we wanted to handle it, one of which was outputting a separate file for each
altExp
. I think we may want to address this in a separate issue/PR here because we will have to think about what we want the output to look like there.In looking briefly at the Scanpy documentation, it looks like they store everything in one matrix with the adt data as additional rows in the gene by cell counts matrix. Maybe we could do something similar prior to converting to anndata to keep everything in one file? https://scanpy-tutorials.readthedocs.io/en/multiomics/cite-seq/pbmc5k.html
Doing that seems kind of hacky, and I don't love it. I think maybe the right approach going forward is to to export mudata
objects? https://mudata.readthedocs.io/en/latest/. This allows wrapping multiple anndata objects in a way much more similar to SCE. The accessing the underlying AnnData objects is done with calls like mudata['rna']
.
There are some remaining questions though: For example: do we want all files to be mudata
for output, even if there is only RNA data?
All of this I think falls into future discussion, but we should probably resolve it pretty soon to prevent too much rewriting later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to file a new issue about this, but think we should tackle it after this goes in/ this sprint. Another question I have is about making sure our output is compliant with CZI cellxgene, since that is part of the goal of creating the AnnData
output. I think because of that we may have to keep everything as AnnData
rather than use mudata
?
https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#general-requirements
That being said, it sounds like they are starting to figure out CITE-seq, so maybe we could discuss with them, how best to store the CITE-seq data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on a new issue. But I think that we can do both pretty easily, as we should be able to make the AnnData RNA object within the mudata
compatible with cellxgene.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Just noting that I tested this with a library with CITE-seq and things do still work, we just only get the RNA experiment in the output right now. This is ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments to clarify that (at least for now) the output file only includes RNA data.
Other than that this looks good.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Closes #226
This PR adds in both the script and process to the workflow to convert an RDS file containing an SCE object to an HDF5 file containing an AnnData object.
scpcaTools
to do the conversion and export the AnnData object.meta
object, the sce file to be converted, and then the type of file, e.g., if the file represents anunfiltered
,filtered
, orprocessed
SCE object. The script for converting to AnnData is then run within that process.post_process_sce
which contains themeta
object and all three files, is passed as the input. I then mapped the channel so that each entry of the channel was a tuple with themeta
and then a single SCE file which gets passed to the process for converting. The output from the process is then grouped back together by library ID to create a single tuple with themeta
object and all three AnnData files.Food for thought/ next steps: