Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process to export SCE as AnnData #350

Merged
merged 9 commits into from
Jun 20, 2023
Merged

Conversation

allyhawkins
Copy link
Member

Closes #226

This PR adds in both the script and process to the workflow to convert an RDS file containing an SCE object to an HDF5 file containing an AnnData object.

  • The script is pretty straightforward and takes in one SCE file at a time. I'm using the function we have in scpcaTools to do the conversion and export the AnnData object.
  • The process is also pretty simple and takes in a tuple of the meta object, the sce file to be converted, and then the type of file, e.g., if the file represents an unfiltered, filtered, or processed SCE object. The script for converting to AnnData is then run within that process.
  • The workflow that I added here is where most of the work is being done. The output of the post_process_sce which contains the meta object and all three files, is passed as the input. I then mapped the channel so that each entry of the channel was a tuple with the meta and then a single SCE file which gets passed to the process for converting. The output from the process is then grouped back together by library ID to create a single tuple with the meta object and all three AnnData files.

Food for thought/ next steps:

  • For right now I have this publishing to the same results directory as the SCE objects and organized in the same way. This means all RDS and HDF5 files for a single sample will live in the same folder. We may want to re-arrange this depending on how downloads will be organized. I think if the goal is to allow for people to download things as either RDS or HDF5 then we probably want to have them separate?
  • Once we have a tuple with all of the AnnData files, we can then pass them onto additional downstream processes. One of our goals with this is to make our processed data CZI compliant. Part of that is making sure that things are in the right places and labeled correctly in the AnnData object. I am envisioning a separate process that runs a python script that works directly with the AnnData to make any modifications that we need.


emit: anndata_ch

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github gods demand a sacrifice new line.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I think it looks good, with a modification to the final mapping that might be needed.

For right now I have this publishing to the same results directory as the SCE objects and organized in the same way. This means all RDS and HDF5 files for a single sample will live in the same folder. We may want to re-arrange this depending on how downloads will be organized. I think if the goal is to allow for people to download things as either RDS or HDF5 then we probably want to have them separate?

I am not worried about this for this stage. I think at ingestion to the website they can be moved around as needed, so I don't think we should worry about subfolders.

Comment on lines 45 to 46
scpcaTools::sce_to_anndata(sce,
anndata_file = opt$output_h5_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly minor, but follows from my feeling that we may want to standardize on tidyverse code formatting.

Suggested change
scpcaTools::sce_to_anndata(sce,
anndata_file = opt$output_h5_file)
scpcaTools::sce_to_anndata(
sce,
anndata_file = opt$output_h5_file
)

I'm also just wondering how this function does handling ADT data? Do we need to convert that separately if it is present? (I have not looked, but I'm kind of assuming you did when writing this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so this does not handle alternative experiments. Our function that we are using will only convert the main experiment. Looking back on AlexsLemonade/scpcaTools#115, we had some initial thoughts on how we wanted to handle it, one of which was outputting a separate file for each altExp. I think we may want to address this in a separate issue/PR here because we will have to think about what we want the output to look like there.

In looking briefly at the Scanpy documentation, it looks like they store everything in one matrix with the adt data as additional rows in the gene by cell counts matrix. Maybe we could do something similar prior to converting to anndata to keep everything in one file?
https://scanpy-tutorials.readthedocs.io/en/multiomics/cite-seq/pbmc5k.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so this does not handle alternative experiments. Our function that we are using will only convert the main experiment. Looking back on AlexsLemonade/scpcaTools#115, we had some initial thoughts on how we wanted to handle it, one of which was outputting a separate file for each altExp. I think we may want to address this in a separate issue/PR here because we will have to think about what we want the output to look like there.

In looking briefly at the Scanpy documentation, it looks like they store everything in one matrix with the adt data as additional rows in the gene by cell counts matrix. Maybe we could do something similar prior to converting to anndata to keep everything in one file? https://scanpy-tutorials.readthedocs.io/en/multiomics/cite-seq/pbmc5k.html

Doing that seems kind of hacky, and I don't love it. I think maybe the right approach going forward is to to export mudata objects? https://mudata.readthedocs.io/en/latest/. This allows wrapping multiple anndata objects in a way much more similar to SCE. The accessing the underlying AnnData objects is done with calls like mudata['rna'].

There are some remaining questions though: For example: do we want all files to be mudata for output, even if there is only RNA data?

All of this I think falls into future discussion, but we should probably resolve it pretty soon to prevent too much rewriting later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to file a new issue about this, but think we should tackle it after this goes in/ this sprint. Another question I have is about making sure our output is compliant with CZI cellxgene, since that is part of the goal of creating the AnnData output. I think because of that we may have to keep everything as AnnData rather than use mudata?
https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#general-requirements

That being said, it sounds like they are starting to figure out CITE-seq, so maybe we could discuss with them, how best to store the CITE-seq data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on a new issue. But I think that we can do both pretty easily, as we should be able to make the AnnData RNA object within the mudata compatible with cellxgene.

modules/export-anndata.nf Outdated Show resolved Hide resolved
modules/export-anndata.nf Outdated Show resolved Hide resolved
modules/export-anndata.nf Outdated Show resolved Hide resolved
allyhawkins and others added 2 commits June 20, 2023 09:32
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins
Copy link
Member Author

Just noting that I tested this with a library with CITE-seq and things do still work, we just only get the RNA experiment in the output right now. This is ready for another review.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments to clarify that (at least for now) the output file only includes RNA data.

Other than that this looks good.

bin/sce_to_anndata.R Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
modules/export-anndata.nf Outdated Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins allyhawkins merged commit 19722f4 into development Jun 20, 2023
2 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/export-anndata branch June 20, 2023 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants