Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions for merging in AltExperiments #31

Merged
merged 8 commits into from
Aug 31, 2021
Merged

Conversation

jashapiro
Copy link
Member

closes #28

This PR adds a function for merging an alternative experiment (i.e. cellhash and CITE-seq data) into a SingleCellExperiment object. While the SingleCellExperiment::altExp() does most of the heavy lifting, it assumes that the altExp already has the same columns (cells) as the SCE it is being added to. As we process the two runs separately, this will not always be the case, so this function adds columns of zeros for cells that appear in the "base" SCE, but are not found in the altExp.

Since alevin output for these alternative experiments processed by alevin-fry is usually in matrix market format (or at least is in our workflow), I also modified read_alevin() to be handle mtx files even when not in USA mode. by adding separate mtx_format argument.

Questions for reviewers

I wrote some tests to be sure this works as expected when the data are as expected, but I was thinking I should throw some kind of warning if the base SCE and the AltExp don't have sufficient overlap in their cells, but I couldn't really decide the best way to do this.

One option was to define some kind of minimal overlap between the sets, but where to draw the threshold was a challenge, as I want this to work on unfiltered data.

I just had the idea that I could check for overlap using only the top X cells with highest total counts, as a very crude filter. If say 50% of those top cells have no corresponding feature data, that would definitely imply a mismatch.

I want to be pretty liberal, but still catch the case where the data may be mismatched.

We could (and probably should!) also check this at a later point, but some generic warning seems like it would be a good idea, if we have any ideas about where to put the threshold.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks pretty good, and my only comment is about how to add in a check for overlap between CITE-seq and rna counts. In thinking about it more, I wonder if it's something that we should add as a warning in the QC report rather than here? I don't think there is a golden number for overlap, and I think low numbers of overlap can be a sign of a poor quality experiment, which someone might want to be aware of.

R/merge_altexp.R Show resolved Hide resolved
@jashapiro
Copy link
Member Author

I think this looks pretty good, and my only comment is about how to add in a check for overlap between CITE-seq and rna counts. In thinking about it more, I wonder if it's something that we should add as a warning in the QC report rather than here? I don't think there is a golden number for overlap, and I think low numbers of overlap can be a sign of a poor quality experiment, which someone might want to be aware of.

I definitely think it should show up in the QC report. I think my hope here was to catch obvious mispairings of data files, and I wasn't really thinking about just bad libraries. The 50% threshold is probably reasonable, though maybe I would set it even a bit lower if I just want mispairings: The chance of hitting even 10% overlap by chance is quite low (given millions of potential barcodes), so probably a lower threshold makes sense.... 20%?

Then in the QC report we should probably say something about what % of cells have features. At that point though, the # of features with RNA will be lost... unless we add it into the metadata, which we could do, if we know what we want to add.

@allyhawkins
Copy link
Member

I think my hope here was to catch obvious mispairings of data files, and I wasn't really thinking about just bad libraries. The 50% threshold is probably reasonable, though maybe I would set it even a bit lower if I just want mispairings: The chance of hitting even 10% overlap by chance is quite low (given millions of potential barcodes), so probably a lower threshold makes sense.... 20%?

I see your point about catching mispairings of data files. I would say by chance there is going to be some overlap between barcodes so but probably nowhere near as high as 50%. I would agree that 20% seems reasonable there to me.

Then in the QC report we should probably say something about what % of cells have features. At that point though, the # of features with RNA will be lost... unless we add it into the metadata, which we could do, if we know what we want to add.

I was thinking along these lines too that adding the information of how many features are also found in the RNA to the metadata so we have that information if we want to use it. It's probably easiest to add it in now rather than later on during the qc_report generation.

@jashapiro
Copy link
Member Author

jashapiro commented Aug 19, 2021

I was thinking along these lines too that adding the information of how many features are also found in the RNA to the metadata so we have that information if we want to use it. It's probably easiest to add it in now rather than later on during the qc_report generation.

Oh wait, will we have this info from the function that reads the alevin metadata?

@allyhawkins
Copy link
Member

Oh wait, will we have this info from the function that reads the alevin metadata?

We would have the metadata from both the original sce and the alt exp which will have the number of cells in it, so technically yes, we would have this information already.

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>
@jaclyn-taroni
Copy link
Member

Relevant to a conversation we’re having in Google Slides comments right now - should we break out the “what should go into the QC report” component discussed here into its own issue?

@jashapiro
Copy link
Member Author

Relevant to a conversation we’re having in Google Slides comments right now - should we break out the “what should go into the QC report” component discussed here into its own issue?

Yes, absolutely.

I think that discussion should be largely orthogonal to what is being discussed here. At this stage, we just need to decide what will be in the merged object, which should be a superset of what is going into the report (the report will also include things calculated from the intersection of the two). In the current implementation, all of the data we get for an RNA-seq run should be retained for its associated CITE-seq runs. I was planning to test that before merging the PR, but I expect that the metadata (which comes from alevin output stats, not directly from the data) should remain unchanged even after filtering the CITE-seq data to only cells found in the RNA-seq data.

@jashapiro
Copy link
Member Author

Confirmed that metadata is included when adding altexps. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add function(s) for reading and adding CITE-seq/cellhash data to a SCE
3 participants