Add functions for merging in AltExperiments #31

jashapiro · 2021-08-17T22:02:26Z

closes #28

This PR adds a function for merging an alternative experiment (i.e. cellhash and CITE-seq data) into a SingleCellExperiment object. While the SingleCellExperiment::altExp() does most of the heavy lifting, it assumes that the altExp already has the same columns (cells) as the SCE it is being added to. As we process the two runs separately, this will not always be the case, so this function adds columns of zeros for cells that appear in the "base" SCE, but are not found in the altExp.

Since alevin output for these alternative experiments processed by alevin-fry is usually in matrix market format (or at least is in our workflow), I also modified read_alevin() to be handle mtx files even when not in USA mode. by adding separate mtx_format argument.

Questions for reviewers

I wrote some tests to be sure this works as expected when the data are as expected, but I was thinking I should throw some kind of warning if the base SCE and the AltExp don't have sufficient overlap in their cells, but I couldn't really decide the best way to do this.

One option was to define some kind of minimal overlap between the sets, but where to draw the threshold was a challenge, as I want this to work on unfiltered data.

I just had the idea that I could check for overlap using only the top X cells with highest total counts, as a very crude filter. If say 50% of those top cells have no corresponding feature data, that would definitely imply a mismatch.

I want to be pretty liberal, but still catch the case where the data may be mismatched.

We could (and probably should!) also check this at a later point, but some generic warning seems like it would be a good idea, if we have any ideas about where to put the threshold.

allyhawkins

I think this looks pretty good, and my only comment is about how to add in a check for overlap between CITE-seq and rna counts. In thinking about it more, I wonder if it's something that we should add as a warning in the QC report rather than here? I don't think there is a golden number for overlap, and I think low numbers of overlap can be a sign of a poor quality experiment, which someone might want to be aware of.

R/merge_altexp.R

jashapiro · 2021-08-19T17:37:23Z

I think this looks pretty good, and my only comment is about how to add in a check for overlap between CITE-seq and rna counts. In thinking about it more, I wonder if it's something that we should add as a warning in the QC report rather than here? I don't think there is a golden number for overlap, and I think low numbers of overlap can be a sign of a poor quality experiment, which someone might want to be aware of.

I definitely think it should show up in the QC report. I think my hope here was to catch obvious mispairings of data files, and I wasn't really thinking about just bad libraries. The 50% threshold is probably reasonable, though maybe I would set it even a bit lower if I just want mispairings: The chance of hitting even 10% overlap by chance is quite low (given millions of potential barcodes), so probably a lower threshold makes sense.... 20%?

Then in the QC report we should probably say something about what % of cells have features. At that point though, the # of features with RNA will be lost... unless we add it into the metadata, which we could do, if we know what we want to add.

allyhawkins · 2021-08-19T18:10:44Z

I think my hope here was to catch obvious mispairings of data files, and I wasn't really thinking about just bad libraries. The 50% threshold is probably reasonable, though maybe I would set it even a bit lower if I just want mispairings: The chance of hitting even 10% overlap by chance is quite low (given millions of potential barcodes), so probably a lower threshold makes sense.... 20%?

I see your point about catching mispairings of data files. I would say by chance there is going to be some overlap between barcodes so but probably nowhere near as high as 50%. I would agree that 20% seems reasonable there to me.

Then in the QC report we should probably say something about what % of cells have features. At that point though, the # of features with RNA will be lost... unless we add it into the metadata, which we could do, if we know what we want to add.

I was thinking along these lines too that adding the information of how many features are also found in the RNA to the metadata so we have that information if we want to use it. It's probably easiest to add it in now rather than later on during the qc_report generation.

jashapiro · 2021-08-19T18:13:14Z

I was thinking along these lines too that adding the information of how many features are also found in the RNA to the metadata so we have that information if we want to use it. It's probably easiest to add it in now rather than later on during the qc_report generation.

Oh wait, will we have this info from the function that reads the alevin metadata?

allyhawkins · 2021-08-19T19:39:41Z

Oh wait, will we have this info from the function that reads the alevin metadata?

We would have the metadata from both the original sce and the alt exp which will have the number of cells in it, so technically yes, we would have this information already.

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>

jaclyn-taroni · 2021-08-30T20:28:13Z

Relevant to a conversation we’re having in Google Slides comments right now - should we break out the “what should go into the QC report” component discussed here into its own issue?

jashapiro · 2021-08-30T20:57:56Z

Relevant to a conversation we’re having in Google Slides comments right now - should we break out the “what should go into the QC report” component discussed here into its own issue?

Yes, absolutely.

I think that discussion should be largely orthogonal to what is being discussed here. At this stage, we just need to decide what will be in the merged object, which should be a superset of what is going into the report (the report will also include things calculated from the intersection of the two). In the current implementation, all of the data we get for an RNA-seq run should be retained for its associated CITE-seq runs. I was planning to test that before merging the PR, but I expect that the metadata (which comes from alevin output stats, not directly from the data) should remain unchanged even after filtering the CITE-seq data to only cells found in the RNA-seq data.

jashapiro · 2021-08-31T14:32:47Z

Confirmed that metadata is included when adding altexps. Merging.

jashapiro added 7 commits August 17, 2021 11:11

Initial function for merging altexp

daa4391

Make mtx format an option outside usa_mode

a996bea

Change to take SCE as input

1b0227b

Add a couple checks and rename

7e51d3f

Add tests and fix errors revealed

a2ab1d7

update docs

b6d3989

importFrom S4Vectors and bug fix

e02e75e

jashapiro requested a review from allyhawkins August 17, 2021 22:02

allyhawkins approved these changes Aug 19, 2021

View reviewed changes

R/merge_altexp.R Show resolved Hide resolved

Update R/merge_altexp.R

758268e

Co-authored-by: Ally Hawkins <54039191+allyhawkins@users.noreply.github.com>

jashapiro merged commit a8e7f41 into main Aug 31, 2021

jashapiro deleted the jashapiro/add-altexp branch August 31, 2021 14:34

jashapiro mentioned this pull request Sep 2, 2021

Generate merged SCE objects with feature (cite-seq) data AlexsLemonade/scpca-nf#21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functions for merging in AltExperiments #31

Add functions for merging in AltExperiments #31

jashapiro commented Aug 17, 2021

allyhawkins left a comment

jashapiro commented Aug 19, 2021

allyhawkins commented Aug 19, 2021

jashapiro commented Aug 19, 2021 •

edited

allyhawkins commented Aug 19, 2021

jaclyn-taroni commented Aug 30, 2021

jashapiro commented Aug 30, 2021

jashapiro commented Aug 31, 2021

Add functions for merging in AltExperiments #31

Add functions for merging in AltExperiments #31

Conversation

jashapiro commented Aug 17, 2021

Questions for reviewers

allyhawkins left a comment

Choose a reason for hiding this comment

jashapiro commented Aug 19, 2021

allyhawkins commented Aug 19, 2021

jashapiro commented Aug 19, 2021 • edited

allyhawkins commented Aug 19, 2021

jaclyn-taroni commented Aug 30, 2021

jashapiro commented Aug 30, 2021

jashapiro commented Aug 31, 2021

jashapiro commented Aug 19, 2021 •

edited