umi_tools count #114

TomSmithCGAT · 2017-05-05T12:41:40Z

For people doing per-gene scRNA-seq, the reads are immaterial and in general only count is of interest.
This tool would count unique UMIs for a given contig, cell, gene combo.
Doesn't have same memory issues as per_gene dedup

TomSmithCGAT · 2017-05-05T12:50:33Z

Comment above from @IanSudbery's project note.

Clarification:

Input would be BAM
Output would be a tsv with 3 columns: gene, cell, count

Further considerations:

We should maintain the --per-contig, --gene-transcript-map and --gene-tag options so the user can align reads to the transcript or genome
I suggest we also support a flatfile input with read id (as suggested by @MarinusVL Dedup by per-gene #44) as this would allow the user to run e.g featureCounts and pass the output directly to UMI-tools without having to add a gene tag onto the BAM

TomSmithCGAT · 2017-05-08T14:07:30Z

@IanSudbery. Could you review 5793e24. This is a first attempt to make a count command.

I've written a new simplified generator umi_methods.get_gene_counts to return the umi counts per gene and rejigged the input for network..ReadClusterer so it will work from just umis + umi counts (group and count commands) or with bundles (dedup command).

On a related note, ReadClusterer has become a bit unwieldy. Currently it has a deduplicate switch to determine how far to proceed with the read processing. This allows us to return early for group and count. As it stands we return groups of umis for group and count and reads and associated stats objects for dedup. I think we should consider refactoring this functor so that it always returns groups. For dedup, we would then have a separate function to identify the reads and make the dedup reads derive directly from the groups to ensure the outputs from the commands always agree. We could also then handle all the stats collection in dedup.py. I'll re-work this now to show you what I mean in case that wasn't clear...

IanSudbery · 2017-05-08T14:56:45Z

The code_algos_in_c branch has already refactored along those lines. In this branch the UMICluster class takes a dictionary of UMIs and counts and Clusters them into a list of lists. Our best guess at the representative UMI is the first item in each list. The class ReadCluster is now a wrapper around UMIClusterer that take bundles of reads and converts them into a dictionary of UMI counts and calls UMICluster on them. If dedup is specified then it selects the represetative reads to return. Have a look at network.py in the `code_algos_in_c` branch.

…

On Mon, 8 May 2017 at 15:07 Tom Smith ***@***.***> wrote: @IanSudbery <https://github.com/IanSudbery>. Could you review 5793e24 <5793e24>. This is a first attempt to make a count command. I've written a new simplified generator umi_methods.get_gene_counts to return the umi counts per gene and rejigged the input for network..ReadClusterer so it will work from just umis + umi counts (group and count commands) or with bundles (dedup command). On a related note, ReadClusterer has become a bit unwieldy. Currently it has a deduplicate switch to determine how far to proceed with the read processing. This allows us to return early for group and count. As it stands we return groups of umis for group and count and reads and stats objects for dedup. I think we should consider refactoring this functor so that it always returns the groups. For dedup, we would then have a separate function to identify the read and make the dedup output derive from the groups to ensure the outputs from the commands always agree. We could also then handle all the stats collection in dedup. I'll re-work this now to show you what I mean in case that wasn't clear... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#114 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFJFjrBwU8OPL8mj5KU0oLdgx5VDxP80ks5r3yGjgaJpZM4NR3TA> .

TomSmithCGAT · 2017-05-08T15:20:15Z

Ah great. I must have seen this previously but forgotten!

I think we should look to merge this into master then rather than waiting until the network code has been recoded into a C library as it'll make the codebase much easier to maintain in the meantime. Are you happy for me to go through the code_algos_in_c branch and merge this into master?

TomSmithCGAT · 2017-05-09T09:57:07Z

The code_algos_in_c branch has now been merged into master.

I made a second branch to add the count command to simplify the process of merging the commits on that branch and the previous {TS}-Add_count_command branch as they covered a lot of the same code. The count command is now ready to be merged into master (I hope): #119

IanSudbery · 2017-05-09T10:43:45Z

Your add_count_command branches are still both using ReadCluster rather than UMICluster. Is this deliberate?

TomSmithCGAT · 2017-05-09T11:03:38Z

The previous add_count_command branch has been deleted. The current (add_count_command2) branch uses the new ReadClusterer (ie. incoportaing UMIClusterer) as per the group command. Sorry for any confusion about the branches. I set up this new branch from master, post the merge with code_algos_in_c to avoid resolving complex conflicts and make my life easier!

TomSmithCGAT · 2017-05-09T11:06:25Z

Think I may have missed the point there. Are you suggesting just using the UMIClusterer for count without the ReadClusterer wrapper since all we need are the number of groups, not the reads and umi counts? This does make sense

IanSudbery · 2017-05-09T11:08:52Z

Yes, that was my point. I was just trying to work out where your read bundle came from at the moment in count, since you don’t remember reads.

IanSudbery · 2017-05-09T11:10:50Z

But of course you we don't actually access the "read" part of the bundle unless deduplicate is true!

TomSmithCGAT · 2017-05-09T11:19:19Z

Yup. I didn't bother returning a bundle at all in the previous branch but I went back to it here to fit in with the changes in your branch. Your suggestion to just use UMIClusterer makes much more sense so I'll do that.

TomSmithCGAT · 2017-05-09T11:30:32Z

By the same logic, we don't really need to use the ReadClusterer for group either given we just need the clusters (groups). So ReadClusterer doesn't need the deduplicate switch and should be called something like ReadDeduplicator.

IanSudbery · 2017-05-09T11:38:00Z

I suppose what read cluster does its take the output of get_bundles and transform it into an input suitable for UMICluster. Removing support for group from ReadCluster would simplify it as the deduplicate if statement would not be necessary. But the conversion code would then need to be duplicated in the group script. Swings and roundabouts.

TomSmithCGAT · 2017-05-09T11:56:53Z

Yeah I think the transformation of the bundles into UMIClusterer input is simple enough (2 lines to extract umis and umi counts) that it would be worth duplicating these 2 lines in the group script to simplify the ReadClusterer

TomSmithCGAT · 2017-05-09T13:16:21Z

group and count now use UMIClusterer and dedup uses ReadClusterer (renamed ReadDuplicator). This is all ready to merge into master when tests pass (#119)

TomSmithCGAT created this issue from a note in Version 0.5 (To Do) May 5, 2017

IanSudbery added the enhancement label May 5, 2017

IanSudbery added this to the 0.5 milestone May 5, 2017

TomSmithCGAT self-assigned this May 8, 2017

TomSmithCGAT moved this from To Do to Ready for review in Version 0.5 May 9, 2017

TomSmithCGAT mentioned this issue May 9, 2017

Dedup by per-gene #44

Closed

TomSmithCGAT moved this from Ready for review to Done in Version 0.5 May 9, 2017

TomSmithCGAT closed this as completed May 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

umi_tools count #114

umi_tools count #114

TomSmithCGAT commented May 5, 2017

TomSmithCGAT commented May 5, 2017

TomSmithCGAT commented May 8, 2017 •

edited

IanSudbery commented May 8, 2017 via email

TomSmithCGAT commented May 8, 2017

TomSmithCGAT commented May 9, 2017

IanSudbery commented May 9, 2017 via email

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 9, 2017

IanSudbery commented May 9, 2017 via email

IanSudbery commented May 9, 2017

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 9, 2017 •

edited

IanSudbery commented May 9, 2017 via email

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 9, 2017

umi_tools count #114

umi_tools count #114

Comments

TomSmithCGAT commented May 5, 2017

TomSmithCGAT commented May 5, 2017

TomSmithCGAT commented May 8, 2017 • edited

IanSudbery commented May 8, 2017 via email

TomSmithCGAT commented May 8, 2017

TomSmithCGAT commented May 9, 2017

IanSudbery commented May 9, 2017 via email

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 9, 2017

IanSudbery commented May 9, 2017 via email

IanSudbery commented May 9, 2017

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 9, 2017 • edited

IanSudbery commented May 9, 2017 via email

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 8, 2017 •

edited

TomSmithCGAT commented May 9, 2017 •

edited