Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umi_tools count #114

Closed
TomSmithCGAT opened this issue May 5, 2017 · 15 comments
Closed

umi_tools count #114

TomSmithCGAT opened this issue May 5, 2017 · 15 comments
Assignees
Projects
Milestone

Comments

@TomSmithCGAT
Copy link
Member

  • For people doing per-gene scRNA-seq, the reads are immaterial and in general only count is of interest.
  • This tool would count unique UMIs for a given contig, cell, gene combo.
  • Doesn't have same memory issues as per_gene dedup
@TomSmithCGAT TomSmithCGAT created this issue from a note in Version 0.5 (To Do) May 5, 2017
@TomSmithCGAT
Copy link
Member Author

Comment above from @IanSudbery's project note.

Clarification:

  • Input would be BAM
  • Output would be a tsv with 3 columns: gene, cell, count

Further considerations:

  • We should maintain the --per-contig, --gene-transcript-map and --gene-tag options so the user can align reads to the transcript or genome
  • I suggest we also support a flatfile input with read id (as suggested by @MarinusVL Dedup by per-gene  #44) as this would allow the user to run e.g featureCounts and pass the output directly to UMI-tools without having to add a gene tag onto the BAM

@IanSudbery IanSudbery added this to the 0.5 milestone May 5, 2017
@TomSmithCGAT
Copy link
Member Author

TomSmithCGAT commented May 8, 2017

@IanSudbery. Could you review 5793e24. This is a first attempt to make a count command.

I've written a new simplified generator umi_methods.get_gene_counts to return the umi counts per gene and rejigged the input for network..ReadClusterer so it will work from just umis + umi counts (group and count commands) or with bundles (dedup command).

On a related note, ReadClusterer has become a bit unwieldy. Currently it has a deduplicate switch to determine how far to proceed with the read processing. This allows us to return early for group and count. As it stands we return groups of umis for group and count and reads and associated stats objects for dedup. I think we should consider refactoring this functor so that it always returns groups. For dedup, we would then have a separate function to identify the reads and make the dedup reads derive directly from the groups to ensure the outputs from the commands always agree. We could also then handle all the stats collection in dedup.py. I'll re-work this now to show you what I mean in case that wasn't clear...

@IanSudbery
Copy link
Member

IanSudbery commented May 8, 2017 via email

@TomSmithCGAT
Copy link
Member Author

Ah great. I must have seen this previously but forgotten!

I think we should look to merge this into master then rather than waiting until the network code has been recoded into a C library as it'll make the codebase much easier to maintain in the meantime. Are you happy for me to go through the code_algos_in_c branch and merge this into master?

@TomSmithCGAT TomSmithCGAT self-assigned this May 8, 2017
@TomSmithCGAT
Copy link
Member Author

The code_algos_in_c branch has now been merged into master.

I made a second branch to add the count command to simplify the process of merging the commits on that branch and the previous {TS}-Add_count_command branch as they covered a lot of the same code. The count command is now ready to be merged into master (I hope): #119

@TomSmithCGAT TomSmithCGAT moved this from To Do to Ready for review in Version 0.5 May 9, 2017
@IanSudbery
Copy link
Member

IanSudbery commented May 9, 2017 via email

@TomSmithCGAT
Copy link
Member Author

The previous add_count_command branch has been deleted. The current (add_count_command2) branch uses the new ReadClusterer (ie. incoportaing UMIClusterer) as per the group command. Sorry for any confusion about the branches. I set up this new branch from master, post the merge with code_algos_in_c to avoid resolving complex conflicts and make my life easier!

@TomSmithCGAT
Copy link
Member Author

Think I may have missed the point there. Are you suggesting just using the UMIClusterer for count without the ReadClusterer wrapper since all we need are the number of groups, not the reads and umi counts? This does make sense

@IanSudbery
Copy link
Member

IanSudbery commented May 9, 2017 via email

@IanSudbery
Copy link
Member

But of course you we don't actually access the "read" part of the bundle unless deduplicate is true!

@TomSmithCGAT
Copy link
Member Author

Yup. I didn't bother returning a bundle at all in the previous branch but I went back to it here to fit in with the changes in your branch. Your suggestion to just use UMIClusterer makes much more sense so I'll do that.

@TomSmithCGAT
Copy link
Member Author

TomSmithCGAT commented May 9, 2017

By the same logic, we don't really need to use the ReadClusterer for group either given we just need the clusters (groups). So ReadClusterer doesn't need the deduplicate switch and should be called something like ReadDeduplicator.

@IanSudbery
Copy link
Member

IanSudbery commented May 9, 2017 via email

@TomSmithCGAT
Copy link
Member Author

Yeah I think the transformation of the bundles into UMIClusterer input is simple enough (2 lines to extract umis and umi counts) that it would be worth duplicating these 2 lines in the group script to simplify the ReadClusterer

@TomSmithCGAT
Copy link
Member Author

group and count now use UMIClusterer and dedup uses ReadClusterer (renamed ReadDuplicator). This is all ready to merge into master when tests pass (#119)

@TomSmithCGAT TomSmithCGAT moved this from Ready for review to Done in Version 0.5 May 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants