New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
umi_tools count #114
Comments
Comment above from @IanSudbery's project note. Clarification:
Further considerations:
|
@IanSudbery. Could you review 5793e24. This is a first attempt to make a count command. I've written a new simplified generator On a related note, |
The code_algos_in_c branch has already refactored along those lines. In
this branch the UMICluster class takes a dictionary of UMIs and counts and
Clusters them into a list of lists. Our best guess at the representative
UMI is the first item in each list.
The class ReadCluster is now a wrapper around UMIClusterer that take
bundles of reads and converts them into a dictionary of UMI counts and
calls UMICluster on them. If dedup is specified then it selects the
represetative reads to return.
Have a look at network.py in the `code_algos_in_c` branch.
…On Mon, 8 May 2017 at 15:07 Tom Smith ***@***.***> wrote:
@IanSudbery <https://github.com/IanSudbery>. Could you review 5793e24
<5793e24>.
This is a first attempt to make a count command.
I've written a new simplified generator umi_methods.get_gene_counts to
return the umi counts per gene and rejigged the input for
network..ReadClusterer so it will work from just umis + umi counts (group
and count commands) or with bundles (dedup command).
On a related note, ReadClusterer has become a bit unwieldy. Currently it
has a deduplicate switch to determine how far to proceed with the read
processing. This allows us to return early for group and count. As it
stands we return groups of umis for group and count and reads and stats
objects for dedup. I think we should consider refactoring this functor so
that it always returns the groups. For dedup, we would then have a separate
function to identify the read and make the dedup output derive from the
groups to ensure the outputs from the commands always agree. We could also
then handle all the stats collection in dedup. I'll re-work this now to
show you what I mean in case that wasn't clear...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFJFjrBwU8OPL8mj5KU0oLdgx5VDxP80ks5r3yGjgaJpZM4NR3TA>
.
|
Ah great. I must have seen this previously but forgotten! I think we should look to merge this into master then rather than waiting until the network code has been recoded into a C library as it'll make the codebase much easier to maintain in the meantime. Are you happy for me to go through the code_algos_in_c branch and merge this into master? |
The code_algos_in_c branch has now been merged into master. I made a second branch to add the count command to simplify the process of merging the commits on that branch and the previous {TS}-Add_count_command branch as they covered a lot of the same code. The count command is now ready to be merged into master (I hope): #119 |
Your add_count_command branches are still both using ReadCluster rather
than UMICluster. Is this deliberate?
|
The previous |
Think I may have missed the point there. Are you suggesting just using the UMIClusterer for count without the ReadClusterer wrapper since all we need are the number of groups, not the reads and umi counts? This does make sense |
Yes, that was my point. I was just trying to work out where your read
bundle came from at the moment in count, since you don’t remember reads.
|
But of course you we don't actually access the "read" part of the bundle unless deduplicate is true! |
Yup. I didn't bother returning a bundle at all in the previous branch but I went back to it here to fit in with the changes in your branch. Your suggestion to just use UMIClusterer makes much more sense so I'll do that. |
By the same logic, we don't really need to use the |
I suppose what read cluster does its take the output of get_bundles and
transform it into an input suitable for UMICluster.
Removing support for group from ReadCluster would simplify it as the
deduplicate if statement would not be necessary. But the conversion code
would then need to be duplicated in the group script.
Swings and roundabouts.
|
Yeah I think the transformation of the bundles into UMIClusterer input is simple enough (2 lines to extract umis and umi counts) that it would be worth duplicating these 2 lines in the group script to simplify the ReadClusterer |
group and count now use UMIClusterer and dedup uses ReadClusterer (renamed ReadDuplicator). This is all ready to merge into master when tests pass (#119) |
The text was updated successfully, but these errors were encountered: