Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dedup parallel #69

Closed
IanSudbery opened this issue Jan 21, 2017 · 21 comments
Closed

Make dedup parallel #69

IanSudbery opened this issue Jan 21, 2017 · 21 comments

Comments

@IanSudbery
Copy link
Member

Make dedup use more than one CPU.

First attempt at an implementation see: branch IS_parallel.

Need large data file to benchmark.

cc: @TomSmithCGAT

@IanSudbery
Copy link
Member Author

So on one test file (from a previous issue), dedup was 4 times longer with 4 CPUs.

This presumably means that the overhead for setting up 4 processes is higher than the advantage from parallelising edit_distance calculations when there are only few umis at each position.

@TomSmithCGAT
Copy link
Member

Yeah that's a shame but I think your approach of checking to see how many UMIs there are at the position first makes a lot of sense. This does mean we probably wont see anything like a linear improvement in time taken for additional threads though.

Have you had a look at pinning down how many UMIs you need for the additional threads to be useful? If not, I can look at this. We could then derive a sensible threshold for using the multi-threading for each call to dedup depending on the UMI length (and threads?).

@IanSudbery
Copy link
Member Author

I don't really have a good file to test it on - I need something that has at least some really high depth positions.

We probably could get linear improvements if we could manage to parallelise the whole for (....) in get_bundles loop in dedup.py, but that is way more work than my simple attempt at parallelising the call to edit_distanace. I'm not sure we'll ever see an improvement at this position with a simple mp.pool.map call as I guess the call to edit_distance is so fast it is completely swamped by the time to pickle the arguments.

The main issue with multiprocessing and the top level loop is that pysam.AlignedSegment objects are not picklable. There might be a couple of ways around that, but its going to need more thinking.

@TomSmithCGAT
Copy link
Member

I guess the easiest way to get around any potential issue with pickling pysam.AlignedSegment would be to run one contig per thread? Not ideal but still potential for a pretty good improvement with the minimum run time roughly the same as deduping just the biggest chromosome.

What do you think @AndreasHeger? Have you any experience with parallelising pysam?

@AndreasHeger
Copy link
Member

I have not tried parallelism on the AlignedSegment level. However, it does seem to be an embarassingly parallel problem and I don't see why the parallelism needs to be on the contig level, can you not do 10Mb windows or similar?

@IanSudbery
Copy link
Member Author

IanSudbery commented Jan 23, 2017 via email

@IanSudbery
Copy link
Member Author

Looks like we are barking up the wrong tree!!
Seems that even on the file I'm working on, calls to edit_distance are not the limiting factor, but rather _get_connected_components_adjacency

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      2/1    0.004    0.002   99.064   99.064 dedup.py:209(<module>)
        1    0.157    0.157   97.285   97.285 dedup.py:274(main)
    11427    0.062    0.000   86.838    0.008 network.py:368(__call__)
    11427   41.889    0.004   64.291    0.006 network.py:181(_get_connected_components_adjacency)
    30938    5.579    0.000   22.277    0.001 network.py:27(breadth_first_search)
    11427    0.030    0.000   21.672    0.002 network.py:150(_get_adj_list_directional)
    11571    0.014    0.000   21.607    0.002 {map}
    11505   12.860    0.001   21.580    0.002 network.py:44(distance_thresholded_list)
  2055213   14.996    0.000   14.996    0.000 {method 'difference_update' of 'set' objects}
    11428    3.264    0.000   10.166    0.001 umi_methods.py:172(get_bundles)
 85821942    8.720    0.000    8.720    0.000 {_dedup_umi.edit_distance}
   482686    1.160    0.000    3.462    0.000 umi_methods.py:144(get_read_position)
  6227531    1.499    0.000    1.499    0.000 {method 'update' of 'set' objects}
        1    0.000    0.000    1.266    1.266 network.py:10(<module>)
        1    0.000    0.000    1.265    1.265 pyximport.py:414(load_module)
        1    0.000    0.000    1.265    1.265 pyximport.py:200(load_module)
        1    0.000    0.000    1.265    1.265 pyximport.py:169(build_module)

Line profiling this, it seems like the call to in is the slowest part:

Timer unit: 1e-06 s

Total time: 64.391 s
File: /home/mb1ims/devel/UMI-tools/umi_tools/network.py
Function: _get_connected_components_adjacency at line 181

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   181                                               @profile
   182                                               def _get_connected_components_adjacency(self, umis, graph, counts):
   183                                                   ''' find the connected UMIs within an adjacency dictionary'''
   184                                           
   185     11427         9688      0.8      0.0          found = list()
   186     11427         6682      0.6      0.0          components = list()
   187                                           
   188     64199        97277      1.5      0.2          for node in sorted(graph, key=lambda x: counts[x], reverse=True):
   189     52772     41687934    790.0     64.7              if node not in found:
   190     30938     22486192    726.8     34.9                  component = breadth_first_search(node, graph)
   191     30938        77903      2.5      0.1                  found.extend(component)
   192     30938        20226      0.7      0.0                  components.append(component)
   193                                           
   194     11427         5080      0.4      0.0          return components

I feel like making found a set rather than a list might make things better...

@TomSmithCGAT
Copy link
Member

Yeah, found really should be a set there.

Was this from applying UMI-tools dedup to a very high depth sample? I'm surprised to see the edit_distance function being used relatively few times.

@IanSudbery
Copy link
Member Author

With a set instead of a list!

Total time: 22.2338 s
File: /home/mb1ims/devel/UMI-tools/umi_tools/network.py
Function: _get_connected_components_adjacency at line 180

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   180                                               @profile
   181                                               def _get_connected_components_adjacency(self, umis, graph, counts):
   182                                                   ''' find the connected UMIs within an adjacency dictionary'''
   183                                           
   184     11427         9306      0.8      0.0          found = set()
   185     11427         7926      0.7      0.0          components = list()
   186                                           
   187     64199        82103      1.3      0.4          for node in sorted(graph, key=lambda x: counts[x], reverse=True):
   188     52772        29248      0.6      0.1              if node not in found:
   189     30938     22008122    711.4     99.0                  component = breadth_first_search(node, graph)
   190     30938        72322      2.3      0.3                  found.update(component)
   191     30938        19310      0.6      0.1                  components.append(component)
   192                                           
   193     11427         5430      0.5      0.0          return components

This is with the file you suggested in your email.

@TomSmithCGAT
Copy link
Member

Wow! That's great. And a bit embarrassing!

@IanSudbery
Copy link
Member Author

BTW thus far my calculations are suggesting that multiproccessing only become efficient at the edit_distance level when there are more than 1000 UMIs per site.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Jan 25, 2017

Hi Ian. We can use a recursive function to find the network components for a small decrease in run time (~15%). See the TS-recursiveComponents branch. Note: I haven't checked the additional memory usage from the recursive function.

Using the non-recursive function

1967598 function calls (81959500 primitive calls) in 86.276 seconds

   Ordered by: internal time
   List reduced from 3870 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11427   40.849    0.004   58.025    0.005 umi_tools/network.py:170(_get_connected_components_adjacency)
    52772   14.808    0.000   17.999    0.000 umi_tools/network.py:159(<listcomp>)
  2055213   11.318    0.000   11.318    0.000 {method 'difference_update' of 'set' objects}
    30938    4.376    0.000   17.089    0.001 umi_tools/network.py:30(breadth_first_search)
 48075266    3.192    0.000    3.192    0.000 {built-in method _dedup_umi.edit_distance}
    11428    2.525    0.000    8.200    0.001 umi_tools/umi_methods.py:172(get_bundles)
  6227515    1.296    0.000    1.296    0.000 {method 'update' of 'set' objects}
   482686    0.986    0.000    2.501    0.000 umi_tools/umi_methods.py:144(get_read_position)
   482686    0.494    0.000    0.556    0.000 pysam/calignedsegment.pyx:910(__get__)
   482686    0.403    0.000    1.078    0.000 umi_tools/umi_methods.py:33(get_umi)

Using the recursive function:

1812978 function calls (79266792 primitive calls) in 73.210 seconds

   Ordered by: internal time
   List reduced from 3884 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11427   40.572    0.004   44.968    0.004 umi_tools/network.py:170(_get_connected_components_adjacency)
    52772   14.728    0.000   17.964    0.000 umi_tools/network.py:159(<listcomp>)
 48075266    3.236    0.000    3.236    0.000 {built-in method _dedup_umi.edit_distance}
2569026/30938    2.624    0.000    4.279    0.000 umi_tools/network.py:47(recursive_search)
    11428    2.572    0.000    8.360    0.001 umi_tools/umi_methods.py:172(get_bundles)
  2569026    1.165    0.000    1.165    0.000 umi_tools/network.py:49(<listcomp>)
   482686    1.006    0.000    2.561    0.000 umi_tools/umi_methods.py:144(get_read_position)
   482686    0.506    0.000    0.570    0.000 pysam/calignedsegment.pyx:910(__get__)
  5076176    0.490    0.000    0.490    0.000 {method 'update' of 'set' objects}
   482686    0.400    0.000    1.087    0.000 umi_tools/umi_methods.py:33(get_umi)

@TomSmithCGAT
Copy link
Member

That's a shame about the parallel processing at the edit_distance level. Given the edit distance is not the major bottleneck as you indicate above, this suggests to me that it's not really worth including parallel processing at this level.

@IanSudbery
Copy link
Member Author

I have combined three optimisations into a new branch.

I ran tests with two different test files. The first used 5bp UMIs, and has 3,320,209 reads at 38,779 positions, that is de-duplicated down to 43,100 reads.

The second file used 10bp UMIs and has 482,686 reads at 11,427 positions, that is de-duplicated down to 30,938 reads.

The first optimisation is using map rather than a list comprehension in the methods to get the networks. See commit 7ce24b9

The second optimisation is to use a set rather than a list for the found object in _get_connected_components. See commit 147b4c5.

Finally I merged in your recursion algo for the connected component search.

The results are below

File Master Opt 1 Opt 1 + 2 Opt 1,2 + 3
1 24s 24s 25s 26s
2 154s 74s 33s 16s

For the first file there is not much difference. But for the second file we have an almost 10 fold improvement with all three optimisations in place.

In File 1, the bottleneck is get_read_position and get_umi followed by various I/O limited functions.

For File 2, the bottleneck now is neither the finding of connected components, nor the calculating of the edit_distance matrix, but the list comprehension that wraps the edit_distance call.

This might be as good as such optimisations get.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Jan 26, 2017

Thanks for properly performing the comparisons. That's great news about the improvements for the second file which I think is probably more representative of the type of samples which have been taking a long time. A 10-fold improvement is much more than I thought we'd get just from tinkering with the code.

I think we can squeeze out a further slight improvement by replacing the list comprehension with a call to itertools.combinations - see the TS-replaceComprehension branch. This reduces the number of times two UMIs are compared ~ 2-fold. Would you mind merging in this branch too and checking the run times on those two files?

This branch also has a commit (25da7f4) to decode the UMIs to strings for outputting to flat files - previously tests were failing as b'ATATA' != ATATA

@IanSudbery
Copy link
Member Author

So this brings it down to 12s. Pretty good going.

Looking at this has been instructive: my Opt1 above was replacing one of the list comprehensions with a map function, which halved the time spent in _get_adj_list_directional. Your itertools solution prevents the use of map, more or less, but still performs better. At first I thought that the difference would be that your solution was spending less time in edit_distance but would be spending more on overhead, while mine was doing the opposite, but that turned out not to be the case. In fact all the improvements from my Opt1 were due to the fact i'd left the encode calls out, which you've fixed now anyway!

@TomSmithCGAT
Copy link
Member

Definitely some valuable lessons regarding profiling and code optimisation. I'm amazed that we can shave off so much of the run-time. Hopefully this will prevent any further issues relating to the time taken to run dedup. If there are still issues I guess we can try and implement parallelisation at the ClusterReducer level.

Are you good to merge in the changes to master? I'll update the version on PyPi and conda afterwards.

@IanSudbery
Copy link
Member Author

IanSudbery commented Jan 26, 2017 via email

@TomSmithCGAT
Copy link
Member

There's an issue with the recursive search. It produces a slightly different order of output with the group_cluster test (not an issue) and outputs a slightly different set of reads when I run dedup on the high depth sample I sent you (File 2 above?). I've not got to the bottom of the second issue so I thought it best just to leave the recursive search out for now and I'll return to this when I have time.

@IanSudbery
Copy link
Member Author

CLosing this for now as we came to the conclusion that the overhead on parallelisation was greater than the reward.

@MilosCRF
Copy link

MilosCRF commented Apr 12, 2024

Hi,

I've been using umi_tools extract and umi_tools dedup in combination with GNU parallel for some time in my projects, which has significantly sped up the processing time. I finally managed to consolidate everything into a GitHub repository - UMI_parallel.

I'd be delighted if it proves helpful to others. Please feel free to provide any feedback or suggestions.

Cheers,
Milos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants