Make dedup parallel #69

IanSudbery · 2017-01-21T14:55:41Z

Make dedup use more than one CPU.

First attempt at an implementation see: branch IS_parallel.

Need large data file to benchmark.

IanSudbery · 2017-01-22T13:10:57Z

So on one test file (from a previous issue), dedup was 4 times longer with 4 CPUs.

This presumably means that the overhead for setting up 4 processes is higher than the advantage from parallelising edit_distance calculations when there are only few umis at each position.

TomSmithCGAT · 2017-01-23T08:31:02Z

Yeah that's a shame but I think your approach of checking to see how many UMIs there are at the position first makes a lot of sense. This does mean we probably wont see anything like a linear improvement in time taken for additional threads though.

Have you had a look at pinning down how many UMIs you need for the additional threads to be useful? If not, I can look at this. We could then derive a sensible threshold for using the multi-threading for each call to dedup depending on the UMI length (and threads?).

IanSudbery · 2017-01-23T09:51:30Z

I don't really have a good file to test it on - I need something that has at least some really high depth positions.

We probably could get linear improvements if we could manage to parallelise the whole for (....) in get_bundles loop in dedup.py, but that is way more work than my simple attempt at parallelising the call to edit_distanace. I'm not sure we'll ever see an improvement at this position with a simple mp.pool.map call as I guess the call to edit_distance is so fast it is completely swamped by the time to pickle the arguments.

The main issue with multiprocessing and the top level loop is that pysam.AlignedSegment objects are not picklable. There might be a couple of ways around that, but its going to need more thinking.

TomSmithCGAT · 2017-01-23T10:29:06Z

I guess the easiest way to get around any potential issue with pickling pysam.AlignedSegment would be to run one contig per thread? Not ideal but still potential for a pretty good improvement with the minimum run time roughly the same as deduping just the biggest chromosome.

What do you think @AndreasHeger? Have you any experience with parallelising pysam?

AndreasHeger · 2017-01-23T10:40:42Z

I have not tried parallelism on the AlignedSegment level. However, it does seem to be an embarassingly parallel problem and I don't see why the parallelism needs to be on the contig level, can you not do 10Mb windows or similar?

IanSudbery · 2017-01-23T11:03:01Z

The problem with this is that each 10Mb is not independent of all other 10Mb blocks because we use an implied mapping start position rather than the actual mapping start position supplied in the BAM. The chromosome level might be better, but we still have the case of paired reads where the mates are on different chromosomes (probably a rare problem when mapping to the genome, but could be more of an issue if mapping to the transcriptome). One solution I was thinking of was that most of the real work done by the main loop is done by ClusterReducer, and that doesn’t actually need the reads themselves, only the UMIs and counts. So it might be possible to peel off a umi_counts dictionary, pass those in a parallel fashion to CluterReducer and then recombine them with the reads on return – that’s going to need some clever imap/izip-fu, but should be possible. I’ve also glanced at the world of proxy objects. But that seems very complex.

IanSudbery · 2017-01-23T15:40:42Z

Looks like we are barking up the wrong tree!!
Seems that even on the file I'm working on, calls to edit_distance are not the limiting factor, but rather _get_connected_components_adjacency

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      2/1    0.004    0.002   99.064   99.064 dedup.py:209(<module>)
        1    0.157    0.157   97.285   97.285 dedup.py:274(main)
    11427    0.062    0.000   86.838    0.008 network.py:368(__call__)
    11427   41.889    0.004   64.291    0.006 network.py:181(_get_connected_components_adjacency)
    30938    5.579    0.000   22.277    0.001 network.py:27(breadth_first_search)
    11427    0.030    0.000   21.672    0.002 network.py:150(_get_adj_list_directional)
    11571    0.014    0.000   21.607    0.002 {map}
    11505   12.860    0.001   21.580    0.002 network.py:44(distance_thresholded_list)
  2055213   14.996    0.000   14.996    0.000 {method 'difference_update' of 'set' objects}
    11428    3.264    0.000   10.166    0.001 umi_methods.py:172(get_bundles)
 85821942    8.720    0.000    8.720    0.000 {_dedup_umi.edit_distance}
   482686    1.160    0.000    3.462    0.000 umi_methods.py:144(get_read_position)
  6227531    1.499    0.000    1.499    0.000 {method 'update' of 'set' objects}
        1    0.000    0.000    1.266    1.266 network.py:10(<module>)
        1    0.000    0.000    1.265    1.265 pyximport.py:414(load_module)
        1    0.000    0.000    1.265    1.265 pyximport.py:200(load_module)
        1    0.000    0.000    1.265    1.265 pyximport.py:169(build_module)

Line profiling this, it seems like the call to in is the slowest part:

Timer unit: 1e-06 s

Total time: 64.391 s
File: /home/mb1ims/devel/UMI-tools/umi_tools/network.py
Function: _get_connected_components_adjacency at line 181

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   181                                               @profile
   182                                               def _get_connected_components_adjacency(self, umis, graph, counts):
   183                                                   ''' find the connected UMIs within an adjacency dictionary'''
   184                                           
   185     11427         9688      0.8      0.0          found = list()
   186     11427         6682      0.6      0.0          components = list()
   187                                           
   188     64199        97277      1.5      0.2          for node in sorted(graph, key=lambda x: counts[x], reverse=True):
   189     52772     41687934    790.0     64.7              if node not in found:
   190     30938     22486192    726.8     34.9                  component = breadth_first_search(node, graph)
   191     30938        77903      2.5      0.1                  found.extend(component)
   192     30938        20226      0.7      0.0                  components.append(component)
   193                                           
   194     11427         5080      0.4      0.0          return components

I feel like making found a set rather than a list might make things better...

TomSmithCGAT · 2017-01-23T16:42:54Z

Yeah, found really should be a set there.

Was this from applying UMI-tools dedup to a very high depth sample? I'm surprised to see the edit_distance function being used relatively few times.

IanSudbery · 2017-01-24T17:53:29Z

With a set instead of a list!

Total time: 22.2338 s
File: /home/mb1ims/devel/UMI-tools/umi_tools/network.py
Function: _get_connected_components_adjacency at line 180

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   180                                               @profile
   181                                               def _get_connected_components_adjacency(self, umis, graph, counts):
   182                                                   ''' find the connected UMIs within an adjacency dictionary'''
   183                                           
   184     11427         9306      0.8      0.0          found = set()
   185     11427         7926      0.7      0.0          components = list()
   186                                           
   187     64199        82103      1.3      0.4          for node in sorted(graph, key=lambda x: counts[x], reverse=True):
   188     52772        29248      0.6      0.1              if node not in found:
   189     30938     22008122    711.4     99.0                  component = breadth_first_search(node, graph)
   190     30938        72322      2.3      0.3                  found.update(component)
   191     30938        19310      0.6      0.1                  components.append(component)
   192                                           
   193     11427         5430      0.5      0.0          return components

This is with the file you suggested in your email.

TomSmithCGAT · 2017-01-24T18:25:32Z

Wow! That's great. And a bit embarrassing!

IanSudbery · 2017-01-24T18:31:57Z

BTW thus far my calculations are suggesting that multiproccessing only become efficient at the edit_distance level when there are more than 1000 UMIs per site.

TomSmithCGAT · 2017-01-25T11:07:45Z

Hi Ian. We can use a recursive function to find the network components for a small decrease in run time (~15%). See the TS-recursiveComponents branch. Note: I haven't checked the additional memory usage from the recursive function.

Using the non-recursive function

1967598 function calls (81959500 primitive calls) in 86.276 seconds

   Ordered by: internal time
   List reduced from 3870 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11427   40.849    0.004   58.025    0.005 umi_tools/network.py:170(_get_connected_components_adjacency)
    52772   14.808    0.000   17.999    0.000 umi_tools/network.py:159(<listcomp>)
  2055213   11.318    0.000   11.318    0.000 {method 'difference_update' of 'set' objects}
    30938    4.376    0.000   17.089    0.001 umi_tools/network.py:30(breadth_first_search)
 48075266    3.192    0.000    3.192    0.000 {built-in method _dedup_umi.edit_distance}
    11428    2.525    0.000    8.200    0.001 umi_tools/umi_methods.py:172(get_bundles)
  6227515    1.296    0.000    1.296    0.000 {method 'update' of 'set' objects}
   482686    0.986    0.000    2.501    0.000 umi_tools/umi_methods.py:144(get_read_position)
   482686    0.494    0.000    0.556    0.000 pysam/calignedsegment.pyx:910(__get__)
   482686    0.403    0.000    1.078    0.000 umi_tools/umi_methods.py:33(get_umi)

Using the recursive function:

1812978 function calls (79266792 primitive calls) in 73.210 seconds

   Ordered by: internal time
   List reduced from 3884 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11427   40.572    0.004   44.968    0.004 umi_tools/network.py:170(_get_connected_components_adjacency)
    52772   14.728    0.000   17.964    0.000 umi_tools/network.py:159(<listcomp>)
 48075266    3.236    0.000    3.236    0.000 {built-in method _dedup_umi.edit_distance}
2569026/30938    2.624    0.000    4.279    0.000 umi_tools/network.py:47(recursive_search)
    11428    2.572    0.000    8.360    0.001 umi_tools/umi_methods.py:172(get_bundles)
  2569026    1.165    0.000    1.165    0.000 umi_tools/network.py:49(<listcomp>)
   482686    1.006    0.000    2.561    0.000 umi_tools/umi_methods.py:144(get_read_position)
   482686    0.506    0.000    0.570    0.000 pysam/calignedsegment.pyx:910(__get__)
  5076176    0.490    0.000    0.490    0.000 {method 'update' of 'set' objects}
   482686    0.400    0.000    1.087    0.000 umi_tools/umi_methods.py:33(get_umi)

TomSmithCGAT · 2017-01-25T11:11:47Z

That's a shame about the parallel processing at the edit_distance level. Given the edit distance is not the major bottleneck as you indicate above, this suggests to me that it's not really worth including parallel processing at this level.

IanSudbery · 2017-01-25T18:28:02Z

I have combined three optimisations into a new branch.

I ran tests with two different test files. The first used 5bp UMIs, and has 3,320,209 reads at 38,779 positions, that is de-duplicated down to 43,100 reads.

The second file used 10bp UMIs and has 482,686 reads at 11,427 positions, that is de-duplicated down to 30,938 reads.

The first optimisation is using map rather than a list comprehension in the methods to get the networks. See commit 7ce24b9

The second optimisation is to use a set rather than a list for the found object in _get_connected_components. See commit 147b4c5.

Finally I merged in your recursion algo for the connected component search.

The results are below

File	Master	Opt 1	Opt 1 + 2	Opt 1,2 + 3
1	24s	24s	25s	26s
2	154s	74s	33s	16s

For the first file there is not much difference. But for the second file we have an almost 10 fold improvement with all three optimisations in place.

In File 1, the bottleneck is get_read_position and get_umi followed by various I/O limited functions.

For File 2, the bottleneck now is neither the finding of connected components, nor the calculating of the edit_distance matrix, but the list comprehension that wraps the edit_distance call.

This might be as good as such optimisations get.

TomSmithCGAT · 2017-01-26T09:37:24Z

Thanks for properly performing the comparisons. That's great news about the improvements for the second file which I think is probably more representative of the type of samples which have been taking a long time. A 10-fold improvement is much more than I thought we'd get just from tinkering with the code.

I think we can squeeze out a further slight improvement by replacing the list comprehension with a call to itertools.combinations - see the TS-replaceComprehension branch. This reduces the number of times two UMIs are compared ~ 2-fold. Would you mind merging in this branch too and checking the run times on those two files?

This branch also has a commit (25da7f4) to decode the UMIs to strings for outputting to flat files - previously tests were failing as b'ATATA' != ATATA

IanSudbery · 2017-01-26T10:58:20Z

So this brings it down to 12s. Pretty good going.

Looking at this has been instructive: my Opt1 above was replacing one of the list comprehensions with a map function, which halved the time spent in _get_adj_list_directional. Your itertools solution prevents the use of map, more or less, but still performs better. At first I thought that the difference would be that your solution was spending less time in edit_distance but would be spending more on overhead, while mine was doing the opposite, but that turned out not to be the case. In fact all the improvements from my Opt1 were due to the fact i'd left the encode calls out, which you've fixed now anyway!

TomSmithCGAT · 2017-01-26T12:59:06Z

Definitely some valuable lessons regarding profiling and code optimisation. I'm amazed that we can shave off so much of the run-time. Hopefully this will prevent any further issues relating to the time taken to run dedup. If there are still issues I guess we can try and implement parallelisation at the ClusterReducer level.

Are you good to merge in the changes to master? I'll update the version on PyPi and conda afterwards.

IanSudbery · 2017-01-26T13:42:16Z

Yep, go ahead and merge

TomSmithCGAT · 2017-01-26T16:56:11Z

There's an issue with the recursive search. It produces a slightly different order of output with the group_cluster test (not an issue) and outputs a slightly different set of reads when I run dedup on the high depth sample I sent you (File 2 above?). I've not got to the bottom of the second issue so I thought it best just to leave the recursive search out for now and I'll return to this when I have time.

IanSudbery · 2017-03-27T10:09:12Z

CLosing this for now as we came to the conclusion that the overhead on parallelisation was greater than the reward.

MilosCRF · 2024-04-12T04:29:27Z

Hi,

I've been using umi_tools extract and umi_tools dedup in combination with GNU parallel for some time in my projects, which has significantly sped up the processing time. I finally managed to consolidate everything into a GitHub repository - UMI_parallel.

I'd be delighted if it proves helpful to others. Please feel free to provide any feedback or suggestions.

Cheers,
Milos

TomSmithCGAT mentioned this issue Jan 26, 2017

Ts replace comprehension #71

Merged

TomSmithCGAT mentioned this issue Mar 21, 2017

Abstract core algorithm in to C library #96

Open

5 tasks

IanSudbery closed this as completed Mar 27, 2017

TomSmithCGAT mentioned this issue Oct 19, 2017

Parallel deduplication #203

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make dedup parallel #69

Make dedup parallel #69

IanSudbery commented Jan 21, 2017

IanSudbery commented Jan 22, 2017

TomSmithCGAT commented Jan 23, 2017

IanSudbery commented Jan 23, 2017

TomSmithCGAT commented Jan 23, 2017

AndreasHeger commented Jan 23, 2017

IanSudbery commented Jan 23, 2017 via email

IanSudbery commented Jan 23, 2017

TomSmithCGAT commented Jan 23, 2017

IanSudbery commented Jan 24, 2017

TomSmithCGAT commented Jan 24, 2017

IanSudbery commented Jan 24, 2017

TomSmithCGAT commented Jan 25, 2017 •

edited

TomSmithCGAT commented Jan 25, 2017

IanSudbery commented Jan 25, 2017

TomSmithCGAT commented Jan 26, 2017 •

edited

IanSudbery commented Jan 26, 2017

TomSmithCGAT commented Jan 26, 2017

IanSudbery commented Jan 26, 2017 via email

TomSmithCGAT commented Jan 26, 2017

IanSudbery commented Mar 27, 2017

MilosCRF commented Apr 12, 2024 •

edited

Make dedup parallel #69

Make dedup parallel #69

Comments

IanSudbery commented Jan 21, 2017

IanSudbery commented Jan 22, 2017

TomSmithCGAT commented Jan 23, 2017

IanSudbery commented Jan 23, 2017

TomSmithCGAT commented Jan 23, 2017

AndreasHeger commented Jan 23, 2017

IanSudbery commented Jan 23, 2017 via email

IanSudbery commented Jan 23, 2017

TomSmithCGAT commented Jan 23, 2017

IanSudbery commented Jan 24, 2017

TomSmithCGAT commented Jan 24, 2017

IanSudbery commented Jan 24, 2017

TomSmithCGAT commented Jan 25, 2017 • edited

TomSmithCGAT commented Jan 25, 2017

IanSudbery commented Jan 25, 2017

TomSmithCGAT commented Jan 26, 2017 • edited

IanSudbery commented Jan 26, 2017

TomSmithCGAT commented Jan 26, 2017

IanSudbery commented Jan 26, 2017 via email

TomSmithCGAT commented Jan 26, 2017

IanSudbery commented Mar 27, 2017

MilosCRF commented Apr 12, 2024 • edited

TomSmithCGAT commented Jan 25, 2017 •

edited

TomSmithCGAT commented Jan 26, 2017 •

edited

MilosCRF commented Apr 12, 2024 •

edited