Easy inverse index array #3387 by SanjanaSogimatt · Pull Request #3433 · MDAnalysis/mdanalysis

SanjanaSogimatt · 2021-10-10T10:08:34Z

Fixes #

Changes made in this Pull Request:

Functions used to fix the bug
There are two methods used and one is commented out

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

pep8speaks · 2021-10-10T10:08:36Z

Hello @SanjanaSogimatt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file Issue.py:

Line 1:1: E265 block comment should start with '# '
Line 4:1: E265 block comment should start with '# '
Line 5:5: E225 missing whitespace around operator
Line 6:1: E302 expected 2 blank lines, found 0
Line 7:1: W293 blank line contains whitespace
Line 8:8: E225 missing whitespace around operator
Line 10:1: E305 expected 2 blank lines after class or function definition, found 0
Line 10:1: E741 ambiguous variable name 'l'
Line 10:2: E225 missing whitespace around operator
Line 13:1: E265 block comment should start with '# '
Line 14:1: E302 expected 2 blank lines, found 1
Line 15:8: E225 missing whitespace around operator
Line 15:11: E231 missing whitespace after ','
Line 15:13: E231 missing whitespace after ','
Line 15:15: E231 missing whitespace after ','
Line 15:17: E231 missing whitespace after ','
Line 15:20: E231 missing whitespace after ','
Line 16:11: E225 missing whitespace around operator
Line 18:1: W293 blank line contains whitespace
Line 19:1: W293 blank line contains whitespace
Line 19:1: W391 blank line at end of file

Comment last updated at 2021-10-10 13:21:51 UTC

codecov · 2021-10-10T10:29:24Z

Codecov Report

Merging #3433 (f959caa) into develop (149eb50) will decrease coverage by 3.13%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #3433      +/-   ##
===========================================
- Coverage    92.92%   89.79%   -3.14%     
===========================================
  Files          172      167       -5     
  Lines        22380    21786     -594     
  Branches      3302        0    -3302     
===========================================
- Hits         20797    19563    -1234     
- Misses        1030     2223    +1193     
+ Partials       553        0     -553

Impacted Files	Coverage Δ
package/MDAnalysis/converters/RDKitParser.py	`14.51% <0.00%> (-82.24%)`	⬇️
package/MDAnalysis/converters/RDKit.py	`16.98% <0.00%> (-81.08%)`	⬇️
package/MDAnalysis/analysis/hole2/hole.py	`14.47% <0.00%> (-59.80%)`	⬇️
package/MDAnalysis/converters/OpenMM.py	`37.68% <0.00%> (-59.34%)`	⬇️
package/MDAnalysis/converters/OpenMMParser.py	`42.85% <0.00%> (-57.15%)`	⬇️
package/MDAnalysis/analysis/hole2/utils.py	`25.78% <0.00%> (-49.06%)`	⬇️
package/MDAnalysis/topology/guessers.py	`92.24% <0.00%> (-6.98%)`	⬇️
package/MDAnalysis/core/selection.py	`93.22% <0.00%> (-5.41%)`	⬇️
package/MDAnalysis/analysis/psa.py	`79.14% <0.00%> (-4.98%)`	⬇️
package/MDAnalysis/lib/NeighborSearch.py	`96.42% <0.00%> (-3.58%)`	⬇️
... and 113 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 149eb50...f959caa. Read the comment docs.

Fixed errors.

SanjanaSogimatt · 2021-10-10T13:45:34Z

Hello, I wanted to know how to improve my code.

tylerjereddy

This doesn't reference any MDAnalysis issue, or adjust any MDAnalysis source code, so I'm going to close it and suggest browsing our issues/docs if you'd like to contribute.

If you'd like feedback on improving your generic Python code you might try something like StackOverflow.

lilyminium · 2021-10-11T14:33:25Z

Hi @SanjanaSogimatt,

Thanks for having a look at this issue and for linking it in the title. As Tyler points out, it would have been more beneficial to add the issue number after the Fixes # line as that enables GitHub to link the pull request like in the screenshot below -- this makes it much easier for reviewers to track the discussion, especially in a large project like MDAnalysis.

Issue #3387 describes how PR #3368 adds this code to the library:

mdanalysis/package/MDAnalysis/core/groups.py

Lines 811 to 815 in 149eb50

    
           mask = np.zeros_like(self.ix) 
        
           for i, x in enumerate(indices): 
        
               values = np.where(self.ix == x)[0] 
        
               mask[values] = i 
        
           self._unique_restore_mask = mask

What this code does is create an inverse array, akin to what you get below (where np stands for numpy):

sorted_unique_array, inverse_array = np.unique(my_list, return_inverse=True)

What this inverse_array does is:

sorted_unique_array[inverse_array] == my_list

This is not the same thing as reversing an array.

The reason I couldn't use the easy np.unique command is because np.unique sorts the output sorted_unique_array. If you go to the groups.py file by clicking on the link, you'll see that the indices that are used in the function are specifically unsorted, but unique.

mdanalysis/package/MDAnalysis/core/groups.py

Line 809 in 149eb50

indices = unique_int_1d_unsorted(self.ix)

So for example:

[ 1, 5, 3, 3, 6,] is unsorted and not unique
[ 1, 3, 3, 5, 6,] is sorted and not unique
[ 1, 5, 3, 6,] is unsorted and unique
[ 1, 3, 5, 6,] is sorted and unique

So what the original code in #3387 does is create an inverse array for the third scenario there, i.e. [ 1, 5, 3, 6,]. The ideal goal is that if [ 1, 5, 3, 3, 6,] is the input, then

np.array([ 1, 5, 3, 6,])[inverse_array] == [ 1, 5, 3, 3, 6,].

However, I think that the original code in #3387 is clumsy and slow. The goal is to rewrite it in Cython to make it faster. An example of a similar function would be here:

mdanalysis/package/MDAnalysis/lib/_cutil.pyx

Lines 51 to 91 in 149eb50

    
           @cython.boundscheck(False) # turn off bounds-checking for entire function 
        
           @cython.wraparound(False)  # turn off negative index wrapping for entire function 
        
           def unique_int_1d(np.intp_t[:] values): 
        
               """Find the unique elements of a 1D array of integers. 
        
               This function is optimal on sorted arrays. 
        
               Parameters 
        
               ---------- 
        
               values: numpy.ndarray 
        
                   1D array of dtype ``numpy.int64`` in which to find the unique values. 
        
               Returns 
        
               ------- 
        
               numpy.ndarray 
        
                   A deduplicated copy of `values`. 
        
               .. versionadded:: 0.19.0 
        
               """ 
        
               cdef bint is_monotonic = True 
        
               cdef int i = 0 
        
               cdef int j = 0 
        
               cdef int n_values = values.shape[0] 
        
               cdef np.intp_t[:] result = np.empty(n_values, dtype=np.intp) 
        
               if n_values == 0: 
        
                   return np.array(result) 
        
               result[0] = values[0] 
        
               for i in range(1, n_values): 
        
                   if values[i] != result[j]: 
        
                       j += 1 
        
                       result[j] = values[i] 
        
                   if values[i] < values[i - 1]: 
        
                       is_monotonic = False 
        
               result = result[:j + 1] 
        
               if not is_monotonic: 
        
                   result = unique_int_1d(np.sort(result)) 
        
               return np.array(result)

To fix Issue #3387, the solution would need to:

Implement a Cython function in _cutil.pyx, the file linked there
Show that it's faster by benchmarking the speedup. The original code would need to be pulled out and turned into its own function for straightforward comparison
Add a test to make sure that the Cython function works
Call this Cython function from groups.py, replacing the original code
Open a pull request and verify that the testsuite passes

If you're happy to do this then please reopen this PR, or open another one, when you're ready!

SanjanaSogimatt and others added 3 commits October 10, 2021 12:56

initial commit

6345919

used a function

bee226a

Merge branch 'MDAnalysis:develop' into main

afe50c1

Update file

f959caa

Fixed errors.

tylerjereddy reviewed Oct 11, 2021

View reviewed changes

tylerjereddy closed this Oct 11, 2021

lilyminium mentioned this pull request Oct 11, 2021

Fast cythonized inverse array of unsorted, unique indices #3387

Open

orbeckst mentioned this pull request Mar 11, 2026

Optimize inverse index mapping in groups.py (Fixes Issue #3387) #5252

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy inverse index array #3387#3433

Easy inverse index array #3387#3433
SanjanaSogimatt wants to merge 4 commits intoMDAnalysis:developfrom
SanjanaSogimatt:main

SanjanaSogimatt commented Oct 10, 2021

Uh oh!

pep8speaks commented Oct 10, 2021 •

edited

Loading

Uh oh!

codecov bot commented Oct 10, 2021 •

edited

Loading

Uh oh!

SanjanaSogimatt commented Oct 10, 2021

Uh oh!

tylerjereddy left a comment

Uh oh!

lilyminium commented Oct 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SanjanaSogimatt commented Oct 10, 2021

PR Checklist

Uh oh!

pep8speaks commented Oct 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-10-10 13:21:51 UTC

Uh oh!

codecov bot commented Oct 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SanjanaSogimatt commented Oct 10, 2021

Uh oh!

tylerjereddy left a comment

Choose a reason for hiding this comment

Uh oh!

lilyminium commented Oct 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pep8speaks commented Oct 10, 2021 •

edited

Loading

codecov bot commented Oct 10, 2021 •

edited

Loading