Skip to content

Easy inverse index array #3387#3433

Closed
SanjanaSogimatt wants to merge 4 commits intoMDAnalysis:developfrom
SanjanaSogimatt:main
Closed

Easy inverse index array #3387#3433
SanjanaSogimatt wants to merge 4 commits intoMDAnalysis:developfrom
SanjanaSogimatt:main

Conversation

@SanjanaSogimatt
Copy link

Fixes #

Changes made in this Pull Request:

  • Functions used to fix the bug
  • There are two methods used and one is commented out

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Oct 10, 2021

Hello @SanjanaSogimatt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1:1: E265 block comment should start with '# '
Line 4:1: E265 block comment should start with '# '
Line 5:5: E225 missing whitespace around operator
Line 6:1: E302 expected 2 blank lines, found 0
Line 7:1: W293 blank line contains whitespace
Line 8:8: E225 missing whitespace around operator
Line 10:1: E305 expected 2 blank lines after class or function definition, found 0
Line 10:1: E741 ambiguous variable name 'l'
Line 10:2: E225 missing whitespace around operator
Line 13:1: E265 block comment should start with '# '
Line 14:1: E302 expected 2 blank lines, found 1
Line 15:8: E225 missing whitespace around operator
Line 15:11: E231 missing whitespace after ','
Line 15:13: E231 missing whitespace after ','
Line 15:15: E231 missing whitespace after ','
Line 15:17: E231 missing whitespace after ','
Line 15:20: E231 missing whitespace after ','
Line 16:11: E225 missing whitespace around operator
Line 18:1: W293 blank line contains whitespace
Line 19:1: W293 blank line contains whitespace
Line 19:1: W391 blank line at end of file

Comment last updated at 2021-10-10 13:21:51 UTC

@codecov
Copy link

codecov bot commented Oct 10, 2021

Codecov Report

Merging #3433 (f959caa) into develop (149eb50) will decrease coverage by 3.13%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3433      +/-   ##
===========================================
- Coverage    92.92%   89.79%   -3.14%     
===========================================
  Files          172      167       -5     
  Lines        22380    21786     -594     
  Branches      3302        0    -3302     
===========================================
- Hits         20797    19563    -1234     
- Misses        1030     2223    +1193     
+ Partials       553        0     -553     
Impacted Files Coverage Δ
package/MDAnalysis/converters/RDKitParser.py 14.51% <0.00%> (-82.24%) ⬇️
package/MDAnalysis/converters/RDKit.py 16.98% <0.00%> (-81.08%) ⬇️
package/MDAnalysis/analysis/hole2/hole.py 14.47% <0.00%> (-59.80%) ⬇️
package/MDAnalysis/converters/OpenMM.py 37.68% <0.00%> (-59.34%) ⬇️
package/MDAnalysis/converters/OpenMMParser.py 42.85% <0.00%> (-57.15%) ⬇️
package/MDAnalysis/analysis/hole2/utils.py 25.78% <0.00%> (-49.06%) ⬇️
package/MDAnalysis/topology/guessers.py 92.24% <0.00%> (-6.98%) ⬇️
package/MDAnalysis/core/selection.py 93.22% <0.00%> (-5.41%) ⬇️
package/MDAnalysis/analysis/psa.py 79.14% <0.00%> (-4.98%) ⬇️
package/MDAnalysis/lib/NeighborSearch.py 96.42% <0.00%> (-3.58%) ⬇️
... and 113 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 149eb50...f959caa. Read the comment docs.

Fixed errors.
@SanjanaSogimatt
Copy link
Author

Hello, I wanted to know how to improve my code.

Copy link
Member

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't reference any MDAnalysis issue, or adjust any MDAnalysis source code, so I'm going to close it and suggest browsing our issues/docs if you'd like to contribute.

If you'd like feedback on improving your generic Python code you might try something like StackOverflow.

@lilyminium
Copy link
Member

Hi @SanjanaSogimatt,

Thanks for having a look at this issue and for linking it in the title. As Tyler points out, it would have been more beneficial to add the issue number after the Fixes # line as that enables GitHub to link the pull request like in the screenshot below -- this makes it much easier for reviewers to track the discussion, especially in a large project like MDAnalysis.

Screen Shot 2021-10-11 at 7 31 15 am

Issue #3387 describes how PR #3368 adds this code to the library:

mask = np.zeros_like(self.ix)
for i, x in enumerate(indices):
values = np.where(self.ix == x)[0]
mask[values] = i
self._unique_restore_mask = mask

What this code does is create an inverse array, akin to what you get below (where np stands for numpy):

sorted_unique_array, inverse_array = np.unique(my_list, return_inverse=True)

What this inverse_array does is:

sorted_unique_array[inverse_array] == my_list

This is not the same thing as reversing an array.

The reason I couldn't use the easy np.unique command is because np.unique sorts the output sorted_unique_array. If you go to the groups.py file by clicking on the link, you'll see that the indices that are used in the function are specifically unsorted, but unique.

indices = unique_int_1d_unsorted(self.ix)

So for example:

[ 1, 5, 3, 3, 6,] is unsorted and not unique
[ 1, 3, 3, 5, 6,] is sorted and not unique
[ 1, 5, 3, 6,] is unsorted and unique
[ 1, 3, 5, 6,] is sorted and unique

So what the original code in #3387 does is create an inverse array for the third scenario there, i.e. [ 1, 5, 3, 6,]. The ideal goal is that if [ 1, 5, 3, 3, 6,] is the input, then

np.array([ 1, 5, 3, 6,])[inverse_array] == [ 1, 5, 3, 3, 6,].

However, I think that the original code in #3387 is clumsy and slow. The goal is to rewrite it in Cython to make it faster. An example of a similar function would be here:

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False) # turn off negative index wrapping for entire function
def unique_int_1d(np.intp_t[:] values):
"""Find the unique elements of a 1D array of integers.
This function is optimal on sorted arrays.
Parameters
----------
values: numpy.ndarray
1D array of dtype ``numpy.int64`` in which to find the unique values.
Returns
-------
numpy.ndarray
A deduplicated copy of `values`.
.. versionadded:: 0.19.0
"""
cdef bint is_monotonic = True
cdef int i = 0
cdef int j = 0
cdef int n_values = values.shape[0]
cdef np.intp_t[:] result = np.empty(n_values, dtype=np.intp)
if n_values == 0:
return np.array(result)
result[0] = values[0]
for i in range(1, n_values):
if values[i] != result[j]:
j += 1
result[j] = values[i]
if values[i] < values[i - 1]:
is_monotonic = False
result = result[:j + 1]
if not is_monotonic:
result = unique_int_1d(np.sort(result))
return np.array(result)

To fix Issue #3387, the solution would need to:

  1. Implement a Cython function in _cutil.pyx, the file linked there
  2. Show that it's faster by benchmarking the speedup. The original code would need to be pulled out and turned into its own function for straightforward comparison
  3. Add a test to make sure that the Cython function works
  4. Call this Cython function from groups.py, replacing the original code
  5. Open a pull request and verify that the testsuite passes

If you're happy to do this then please reopen this PR, or open another one, when you're ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants