Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

Implement Series.nsmallest()/Series.nlargest() in new style #241

Merged
merged 3 commits into from
Nov 4, 2019

Conversation

densmirn
Copy link
Contributor

No description provided.

@densmirn densmirn changed the title Implement Series.nsmallest() in new style Implement Series.nsmallest()/Seires.nlargest() in new style Oct 18, 2019
@densmirn densmirn changed the title Implement Series.nsmallest()/Seires.nlargest() in new style Implement Series.nsmallest()/Series.nlargest() in new style Oct 18, 2019
np.random.seed(0)
S = pd.Series(np.random.randint(-30, 30, m))
np.testing.assert_array_equal(hpat_func(S).values, test_impl(S).values)
np.testing.assert_array_equal(hpat_func(), test_impl())
Copy link
Contributor

@PokhodenkoSA PokhodenkoSA Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas docs: nsmallest returns Series.
Consider use pandas.testing.assert_series_equal(). It knows how to compare series better.

Suggested change
np.testing.assert_array_equal(hpat_func(), test_impl())
pd.testing.assert_series_equal(hpat_func(), test_impl())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PokhodenkoSA You are right, but functionality "in old style" doesn't work with indexes correctly. So until we completely moved to "new style" the check is valid. Now we don't drop code "in old style" to keep MPI parallelism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@densmirn A better solution would be to change the test as @PokhodenkoSA suggested and skip it if executed with old-style (as there's a bug which has to be fixed anyway), while new-style impl can be written so that it handles index correctly from the start.

Copy link
Contributor Author

@densmirn densmirn Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlov-alexey If I skipped the tests it would be no tests "in old style" at all. Or do you propose to duplicate the tests for "old style" and "new style" code with different asserts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@densmirn please use

if hpat.config.config_pipeline_hpat_default:
  np.testing.assert_array_equal(hpat_func(), test_impl())
else
  pd.testing.assert_series_equal(hpat_func(), test_impl())

np.testing.assert_array_equal(hpat_func(S).values, test_impl(S).values)
for data in test_global_input_data_numeric + [[]]:
series = pd.Series(data * 3)
np.testing.assert_array_equal(hpat_func(series), test_impl(series))
Copy link
Contributor

@PokhodenkoSA PokhodenkoSA Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.testing.assert_array_equal(hpat_func(series), test_impl(series))
pd.testing.assert_series_equal(hpat_func(series), test_impl(series))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PokhodenkoSA You are right, but functionality "in old style" doesn't work with indexes correctly. As I see np.testing.assert_array_equal checks only data inside of series.

Copy link
Contributor

@shssf shssf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need tests be unskipped
paralelization check

]

min_int64 = -9223372036854775808
max_int64 = 9223372036854775807
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.maxint

@@ -34,6 +36,23 @@
),
]]

test_global_input_data_float64 = [
[1.0, np.nan, -1.0, 0.0, 5e-324],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[1.0, np.nan, -1.0, 0.0, 5e-324],
[1.0, np.nan, -1.0, 0.0, sys.float_info.min],
sys.float_info
sys.floatinfo(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2
250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsil
on=2.2204460492503131e-16, radix=2, rounds=1)

hpat/tests/test_series.py Show resolved Hide resolved
@@ -559,13 +559,13 @@ def select_k_nonan_overload(A, m, k):
dtype = A.dtype
if isinstance(dtype, types.Integer):
# ints don't have nans
return lambda A, m, k: (A[:k].copy(), k)
return lambda A, m, k: (A[:max(k, 0)].copy(), k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we have to cut off negatives here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nsmallest(0) == nsmallest(-1) returns empty series. So we need to set up 0 instead of negative k to return empty array. E.g. k = -2, A = [1, 2, 3, 4, 5]:
A[:k] # [1, 2, 3]
A[:0] # [] - what we want to get when we have negative k

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW old functionality didn't work with negative k.

Comment on lines 211 to 212
if not isinstance(keep, (types.Omitted, str, types.UnicodeType)):
raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@densmirn I think we don't need this, as it checks correct type of supported argument and keep seems to be unsupported. We only need to check that it was omitted argument (hence it should have either types.Omitted type or str type and 'first' value), i.e. I suggest using following check at typing step (and not during runtime):

Suggested change
if not isinstance(keep, (types.Omitted, str, types.UnicodeType)):
raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep))
if not (keep == 'first' or isinstance(keep, types.Omitted)):
raise TypingError('{} Unsupported parameters. Given keep : {}'.format(_func_name, keep))

Copy link
Contributor Author

@densmirn densmirn Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlov-alexey In this case if to pass explicitly parameter keep='first' as argument the exception will be raised, because keep will be equal to unicode_type and the condition will be False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@densmirn True, but why do this if this parameter is unsupported? If it's unsupported then the user should not provide any value for it (and hence only default value can be checked).


Returns
-------
:obj:`scalar`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why scalar?

local_index.extend(indices)
local_index = local_index[:n]

return pandas.Series(local_data, local_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have to copy series name as well?

Comment on lines 218 to 223
local_data = hpat.hiframes.api.nlargest_parallel(self._data, n, False, hpat.hiframes.series_kernels.lt_f)
local_index = []
for a in numpy.unique(local_data):
for indices in numpy.where(self._data == a):
local_index.extend(indices)
local_index = local_index[:n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@densmirn This looks inefficient as we call numpy.where on the self._data up to n times.
I think better would be to use dict/set of unique values to look for and iterate over self._data only once.
If the self._data[i] is in that dict/set we can append i to a list of indexes where it was found. Then we will need to merge all such lists into one local_index and cut off first n elements as before. What do you think?

@densmirn
Copy link
Contributor Author

@shssf, @kozlov-alexey could you re-review the PR?

raise ValueError("Method nsmallest(). Unsupported parameter. "
"Given 'keep' != 'first'")

nlargest = hpat.hiframes.api.nlargest(self._data, n, False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to rename the variable to nsmallest, to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if not isinstance(keep, (types.Omitted, str, types.UnicodeType)):
raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep))
local_index = [idx for item in sorted(indices) for idx in indices[item]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to sort it as we already have nsmallest elements in ascending order in nlargest and we can iterate over it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

nlargest = hpat.hiframes.api.nlargest(self._data, n, False,
hpat.hiframes.series_kernels.lt_f)
indices = Dict.empty(dtype, inner_list_type)
for item in set(nlargest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for item in set(nlargest):
for item in nlargest:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 275 to 279
for idx, item in enumerate(self._data):
if item in indices:
indices[item].append(self._index[idx])

local_index = [idx for item in sorted(indices) for idx in indices[item]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we can have situation when all our Big Array is filled with 5 numbers and with this loop comprehension we will have to iterate over it once again, i.e. what we lack is the ability to leave early after we found exactly n first indices. What do you think if we use explicit loop here?

Suggested change
for idx, item in enumerate(self._data):
if item in indices:
indices[item].append(self._index[idx])
local_index = [idx for item in sorted(indices) for idx in indices[item]]
local_index = []
for item in nsmallest:
for index in indices[item]:
local_index.append(index)
if len(local_index) == n:
break
else:
continue
break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@densmirn
Copy link
Contributor Author

@kozlov-alexey could you please take the next round of review?

@densmirn
Copy link
Contributor Author

@shssf could you take the next round of review?

@densmirn
Copy link
Contributor Author

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@densmirn
Copy link
Contributor Author

@shssf I moved TypeChecker from this PR to another one. So could you please re-review this PR?

Copy link
Contributor

@shssf shssf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it looks much better. I hope it requires only one more effort before this PR to be merged.


# data: [0, 1, -1, 1, 0] -> [1, 1, 0, 0, -1]
# index: [0, 1, 2, 3, 4] -> [1, 3, 0, 4, 2] (not [3, 1, 4, 0, 2])
indices = (-self._data - 1).argsort(kind='mergesort')[:max(n, 0)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite strange algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(-self._data - 1): subtracted 1 to ensure reverse ordering at boundaries, e.g. to turn min into max integer. self._data.argsort(kind='mergesort')[::-1] is invalid in case of duplicates in data.

Copy link
Contributor

@shssf shssf Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it will work if self._data[i]==type_max as you expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked it. Moreover the similar approach is used in Pandas.

if not hpat.config.config_pipeline_hpat_default:
for attr in _non_hpat_pipeline_attrs:
if attr in SeriesAttribute.__dict__:
delattr(SeriesAttribute, attr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need to remove attributes after adding them. I still think it would be better to merge _non_hpat_pipeline_attrs with _not_series_array_attrs and remove this piece of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot merge them. The key idea of the code is to remove predefined series attributes from SeriesAttribute, not added to SeriesAttribute via previous loop. E.g. if to add 'resolve_nsmallest' to _not_series_array_attrs and remove the "deleter" then SeriesAttribute will still contain attribute 'resolve_nsmallest' and "new style" won't be used.

rands_chars = np.array(accepted_chars, dtype=(np.str_, 1))

np.random.seed(100)
return np.random.choice(rands_chars, size=nchars * size).view((np.str_, nchars))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bad idea to use random input data. If some test failed by some input data - it would be more difficult to reproduce the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I set up seed here np.random.seed(100). So the result of the function should be always identical. Anyway I will think how to avoid random generation.


@unittest.skipIf(hpat.config.config_pipeline_hpat_default,
'Series.nlargest() types validation unsupported')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Series.nlargest() types validation unsupported')
'Series.nlargest() do not throw an exception')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me keep Pythonic style and replace throw with raise:)


# TODO: check data == [] when index fixed
for data in test_global_input_data_numeric:
data *= 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplied number of data to test with duplicates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it might be more clear to use

data_duplicated = data * 3

for n in range(-1, 10):
ref_result = test_impl(series, n)
jit_result = hpat_func(series, n)
pd.testing.assert_series_equal(ref_result, jit_result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this test is also dosn't work with index of strings. Also, as far s i remember, old-style is working with index of strings.
Anyway, you could use if/else as you did in other tests to make diffrent checks for diffrent styles. Please note, I proposed ref_result.value which returns array (not series)

hpat/tests/test_series.py Outdated Show resolved Hide resolved
hpat/tests/test_series.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Nov 1, 2019

Hello @densmirn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-04 11:45:43 UTC

Copy link
Contributor

@shssf shssf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pass tests

@shssf shssf merged commit 6f36f5b into IntelPython:master Nov 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants