-
Notifications
You must be signed in to change notification settings - Fork 61
Implement Series.nsmallest()/Series.nlargest() in new style #241
Conversation
hpat/tests/test_series.py
Outdated
np.random.seed(0) | ||
S = pd.Series(np.random.randint(-30, 30, m)) | ||
np.testing.assert_array_equal(hpat_func(S).values, test_impl(S).values) | ||
np.testing.assert_array_equal(hpat_func(), test_impl()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas docs: nsmallest returns Series.
Consider use pandas.testing.assert_series_equal(). It knows how to compare series better.
np.testing.assert_array_equal(hpat_func(), test_impl()) | |
pd.testing.assert_series_equal(hpat_func(), test_impl()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PokhodenkoSA You are right, but functionality "in old style" doesn't work with indexes correctly. So until we completely moved to "new style" the check is valid. Now we don't drop code "in old style" to keep MPI parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@densmirn A better solution would be to change the test as @PokhodenkoSA suggested and skip it if executed with old-style (as there's a bug which has to be fixed anyway), while new-style impl can be written so that it handles index correctly from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlov-alexey If I skipped the tests it would be no tests "in old style" at all. Or do you propose to duplicate the tests for "old style" and "new style" code with different asserts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@densmirn please use
if hpat.config.config_pipeline_hpat_default:
np.testing.assert_array_equal(hpat_func(), test_impl())
else
pd.testing.assert_series_equal(hpat_func(), test_impl())
hpat/tests/test_series.py
Outdated
np.testing.assert_array_equal(hpat_func(S).values, test_impl(S).values) | ||
for data in test_global_input_data_numeric + [[]]: | ||
series = pd.Series(data * 3) | ||
np.testing.assert_array_equal(hpat_func(series), test_impl(series)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.testing.assert_array_equal(hpat_func(series), test_impl(series)) | |
pd.testing.assert_series_equal(hpat_func(series), test_impl(series)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PokhodenkoSA You are right, but functionality "in old style" doesn't work with indexes correctly. As I see np.testing.assert_array_equal
checks only data inside of series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need tests be unskipped
paralelization check
hpat/tests/test_series.py
Outdated
] | ||
|
||
min_int64 = -9223372036854775808 | ||
max_int64 = 9223372036854775807 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys.maxint
hpat/tests/test_series.py
Outdated
@@ -34,6 +36,23 @@ | |||
), | |||
]] | |||
|
|||
test_global_input_data_float64 = [ | |||
[1.0, np.nan, -1.0, 0.0, 5e-324], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1.0, np.nan, -1.0, 0.0, 5e-324], | |
[1.0, np.nan, -1.0, 0.0, sys.float_info.min], |
sys.float_info
sys.floatinfo(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2
250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsil
on=2.2204460492503131e-16, radix=2, rounds=1)
@@ -559,13 +559,13 @@ def select_k_nonan_overload(A, m, k): | |||
dtype = A.dtype | |||
if isinstance(dtype, types.Integer): | |||
# ints don't have nans | |||
return lambda A, m, k: (A[:k].copy(), k) | |||
return lambda A, m, k: (A[:max(k, 0)].copy(), k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we have to cut off negatives here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nsmallest(0) == nsmallest(-1)
returns empty series. So we need to set up 0
instead of negative k
to return empty array. E.g. k = -2
, A = [1, 2, 3, 4, 5]
:
A[:k] # [1, 2, 3]
A[:0] # [] - what we want to get when we have negative k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW old functionality didn't work with negative k
.
6871e0a
to
fa19c2a
Compare
if not isinstance(keep, (types.Omitted, str, types.UnicodeType)): | ||
raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@densmirn I think we don't need this, as it checks correct type of supported argument and keep seems to be unsupported. We only need to check that it was omitted argument (hence it should have either types.Omitted type or str type and 'first' value), i.e. I suggest using following check at typing step (and not during runtime):
if not isinstance(keep, (types.Omitted, str, types.UnicodeType)): | |
raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep)) | |
if not (keep == 'first' or isinstance(keep, types.Omitted)): | |
raise TypingError('{} Unsupported parameters. Given keep : {}'.format(_func_name, keep)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlov-alexey In this case if to pass explicitly parameter keep
='first'
as argument the exception will be raised, because keep
will be equal to unicode_type
and the condition will be False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@densmirn True, but why do this if this parameter is unsupported? If it's unsupported then the user should not provide any value for it (and hence only default value can be checked).
|
||
Returns | ||
------- | ||
:obj:`scalar` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why scalar?
local_index.extend(indices) | ||
local_index = local_index[:n] | ||
|
||
return pandas.Series(local_data, local_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have to copy series name as well?
local_data = hpat.hiframes.api.nlargest_parallel(self._data, n, False, hpat.hiframes.series_kernels.lt_f) | ||
local_index = [] | ||
for a in numpy.unique(local_data): | ||
for indices in numpy.where(self._data == a): | ||
local_index.extend(indices) | ||
local_index = local_index[:n] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@densmirn This looks inefficient as we call numpy.where on the self._data up to n times.
I think better would be to use dict/set of unique values to look for and iterate over self._data only once.
If the self._data[i] is in that dict/set we can append i to a list of indexes where it was found. Then we will need to merge all such lists into one local_index and cut off first n elements as before. What do you think?
fa19c2a
to
4154640
Compare
4154640
to
9fb7acd
Compare
@shssf, @kozlov-alexey could you re-review the PR? |
raise ValueError("Method nsmallest(). Unsupported parameter. " | ||
"Given 'keep' != 'first'") | ||
|
||
nlargest = hpat.hiframes.api.nlargest(self._data, n, False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better to rename the variable to nsmallest, to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
if not isinstance(keep, (types.Omitted, str, types.UnicodeType)): | ||
raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep)) | ||
local_index = [idx for item in sorted(indices) for idx in indices[item]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need to sort it as we already have nsmallest elements in ascending order in nlargest and we can iterate over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nlargest = hpat.hiframes.api.nlargest(self._data, n, False, | ||
hpat.hiframes.series_kernels.lt_f) | ||
indices = Dict.empty(dtype, inner_list_type) | ||
for item in set(nlargest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for item in set(nlargest): | |
for item in nlargest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for idx, item in enumerate(self._data): | ||
if item in indices: | ||
indices[item].append(self._index[idx]) | ||
|
||
local_index = [idx for item in sorted(indices) for idx in indices[item]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we can have situation when all our Big Array is filled with 5 numbers and with this loop comprehension we will have to iterate over it once again, i.e. what we lack is the ability to leave early after we found exactly n first indices. What do you think if we use explicit loop here?
for idx, item in enumerate(self._data): | |
if item in indices: | |
indices[item].append(self._index[idx]) | |
local_index = [idx for item in sorted(indices) for idx in indices[item]] | |
local_index = [] | |
for item in nsmallest: | |
for index in indices[item]: | |
local_index.append(index) | |
if len(local_index) == n: | |
break | |
else: | |
continue | |
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
9fb7acd
to
caf6e2c
Compare
@kozlov-alexey could you please take the next round of review? |
@shssf could you take the next round of review? |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
07f7468
to
e7da101
Compare
@shssf I moved TypeChecker from this PR to another one. So could you please re-review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it looks much better. I hope it requires only one more effort before this PR to be merged.
|
||
# data: [0, 1, -1, 1, 0] -> [1, 1, 0, 0, -1] | ||
# index: [0, 1, 2, 3, 4] -> [1, 3, 0, 4, 2] (not [3, 1, 4, 0, 2]) | ||
indices = (-self._data - 1).argsort(kind='mergesort')[:max(n, 0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite strange algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(-self._data - 1)
: subtracted 1 to ensure reverse ordering at boundaries, e.g. to turn min into max integer. self._data.argsort(kind='mergesort')[::-1]
is invalid in case of duplicates in data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it will work if self._data[i]==type_max
as you expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked it. Moreover the similar approach is used in Pandas.
if not hpat.config.config_pipeline_hpat_default: | ||
for attr in _non_hpat_pipeline_attrs: | ||
if attr in SeriesAttribute.__dict__: | ||
delattr(SeriesAttribute, attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why we need to remove attributes after adding them. I still think it would be better to merge _non_hpat_pipeline_attrs
with _not_series_array_attrs
and remove this piece of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot merge them. The key idea of the code is to remove predefined series attributes from SeriesAttribute
, not added to SeriesAttribute
via previous loop. E.g. if to add 'resolve_nsmallest'
to _not_series_array_attrs
and remove the "deleter" then SeriesAttribute
will still contain attribute 'resolve_nsmallest'
and "new style" won't be used.
hpat/tests/test_series.py
Outdated
rands_chars = np.array(accepted_chars, dtype=(np.str_, 1)) | ||
|
||
np.random.seed(100) | ||
return np.random.choice(rands_chars, size=nchars * size).view((np.str_, nchars)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bad idea to use random input data. If some test failed by some input data - it would be more difficult to reproduce the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I set up seed here np.random.seed(100)
. So the result of the function should be always identical. Anyway I will think how to avoid random generation.
hpat/tests/test_series.py
Outdated
|
||
@unittest.skipIf(hpat.config.config_pipeline_hpat_default, | ||
'Series.nlargest() types validation unsupported') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Series.nlargest() types validation unsupported') | |
'Series.nlargest() do not throw an exception') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me keep Pythonic style and replace throw
with raise
:)
hpat/tests/test_series.py
Outdated
|
||
# TODO: check data == [] when index fixed | ||
for data in test_global_input_data_numeric: | ||
data *= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiplied number of data to test with duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it might be more clear to use
data_duplicated = data * 3
hpat/tests/test_series.py
Outdated
for n in range(-1, 10): | ||
ref_result = test_impl(series, n) | ||
jit_result = hpat_func(series, n) | ||
pd.testing.assert_series_equal(ref_result, jit_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this test is also dosn't work with index of strings. Also, as far s i remember, old-style is working with index of strings.
Anyway, you could use if/else as you did in other tests to make diffrent checks for diffrent styles. Please note, I proposed ref_result.value
which returns array (not series)
60cdd9a
to
3038203
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to pass tests
3038203
to
0e0cf7e
Compare
No description provided.