Implement Series.nsmallest()/Series.nlargest() in new style #241

densmirn · 2019-10-18T15:39:31Z

No description provided.

PokhodenkoSA · 2019-10-18T20:55:49Z

hpat/tests/test_series.py

-        np.random.seed(0)
-        S = pd.Series(np.random.randint(-30, 30, m))
-        np.testing.assert_array_equal(hpat_func(S).values, test_impl(S).values)
+        np.testing.assert_array_equal(hpat_func(), test_impl())


Pandas docs: nsmallest returns Series.
Consider use pandas.testing.assert_series_equal(). It knows how to compare series better.

Suggested change

np.testing.assert_array_equal(hpat_func(), test_impl())

pd.testing.assert_series_equal(hpat_func(), test_impl())

@PokhodenkoSA You are right, but functionality "in old style" doesn't work with indexes correctly. So until we completely moved to "new style" the check is valid. Now we don't drop code "in old style" to keep MPI parallelism.

@densmirn A better solution would be to change the test as @PokhodenkoSA suggested and skip it if executed with old-style (as there's a bug which has to be fixed anyway), while new-style impl can be written so that it handles index correctly from the start.

@kozlov-alexey If I skipped the tests it would be no tests "in old style" at all. Or do you propose to duplicate the tests for "old style" and "new style" code with different asserts?

@densmirn please use

if hpat.config.config_pipeline_hpat_default: np.testing.assert_array_equal(hpat_func(), test_impl()) else pd.testing.assert_series_equal(hpat_func(), test_impl())

PokhodenkoSA · 2019-10-18T20:57:07Z

hpat/tests/test_series.py

-        np.testing.assert_array_equal(hpat_func(S).values, test_impl(S).values)
+        for data in test_global_input_data_numeric + [[]]:
+            series = pd.Series(data * 3)
+            np.testing.assert_array_equal(hpat_func(series), test_impl(series))


Suggested change

np.testing.assert_array_equal(hpat_func(series), test_impl(series))

pd.testing.assert_series_equal(hpat_func(series), test_impl(series))

@PokhodenkoSA You are right, but functionality "in old style" doesn't work with indexes correctly. As I see np.testing.assert_array_equal checks only data inside of series.

shssf

need tests be unskipped
paralelization check

shssf · 2019-10-19T16:05:10Z

hpat/tests/test_series.py

+]
+
+min_int64 = -9223372036854775808
+max_int64 = 9223372036854775807


shssf · 2019-10-19T16:07:09Z

hpat/tests/test_series.py

@@ -34,6 +36,23 @@
    ),
 ]]

+test_global_input_data_float64 = [
+    [1.0, np.nan, -1.0, 0.0, 5e-324],


Suggested change

[1.0, np.nan, -1.0, 0.0, 5e-324],

[1.0, np.nan, -1.0, 0.0, sys.float_info.min],

sys.float_info sys.floatinfo(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2 250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsil on=2.2204460492503131e-16, radix=2, rounds=1)

hpat/tests/test_series.py

shssf · 2019-10-19T16:10:01Z

hpat/hiframes/api.py

@@ -559,13 +559,13 @@ def select_k_nonan_overload(A, m, k):
    dtype = A.dtype
    if isinstance(dtype, types.Integer):
        # ints don't have nans
-        return lambda A, m, k: (A[:k].copy(), k)
+        return lambda A, m, k: (A[:max(k, 0)].copy(), k)


not sure we have to cut off negatives here

nsmallest(0) == nsmallest(-1) returns empty series. So we need to set up 0 instead of negative k to return empty array. E.g. k = -2, A = [1, 2, 3, 4, 5]:
A[:k] # [1, 2, 3]
A[:0] # [] - what we want to get when we have negative k

BTW old functionality didn't work with negative k.

kozlov-alexey · 2019-10-22T10:07:15Z

hpat/datatypes/hpat_pandas_series_functions.py

+    if not isinstance(keep, (types.Omitted, str, types.UnicodeType)):
+        raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep))


@densmirn I think we don't need this, as it checks correct type of supported argument and keep seems to be unsupported. We only need to check that it was omitted argument (hence it should have either types.Omitted type or str type and 'first' value), i.e. I suggest using following check at typing step (and not during runtime):

Suggested change

if not isinstance(keep, (types.Omitted, str, types.UnicodeType)):

raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep))

if not (keep == 'first' or isinstance(keep, types.Omitted)):

raise TypingError('{} Unsupported parameters. Given keep : {}'.format(_func_name, keep))

@kozlov-alexey In this case if to pass explicitly parameter keep='first' as argument the exception will be raised, because keep will be equal to unicode_type and the condition will be False.

@densmirn True, but why do this if this parameter is unsupported? If it's unsupported then the user should not provide any value for it (and hence only default value can be checked).

kozlov-alexey · 2019-10-22T10:14:45Z

hpat/datatypes/hpat_pandas_series_functions.py

+
+    Returns
+    -------
+    :obj:`scalar`


Why scalar?

kozlov-alexey · 2019-10-22T10:29:54Z

hpat/datatypes/hpat_pandas_series_functions.py

+                local_index.extend(indices)
+        local_index = local_index[:n]
+
+        return pandas.Series(local_data, local_index)


Don't we have to copy series name as well?

kozlov-alexey · 2019-10-22T10:53:25Z

hpat/datatypes/hpat_pandas_series_functions.py

+        local_data = hpat.hiframes.api.nlargest_parallel(self._data, n, False, hpat.hiframes.series_kernels.lt_f)
+        local_index = []
+        for a in numpy.unique(local_data):
+            for indices in numpy.where(self._data == a):
+                local_index.extend(indices)
+        local_index = local_index[:n]


@densmirn This looks inefficient as we call numpy.where on the self._data up to n times.
I think better would be to use dict/set of unique values to look for and iterate over self._data only once.
If the self._data[i] is in that dict/set we can append i to a list of indexes where it was found. Then we will need to merge all such lists into one local_index and cut off first n elements as before. What do you think?

densmirn · 2019-10-25T09:32:39Z

@shssf, @kozlov-alexey could you re-review the PR?

kozlov-alexey · 2019-10-25T15:50:04Z

hpat/datatypes/hpat_pandas_series_functions.py

+                raise ValueError("Method nsmallest(). Unsupported parameter. "
+                                 "Given 'keep' != 'first'")
+
+            nlargest = hpat.hiframes.api.nlargest(self._data, n, False,


I think better to rename the variable to nsmallest, to avoid confusion.

kozlov-alexey · 2019-10-25T15:55:25Z

hpat/datatypes/hpat_pandas_series_functions.py


-    if not isinstance(keep, (types.Omitted, str, types.UnicodeType)):
-        raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep))
+            local_index = [idx for item in sorted(indices) for idx in indices[item]]


We probably don't need to sort it as we already have nsmallest elements in ascending order in nlargest and we can iterate over it.

kozlov-alexey · 2019-10-25T16:01:34Z

hpat/datatypes/hpat_pandas_series_functions.py

+            nlargest = hpat.hiframes.api.nlargest(self._data, n, False,
+                                                  hpat.hiframes.series_kernels.lt_f)
+            indices = Dict.empty(dtype, inner_list_type)
+            for item in set(nlargest):


Suggested change

for item in set(nlargest):

for item in nlargest:

kozlov-alexey · 2019-10-25T16:31:01Z

hpat/datatypes/hpat_pandas_series_functions.py

+        for idx, item in enumerate(self._data):
+            if item in indices:
+                indices[item].append(self._index[idx])
+
+        local_index = [idx for item in sorted(indices) for idx in indices[item]]


In theory we can have situation when all our Big Array is filled with 5 numbers and with this loop comprehension we will have to iterate over it once again, i.e. what we lack is the ability to leave early after we found exactly n first indices. What do you think if we use explicit loop here?

Suggested change

for idx, item in enumerate(self._data):

if item in indices:

indices[item].append(self._index[idx])

local_index = [idx for item in sorted(indices) for idx in indices[item]]

local_index = []

for item in nsmallest:

for index in indices[item]:

local_index.append(index)

if len(local_index) == n:

break

else:

continue

break

densmirn · 2019-10-28T13:10:36Z

@kozlov-alexey could you please take the next round of review?

hpat/datatypes/hpat_pandas_series_functions.py

densmirn · 2019-10-31T07:50:38Z

@shssf could you take the next round of review?

densmirn · 2019-10-31T10:59:59Z

/AzurePipelines run

azure-pipelines · 2019-10-31T11:00:10Z

Azure Pipelines successfully started running 1 pipeline(s).

densmirn · 2019-10-31T15:54:34Z

@shssf I moved TypeChecker from this PR to another one. So could you please re-review this PR?

shssf

Currently it looks much better. I hope it requires only one more effort before this PR to be merged.

shssf · 2019-10-31T17:29:45Z

hpat/datatypes/hpat_pandas_series_functions.py

+
+        # data: [0, 1, -1, 1, 0] -> [1, 1, 0, 0, -1]
+        # index: [0, 1,  2, 3, 4] -> [1, 3, 0, 4,  2] (not [3, 1, 4, 0, 2])
+        indices = (-self._data - 1).argsort(kind='mergesort')[:max(n, 0)]


Quite strange algorithm.

(-self._data - 1): subtracted 1 to ensure reverse ordering at boundaries, e.g. to turn min into max integer. self._data.argsort(kind='mergesort')[::-1] is invalid in case of duplicates in data.

do you think it will work if self._data[i]==type_max as you expected?

Yes, I checked it. Moreover the similar approach is used in Pandas.

shssf · 2019-10-31T17:32:21Z

hpat/hiframes/pd_series_ext.py

+if not hpat.config.config_pipeline_hpat_default:
+    for attr in _non_hpat_pipeline_attrs:
+        if attr in SeriesAttribute.__dict__:
+            delattr(SeriesAttribute, attr)


I still don't understand why we need to remove attributes after adding them. I still think it would be better to merge _non_hpat_pipeline_attrs with _not_series_array_attrs and remove this piece of code.

We cannot merge them. The key idea of the code is to remove predefined series attributes from SeriesAttribute, not added to SeriesAttribute via previous loop. E.g. if to add 'resolve_nsmallest' to _not_series_array_attrs and remove the "deleter" then SeriesAttribute will still contain attribute 'resolve_nsmallest' and "new style" won't be used.

shssf · 2019-10-31T17:33:59Z

hpat/tests/test_series.py

+    rands_chars = np.array(accepted_chars, dtype=(np.str_, 1))
+
+    np.random.seed(100)
+    return np.random.choice(rands_chars, size=nchars * size).view((np.str_, nchars))


I think this is a bad idea to use random input data. If some test failed by some input data - it would be more difficult to reproduce the issue.

BTW I set up seed here np.random.seed(100). So the result of the function should be always identical. Anyway I will think how to avoid random generation.

shssf · 2019-10-31T17:35:42Z

hpat/tests/test_series.py


+    @unittest.skipIf(hpat.config.config_pipeline_hpat_default,
+                     'Series.nlargest() types validation unsupported')


Suggested change

'Series.nlargest() types validation unsupported')

'Series.nlargest() do not throw an exception')

Let me keep Pythonic style and replace throw with raise:)

shssf · 2019-10-31T17:37:26Z

hpat/tests/test_series.py

+
+        # TODO: check data == [] when index fixed
+        for data in test_global_input_data_numeric:
+            data *= 3


Why this is?

Multiplied number of data to test with duplicates.

In this case it might be more clear to use

data_duplicated = data * 3

shssf · 2019-10-31T17:39:31Z

hpat/tests/test_series.py

+                for n in range(-1, 10):
+                    ref_result = test_impl(series, n)
+                    jit_result = hpat_func(series, n)
+                    pd.testing.assert_series_equal(ref_result, jit_result)


I see this test is also dosn't work with index of strings. Also, as far s i remember, old-style is working with index of strings.
Anyway, you could use if/else as you did in other tests to make diffrent checks for diffrent styles. Please note, I proposed ref_result.value which returns array (not series)

hpat/tests/test_series.py

pep8speaks · 2019-11-01T07:48:46Z

Hello @densmirn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-04 11:45:43 UTC

shssf

need to pass tests

densmirn requested review from shssf, akharche and kozlov-alexey October 18, 2019 15:39

densmirn changed the title ~~Implement Series.nsmallest() in new style~~ Implement Series.nsmallest()/Seires.nlargest() in new style Oct 18, 2019

densmirn changed the title ~~Implement Series.nsmallest()/Seires.nlargest() in new style~~ Implement Series.nsmallest()/Series.nlargest() in new style Oct 18, 2019

PokhodenkoSA reviewed Oct 18, 2019

View reviewed changes

shssf suggested changes Oct 19, 2019

View reviewed changes

densmirn force-pushed the feature/series_nsmallest branch from 6871e0a to fa19c2a Compare October 21, 2019 07:09

densmirn requested review from shssf and PokhodenkoSA October 21, 2019 07:10

kozlov-alexey reviewed Oct 22, 2019

View reviewed changes

densmirn force-pushed the feature/series_nsmallest branch from fa19c2a to 4154640 Compare October 23, 2019 13:41

densmirn requested review from kozlov-alexey and 1e-to October 23, 2019 13:45

densmirn force-pushed the feature/series_nsmallest branch from 4154640 to 9fb7acd Compare October 23, 2019 14:18

densmirn requested a review from Hardcode84 October 25, 2019 06:55

densmirn added the Ready for Review label Oct 25, 2019

kozlov-alexey mentioned this pull request Oct 25, 2019

Refactor Series.fillna() in a new style and add more tests #251

Merged

kozlov-alexey reviewed Oct 25, 2019

View reviewed changes

shssf added Waiting on author and removed Ready for Review labels Oct 27, 2019

densmirn force-pushed the feature/series_nsmallest branch from 9fb7acd to caf6e2c Compare October 28, 2019 13:03

densmirn added Ready for Review and removed Waiting on author labels Oct 28, 2019

densmirn requested a review from kozlov-alexey October 30, 2019 13:01

densmirn added Ready for Review and removed Waiting on author labels Oct 30, 2019

kozlov-alexey reviewed Oct 30, 2019

View reviewed changes

hpat/datatypes/hpat_pandas_series_functions.py Outdated Show resolved Hide resolved

kozlov-alexey approved these changes Oct 30, 2019

View reviewed changes

densmirn force-pushed the feature/series_nsmallest branch from 07f7468 to e7da101 Compare October 31, 2019 12:16

shssf suggested changes Oct 31, 2019

View reviewed changes

shssf added Waiting on author and removed Ready for Review labels Oct 31, 2019

densmirn force-pushed the feature/series_nsmallest branch from 60cdd9a to 3038203 Compare November 1, 2019 07:48

densmirn added Ready for Review and removed Waiting on author labels Nov 1, 2019

densmirn requested a review from shssf November 1, 2019 07:51

shssf suggested changes Nov 3, 2019

View reviewed changes

shssf added Waiting on author and removed Ready for Review labels Nov 3, 2019

densmirn added 3 commits November 4, 2019 14:26

Implement Series.nsmallest() in new style

d9f587e

Replace rand chararray generator with strlist one

84355ef

Minor changes in tests for nsmallest/nlargest

0e0cf7e

densmirn force-pushed the feature/series_nsmallest branch from 3038203 to 0e0cf7e Compare November 4, 2019 11:45

densmirn added Ready for Review and removed Waiting on author labels Nov 4, 2019

densmirn requested a review from shssf November 4, 2019 12:42

shssf approved these changes Nov 4, 2019

View reviewed changes

shssf merged commit 6f36f5b into IntelPython:master Nov 4, 2019

	np.testing.assert_array_equal(hpat_func(), test_impl())
	pd.testing.assert_series_equal(hpat_func(), test_impl())

	[1.0, np.nan, -1.0, 0.0, 5e-324],
	[1.0, np.nan, -1.0, 0.0, sys.float_info.min],

		if not isinstance(keep, (types.Omitted, str, types.UnicodeType)):
		raise TypingError('{} The object must be an unicode. Given keep: {}'.format(_func_name, keep))


		@unittest.skipIf(hpat.config.config_pipeline_hpat_default,
		'Series.nlargest() types validation unsupported')

	'Series.nlargest() types validation unsupported')
	'Series.nlargest() do not throw an exception')

Implement Series.nsmallest()/Series.nlargest() in new style #241

Implement Series.nsmallest()/Series.nlargest() in new style #241

Conversation

densmirn commented Oct 18, 2019

PokhodenkoSA Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densmirn Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PokhodenkoSA Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shssf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densmirn Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densmirn commented Oct 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densmirn commented Oct 28, 2019

densmirn commented Oct 31, 2019

densmirn commented Oct 31, 2019

azure-pipelines bot commented Oct 31, 2019

densmirn commented Oct 31, 2019

shssf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shssf Nov 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Nov 1, 2019 • edited Loading

Comment last updated at 2019-11-04 11:45:43 UTC

shssf left a comment

Choose a reason for hiding this comment

PokhodenkoSA Oct 18, 2019 •

edited

Loading

densmirn Oct 22, 2019 •

edited

Loading

PokhodenkoSA Oct 18, 2019 •

edited

Loading

densmirn Oct 22, 2019 •

edited

Loading

shssf Nov 3, 2019 •

edited

Loading

pep8speaks commented Nov 1, 2019 •

edited

Loading