Add support for Series.operator.add in a new-style #305

kozlov-alexey · 2019-11-19T10:06:37Z

No description provided.

pep8speaks · 2019-11-19T10:06:45Z

Hello @kozlov-alexey! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-09 22:19:47 UTC

densmirn · 2019-11-19T12:20:28Z

sdc/datatypes/common_functions.py

+@numba.njit
+def _hpat_ensure_array_capacity(new_size, arr):
+    '''Function creating a copy of numpy array with a size more than specified'''
+    # TODO: replace this function with np.resize when supported by Numba


np.resize resizes array in other way than this algorithm. The algorithm can generates array of size more than new_size and doesn't repeat values of arr in the resulting array. Looks like there is mismatch somewhere :)

Removed the comment at all)

densmirn · 2019-11-19T12:27:06Z

sdc/datatypes/common_functions.py

+
+@numba.njit
+def _hpat_ensure_array_capacity(new_size, arr):
+    '''Function creating a copy of numpy array with a size more than specified'''


Could you apply below recommendation everywhere in this PR?
pep8 recommendation: For triple-quoted strings, always use double quote characters to be consistent with the docstring convention in PEP 257.

densmirn · 2019-11-19T12:29:44Z

sdc/datatypes/hpat_pandas_series_functions.py

+    _func_name = 'Operator add().'
+
+    if not isinstance(self, SeriesType):
+        raise TypingError('{} The object must be a pandas.series. Given: {}'.format(_func_name, self))


Could you use TypeChecker where it's possible?

kozlov-alexey · 2019-11-27T09:19:24Z

@AlexanderKalistratov @Hardcode84 Can you please review too?

AlexanderKalistratov

I think we need to think how to support not just operators like '+', '-' and so, but also corresponding methods like 'add', 'sub' and so on.

Also it is better to have common way to implement all these operations/operators without copy and past

AlexanderKalistratov · 2019-11-27T16:17:39Z

sdc/datatypes/hpat_pandas_series_functions.py

+
+            return joined, lidx, ridx
+
+        return hpat_join_series_indexes_impl


I think it would be better to add explicitly return None in the function end (if niether if nor elif is True)

Yeap, done.

AlexanderKalistratov · 2019-11-27T16:51:56Z

sdc/datatypes/hpat_pandas_series_functions.py

+        np_dtypes = [numpy_support.as_dtype(left.dtype), numpy_support.as_dtype(right.dtype)]
+        np_common_dtype = numpy.find_common_type([], np_dtypes)
+        numba_common_dtype = numpy_support.from_dtype(np_common_dtype)
+        if (isinstance(left.dtype, types.Number) and isinstance(right.dtype, types.Number)):


check numba_common_dtype once instead of checking two types? And add return None (or raise exception?) in case condition doesn't met?

AlexanderKalistratov · 2019-11-28T13:53:28Z

sdc/datatypes/hpat_pandas_series_functions.py

+    series_indexes_alignable = False
+    if isinstance(other, SeriesType):
+        if (other.index == string_array_type and self.index == string_array_type):
+            series_indexes_alignable = True


What would happen if one index is string and another one if number?

Well, pandas will use dtype that can hold objects of diff types, i.e. resulting series will have index with dtype=object and all elements of resulting series will be NaN, since indexes won't align.

AlexanderKalistratov · 2019-11-28T13:55:09Z

sdc/datatypes/hpat_pandas_series_functions.py

+            series_indexes_alignable = True
+
+        if ((isinstance(self.index, types.NoneType) or
+             isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number))


Could you please add extra braces around isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number)?

AlexanderKalistratov · 2019-11-28T13:57:25Z

sdc/datatypes/hpat_pandas_series_functions.py

+        if ((isinstance(self.index, types.NoneType) or
+             isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number))
+            and (isinstance(other.index, types.NoneType) or
+                 isinstance(other.index, types.Array) and isinstance(other.index.dtype, types.Number))):


Probably it is better to add small utility function, like check_index_is_numeric() in order not to duplicate code for self.index and other.index?

Not insisting

AlexanderKalistratov · 2019-11-28T14:03:32Z

sdc/datatypes/hpat_pandas_series_functions.py

+        return hpat_pandas_series_add_scalar_impl
+
+    elif (isinstance(other, SeriesType)):
+        is_numeric_index = isinstance(self.index, (types.Array, types.NoneType))


Shouldn't it be like: (isinstance(self.index, types.NoneType) or isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number). The same as check above? Or Check above should be the same as this one?

Given that the above check passed, we can only check first series here, that's why it worked. But I agree that it's better to reuse the check from above, so I'll change to using 'none_or_numeric_indexes' variable in the next commit.

AlexanderKalistratov · 2019-11-28T14:07:39Z

sdc/datatypes/hpat_pandas_series_functions.py

+        is_numeric_index = isinstance(self.index, (types.Array, types.NoneType))
+
+        if is_numeric_index:
+            ty_left_index_dtype = types.int64 if isinstance(self.index, types.NoneType) else self.index.dtype


Looks like you can move this block to line 3958 (after line if is_numeric_index == True: # noqa

Alright, most likely you can't. Since line 3958 inside jitted region

Well, yes, but this block can be moved to the else branch below, where it's probably a better place for it. Will do in the next commit.

AlexanderKalistratov · 2019-11-28T14:16:37Z

sdc/datatypes/hpat_pandas_series_functions.py

+                    max_data_size = max(len(self._data), len(other._data))
+                    new_data = numpy.empty(max_data_size, dtype=numpy.float64)
+                    new_data[:min_data_size] = self._data[:min_data_size] + other._data[:min_data_size]
+                    new_data[min_data_size:] = numpy.repeat(numpy.nan, max_data_size - min_data_size)


This behavior could be changed by parameters passed to add method of series.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.add.html#pandas.Series.add

This PR is about '+' operator, but what about 'add' method?

@AlexanderKalistratov Yes, I understand, I actually planned to have two different templates: one for binary operators and the other for add/sub/div/... methods, which can then be used with auto-generation script, because latter need special handling for method args and so on. But yes, I will address that in subsequent PRs.

AlexanderKalistratov · 2019-11-28T14:22:07Z

sdc/datatypes/hpat_pandas_series_functions.py

+                # check if indexes are equal and series don't have to be aligned
+                if is_numeric_index == True:  # noqa
+                    if (numpy.array_equal(left_index, right_index)):
+                        return pandas.Series(numpy.asarray(self._data + other._data, numpy.float64),


Are you sure about numpy.float64? shouldn't is be something like common_dtype?

Yes, thank you, that's incorrect of course.

@kozlov-alexey I'm already not that sure it is incorrect. In case we are adding nan dtype should become numpy.float64 anyway

Right, just verified that the result of an operation in Pandas has unstable dtype: if indexes match it will be common numpy dtype of two series, but if one series has indexes absent in other, then np.nan will appear in data and resulting dtype will be float. So we have to use types.float64 due to type stability requirements.

AlexanderKalistratov · 2019-11-28T14:22:31Z

sdc/datatypes/hpat_pandas_series_functions.py

+                            is_index_equal = False
+
+                    if is_index_equal:
+                        return pandas.Series(numpy.asarray(self._data + other._data, numpy.float64),


Are you sure about numpy.float64? shouldn't is be something like common_dtype?

PokhodenkoSA · 2019-12-05T17:50:34Z

sdc/tests/test_series.py

    "test_series_op4",
    "test_series_op6",
    "test_series_op7",
+    "test_series_operator_add_numeric_same_index_default",


This code is not in master after #340.

Yes, thank you, I'll use new skip decorators.

AlexanderKalistratov

LGTM

AlexanderKalistratov · 2019-12-07T20:50:26Z

sdc/datatypes/hpat_pandas_series_functions.py

+                    new_data[:min_data_size] = self._data[:min_data_size] + other._data[:min_data_size]
+                    new_data[min_data_size:] = numpy.repeat(numpy.nan, max_data_size - min_data_size)
+
+                    return pandas.Series(new_data, self._index)


@kozlov-alexey In some cases resulting series should have the same name as original one. This cases, seems to be are:

series + scalar (i.e. A + 1)

series + series, but both series has the same name

This is very important in order to implement DataFrame filters. I'm ok to do it in separate PR, but this must be done. Also tests for named series should be added

Well, after some more investigations, it's not important for filters, but still we should be consistent with pandas

It looks like, at the moment we can't support second case (when two series have the same name) since it is runtime condition, but whether series has name or not we should know at compile time.

@AlexanderKalistratov Thank you! I think we should address this in a separate PR. BTW, I just checked if there's any difference between '1 + A' and 'A + 1' and there's no, in both cases resulting series takes name from A. I'm going to dig in pandas sources to clarify this behavior.

kozlov-alexey requested review from AlexanderKalistratov, Hardcode84, densmirn and shssf November 19, 2019 10:06

densmirn reviewed Nov 19, 2019

View reviewed changes

kozlov-alexey added [WIP] Work in progress Waiting other PR This PR depends on functionality to be merged in other PR labels Nov 21, 2019

kozlov-alexey force-pushed the feature/refactor_series_operator_add branch 2 times, most recently from 62793ef to 943c557 Compare November 26, 2019 17:29

kozlov-alexey added Ready for Review and removed Waiting other PR This PR depends on functionality to be merged in other PR [WIP] Work in progress labels Nov 27, 2019

kozlov-alexey requested a review from densmirn November 27, 2019 11:59

densmirn approved these changes Nov 27, 2019

View reviewed changes

AlexanderKalistratov reviewed Nov 28, 2019

View reviewed changes

kozlov-alexey added [WIP] Work in progress and removed Ready for Review labels Dec 2, 2019

PokhodenkoSA reviewed Dec 5, 2019

View reviewed changes

kozlov-alexey force-pushed the feature/refactor_series_operator_add branch from 943c557 to 452e8ba Compare December 6, 2019 17:49

AlexanderKalistratov approved these changes Dec 7, 2019

View reviewed changes

AlexanderKalistratov reviewed Dec 7, 2019

View reviewed changes

kozlov-alexey force-pushed the feature/refactor_series_operator_add branch 2 times, most recently from cafa353 to ffbb414 Compare December 9, 2019 21:36

kozlov-alexey added Waiting on CI and removed [WIP] Work in progress labels Dec 9, 2019

kozlov-alexey added 3 commits December 10, 2019 01:12

Add support for Series.operator.add in a new-style

0db8ef8

Applying review comments and addding tests

80de726

More comments and refactoring from review

738d74a

Bugfix in indexes join and minor changes

7c630fa

kozlov-alexey force-pushed the feature/refactor_series_operator_add branch from ffbb414 to 7c630fa Compare December 9, 2019 22:19

AlexanderKalistratov merged commit 81208c7 into IntelPython:master Dec 10, 2019


		return joined, lidx, ridx

		return hpat_join_series_indexes_impl

Add support for Series.operator.add in a new-style #305

Add support for Series.operator.add in a new-style #305

Uh oh!

Conversation

kozlov-alexey commented Nov 19, 2019

Uh oh!

pep8speaks commented Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-12-09 22:19:47 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexey commented Nov 27, 2019

Uh oh!

AlexanderKalistratov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexanderKalistratov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Nov 19, 2019 •

edited

Loading