-
Notifications
You must be signed in to change notification settings - Fork 62
Add support for Series.operator.add in a new-style #305
Add support for Series.operator.add in a new-style #305
Conversation
Hello @kozlov-alexey! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-12-09 22:19:47 UTC |
sdc/datatypes/common_functions.py
Outdated
@numba.njit | ||
def _hpat_ensure_array_capacity(new_size, arr): | ||
'''Function creating a copy of numpy array with a size more than specified''' | ||
# TODO: replace this function with np.resize when supported by Numba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.resize
resizes array in other way than this algorithm. The algorithm can generates array of size more than new_size
and doesn't repeat values of arr
in the resulting array. Looks like there is mismatch somewhere :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the comment at all)
sdc/datatypes/common_functions.py
Outdated
|
||
@numba.njit | ||
def _hpat_ensure_array_capacity(new_size, arr): | ||
'''Function creating a copy of numpy array with a size more than specified''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you apply below recommendation everywhere in this PR?
pep8 recommendation: For triple-quoted strings, always use double quote characters to be consistent with the docstring convention in PEP 257.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
_func_name = 'Operator add().' | ||
|
||
if not isinstance(self, SeriesType): | ||
raise TypingError('{} The object must be a pandas.series. Given: {}'.format(_func_name, self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use TypeChecker
where it's possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
62793ef
to
943c557
Compare
@AlexanderKalistratov @Hardcode84 Can you please review too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to think how to support not just operators like '+', '-' and so, but also corresponding methods like 'add', 'sub' and so on.
Also it is better to have common way to implement all these operations/operators without copy and past
|
||
return joined, lidx, ridx | ||
|
||
return hpat_join_series_indexes_impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add explicitly return None
in the function end (if niether if
nor elif
is True
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, done.
np_dtypes = [numpy_support.as_dtype(left.dtype), numpy_support.as_dtype(right.dtype)] | ||
np_common_dtype = numpy.find_common_type([], np_dtypes) | ||
numba_common_dtype = numpy_support.from_dtype(np_common_dtype) | ||
if (isinstance(left.dtype, types.Number) and isinstance(right.dtype, types.Number)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check numba_common_dtype
once instead of checking two types? And add return None
(or raise exception?) in case condition doesn't met?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
series_indexes_alignable = False | ||
if isinstance(other, SeriesType): | ||
if (other.index == string_array_type and self.index == string_array_type): | ||
series_indexes_alignable = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if one index is string and another one if number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, pandas will use dtype that can hold objects of diff types, i.e. resulting series will have index with dtype=object and all elements of resulting series will be NaN, since indexes won't align.
series_indexes_alignable = True | ||
|
||
if ((isinstance(self.index, types.NoneType) or | ||
isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add extra braces around isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if ((isinstance(self.index, types.NoneType) or | ||
isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number)) | ||
and (isinstance(other.index, types.NoneType) or | ||
isinstance(other.index, types.Array) and isinstance(other.index.dtype, types.Number))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it is better to add small utility function, like check_index_is_numeric()
in order not to duplicate code for self.index
and other.index
?
Not insisting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
return hpat_pandas_series_add_scalar_impl | ||
|
||
elif (isinstance(other, SeriesType)): | ||
is_numeric_index = isinstance(self.index, (types.Array, types.NoneType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be like: (isinstance(self.index, types.NoneType) or isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number)
. The same as check above? Or Check above should be the same as this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the above check passed, we can only check first series here, that's why it worked. But I agree that it's better to reuse the check from above, so I'll change to using 'none_or_numeric_indexes' variable in the next commit.
is_numeric_index = isinstance(self.index, (types.Array, types.NoneType)) | ||
|
||
if is_numeric_index: | ||
ty_left_index_dtype = types.int64 if isinstance(self.index, types.NoneType) else self.index.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you can move this block to line 3958 (after line if is_numeric_index == True: # noqa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, most likely you can't. Since line 3958 inside jitted region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes, but this block can be moved to the else branch below, where it's probably a better place for it. Will do in the next commit.
max_data_size = max(len(self._data), len(other._data)) | ||
new_data = numpy.empty(max_data_size, dtype=numpy.float64) | ||
new_data[:min_data_size] = self._data[:min_data_size] + other._data[:min_data_size] | ||
new_data[min_data_size:] = numpy.repeat(numpy.nan, max_data_size - min_data_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior could be changed by parameters passed to add
method of series.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.add.html#pandas.Series.add
This PR is about '+' operator, but what about 'add' method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexanderKalistratov Yes, I understand, I actually planned to have two different templates: one for binary operators and the other for add/sub/div/... methods, which can then be used with auto-generation script, because latter need special handling for method args and so on. But yes, I will address that in subsequent PRs.
# check if indexes are equal and series don't have to be aligned | ||
if is_numeric_index == True: # noqa | ||
if (numpy.array_equal(left_index, right_index)): | ||
return pandas.Series(numpy.asarray(self._data + other._data, numpy.float64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about numpy.float64
? shouldn't is be something like common_dtype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you, that's incorrect of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlov-alexey I'm already not that sure it is incorrect. In case we are adding nan
dtype should become numpy.float64 anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, just verified that the result of an operation in Pandas has unstable dtype: if indexes match it will be common numpy dtype of two series, but if one series has indexes absent in other, then np.nan will appear in data and resulting dtype will be float. So we have to use types.float64 due to type stability requirements.
is_index_equal = False | ||
|
||
if is_index_equal: | ||
return pandas.Series(numpy.asarray(self._data + other._data, numpy.float64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about numpy.float64
? shouldn't is be something like common_dtype
?
sdc/tests/test_series.py
Outdated
"test_series_op4", | ||
"test_series_op6", | ||
"test_series_op7", | ||
"test_series_operator_add_numeric_same_index_default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not in master after #340.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you, I'll use new skip decorators.
943c557
to
452e8ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
new_data[:min_data_size] = self._data[:min_data_size] + other._data[:min_data_size] | ||
new_data[min_data_size:] = numpy.repeat(numpy.nan, max_data_size - min_data_size) | ||
|
||
return pandas.Series(new_data, self._index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlov-alexey In some cases resulting series should have the same name as original one. This cases, seems to be are:
- series + scalar (i.e. A + 1)
- series + series, but both series has the same name
This is very important in order to implement DataFrame filters. I'm ok to do it in separate PR, but this must be done. Also tests for named series should be added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, after some more investigations, it's not important for filters, but still we should be consistent with pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like, at the moment we can't support second case (when two series have the same name) since it is runtime condition, but whether series has name or not we should know at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexanderKalistratov Thank you! I think we should address this in a separate PR. BTW, I just checked if there's any difference between '1 + A' and 'A + 1' and there's no, in both cases resulting series takes name from A. I'm going to dig in pandas sources to clarify this behavior.
cafa353
to
ffbb414
Compare
ffbb414
to
7c630fa
Compare
No description provided.