Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Conversation

kozlov-alexey
Copy link
Contributor

No description provided.

@pep8speaks
Copy link

pep8speaks commented Nov 19, 2019

Hello @kozlov-alexey! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-09 22:19:47 UTC

@numba.njit
def _hpat_ensure_array_capacity(new_size, arr):
'''Function creating a copy of numpy array with a size more than specified'''
# TODO: replace this function with np.resize when supported by Numba
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.resize resizes array in other way than this algorithm. The algorithm can generates array of size more than new_size and doesn't repeat values of arr in the resulting array. Looks like there is mismatch somewhere :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment at all)


@numba.njit
def _hpat_ensure_array_capacity(new_size, arr):
'''Function creating a copy of numpy array with a size more than specified'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you apply below recommendation everywhere in this PR?
pep8 recommendation: For triple-quoted strings, always use double quote characters to be consistent with the docstring convention in PEP 257.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

_func_name = 'Operator add().'

if not isinstance(self, SeriesType):
raise TypingError('{} The object must be a pandas.series. Given: {}'.format(_func_name, self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use TypeChecker where it's possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kozlov-alexey kozlov-alexey added [WIP] Work in progress Waiting other PR This PR depends on functionality to be merged in other PR labels Nov 21, 2019
@kozlov-alexey kozlov-alexey force-pushed the feature/refactor_series_operator_add branch 2 times, most recently from 62793ef to 943c557 Compare November 26, 2019 17:29
@kozlov-alexey
Copy link
Contributor Author

@AlexanderKalistratov @Hardcode84 Can you please review too?

@kozlov-alexey kozlov-alexey added Ready for Review and removed Waiting other PR This PR depends on functionality to be merged in other PR [WIP] Work in progress labels Nov 27, 2019
Copy link
Collaborator

@AlexanderKalistratov AlexanderKalistratov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think how to support not just operators like '+', '-' and so, but also corresponding methods like 'add', 'sub' and so on.

Also it is better to have common way to implement all these operations/operators without copy and past


return joined, lidx, ridx

return hpat_join_series_indexes_impl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add explicitly return None in the function end (if niether if nor elif is True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, done.

np_dtypes = [numpy_support.as_dtype(left.dtype), numpy_support.as_dtype(right.dtype)]
np_common_dtype = numpy.find_common_type([], np_dtypes)
numba_common_dtype = numpy_support.from_dtype(np_common_dtype)
if (isinstance(left.dtype, types.Number) and isinstance(right.dtype, types.Number)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check numba_common_dtype once instead of checking two types? And add return None (or raise exception?) in case condition doesn't met?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

series_indexes_alignable = False
if isinstance(other, SeriesType):
if (other.index == string_array_type and self.index == string_array_type):
series_indexes_alignable = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if one index is string and another one if number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, pandas will use dtype that can hold objects of diff types, i.e. resulting series will have index with dtype=object and all elements of resulting series will be NaN, since indexes won't align.

series_indexes_alignable = True

if ((isinstance(self.index, types.NoneType) or
isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add extra braces around isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if ((isinstance(self.index, types.NoneType) or
isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number))
and (isinstance(other.index, types.NoneType) or
isinstance(other.index, types.Array) and isinstance(other.index.dtype, types.Number))):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it is better to add small utility function, like check_index_is_numeric() in order not to duplicate code for self.index and other.index?

Not insisting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

return hpat_pandas_series_add_scalar_impl

elif (isinstance(other, SeriesType)):
is_numeric_index = isinstance(self.index, (types.Array, types.NoneType))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be like: (isinstance(self.index, types.NoneType) or isinstance(self.index, types.Array) and isinstance(self.index.dtype, types.Number). The same as check above? Or Check above should be the same as this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the above check passed, we can only check first series here, that's why it worked. But I agree that it's better to reuse the check from above, so I'll change to using 'none_or_numeric_indexes' variable in the next commit.

is_numeric_index = isinstance(self.index, (types.Array, types.NoneType))

if is_numeric_index:
ty_left_index_dtype = types.int64 if isinstance(self.index, types.NoneType) else self.index.dtype
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you can move this block to line 3958 (after line if is_numeric_index == True: # noqa

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, most likely you can't. Since line 3958 inside jitted region

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes, but this block can be moved to the else branch below, where it's probably a better place for it. Will do in the next commit.

max_data_size = max(len(self._data), len(other._data))
new_data = numpy.empty(max_data_size, dtype=numpy.float64)
new_data[:min_data_size] = self._data[:min_data_size] + other._data[:min_data_size]
new_data[min_data_size:] = numpy.repeat(numpy.nan, max_data_size - min_data_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior could be changed by parameters passed to add method of series.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.add.html#pandas.Series.add

This PR is about '+' operator, but what about 'add' method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexanderKalistratov Yes, I understand, I actually planned to have two different templates: one for binary operators and the other for add/sub/div/... methods, which can then be used with auto-generation script, because latter need special handling for method args and so on. But yes, I will address that in subsequent PRs.

# check if indexes are equal and series don't have to be aligned
if is_numeric_index == True: # noqa
if (numpy.array_equal(left_index, right_index)):
return pandas.Series(numpy.asarray(self._data + other._data, numpy.float64),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about numpy.float64? shouldn't is be something like common_dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you, that's incorrect of course.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlov-alexey I'm already not that sure it is incorrect. In case we are adding nan dtype should become numpy.float64 anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, just verified that the result of an operation in Pandas has unstable dtype: if indexes match it will be common numpy dtype of two series, but if one series has indexes absent in other, then np.nan will appear in data and resulting dtype will be float. So we have to use types.float64 due to type stability requirements.

is_index_equal = False

if is_index_equal:
return pandas.Series(numpy.asarray(self._data + other._data, numpy.float64),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about numpy.float64? shouldn't is be something like common_dtype?

"test_series_op4",
"test_series_op6",
"test_series_op7",
"test_series_operator_add_numeric_same_index_default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not in master after #340.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you, I'll use new skip decorators.

@kozlov-alexey kozlov-alexey force-pushed the feature/refactor_series_operator_add branch from 943c557 to 452e8ba Compare December 6, 2019 17:49
Copy link
Collaborator

@AlexanderKalistratov AlexanderKalistratov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

new_data[:min_data_size] = self._data[:min_data_size] + other._data[:min_data_size]
new_data[min_data_size:] = numpy.repeat(numpy.nan, max_data_size - min_data_size)

return pandas.Series(new_data, self._index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlov-alexey In some cases resulting series should have the same name as original one. This cases, seems to be are:

  • series + scalar (i.e. A + 1)
  • series + series, but both series has the same name

This is very important in order to implement DataFrame filters. I'm ok to do it in separate PR, but this must be done. Also tests for named series should be added

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, after some more investigations, it's not important for filters, but still we should be consistent with pandas

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like, at the moment we can't support second case (when two series have the same name) since it is runtime condition, but whether series has name or not we should know at compile time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexanderKalistratov Thank you! I think we should address this in a separate PR. BTW, I just checked if there's any difference between '1 + A' and 'A + 1' and there's no, in both cases resulting series takes name from A. I'm going to dig in pandas sources to clarify this behavior.

@kozlov-alexey kozlov-alexey force-pushed the feature/refactor_series_operator_add branch 2 times, most recently from cafa353 to ffbb414 Compare December 9, 2019 21:36
@kozlov-alexey kozlov-alexey force-pushed the feature/refactor_series_operator_add branch from ffbb414 to 7c630fa Compare December 9, 2019 22:19
@AlexanderKalistratov AlexanderKalistratov merged commit 81208c7 into IntelPython:master Dec 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants