Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

Series.argsort() / Series.sort_values() #248

Merged
merged 18 commits into from
Nov 5, 2019

Conversation

1e-to
Copy link
Contributor

@1e-to 1e-to commented Oct 22, 2019

No description provided.

@1e-to 1e-to changed the title Add functional and 3 tests Series.argsort() Add functional and 3 tests Oct 22, 2019
input arg
axis: :obj:`int`
Has no effect but is accepted for compatibility with numpy.
kind: {‘mergesort’, ‘quicksort’, ‘heapsort’}, default ‘quicksort’
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether it passed style check? I guess it was copied from somewhere with specific quotes formatting


return hpat_pandas_series_argsort_impl

def hpat_pandas_series_argsort_impl(self, axis=0, kind='quicksort', order=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is almost identical. Could we combine this solutions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we dont, because in cause with index we need to return (data, index)

@1e-to 1e-to changed the title Series.argsort() Add functional and 3 tests Series.argsort() / Series.sort_values() Oct 25, 2019
Parameters
-----------
self: :class:`pandas.Series`
input arg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input arg -> input series

Has no effect but is accepted for compatibility with numpy.
kind: {‘mergesort’, ‘quicksort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See np.sort for more information. ‘mergesort’ is the only stable algorithm
order: None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add type of the parameter in docstring?

raise TypingError('{} Currently function supports only numeric values. Given data type: {}'.format(_func_name,
self.data.dtype))

if not (isinstance(axis, types.Omitted) or isinstance(axis, types.Integer) or axis == 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually axis can be represented as string, not only integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas' docs axis can be only integer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

axis : {0 or ‘index’}, default 0. The value ‘index’ is accepted for compatibility
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_values.html


hpat_func = hpat.jit(test_impl)

data_test = [[6, 6, 2, 1, 3, 3, 2, 1, 2],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to use globally defined data as input data in tests: f2435b0#diff-deca39d332649cea819383154a5d2cb3R39-R62

*unsupported*
kind: {'mergesort', 'quicksort', 'heapsort'}, default 'quicksort'
Choice of sorting algorithm. See np.sort for more information. 'mergesort' is the only stable algorithm
*unsupported, uses python func - sorted()*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move *unsupported after description of the parameter.

Comment on lines 2275 to 2280
Test: python -m hpat.runtests hpat.tests.test_series.TestSeries.test_series_sort_values1
Test: python -m hpat.runtests hpat.tests.test_series.TestSeries.test_series_sort_values2
Test: python -m hpat.runtests hpat.tests.test_series.TestSeries.test_series_sort_values_index1
Test: python -m hpat.runtests hpat.tests.test_series.TestSeries.test_series_sort_values_noidx
Test: python -m hpat.runtests hpat.tests.test_series.TestSeries.test_series_sort_values_idx
Test: python -m hpat.runtests hpat.tests.test_series.TestSeries.test_series_sort_values_parallel1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use pattern to represent many tests via one line in docstring:
python -m hpat.runtests -k hpat.tests.test_series.TestSeries.test_series_sort_values*

Comment on lines 2255 to 2262
data_test = [[6, 6, 2, 1, 3, 3, 2, 1, 2],
[1.1, 0.3, 2.1, 1, 3, 0.3, 2.1, 1.1, 2.2],
[6, 6.1, 2.2, 1, 3, 0, 2.2, 1, 2],
[6, 6, 2, 1, 3, np.nan, np.nan, np.nan, np.nan],
[3., 5.3, np.nan, np.nan, 33.2, 56.3, 4.4, 3.7, 8.9],
['a', 's', 'dd', 'm', 'll', '345', 'xrt', 'kd', 'qq'],
['dh', 'a', '', 'cv', 'b', '', 'b', 'b', 'p']
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use globally defined data as input data in tests: f2435b0#diff-deca39d332649cea819383154a5d2cb3R39-R62

index = numpy.arange(len(self._data))
my_index = numpy.arange(len(self._data))
used_index = numpy.full((len(self._data)), -1)
indices = numpy.arange(len(self._data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like indices is not required and index_result is enough. Correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_result has type int, and our indexes can be other type, we need to create array of type equal self.index (in my case indices)

index = self._index
my_index = numpy.arange(len(self._data))
used_index = numpy.full((len(self._data)), -1)
indices = self._index.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same, why do you need indices? Isn't it enough to have index_result?

@@ -65,6 +65,12 @@
'大处着眼,小处着手。',
]

test_global_input_data_unicode_kind1 = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As known the list exists in master branch. Please do rebase.

return np.random.choice(rands_chars, size=nchars * size).view((np.str_, nchars))


def gen_frand_array(size, min=-100, max=100):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use seed to be able to reproduce the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bad idea to generate input randomly

axis: :obj:`int`
Has no effect but is accepted for compatibility with numpy.
*unsupported*
kind: {'mergesort', 'quicksort', 'heapsort'}, default: 'quicksort'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe type of the parameter, not only values.

Comment on lines 85 to 86
float_list = (max - min) * np.random.sample(size) + min
return float_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return (max - min) * np.random.sample(size) + min?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@densmirn I have, at least, one explanation - it is easier to debug return value if you have it in separate value. You an not see result of equations if it written in return operator.
So, I would say this is a good practice to have float_list and use it in return later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you are always able to debug the result outside of the function.

Copy link
Contributor

@shssf shssf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to address diffrent perspective in this PR
If you have to use same algorithm (with minor differences) for diffrent branches - it is good idea to create some _methodname_algo function and use it here to avoid massive code duplication.


.. only:: developer

Test: python -m -k hpat.runtests hpat.tests.test_series.TestSeries.test_series_argsort*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use it:

(base) $ python -m -k hpat.runtests hpat.tests.test_series.TestSeries.test_series_argsort*
/bin/python: No module named -k
(base) $ python -m hpat.runtests hpat.tests.test_series.TestSeries.test_series_argsort*
E
======================================================================
ERROR: test_series_argsort* (unittest.loader._FailedTest)
----------------------------------------------------------------------
AttributeError: type object 'TestSeries' has no attribute 'test_series_argsort*'

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to list all tests for this functionality here. one or two good test are enough by your choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mistaken, the correct call tests for the mask: python -m hpat.runtests -k hpat.tests.test_series.TestSeries.test_series_argsort*

raise TypingError('{} The object must be a pandas.series. Given: {}'.format(_func_name, self))

if not isinstance(self.data.dtype, types.Number):
raise TypingError('{} Currently function supports only numeric values. Given data type: {}'.format(_func_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypingError('{} Currently function supports only numeric values. Given data type: {}'.format(_func_name,
raise TypingError('{} Non-numeric type unsupported. Given: {}'.format(_func_name,

Is it shorter, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 2588 to 2606
na = 0
for i in series_data.isna():
if i:
na += 1
id = 0
i = 0
list_no_nan = numpy.empty(len(self._data) - na)
for bool_value in series_data.isna():
if not bool_value:
list_no_nan[id] = self._data[i]
id += 1
i += 1
sort_no_nan = numpy.argsort(list_no_nan)
ne_na = sort[:len(sort) - na]
num = 0
result = numpy.full((len(self._data)), -1)
for i in numpy.sort(ne_na):
result[i] = sort_no_nan[num]
num += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ambiguous algorithm from my perspective. I think it might be easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 2664 to 2686
for i in range(len(result_index)):
find = 0
for search in cycle:
check = 0
for j in used_index:
if my_index[search] == j:
check = 1
if (self._data[search] == result[i]) and check == 0 and find == 0:
result_index[i] = index[search]
used_index[i] = my_index[search]
find = 1
na = 0
for i in self.isna():
if i:
na += 1
num = 0
for i in self.isna():
j = len(result_index) - na
if i and used_index[j] == -1:
result_index[j] = index[num]
used_index[j] = my_index[num]
na -= 1
num += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Numpy can be used more intensively here to reduce complexity of the code.

@overload_method(SeriesType, 'dropna')
def hpat_pandas_series_dropna(self, axis=0, inplace=False):
"""
Pandas Series method :meth:`pandas.Series.dropna` implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line should be here to form documentation properly

@@ -2556,3 +2814,4 @@ def hpat_pandas_series_dropna_impl(self, axis=0, inplace=False):
return pandas.Series(data, index, self._name)

return hpat_pandas_series_dropna_impl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should not be here :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return np.random.choice(rands_chars, size=nchars * size).view((np.str_, nchars))


def gen_frand_array(size, min=-100, max=100):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bad idea to generate input randomly

raise TypingError('{} Unsupported parameters. Given axis: {}'.format(_func_name, axis))

if not isinstance(self.index, types.NoneType):
def hpat_pandas_series_argsort_impl(self, axis=0, kind='quicksort', order=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these branches looks big. I think this is a good practice to insert a comments inside such branch with short explanations and test name for this branch.
First question will be asked here is "what differences between these branches?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, please don't use same name of functions. it looks like you use hpat_pandas_series_argsort_impl for any branch

@pep8speaks
Copy link

pep8speaks commented Nov 1, 2019

Hello @1e-to! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 2789:49: E128 continuation line under-indented for visual indent
Line 2826:49: E128 continuation line under-indented for visual indent
Line 2845:49: E128 continuation line under-indented for visual indent
Line 2882:49: E128 continuation line under-indented for visual indent
Line 3061:50: W292 no newline at end of file

Line 76:1: E302 expected 2 blank lines, found 1
Line 367:5: E303 too many blank lines (2)

Comment last updated at 2019-11-05 20:09:28 UTC

@shssf shssf merged commit a62195d into IntelPython:master Nov 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants