Add functions for dataframe: median, mean, min, max, sum #345

Rubtsowa · 2019-11-28T06:56:41Z

No description provided.

pep8speaks · 2019-11-28T06:56:45Z

Hello @Rubtsowa! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-31 12:51:56 UTC

kozlov-alexey · 2019-11-28T20:18:00Z

sdc/config.py

+
+use_default_dataframe = distutils_util.strtobool(os.getenv('SDC_CONFIG_USE_DEFAULT_DATAFRAME', 'True'))
+'''
+Default value used to select compiler pipeline in a function decorator


Description is not right - should be something like "Config variable used to select DataFrameType model (default is legacy model)"

kozlov-alexey · 2019-11-28T20:20:33Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        return sdc_pandas_dataframe_count_impl
+
+else:
+    def reduce(df, name):


I think function name should be more specific, e.g. "sdc_pandas_dataframe_reduce_columns"

kozlov-alexey · 2019-11-28T20:30:24Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        saved_columns = df.columns
+        n_cols = len(saved_columns)
+        data_args = tuple('data{}'.format(i) for i in range(n_cols))
+        func_text = "def _reduce_impl(df, axis=None, skipna=None, level=None, numeric_only=None):\n"


This line defines function signature, which is now actually hardcoded to signature that 'median' has (the other function like max, min, mean also have the same signature, but not all). That won't work for other functions, e.g. Series.sum, which has additional arguments:
Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
You have to pass additional arguments to this function to be able to generate correct function text that matches the signature used in overload.

kozlov-alexey · 2019-11-28T20:34:51Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        if not isinstance(df, DataFrameType):
+            raise TypingError('{} The object must be a pandas.dataframe. Given: {}'.format(name, df))
+
+        if not (isinstance(axis, types.Omitted) or axis is None):


Probably we should use ty_checker here and everywhere else, e.g.

Suggested change

if not (isinstance(axis, types.Omitted) or axis is None):

if (not isinstance(axis, (int, types.Integer, str, types.UnicodeType, types.StringLiteral, types.Omitted))

and axis not in (0, 'index')):

ty_checker.raise_exc(axis, 'integer or string', 'axis')

All functions seems to have more or less the same checks. Shouldn't we move generic checks into generic function in order to avoid copy and paste?

kozlov-alexey · 2019-11-28T20:40:49Z

sdc/tests/test_dataframe.py

+        def test_impl(df):
+            return df.min()
+        sdc_func = sdc.jit(test_impl)
+        df = pd.DataFrame({"A": [12, 4, 5, 44, 1],


Since we just use Series method for each column, I prefer to use an aggregated DF with many columns that actually differ between each other, that is, each testing some specific case - ints, floats, datatimes, etc.

@Rubtsowa Why haven't you applied the above comment? You don't need two different tests for min, you can just use one with:

df = pd.DataFrame({ "A": [12, 4, 5, 44, 1], "B": [5.0, np.nan, 9, 2, -1], "C": ['a', 'aa', 'd', 'cc', None], "D": [True, True, False, True, True] })

AlexanderKalistratov · 2019-11-28T21:20:14Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        if not isinstance(df, DataFrameType):
+            raise TypingError('{} The object must be a pandas.dataframe. Given: {}'.format(name, df))
+
+        if not (isinstance(axis, types.Omitted) or axis is None):


All functions seems to have more or less the same checks. Shouldn't we move generic checks into generic function in order to avoid copy and paste?

AlexanderKalistratov · 2019-11-28T21:22:08Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        loc_vars = {}
+
+        print()
+        print(func_text)


I believe it should be removed

densmirn · 2019-11-29T05:00:13Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+            func_text += "  {} = hpat.hiframes.api.init_series(hpat.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))\n".format(
+                d + '_S', i)
+            func_text += "  {} = {}.{}()\n".format(d + '_O', d + '_S', name)
+        func_text += "  data = np.array(({},))\n".format(
+            ", ".join(d + '_O' for d in data_args))
+        func_text += "  index = hpat.str_arr_ext.StringArray(({},))\n".format(
+            ", ".join("'{}'".format(c) for c in saved_columns))
+        func_text += "  return hpat.hiframes.api.init_series(data, index)\n"


You concatenate strings via operator +, but more effective way is to create list, fill the list via .append() (do NOT forget to cut off \n), then join the list to the string via '\n'.join(), e.g.:

func_lines = [] func_lines.append('the first line') func_lines.append('the second line') func_lines.append('the last line') func_text = '\n'.join(func_lines)

densmirn · 2019-11-29T05:01:58Z

Hello @Rubtsowa! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file sdc/datatypes/hpat_pandas_dataframe_functions.py:

Line 47:5: E303 too many blank lines (2)
Line 70:121: E501 line too long (122 > 120 characters)
Line 108:121: E501 line too long (132 > 120 characters)
Line 128:5: E303 too many blank lines (2)
Line 176:5: E303 too many blank lines (2)
Line 224:5: E303 too many blank lines (2)
Line 272:5: E303 too many blank lines (2)
Line 320:5: E303 too many blank lines (2)

In the file sdc/hiframes/pd_dataframe_ext.py:

Line 1641:1: E402 module level import not at top of file

Please fix the code style issues.

sdc/datatypes/hpat_pandas_dataframe_functions.py

densmirn · 2019-12-02T10:50:16Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        return sdc_pandas_dataframe_count_impl
+
+else:
+    def sdc_pandas_dataframe_reduce_columns(df, name, param):


Maybe replace param with params.

…_dataframe_min

kozlov-alexey · 2019-12-30T21:44:08Z

sdc/datatypes/hpat_pandas_series_functions.py

@@ -1994,8 +1994,7 @@ def hpat_pandas_series_sum(
    .. only:: developer

        Tests:
-            python -m sdc.runtests sdc.tests.test_series.TestSeries.test_series_sum1
-            # python -m sdc.runtests sdc.tests.test_series.TestSeries.test_series_sum2
+            python -m sdc.runtests -k sdc.tests.test_series.TestSeries.test_series_sum


Need to refer to all tests for sum method:
python -m sdc.runtests -k sdc.tests.test_series.TestSeries.test_series_sum*

kozlov-alexey · 2019-12-30T21:52:54Z

sdc/datatypes/hpat_pandas_series_functions.py

+        if skipna is None:
+            skipna = True


This should not compile, since type won't be unified for skipna variable which is None and bool at the same time. You can add a branch based on compile time value, i.e. define skipna_is_none variable at typing and refer to it with:

if skipna_is_none == True: #noqa _skipna = True else: _skipna = skipna

kozlov-alexey · 2019-12-30T22:07:24Z

sdc/tests/test_dataframe.py

+                           "F": [np.nan, np.nan, np.inf, np.nan]})
+        pd.testing.assert_series_equal(hpat_func(df), test_impl(df))
+
+    def test_prod(self):


All these tests test default argument values, so I suggest adding _default suffix for all of them.

…_dataframe_min

AlexanderKalistratov · 2019-12-31T12:24:27Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

 from numba.errors import TypingError
+from sdc.hiframes.pd_dataframe_ext import DataFrameType
+import sdc.datatypes.hpat_pandas_dataframe_types


Suggested change

import sdc.datatypes.hpat_pandas_dataframe_types

AlexanderKalistratov · 2019-12-31T12:24:39Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

 from numba.errors import TypingError
+from sdc.hiframes.pd_dataframe_ext import DataFrameType


Suggested change

from sdc.hiframes.pd_dataframe_ext import DataFrameType

from sdc.hiframes.pd_dataframe_type import DataFrameType

…_dataframe_min

All comments applied

Rubtsowa added 2 commits November 27, 2019 14:13

change

c12c7ba

Impl functions for dataframe:median, mean, min, max, sum

063e5ea

Rubtsowa requested review from kozlov-alexey and densmirn November 28, 2019 06:56

add *arga, **kwars

3d5badd

kozlov-alexey reviewed Nov 28, 2019

View reviewed changes

AlexanderKalistratov reviewed Nov 28, 2019

View reviewed changes

densmirn suggested changes Nov 29, 2019

View reviewed changes

Rubtsowa added 3 commits November 29, 2019 09:50

change config and use TypeChecker

009c9c9

add functions: std, var, prod, count. Add tests for functions

d56d693

change

eeed327

densmirn previously requested changes Dec 2, 2019

View reviewed changes

sdc/datatypes/hpat_pandas_dataframe_functions.py Outdated Show resolved Hide resolved

densmirn reviewed Dec 2, 2019

View reviewed changes

AlexanderKalistratov requested review from PokhodenkoSA and Vyacheslav-Smirnov December 2, 2019 14:49

Rubtsowa added 9 commits December 3, 2019 09:05

change name input parameter

57d95fe

refactor

599fb4a

added change for methods median and min

947e2d2

correct input parameters

83d1bf6

change

5db8506

delete method count

454ab71

change

461129a

merge

a23adc0

Merge branch 'master' of https://github.com/IntelPython/hpat into add…

c7ed931

…_dataframe_min

Rubtsowa added 6 commits December 29, 2019 09:12

correction function. problems with parameters for series methods

da8f9e4

correction functions for Series and for DataFrame

5263e1f

correction problem with PEP8

2460aba

delete print

31ab157

merge

2edfaed

skip some tests

74d75a5

Rubtsowa requested a review from densmirn December 30, 2019 16:28

kozlov-alexey added the Coverage decreased label Dec 30, 2019

kozlov-alexey reviewed Dec 30, 2019

View reviewed changes

Rubtsowa added 4 commits December 31, 2019 10:47

correction tests and Series mehods

293c57e

Merge branch 'master' of https://github.com/IntelPython/hpat into add…

6a6430b

…_dataframe_min

correction doc for df methods

89e51f4

correction test

e6b74f8

AlexanderKalistratov reviewed Dec 31, 2019

View reviewed changes

Rubtsowa added 2 commits December 31, 2019 15:50

delete 1 import

5aaefcf

Merge branch 'master' of https://github.com/IntelPython/hpat into add…

d3086b0

…_dataframe_min

AlexanderKalistratov approved these changes Dec 31, 2019

View reviewed changes

AlexanderKalistratov merged commit 433fa97 into IntelPython:master Dec 31, 2019

Rubtsowa deleted the add_dataframe_min branch April 7, 2020 07:04

-        if not (isinstance(axis, types.Omitted) or axis is None):
+    if (not isinstance(axis, (int, types.Integer, str, types.UnicodeType, types.StringLiteral, types.Omitted))
+        and axis not in (0, 'index')):
+        ty_checker.raise_exc(axis, 'integer or string', 'axis')

		from numba.errors import TypingError
		from sdc.hiframes.pd_dataframe_ext import DataFrameType

	from sdc.hiframes.pd_dataframe_ext import DataFrameType
	from sdc.hiframes.pd_dataframe_type import DataFrameType

Add functions for dataframe: median, mean, min, max, sum #345

Add functions for dataframe: median, mean, min, max, sum #345

Uh oh!

Conversation

Rubtsowa commented Nov 28, 2019

Uh oh!

pep8speaks commented Nov 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-12-31 12:51:56 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

densmirn commented Nov 29, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexey Dec 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pep8speaks commented Nov 28, 2019 •

edited

Loading

kozlov-alexey Dec 30, 2019 •

edited

Loading