-
Notifications
You must be signed in to change notification settings - Fork 62
Add functions for dataframe: median, mean, min, max, sum #345
Add functions for dataframe: median, mean, min, max, sum #345
Conversation
sdc/config.py
Outdated
|
||
use_default_dataframe = distutils_util.strtobool(os.getenv('SDC_CONFIG_USE_DEFAULT_DATAFRAME', 'True')) | ||
''' | ||
Default value used to select compiler pipeline in a function decorator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description is not right - should be something like "Config variable used to select DataFrameType model (default is legacy model)"
return sdc_pandas_dataframe_count_impl | ||
|
||
else: | ||
def reduce(df, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think function name should be more specific, e.g. "sdc_pandas_dataframe_reduce_columns"
saved_columns = df.columns | ||
n_cols = len(saved_columns) | ||
data_args = tuple('data{}'.format(i) for i in range(n_cols)) | ||
func_text = "def _reduce_impl(df, axis=None, skipna=None, level=None, numeric_only=None):\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line defines function signature, which is now actually hardcoded to signature that 'median' has (the other function like max, min, mean also have the same signature, but not all). That won't work for other functions, e.g. Series.sum, which has additional arguments:
Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
You have to pass additional arguments to this function to be able to generate correct function text that matches the signature used in overload.
if not isinstance(df, DataFrameType): | ||
raise TypingError('{} The object must be a pandas.dataframe. Given: {}'.format(name, df)) | ||
|
||
if not (isinstance(axis, types.Omitted) or axis is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should use ty_checker here and everywhere else, e.g.
if not (isinstance(axis, types.Omitted) or axis is None): | |
if (not isinstance(axis, (int, types.Integer, str, types.UnicodeType, types.StringLiteral, types.Omitted)) | |
and axis not in (0, 'index')): | |
ty_checker.raise_exc(axis, 'integer or string', 'axis') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All functions seems to have more or less the same checks. Shouldn't we move generic checks into generic function in order to avoid copy and paste?
sdc/tests/test_dataframe.py
Outdated
def test_impl(df): | ||
return df.min() | ||
sdc_func = sdc.jit(test_impl) | ||
df = pd.DataFrame({"A": [12, 4, 5, 44, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we just use Series method for each column, I prefer to use an aggregated DF with many columns that actually differ between each other, that is, each testing some specific case - ints, floats, datatimes, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa Why haven't you applied the above comment? You don't need two different tests for min, you can just use one with:
df = pd.DataFrame({
"A": [12, 4, 5, 44, 1],
"B": [5.0, np.nan, 9, 2, -1],
"C": ['a', 'aa', 'd', 'cc', None],
"D": [True, True, False, True, True]
})
if not isinstance(df, DataFrameType): | ||
raise TypingError('{} The object must be a pandas.dataframe. Given: {}'.format(name, df)) | ||
|
||
if not (isinstance(axis, types.Omitted) or axis is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All functions seems to have more or less the same checks. Shouldn't we move generic checks into generic function in order to avoid copy and paste?
loc_vars = {} | ||
|
||
print() | ||
print(func_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should be removed
func_text += " {} = hpat.hiframes.api.init_series(hpat.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))\n".format( | ||
d + '_S', i) | ||
func_text += " {} = {}.{}()\n".format(d + '_O', d + '_S', name) | ||
func_text += " data = np.array(({},))\n".format( | ||
", ".join(d + '_O' for d in data_args)) | ||
func_text += " index = hpat.str_arr_ext.StringArray(({},))\n".format( | ||
", ".join("'{}'".format(c) for c in saved_columns)) | ||
func_text += " return hpat.hiframes.api.init_series(data, index)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You concatenate strings via operator +
, but more effective way is to create list
, fill the list via .append()
(do NOT forget to cut off \n
), then join the list to the string via '\n'.join()
, e.g.:
func_lines = []
func_lines.append('the first line')
func_lines.append('the second line')
func_lines.append('the last line')
func_text = '\n'.join(func_lines)
Please fix the code style issues. |
return sdc_pandas_dataframe_count_impl | ||
|
||
else: | ||
def sdc_pandas_dataframe_reduce_columns(df, name, param): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace param
with params
.
…_dataframe_min
@@ -1994,8 +1994,7 @@ def hpat_pandas_series_sum( | |||
.. only:: developer | |||
|
|||
Tests: | |||
python -m sdc.runtests sdc.tests.test_series.TestSeries.test_series_sum1 | |||
# python -m sdc.runtests sdc.tests.test_series.TestSeries.test_series_sum2 | |||
python -m sdc.runtests -k sdc.tests.test_series.TestSeries.test_series_sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to refer to all tests for sum method:
python -m sdc.runtests -k sdc.tests.test_series.TestSeries.test_series_sum*
if skipna is None: | ||
skipna = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not compile, since type won't be unified for skipna variable which is None and bool at the same time. You can add a branch based on compile time value, i.e. define skipna_is_none variable at typing and refer to it with:
if skipna_is_none == True: #noqa
_skipna = True
else:
_skipna = skipna
sdc/tests/test_dataframe.py
Outdated
"F": [np.nan, np.nan, np.inf, np.nan]}) | ||
pd.testing.assert_series_equal(hpat_func(df), test_impl(df)) | ||
|
||
def test_prod(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these tests test default argument values, so I suggest adding _default suffix for all of them.
from numba.errors import TypingError | ||
from sdc.hiframes.pd_dataframe_ext import DataFrameType | ||
import sdc.datatypes.hpat_pandas_dataframe_types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import sdc.datatypes.hpat_pandas_dataframe_types |
from numba.errors import TypingError | ||
from sdc.hiframes.pd_dataframe_ext import DataFrameType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from sdc.hiframes.pd_dataframe_ext import DataFrameType | |
from sdc.hiframes.pd_dataframe_type import DataFrameType |
No description provided.