-
Notifications
You must be signed in to change notification settings - Fork 62
add reduce #366
add reduce #366
Conversation
| from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType | ||
|
|
||
|
|
||
| @overload_method(DataFrameType, 'count') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the overload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this can be done in separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overload_method count is not used. It is ok to remove it. It looks like implementation of sdc_pandas_dataframe_reduce_columns function is preparation for other PR #363
not sure in it.
| loc_vars = {} | ||
| func_text = '\n'.join(func_lines) | ||
|
|
||
| exec(func_text, {'hpat': sdc, 'np': numpy}, loc_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa I think we should stick to the new name:
| exec(func_text, {'hpat': sdc, 'np': numpy}, loc_vars) | |
| exec(func_text, {'sdc': sdc, 'np': numpy}, loc_vars) |
and use sdc.hiframes instead hpat.hiframes in the generated function text too.
| for i, d in enumerate(data_args): | ||
| line = ' {} = hpat.hiframes.api.init_series(hpat.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))' | ||
| func_lines.append(line.format(d + '_S', i)) | ||
| func_lines.append(' {} = {}.{}()'.format(d + '_O', d + '_S', name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we call the method of series without any arguments?
You have to actually forward the arguments passed to DF call into the Series method. And also take care about how they actually map to each other (e.g. for DF.mean and Series.mean arguments are the same, but will it be so for all functions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though arguments of df.mean and series.mean are the same, it can't be forwarded straight forward.
axisfor series supports onlyindex (0)numeric_onlyis not support by series at all, and shouldn't be forwardedlevelat the moment we are not storing series in data frame. So there are no hierarchical index and level parameter doesn't make any sense for underlying series, but still there is sense for dataframe. on the other hand at the moment hierarchical index is not support for dataframe either.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mean.html
| n_cols = len(saved_columns) | ||
| data_args = tuple('data{}'.format(i) for i in range(n_cols)) | ||
| all_params = ['df'] | ||
| for key, value in params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A blank line, e.g. before a block that generates func text, wouldn't be that bad.
Also n_cols is used only once and probably can be removed at all.
kozlov-alexey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please apply comments and fix forwarding of arguments into the inner Series call.
| func_definition = 'def _reduce_impl({}):'.format(', '.join(all_params)) | ||
| func_lines = [func_definition] | ||
| for i, d in enumerate(data_args): | ||
| line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))' | |
| line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(all_params[0], {}))' |
|
|
||
| if not (isinstance(axis, types.Omitted) or axis == 0): | ||
| raise TypingError("{} 'axis' unsupported. Given: {}".format(_func_name, axis)) | ||
| def sdc_pandas_dataframe_reduce_columns(df, name, params_s, params_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like params_df is not used at all.
| for key, value in params_s: | ||
| all_params.append('{}={}'.format(key, value)) | ||
| ap = all_params.copy() | ||
| ap.pop(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for avoiding of using pop? Let's separate all parameters and the last one initially.
| for i, d in enumerate(data_args): | ||
| line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))' | ||
| func_lines.append(line.format(d + '_S', i)) | ||
| func_lines.append(' {} = {}.{}({})'.format(d + '_O', d + '_S', name, par)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func_lines.append(' {} = {}.{}({})'.format(d + '_O', d + '_S', name, par)) -> func_lines.append(' {}_O = {}_S.{}({})'.format(d, d, name, par))
|
|
||
| if not (isinstance(numeric_only, types.Omitted) or numeric_only is False): | ||
| raise TypingError("{} 'numeric_only' unsupported. Given: {}".format(_func_name, axis)) | ||
| for key, value in params_s: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use params_s (is it parameters of series function?) to to create df function signature?
|
|
||
| if not (isinstance(axis, types.Omitted) or axis == 0): | ||
| raise TypingError("{} 'axis' unsupported. Given: {}".format(_func_name, axis)) | ||
| def sdc_pandas_dataframe_reduce_columns(df, name, params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def sdc_pandas_dataframe_reduce_columns(df, name, params): | |
| def sdc_pandas_dataframe_reduce_columns(df, name, series_call_params): |
| raise TypingError("{} 'numeric_only' unsupported. Given: {}".format(_func_name, axis)) | ||
| for key, value in params: | ||
| all_params.append('{}={}'.format(key, value)) | ||
| ap = all_params.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the copy of all_params is needed. Also it would be good to have a comment here, to indicate that it relies on the fact we use series params to generate signature of DF method, i.e. something like:
# This relies on parameters part of the signature of Series method called below being the same
# as for the corresponding DataFrame method
series_call_params_str = '{}'.format(', '.join(all_params[1:]))
func_definition = 'def _reduce_impl({}):'.format(', '.join(all_params))
| for key, value in params: | ||
| all_params.append('{}={}'.format(key, value)) | ||
| ap = all_params.copy() | ||
| par = '{}'.format(', '.join(ap[1:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not valid. Params which passed to dataframe method and params which pass to series methods are different params. You shouldn't pass axis or numeric_only params to series.
Also currently you always passing params with default values to series!
| all_params.append('{}={}'.format(key, value)) | ||
| ap = all_params.copy() | ||
| par = '{}'.format(', '.join(ap[1:])) | ||
| func_definition = 'def _reduce_impl({}):'.format(', '.join(all_params)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use fstring here and everywhere:
func_definition = f'def _reduce_impl({', '.join(all_params)}):'Or, better:
func_params = ', '.join(all_params)
func_definition = f'def _reduce_impl({func_params)}):'| Overload Numba function to allow call SDC pass in Numba compiler pipeline | ||
| Functions are: | ||
| - Numba DefaultPassBuilder define_nopython_pipeline() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe since @AlexanderKalistratov PR with rewrites this should be commented out. Please pay attention while rebasing.
|
|
||
| saved_columns = df.columns | ||
| data_args = tuple('data{}'.format(i) for i in range(len(saved_columns))) | ||
| def _dataframe_reduce_columns_codegen(func_name, func_params, series_params, columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a multi-line comment at the top of this function (not a docstring) with a short example of how func_text will look like (for example for mean and a DF with 1 column)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still seeing here docstring instead of comment
kozlov-alexey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@Rubtsowa please fix style issues:
|
sdc/tests/test_dataframe.py
Outdated
| pd.testing.assert_series_equal(hpat_func(n), test_impl(n)) | ||
|
|
||
| @skip_numba_jit | ||
| @skip_sdc_jit('SDC pipeline does not support arguments for Series.count()') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't feel right that count stops to work with sdc pipeline.
|
|
||
| saved_columns = df.columns | ||
| data_args = tuple('data{}'.format(i) for i in range(len(saved_columns))) | ||
| def _dataframe_reduce_columns_codegen(func_name, func_params, series_params, columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still seeing here docstring instead of comment
sdc/hiframes/pd_dataframe_ext.py
Outdated
| return _impl | ||
|
|
||
|
|
||
| from sdc.datatypes.hpat_pandas_dataframe_functions import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
| # numba.compiler.DefaultPassBuilder.define_nopython_pipeline | ||
| # numba.compiler.DefaultPassBuilder.define_nopython_pipeline = \ | ||
| # sdc.datatypes.hpat_pandas_dataframe_pass.sdc_nopython_pipeline_lite_register | ||
| # sdc.datatypes.hpat_pandas_dataframe_pass.sdc_nopython_pipeline_lite_register | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this changes.
|
|
||
| from numba import types | ||
| from numba.extending import (overload, overload_method, overload_attribute) | ||
| from sdc.hiframes.pd_dataframe_ext import DataFrameType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move imports from sdc after imports from numba.
| """ | ||
| Pandas DataFrame method :meth:`pandas.DataFrame.count` implementation. | ||
| .. only:: developer | ||
| Test: python -m sdc.runtests sdc.tests.test_dataframe.TestDataFrame.test_count | ||
| Test: python -m sdc.runtests sdc.tests.test_dataframe.TestDataFrame.test_count | ||
| Test: python -m sdc.runtests sdc.tests.test_dataframe.TestDataFrame.test_count1 | ||
| Parameters | ||
| ----------- | ||
| self: :class:`pandas.DataFrame` | ||
| input arg | ||
| input arg | ||
| axis: | ||
| *unsupported* | ||
| *unsupported* | ||
| level: | ||
| *unsupported* | ||
| *unsupported* | ||
| numeric_only: | ||
| *unsupported* | ||
| *unsupported* | ||
| Returns | ||
| ------- | ||
| :obj:`pandas.Series` or `pandas.DataFrame` | ||
| returns: For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame. | ||
| for each column/row the number of non-NA/null entries. If level is specified returns a DataFrame. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation is not like in other function. See #455 and docs from other functions.
| if not sdc.config.config_pipeline_hpat_default: | ||
| from sdc.datatypes.hpat_pandas_dataframe_functions import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, consider another @overload_method(DataFrameType, 'count') in pd_dataframe_ext.py on line ~1564. You should switch from old to new overload not only switch on new one.
@densmirn have we something like list _non_hpat_pipeline_attrs in pd_series_ext.py but for DataFrame?
No description provided.