Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Conversation

@Rubtsowa
Copy link
Contributor

@Rubtsowa Rubtsowa commented Dec 5, 2019

No description provided.

@Rubtsowa Rubtsowa requested a review from densmirn December 5, 2019 09:48
@pep8speaks
Copy link

pep8speaks commented Dec 5, 2019

Hello @Rubtsowa! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-27 10:54:22 UTC

from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType


@overload_method(DataFrameType, 'count')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the overload?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this can be done in separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overload_method count is not used. It is ok to remove it. It looks like implementation of sdc_pandas_dataframe_reduce_columns function is preparation for other PR #363
not sure in it.

@densmirn densmirn closed this Dec 5, 2019
@densmirn densmirn reopened this Dec 5, 2019
loc_vars = {}
func_text = '\n'.join(func_lines)

exec(func_text, {'hpat': sdc, 'np': numpy}, loc_vars)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rubtsowa I think we should stick to the new name:

Suggested change
exec(func_text, {'hpat': sdc, 'np': numpy}, loc_vars)
exec(func_text, {'sdc': sdc, 'np': numpy}, loc_vars)

and use sdc.hiframes instead hpat.hiframes in the generated function text too.

for i, d in enumerate(data_args):
line = ' {} = hpat.hiframes.api.init_series(hpat.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))'
func_lines.append(line.format(d + '_S', i))
func_lines.append(' {} = {}.{}()'.format(d + '_O', d + '_S', name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call the method of series without any arguments?
You have to actually forward the arguments passed to DF call into the Series method. And also take care about how they actually map to each other (e.g. for DF.mean and Series.mean arguments are the same, but will it be so for all functions?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though arguments of df.mean and series.mean are the same, it can't be forwarded straight forward.

  1. axis for series supports only index (0)
  2. numeric_only is not support by series at all, and shouldn't be forwarded
  3. level at the moment we are not storing series in data frame. So there are no hierarchical index and level parameter doesn't make any sense for underlying series, but still there is sense for dataframe. on the other hand at the moment hierarchical index is not support for dataframe either.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mean.html

n_cols = len(saved_columns)
data_args = tuple('data{}'.format(i) for i in range(n_cols))
all_params = ['df']
for key, value in params:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A blank line, e.g. before a block that generates func text, wouldn't be that bad.
Also n_cols is used only once and probably can be removed at all.

Copy link
Contributor

@kozlov-alexey kozlov-alexey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply comments and fix forwarding of arguments into the inner Series call.

func_definition = 'def _reduce_impl({}):'.format(', '.join(all_params))
func_lines = [func_definition]
for i, d in enumerate(data_args):
line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))'
line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(all_params[0], {}))'


if not (isinstance(axis, types.Omitted) or axis == 0):
raise TypingError("{} 'axis' unsupported. Given: {}".format(_func_name, axis))
def sdc_pandas_dataframe_reduce_columns(df, name, params_s, params_df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like params_df is not used at all.

for key, value in params_s:
all_params.append('{}={}'.format(key, value))
ap = all_params.copy()
ap.pop(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for avoiding of using pop? Let's separate all parameters and the last one initially.

for i, d in enumerate(data_args):
line = ' {} = sdc.hiframes.api.init_series(sdc.hiframes.pd_dataframe_ext.get_dataframe_data(df, {}))'
func_lines.append(line.format(d + '_S', i))
func_lines.append(' {} = {}.{}({})'.format(d + '_O', d + '_S', name, par))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func_lines.append(' {} = {}.{}({})'.format(d + '_O', d + '_S', name, par)) -> func_lines.append(' {}_O = {}_S.{}({})'.format(d, d, name, par))


if not (isinstance(numeric_only, types.Omitted) or numeric_only is False):
raise TypingError("{} 'numeric_only' unsupported. Given: {}".format(_func_name, axis))
for key, value in params_s:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we use params_s (is it parameters of series function?) to to create df function signature?


if not (isinstance(axis, types.Omitted) or axis == 0):
raise TypingError("{} 'axis' unsupported. Given: {}".format(_func_name, axis))
def sdc_pandas_dataframe_reduce_columns(df, name, params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def sdc_pandas_dataframe_reduce_columns(df, name, params):
def sdc_pandas_dataframe_reduce_columns(df, name, series_call_params):

raise TypingError("{} 'numeric_only' unsupported. Given: {}".format(_func_name, axis))
for key, value in params:
all_params.append('{}={}'.format(key, value))
ap = all_params.copy()
Copy link
Contributor

@kozlov-alexey kozlov-alexey Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the copy of all_params is needed. Also it would be good to have a comment here, to indicate that it relies on the fact we use series params to generate signature of DF method, i.e. something like:

    # This relies on parameters part of the signature of Series method called below being the same
    # as for the corresponding DataFrame method
    series_call_params_str = '{}'.format(', '.join(all_params[1:]))
    func_definition = 'def _reduce_impl({}):'.format(', '.join(all_params))

for key, value in params:
all_params.append('{}={}'.format(key, value))
ap = all_params.copy()
par = '{}'.format(', '.join(ap[1:]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not valid. Params which passed to dataframe method and params which pass to series methods are different params. You shouldn't pass axis or numeric_only params to series.

Also currently you always passing params with default values to series!

all_params.append('{}={}'.format(key, value))
ap = all_params.copy()
par = '{}'.format(', '.join(ap[1:]))
func_definition = 'def _reduce_impl({}):'.format(', '.join(all_params))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use fstring here and everywhere:

func_definition = f'def _reduce_impl({', '.join(all_params)}):'

Or, better:

func_params = ', '.join(all_params)
func_definition = f'def _reduce_impl({func_params)}):'

Overload Numba function to allow call SDC pass in Numba compiler pipeline
Functions are:
- Numba DefaultPassBuilder define_nopython_pipeline()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe since @AlexanderKalistratov PR with rewrites this should be commented out. Please pay attention while rebasing.


saved_columns = df.columns
data_args = tuple('data{}'.format(i) for i in range(len(saved_columns)))
def _dataframe_reduce_columns_codegen(func_name, func_params, series_params, columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a multi-line comment at the top of this function (not a docstring) with a short example of how func_text will look like (for example for mean and a DF with 1 column)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still seeing here docstring instead of comment

Copy link
Contributor

@kozlov-alexey kozlov-alexey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AlexanderKalistratov
Copy link
Collaborator

@Rubtsowa please fix style issues:

Hello @Rubtsowa! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 63:121: E501 line too long (122 > 120 characters)
Line 64:121: E501 line too long (143 > 120 characters)

Line 57:1: E302 expected 2 blank lines, found 0

Line 1633:1: E402 module level import not at top of file

Comment last updated at 2019-12-26 14:32:57 UTC

pd.testing.assert_series_equal(hpat_func(n), test_impl(n))

@skip_numba_jit
@skip_sdc_jit('SDC pipeline does not support arguments for Series.count()')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't feel right that count stops to work with sdc pipeline.


saved_columns = df.columns
data_args = tuple('data{}'.format(i) for i in range(len(saved_columns)))
def _dataframe_reduce_columns_codegen(func_name, func_params, series_params, columns):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still seeing here docstring instead of comment

return _impl


from sdc.datatypes.hpat_pandas_dataframe_functions import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@AlexanderKalistratov AlexanderKalistratov merged commit 923b23a into IntelPython:master Dec 27, 2019
Comment on lines +64 to +67
# numba.compiler.DefaultPassBuilder.define_nopython_pipeline
# numba.compiler.DefaultPassBuilder.define_nopython_pipeline = \
# sdc.datatypes.hpat_pandas_dataframe_pass.sdc_nopython_pipeline_lite_register
# sdc.datatypes.hpat_pandas_dataframe_pass.sdc_nopython_pipeline_lite_register

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this changes.


from numba import types
from numba.extending import (overload, overload_method, overload_attribute)
from sdc.hiframes.pd_dataframe_ext import DataFrameType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move imports from sdc after imports from numba.

Comment on lines 104 to 127
"""
Pandas DataFrame method :meth:`pandas.DataFrame.count` implementation.
.. only:: developer
Test: python -m sdc.runtests sdc.tests.test_dataframe.TestDataFrame.test_count
Test: python -m sdc.runtests sdc.tests.test_dataframe.TestDataFrame.test_count
Test: python -m sdc.runtests sdc.tests.test_dataframe.TestDataFrame.test_count1
Parameters
-----------
self: :class:`pandas.DataFrame`
input arg
input arg
axis:
*unsupported*
*unsupported*
level:
*unsupported*
*unsupported*
numeric_only:
*unsupported*
*unsupported*
Returns
-------
:obj:`pandas.Series` or `pandas.DataFrame`
returns: For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.
for each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is not like in other function. See #455 and docs from other functions.

Comment on lines +1633 to +1634
if not sdc.config.config_pipeline_hpat_default:
from sdc.datatypes.hpat_pandas_dataframe_functions import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, consider another @overload_method(DataFrameType, 'count') in pd_dataframe_ext.py on line ~1564. You should switch from old to new overload not only switch on new one.
@densmirn have we something like list _non_hpat_pipeline_attrs in pd_series_ext.py but for DataFrame?

@Rubtsowa Rubtsowa deleted the df_reduce branch April 7, 2020 07:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants