Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Conversation

@akharche
Copy link
Contributor

@akharche akharche commented Dec 9, 2019

Implementation of a simple case when two dfs with same column names are concatenated

@pep8speaks
Copy link

pep8speaks commented Dec 9, 2019

Hello @akharche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 93:5: E303 too many blank lines (3)

Comment last updated at 2020-01-13 14:55:51 UTC

@akharche akharche changed the title DataFrame.append base implementation DataFrame.append Dec 16, 2019
Comment on lines 33 to 41
# Concat dfs with the same column names
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
result1 = df.append(df2)

# Concat dfs with the different column names
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD'))
result2 = df.append(df2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could merge the cases to a single one:

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BC'))
result = df.append(df2)

if not isinstance(verify_integrity, (bool, types.Boolean, types.Omitted)) and verify_integrity:
ty_checker.raise_exc(verify_integrity, 'boolean', 'verify_integrity')

if not isinstance(sort, (bool, types.Boolean, types.Omitted)) and verify_integrity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check sort belongs to NoneType as well.

args = (('ignore_index', ignore_index), ('verify_integrity', False), ('sort', None))

def sdc_pandas_dataframe_append_impl(df, other, name, args):
spaces = 4 * ' '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be name it as indentation or indent?


for col_name, i in df_columns_indx.items():
if col_name in other_columns_indx:
func_text.append(get_dataframe_column('df', col_name, i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this line above of the if-else block.

func_text.append(get_dataframe_column('df', col_name, i))
func_text.append(get_dataframe_column('other', col_name, other_columns_indx.get(col_name)))
func_text.append(get_append_result('df', 'other', col_name))
column_list.append((f'new_col_{col_name}', col_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this line below of the if-else block.


pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2))

@unittest.skip('Unsupported functionality df.append([df2, df3]')
Copy link
Contributor

@densmirn densmirn Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we going to keep append in the "old style" and switch the style via SDC_CONFIG_PIPELINE_SDC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexanderKalistratov what do you think about that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@unittest.skip('Unsupported functionality df.append([df2, df3]')
@unittest.skip('Unsupported functionality df.append([df2, df3])')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep old functionality under SDC_CONFIG_PIPELINE_SDC = 1

So, please modify decorator to skip_numba_jit and make sure this test is passed

from sdc.str_arr_ext import StringArrayType

from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType
# from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just remove the import or do you think it can be uncommented in further?

Comment on lines 63 to 70
.. code-block:: console
> python ./dataframe_append.py
A B C
0 1.0 2 NaN
1 3.0 4 NaN
2 NaN 5 6.0
3 NaN 7 8.0
dtype: object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the result will appear in the documentation due to missed empty lines around of the block.


def get_append_result(df1, df2, column):
return f'new_col_{column} = ' \
f'init_series(new_col_{column}_data_{df1}).append(init_series(new_col_{column}_data_{df2}))._data'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create series strings separately then insert in to the resulting string, something like that:

s1 = f'init_series(new_col_{column}_data_{df1})'
s2 = f'init_series(new_col_{column}_data_{df2})'
return f'new_col_{column} = {s1}.append{s2}._data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can help to avoid long code lines.

n = 11
df = pd.DataFrame({'A': np.arange(n), 'B': np.arange(n)**2})
df2 = pd.DataFrame({'A': np.arange(n), 'C': np.arange(n)**2})
df2 = pd.DataFrame({'A': np.arange(n), 'B': np.arange(n)**2})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using of method 'copy()'?
df2 = df.copy(deep=True)

str_arr_is_na_mask = []
for i in numba.prange(string_array_size):
if sdc.hiframes.api.isna(data, i):
str_arr_is_na_mask.append(i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't looks safe or efficient. It would be better to allocate array of size string_array_size and then fill it with True and False values

sdc.str_arr_ext.cp_str_list_to_array(result_data, result_list)

for i in numba.prange(len(str_arr_is_na_mask)):
str_arr_set_na(result_data, str_arr_is_na_mask[i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's could be a big problem. We need to group it by 64 elements, so only one thread would access this 64 elements


pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2))

@unittest.skip('Unsupported functionality df.append([df2, df3]')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@unittest.skip('Unsupported functionality df.append([df2, df3]')
@unittest.skip('Unsupported functionality df.append([df2, df3])')


pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2))

@unittest.skip('Unsupported functionality df.append([df2, df3]')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep old functionality under SDC_CONFIG_PIPELINE_SDC = 1

So, please modify decorator to skip_numba_jit and make sure this test is passed

return _impl


from sdc.datatypes.hpat_pandas_dataframe_functions import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add empty line at the end of the file

dtype: object
"""

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work?

if key not in func_args:
if isinstance(value, types.Literal):
value = value.literal_value
func_args.append('{}={}'.format(key, value))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func_args.append('{}={}'.format(key, value))
func_args.append(f'{key}={value}')

func_text = []
column_list = []

func_text.append(f'len_df = len(get_dataframe_data(df, {0}))')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func_text.append(f'len_df = len(get_dataframe_data(df, {0}))')
func_text.append(f'len_df = len(get_dataframe_data(df, 0))')

column_list = []

func_text.append(f'len_df = len(get_dataframe_data(df, {0}))')
func_text.append(f'len_other = len(get_dataframe_data(other, {0}))')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func_text.append(f'len_other = len(get_dataframe_data(other, {0}))')
func_text.append(f'len_other = len(get_dataframe_data(other, 0))')

# TODO: Handle index
index = None
col_names = ', '.join(f"'{column_name}'" for _, column_name in column_list)
func_text.append(f"return sdc.hiframes.pd_dataframe_ext.init_dataframe({data}, {index}, {col_names})\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func_text.append(f"return sdc.hiframes.pd_dataframe_ext.init_dataframe({data}, {index}, {col_names})\n")
func_text.append(f"return init_dataframe({data}, {index}, {col_names})\n")

And then pass sdc.hiframes.pd_dataframe_ext.init_dataframe as global into exec?

Comment on lines 156 to 163
.. code-block:: console
> python ./dataframe_append.py
A B C
0 1.0 3 NaN
1 2.0 4 NaN
2 NaN 5 7.0
3 NaN 6 8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +33 to +41
"""
Expected result:
A B C
0 1.0 3 NaN
1 2.0 4 NaN
2 NaN 5 7.0
3 NaN 6 8.0
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@AlexanderKalistratov AlexanderKalistratov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good


# Keep NaN values of initial array
arr_is_na_mask = numpy.array([sdc.hiframes.api.isna(data, i) for i in
numba.prange(string_array_size)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List comprehension should be auto paralled. You don't need to use prange here

for j in range(i, max(i + batch_size, string_array_size)):
if arr_is_na_mask[j]:
str_arr_set_na(result_data, j)
for i in numba.prange(none_array_size//batch_size + 1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work only in case string_array_size is multiple to 64. But if it is not?

new_col_C_data_other = get_dataframe_data(other, 1)
new_col_C_data = init_series(new_col_C_data_other)._data
new_col_C = fill_str_array(new_col_C_data, len_df+len_other, push_back=False)
return init_dataframe(new_col_A, new_col_B, new_col_C, None, 'A', 'B', 'C')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please replace it if pandas.DataFrame(...)?

raise TypingError("{} 'numeric_only' unsupported. Given: {}".format(_func_name, axis))
data = ', '.join(f'"{column_name}": {column}' for column, column_name in column_list)
# TODO: Handle index
func_text.append(f"return pandas.DataFrame({{{data}}})\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add index support. It should work

func_definition.extend([indent + func_line for func_line in func_text])
func_def = '\n'.join(func_definition)

global_vars = {'sdc': sdc, 'np': numpy, 'pandas': pandas,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdc and numpy are not used at all. Please remove it.


from numba import types
from numba.extending import (overload, overload_method, overload_attribute)
from sdc.hiframes.pd_dataframe_ext import DataFrameType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrameType is already importing from correct module at line 48.

@AlexanderKalistratov AlexanderKalistratov merged commit b83641f into IntelPython:master Jan 13, 2020
@akharche akharche deleted the dataframe_append branch January 13, 2020 17:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants