-
Notifications
You must be signed in to change notification settings - Fork 62
DataFrame.append #401
DataFrame.append #401
Conversation
|
Hello @akharche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-01-13 14:55:51 UTC |
| # Concat dfs with the same column names | ||
| df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) | ||
| df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) | ||
| result1 = df.append(df2) | ||
|
|
||
| # Concat dfs with the different column names | ||
| df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) | ||
| df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD')) | ||
| result2 = df.append(df2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could merge the cases to a single one:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BC'))
result = df.append(df2)
| if not isinstance(verify_integrity, (bool, types.Boolean, types.Omitted)) and verify_integrity: | ||
| ty_checker.raise_exc(verify_integrity, 'boolean', 'verify_integrity') | ||
|
|
||
| if not isinstance(sort, (bool, types.Boolean, types.Omitted)) and verify_integrity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check sort belongs to NoneType as well.
| args = (('ignore_index', ignore_index), ('verify_integrity', False), ('sort', None)) | ||
|
|
||
| def sdc_pandas_dataframe_append_impl(df, other, name, args): | ||
| spaces = 4 * ' ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be name it as indentation or indent?
|
|
||
| for col_name, i in df_columns_indx.items(): | ||
| if col_name in other_columns_indx: | ||
| func_text.append(get_dataframe_column('df', col_name, i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this line above of the if-else block.
| func_text.append(get_dataframe_column('df', col_name, i)) | ||
| func_text.append(get_dataframe_column('other', col_name, other_columns_indx.get(col_name))) | ||
| func_text.append(get_append_result('df', 'other', col_name)) | ||
| column_list.append((f'new_col_{col_name}', col_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this line below of the if-else block.
sdc/tests/test_dataframe.py
Outdated
|
|
||
| pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2)) | ||
|
|
||
| @unittest.skip('Unsupported functionality df.append([df2, df3]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we going to keep append in the "old style" and switch the style via SDC_CONFIG_PIPELINE_SDC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexanderKalistratov what do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @unittest.skip('Unsupported functionality df.append([df2, df3]') | |
| @unittest.skip('Unsupported functionality df.append([df2, df3])') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep old functionality under SDC_CONFIG_PIPELINE_SDC = 1
So, please modify decorator to skip_numba_jit and make sure this test is passed
| from sdc.str_arr_ext import StringArrayType | ||
|
|
||
| from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType | ||
| # from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just remove the import or do you think it can be uncommented in further?
| .. code-block:: console | ||
| > python ./dataframe_append.py | ||
| A B C | ||
| 0 1.0 2 NaN | ||
| 1 3.0 4 NaN | ||
| 2 NaN 5 6.0 | ||
| 3 NaN 7 8.0 | ||
| dtype: object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the result will appear in the documentation due to missed empty lines around of the block.
|
|
||
| def get_append_result(df1, df2, column): | ||
| return f'new_col_{column} = ' \ | ||
| f'init_series(new_col_{column}_data_{df1}).append(init_series(new_col_{column}_data_{df2}))._data' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create series strings separately then insert in to the resulting string, something like that:
s1 = f'init_series(new_col_{column}_data_{df1})'
s2 = f'init_series(new_col_{column}_data_{df2})'
return f'new_col_{column} = {s1}.append{s2}._data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can help to avoid long code lines.
| n = 11 | ||
| df = pd.DataFrame({'A': np.arange(n), 'B': np.arange(n)**2}) | ||
| df2 = pd.DataFrame({'A': np.arange(n), 'C': np.arange(n)**2}) | ||
| df2 = pd.DataFrame({'A': np.arange(n), 'B': np.arange(n)**2}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using of method 'copy()'?
df2 = df.copy(deep=True)
sdc/datatypes/common_functions.py
Outdated
| str_arr_is_na_mask = [] | ||
| for i in numba.prange(string_array_size): | ||
| if sdc.hiframes.api.isna(data, i): | ||
| str_arr_is_na_mask.append(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't looks safe or efficient. It would be better to allocate array of size string_array_size and then fill it with True and False values
sdc/datatypes/common_functions.py
Outdated
| sdc.str_arr_ext.cp_str_list_to_array(result_data, result_list) | ||
|
|
||
| for i in numba.prange(len(str_arr_is_na_mask)): | ||
| str_arr_set_na(result_data, str_arr_is_na_mask[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's could be a big problem. We need to group it by 64 elements, so only one thread would access this 64 elements
sdc/tests/test_dataframe.py
Outdated
|
|
||
| pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2)) | ||
|
|
||
| @unittest.skip('Unsupported functionality df.append([df2, df3]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @unittest.skip('Unsupported functionality df.append([df2, df3]') | |
| @unittest.skip('Unsupported functionality df.append([df2, df3])') |
sdc/tests/test_dataframe.py
Outdated
|
|
||
| pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2)) | ||
|
|
||
| @unittest.skip('Unsupported functionality df.append([df2, df3]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep old functionality under SDC_CONFIG_PIPELINE_SDC = 1
So, please modify decorator to skip_numba_jit and make sure this test is passed
sdc/hiframes/pd_dataframe_ext.py
Outdated
| return _impl | ||
|
|
||
|
|
||
| from sdc.datatypes.hpat_pandas_dataframe_functions import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add empty line at the end of the file
| dtype: object | ||
| """ | ||
|
|
||
| df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work?
| if key not in func_args: | ||
| if isinstance(value, types.Literal): | ||
| value = value.literal_value | ||
| func_args.append('{}={}'.format(key, value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func_args.append('{}={}'.format(key, value)) | |
| func_args.append(f'{key}={value}') |
| func_text = [] | ||
| column_list = [] | ||
|
|
||
| func_text.append(f'len_df = len(get_dataframe_data(df, {0}))') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func_text.append(f'len_df = len(get_dataframe_data(df, {0}))') | |
| func_text.append(f'len_df = len(get_dataframe_data(df, 0))') |
| column_list = [] | ||
|
|
||
| func_text.append(f'len_df = len(get_dataframe_data(df, {0}))') | ||
| func_text.append(f'len_other = len(get_dataframe_data(other, {0}))') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func_text.append(f'len_other = len(get_dataframe_data(other, {0}))') | |
| func_text.append(f'len_other = len(get_dataframe_data(other, 0))') |
| # TODO: Handle index | ||
| index = None | ||
| col_names = ', '.join(f"'{column_name}'" for _, column_name in column_list) | ||
| func_text.append(f"return sdc.hiframes.pd_dataframe_ext.init_dataframe({data}, {index}, {col_names})\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func_text.append(f"return sdc.hiframes.pd_dataframe_ext.init_dataframe({data}, {index}, {col_names})\n") | |
| func_text.append(f"return init_dataframe({data}, {index}, {col_names})\n") |
And then pass sdc.hiframes.pd_dataframe_ext.init_dataframe as global into exec?
| .. code-block:: console | ||
| > python ./dataframe_append.py | ||
| A B C | ||
| 0 1.0 3 NaN | ||
| 1 2.0 4 NaN | ||
| 2 NaN 5 7.0 | ||
| 3 NaN 6 8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """ | ||
| Expected result: | ||
| A B C | ||
| 0 1.0 3 NaN | ||
| 1 2.0 4 NaN | ||
| 2 NaN 5 7.0 | ||
| 3 NaN 6 8.0 | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AlexanderKalistratov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good
sdc/datatypes/common_functions.py
Outdated
|
|
||
| # Keep NaN values of initial array | ||
| arr_is_na_mask = numpy.array([sdc.hiframes.api.isna(data, i) for i in | ||
| numba.prange(string_array_size)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List comprehension should be auto paralled. You don't need to use prange here
sdc/datatypes/common_functions.py
Outdated
| for j in range(i, max(i + batch_size, string_array_size)): | ||
| if arr_is_na_mask[j]: | ||
| str_arr_set_na(result_data, j) | ||
| for i in numba.prange(none_array_size//batch_size + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work only in case string_array_size is multiple to 64. But if it is not?
| new_col_C_data_other = get_dataframe_data(other, 1) | ||
| new_col_C_data = init_series(new_col_C_data_other)._data | ||
| new_col_C = fill_str_array(new_col_C_data, len_df+len_other, push_back=False) | ||
| return init_dataframe(new_col_A, new_col_B, new_col_C, None, 'A', 'B', 'C') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please replace it if pandas.DataFrame(...)?
| raise TypingError("{} 'numeric_only' unsupported. Given: {}".format(_func_name, axis)) | ||
| data = ', '.join(f'"{column_name}": {column}' for column, column_name in column_list) | ||
| # TODO: Handle index | ||
| func_text.append(f"return pandas.DataFrame({{{data}}})\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add index support. It should work
| func_definition.extend([indent + func_line for func_line in func_text]) | ||
| func_def = '\n'.join(func_definition) | ||
|
|
||
| global_vars = {'sdc': sdc, 'np': numpy, 'pandas': pandas, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sdc and numpy are not used at all. Please remove it.
|
|
||
| from numba import types | ||
| from numba.extending import (overload, overload_method, overload_attribute) | ||
| from sdc.hiframes.pd_dataframe_ext import DataFrameType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrameType is already importing from correct module at line 48.
Implementation of a simple case when two dfs with same column names are concatenated