DataFrame.append #401

akharche · 2019-12-09T16:11:59Z

Implementation of a simple case when two dfs with same column names are concatenated

pep8speaks · 2019-12-09T16:12:05Z

Hello @akharche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file sdc/datatypes/hpat_pandas_dataframe_functions.py:

Line 93:5: E303 too many blank lines (3)

Comment last updated at 2020-01-13 14:55:51 UTC

densmirn · 2019-12-16T11:36:25Z

examples/dataframe/dataframe_append.py

+    # Concat dfs with the same column names
+    df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
+    df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
+    result1 = df.append(df2)
+
+    # Concat dfs with the different column names
+    df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
+    df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD'))
+    result2 = df.append(df2)


You could merge the cases to a single one:

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BC')) result = df.append(df2)

densmirn · 2019-12-16T11:39:35Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    if not isinstance(verify_integrity, (bool, types.Boolean, types.Omitted)) and verify_integrity:
+        ty_checker.raise_exc(verify_integrity, 'boolean', 'verify_integrity')
+
+    if not isinstance(sort, (bool, types.Boolean, types.Omitted)) and verify_integrity:


Please check sort belongs to NoneType as well.

densmirn · 2019-12-16T11:40:28Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    args = (('ignore_index', ignore_index), ('verify_integrity', False), ('sort', None))
+
+    def sdc_pandas_dataframe_append_impl(df, other, name, args):
+        spaces = 4 * ' '


May be name it as indentation or indent?

densmirn · 2019-12-16T11:45:49Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+
+        for col_name, i in df_columns_indx.items():
+            if col_name in other_columns_indx:
+                func_text.append(get_dataframe_column('df', col_name, i))


Let's move this line above of the if-else block.

densmirn · 2019-12-16T11:46:22Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+                func_text.append(get_dataframe_column('df', col_name, i))
+                func_text.append(get_dataframe_column('other', col_name, other_columns_indx.get(col_name)))
+                func_text.append(get_append_result('df', 'other', col_name))
+                column_list.append((f'new_col_{col_name}', col_name))


Let's move this line below of the if-else block.

…frame_append

densmirn · 2019-12-18T11:26:55Z

sdc/tests/test_dataframe.py

+
+        pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2))
+
+    @unittest.skip('Unsupported functionality df.append([df2, df3]')


Aren't we going to keep append in the "old style" and switch the style via SDC_CONFIG_PIPELINE_SDC?

@AlexanderKalistratov what do you think about that?

Suggested change

@unittest.skip('Unsupported functionality df.append([df2, df3]')

@unittest.skip('Unsupported functionality df.append([df2, df3])')

I think we should keep old functionality under SDC_CONFIG_PIPELINE_SDC = 1

So, please modify decorator to skip_numba_jit and make sure this test is passed

densmirn · 2019-12-20T06:18:56Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+from sdc.str_arr_ext import StringArrayType

-from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType
+# from sdc.datatypes.hpat_pandas_dataframe_types import DataFrameType


Maybe just remove the import or do you think it can be uncommented in further?

densmirn · 2019-12-20T06:23:59Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    .. code-block:: console
+        > python ./dataframe_append.py
+             A  B    C
+        0  1.0  2  NaN
+        1  3.0  4  NaN
+        2  NaN  5  6.0
+        3  NaN  7  8.0
+        dtype: object


I'm not sure the result will appear in the documentation due to missed empty lines around of the block.

densmirn · 2019-12-20T06:29:42Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+
+        def get_append_result(df1, df2, column):
+            return f'new_col_{column} = ' \
+                f'init_series(new_col_{column}_data_{df1}).append(init_series(new_col_{column}_data_{df2}))._data'


I would create series strings separately then insert in to the resulting string, something like that:

s1 = f'init_series(new_col_{column}_data_{df1})' s2 = f'init_series(new_col_{column}_data_{df2})' return f'new_col_{column} = {s1}.append{s2}._data

It can help to avoid long code lines.

densmirn · 2019-12-20T06:31:42Z

sdc/tests/test_dataframe.py

        n = 11
        df = pd.DataFrame({'A': np.arange(n), 'B': np.arange(n)**2})
-        df2 = pd.DataFrame({'A': np.arange(n), 'C': np.arange(n)**2})
+        df2 = pd.DataFrame({'A': np.arange(n), 'B': np.arange(n)**2})


What do you think about using of method 'copy()'?
df2 = df.copy(deep=True)

AlexanderKalistratov · 2019-12-20T19:32:50Z

sdc/datatypes/common_functions.py

+    str_arr_is_na_mask = []
+    for i in numba.prange(string_array_size):
+        if sdc.hiframes.api.isna(data, i):
+            str_arr_is_na_mask.append(i)


That doesn't looks safe or efficient. It would be better to allocate array of size string_array_size and then fill it with True and False values

AlexanderKalistratov · 2019-12-20T19:38:04Z

sdc/datatypes/common_functions.py

+    sdc.str_arr_ext.cp_str_list_to_array(result_data, result_list)
+
+    for i in numba.prange(len(str_arr_is_na_mask)):
+        str_arr_set_na(result_data, str_arr_is_na_mask[i])


That's could be a big problem. We need to group it by 64 elements, so only one thread would access this 64 elements

AlexanderKalistratov · 2019-12-20T19:41:40Z

sdc/tests/test_dataframe.py

+
+        pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2))
+
+    @unittest.skip('Unsupported functionality df.append([df2, df3]')


Suggested change

@unittest.skip('Unsupported functionality df.append([df2, df3]')

@unittest.skip('Unsupported functionality df.append([df2, df3])')

AlexanderKalistratov · 2019-12-20T19:42:47Z

sdc/tests/test_dataframe.py

+
+        pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2))
+
+    @unittest.skip('Unsupported functionality df.append([df2, df3]')


I think we should keep old functionality under SDC_CONFIG_PIPELINE_SDC = 1

So, please modify decorator to skip_numba_jit and make sure this test is passed

AlexanderKalistratov · 2019-12-20T19:43:39Z

sdc/hiframes/pd_dataframe_ext.py

    return _impl
+
+
+from sdc.datatypes.hpat_pandas_dataframe_functions import *


Please add empty line at the end of the file

AlexanderKalistratov · 2019-12-20T19:45:33Z

examples/dataframe/dataframe_append.py

+    dtype: object
+    """
+
+    df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))


Does it work?

AlexanderKalistratov · 2019-12-20T19:52:33Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+            if key not in func_args:
+                if isinstance(value, types.Literal):
+                    value = value.literal_value
+                func_args.append('{}={}'.format(key, value))


Suggested change

func_args.append('{}={}'.format(key, value))

func_args.append(f'{key}={value}')

AlexanderKalistratov · 2019-12-20T19:56:11Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        func_text = []
+        column_list = []
+
+        func_text.append(f'len_df = len(get_dataframe_data(df, {0}))')


Suggested change

func_text.append(f'len_df = len(get_dataframe_data(df, {0}))')

func_text.append(f'len_df = len(get_dataframe_data(df, 0))')

AlexanderKalistratov · 2019-12-20T19:56:24Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        column_list = []
+
+        func_text.append(f'len_df = len(get_dataframe_data(df, {0}))')
+        func_text.append(f'len_other = len(get_dataframe_data(other, {0}))')


Suggested change

func_text.append(f'len_other = len(get_dataframe_data(other, {0}))')

func_text.append(f'len_other = len(get_dataframe_data(other, 0))')

AlexanderKalistratov · 2019-12-20T20:19:35Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        # TODO: Handle index
+        index = None
+        col_names = ', '.join(f"'{column_name}'" for _, column_name in column_list)
+        func_text.append(f"return sdc.hiframes.pd_dataframe_ext.init_dataframe({data}, {index}, {col_names})\n")


Suggested change

func_text.append(f"return sdc.hiframes.pd_dataframe_ext.init_dataframe({data}, {index}, {col_names})\n")

func_text.append(f"return init_dataframe({data}, {index}, {col_names})\n")

And then pass sdc.hiframes.pd_dataframe_ext.init_dataframe as global into exec?

…frame_append

PokhodenkoSA · 2019-12-26T11:34:12Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        .. code-block:: console
+
+            > python ./dataframe_append.py
+                 A  B    C
+            0  1.0  3  NaN
+            1  2.0  4  NaN
+            2  NaN  5  7.0
+            3  NaN  6  8.0


PokhodenkoSA · 2019-12-26T11:34:46Z

examples/dataframe/dataframe_append.py

+    """
+    Expected result:
+         A  B    C
+    0  1.0  3  NaN
+    1  2.0  4  NaN
+    2  NaN  5  7.0
+    3  NaN  6  8.0
+
+    """


AlexanderKalistratov

Looks very good

AlexanderKalistratov · 2019-12-26T21:30:38Z

sdc/datatypes/common_functions.py

+
+    # Keep NaN values of initial array
+    arr_is_na_mask = numpy.array([sdc.hiframes.api.isna(data, i) for i in
+                                  numba.prange(string_array_size)])


List comprehension should be auto paralled. You don't need to use prange here

AlexanderKalistratov · 2019-12-26T21:40:33Z

sdc/datatypes/common_functions.py

+            for j in range(i, max(i + batch_size, string_array_size)):
+                if arr_is_na_mask[j]:
+                    str_arr_set_na(result_data, j)
+        for i in numba.prange(none_array_size//batch_size + 1):


This would work only in case string_array_size is multiple to 64. But if it is not?

AlexanderKalistratov · 2019-12-26T21:41:19Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+            new_col_C_data_other = get_dataframe_data(other, 1)
+            new_col_C_data = init_series(new_col_C_data_other)._data
+            new_col_C = fill_str_array(new_col_C_data, len_df+len_other, push_back=False)
+            return init_dataframe(new_col_A, new_col_B, new_col_C, None, 'A', 'B', 'C')


Could you please replace it if pandas.DataFrame(...)?

AlexanderKalistratov · 2019-12-26T21:48:17Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

-        raise TypingError("{} 'numeric_only' unsupported. Given: {}".format(_func_name, axis))
+        data = ', '.join(f'"{column_name}": {column}' for column, column_name in column_list)
+        # TODO: Handle index
+        func_text.append(f"return pandas.DataFrame({{{data}}})\n")


Please add index support. It should work

densmirn · 2020-01-10T08:18:43Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    func_definition.extend([indent + func_line for func_line in func_text])
+    func_def = '\n'.join(func_definition)
+
+    global_vars = {'sdc': sdc, 'np': numpy, 'pandas': pandas,


sdc and numpy are not used at all. Please remove it.

AlexanderKalistratov · 2020-01-13T16:32:50Z

sdc/datatypes/hpat_pandas_dataframe_functions.py


 from numba import types
 from numba.extending import (overload, overload_method, overload_attribute)
+from sdc.hiframes.pd_dataframe_ext import DataFrameType


DataFrameType is already importing from correct module at line 48.

DataFrame.append base implementation

f405df1

akharche added the [WIP] Work in progress label Dec 9, 2019

akharche requested review from AlexanderKalistratov, kozlov-alexey and shssf December 9, 2019 16:11

Added functionality for appending columns with different names

82188fc

akharche added Ready for Review and removed [WIP] Work in progress labels Dec 16, 2019

akharche changed the title ~~DataFrame.append base implementation~~ DataFrame.append Dec 16, 2019

Delete duplicate

7b2ffb6

densmirn reviewed Dec 16, 2019

View reviewed changes

akharche added 5 commits December 16, 2019 15:16

resolve conflicts

588698c

Merge branch 'master' of https://github.com/IntelPython/sdc into data…

2122fcc

…frame_append

Merge branch 'master' of https://github.com/IntelPython/sdc into data…

8cb66a0

…frame_append

Handle StringArrayType

e664723

Refactor

661ad74

densmirn reviewed Dec 18, 2019

View reviewed changes

densmirn reviewed Dec 20, 2019

View reviewed changes

Refactoring

a364c37

AlexanderKalistratov reviewed Dec 20, 2019

View reviewed changes

akharche added 6 commits December 23, 2019 15:58

Merge branch 'master' of https://github.com/IntelPython/sdc into data…

718fb62

…frame_append

Merge branch 'master' of https://github.com/IntelPython/sdc into data…

be53a39

…frame_append

Separated codegen func+refactoring

c360dd3

Batch iteration to add nans to StringArray

133ef8d

Style fixes

f44d2b2

Create df through rewrite

65c28dc

PokhodenkoSA reviewed Dec 26, 2019

View reviewed changes

AlexanderKalistratov reviewed Dec 26, 2019

View reviewed changes

Merge conflicts

d7acd26

Fix appending nones to StringArrayType columns

c8a6430

akharche added the Coverage decreased label Jan 10, 2020

densmirn approved these changes Jan 10, 2020

View reviewed changes

akharche added 2 commits January 13, 2020 16:15

Fix threads competition cases

80ce11b

Merge branch 'master' into dataframe_append

b9d1f4c

AlexanderKalistratov approved these changes Jan 13, 2020

View reviewed changes

AlexanderKalistratov merged commit b83641f into IntelPython:master Jan 13, 2020

akharche deleted the dataframe_append branch January 13, 2020 17:39


		pd.testing.assert_frame_equal(hpat_func(df, df2), test_impl(df, df2))

		@unittest.skip('Unsupported functionality df.append([df2, df3]')

		return _impl


		from sdc.datatypes.hpat_pandas_dataframe_functions import *

	func_args.append('{}={}'.format(key, value))
	func_args.append(f'{key}={value}')

	func_text.append(f'len_df = len(get_dataframe_data(df, {0}))')
	func_text.append(f'len_df = len(get_dataframe_data(df, 0))')

	func_text.append(f'len_other = len(get_dataframe_data(other, {0}))')
	func_text.append(f'len_other = len(get_dataframe_data(other, 0))')

	func_text.append(f"return sdc.hiframes.pd_dataframe_ext.init_dataframe({data}, {index}, {col_names})\n")
	func_text.append(f"return init_dataframe({data}, {index}, {col_names})\n")

DataFrame.append #401

DataFrame.append #401

Uh oh!

Conversation

akharche commented Dec 9, 2019

Uh oh!

pep8speaks commented Dec 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-01-13 14:55:51 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

densmirn Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexanderKalistratov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

pep8speaks commented Dec 9, 2019 •

edited

Loading

densmirn Dec 18, 2019 •

edited

Loading