Redesign DataFrame structure #817

akharche · 2020-04-23T14:44:14Z

Extension for #801

Implementation of new DataFrame structure based on lists instead of tuples
Improved df.count() codegen for testing
Example:

df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

(['A', 'B', 'C', 'D'],)
([array([1, 2, 3], dtype=int64), array([4, 5, 6], dtype=int64)], [array([0.5, 0.6, 0.7])], [array(['a', 'b', 'c'], dtype=object)])

Reproduce:

@njit
def run_df():
    df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

    print(df._columns)
    print(df._data)

    return df.count()

pep8speaks · 2020-04-23T14:44:18Z

Hello @akharche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-30 18:36:58 UTC

densmirn · 2020-04-24T11:49:38Z

sdc/hiframes/pd_dataframe_ext.py

@@ -52,6 +53,9 @@ def generic_resolve(self, df, attr):
            return SeriesType(arr_typ.dtype, arr_typ, df.index, True)


+Column_id = namedtuple('Column_id', ('type_id', 'col_type_id'))


You mixed CamelCase and snake_case in one palce. I would propose to use CamelCase only for named tuples.
Also the definition of named tuple for me looks better in the following representation:

from typing import NamedTuple class ColumnId(NamedTuple): type_id: int col_type_id: int

But it's up to you!

densmirn · 2020-04-24T11:53:44Z

sdc/hiframes/pd_dataframe_ext.py

+            col_idx_list = len(data_typs_map[col_typ][1])
+            type_id = data_typs_map[col_typ][0]
+            df_structure[col_name] = Column_id(type_id, col_idx_list)
+            data_typs_map[col_typ][1].append(col_id)


I would unpack tuple something like this:

type_id, col_indices = data_typs_map[col_typ] col_idx_list = len(col_indices) df_structure[col_name] = ColumnId(type_id, col_idx_list) col_indices.append(col_id)

densmirn · 2020-04-24T12:06:17Z

sdc/hiframes/pd_dataframe_type.py

-        n_cols = len(fe_type.columns)
+        types_unique = set()
+        df_types = []
+        for col_id, col_type in enumerate(fe_type.data):


Suggested change

for col_id, col_type in enumerate(fe_type.data):

for col_type in fe_type.data:

densmirn · 2020-04-24T12:13:06Z

sdc/hiframes/pd_dataframe_ext.py

@@ -76,15 +101,23 @@ def codegen(context, builder, signature, args):
        dataframe = cgutils.create_struct_proxy(
            signature.return_type)(context, builder)

+        data_list_type = [types.List(typ) for typ in data_typs_map.keys()]


Are you sure that the order of types in the dict will always match the order of columns in DF?

Maybe more reliable to create extra list of types as it was done in the model. Then use the list for ordered iterating.

densmirn · 2020-04-27T09:42:51Z

sdc/hiframes/pd_dataframe_ext.py

+            types_order.append(col_typ)
+        else:
+            # Get index of column in list of types
+            type_id, col_indices = data_typs_map[col_typ]


type_id looks like a counter. Is it ok that we rewrite it here? Maybe use another name for type_id in else-block and increment type_id only inside of if-block.

densmirn · 2020-04-27T09:45:58Z

sdc/hiframes/pd_dataframe_ext.py

+        data_list_type = [types.List(typ) for typ in types_order]
+
+        data_lists = []
+        for typ_id, typ in enumerate(data_typs_map.keys()):


Is there no problem with order of lists in the tuple?

akharche · 2020-04-29T08:52:08Z

Performance results of DataFrame with mixed types of columns

Columns	Current compile time, s	New compile time, s	New/Current
16	11.092775	9.713415	0.875652
32	28.611168	20.057998	0.701055
64	128.307209	67.369364	0.525062
128	803.671405	310.529887	0.386389

densmirn · 2020-04-29T08:58:03Z

Performance results of DataFrame with mixed types of columns

Columns Current compile time, s New compile time, s New/Current
4 11.092775 9.713415 0.875652
8 28.611168 20.057998 0.701055
16 128.307209 67.369364 0.525062
32 803.671405 310.529887 0.386389

The results look very good.

…com/IntelPython/sdc into new_df_structure

AlexanderKalistratov · 2020-04-29T22:17:09Z

@akharche examples failed

* Turn on Azure CI for branch (#822) * Redesign DataFrame structure (#817) * Merge master (#840) * Df.at impl (#738) * Series.add / Series.lt with fill_value (#655) * Impl Series.skew() (#813) * Run tests in separate processes (#833) * Run tests in separate processes * Take tests list from sdc/tests/__init__.py * change README (#818) * change README * change README for doc * add refs * change ref * change ref * change ref * change readme * Improve boxing (#832) * Specify sdc version from channel for examples testing (#837) * Specify sdc version from channel for examples testing It occurs that conda resolver can take Intel SDC package not from first channel where it is found. Specify particular SDC version to avoid this in examples for now. Also print info for environment creation and package installing * Fix incerrectly used f-string * Fix log_info call * Numba 0.49.0 all (#824) * Fix run tests Remove import of _getitem_array1d * expectedFailure * expectedFailure-2 * expectedFailure-3 * Conda recipe numba==0.49 * expectedFailure-4 * Refactor imports from Numba * Unskip tests * Fix using of numpy_support.from_dtype() * Unskip tests * Fix DataFrame tests with rewrite IR without Del statements * Unskip tests * Fix corr_overload with type inference error for none < 1 * Fix hpat_pandas_series_cov with type inference error for none < 2 * Unskip tests * Unskip tests * Fixed iternext_series_array with using _getitem_array1d. _getitem_array1d is replaced with _getitem_array_single_int in Numba 0.49. * Unskip tests * Unskip old test * Fix Series.at * Unskip tests * Add decrefs in boxing (#836) * Adding extension type for pd.RangeIndex (#820) * Adding extension type for pd.RangeIndex This commit adds Numba extension types for pandas.RangeIndex class, allowing creation of pd.RangeIndex objects and passing and returning them to/from nopython functions. * Applying review comments * Fix for PR 831 (#839) * Update pyarrow version to 0.17.0 Update recipe, code and docs. * Disable intel channel * Disable intel channel for testing * Fix remarks Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com> * Update to Numba 0.49.1 (#838) * Update to Numba 0.49.1 * Fix requirements.txt * Add travis Co-authored-by: Elena Totmenina <totmeninal@mail.ru> Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com> Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com> Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com> Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com> Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com> * Re-implement df.getitem based on new structure (#845) * Re-implement df.getitem based on new structure * Re-implemented remaining getitem overloads, add tests * Re-implement df.values based on new structure (#846) * Re-implement df.pct_change based on new structure (#847) * Re-implement df.drop based on new structure (#848) * Re-implement df.append based on new structure (#857) * Re-implement df.reset_index based on new structure (#849) * Re-implement df._set_column based on new strcture (#850) * Re-implement df.rolling methods based on new structure (#852) * Re-implement df.index based on new structure (#853) * Re-implement df.copy based on new structure (#854) * Re-implement df.isna based on new structure (#856) * Re-implement df.at/iat/loc/iloc based on new structure (#858) * Re-implement df.head based on new structure (#855) * Re-implement df.head based on new structure * Simplify codegen docstring * Re-implement df.groupby methods based on new structure (#859) * Re-implement dataframe boxing based on new structure (#861) * Re-implement DataFrame unboxing (#860) * Boxing draft Merge branch 'master' of https://github.com/IntelPython/sdc into merge_master # Conflicts: # sdc/hiframes/pd_dataframe_ext.py # sdc/tests/test_dataframe.py * Implement unboxing in new structure * Improve variable names + add error handling * Return error status * Move getting list size to if_ok block * Unskipped unexpected success tests * Unskipped unexpected success tests in GroupBy * Remove decorators * Change to incref False * Skip tests failed due to unimplemented df structure * Bug in rolling * Fix rolling (#865) * Undecorate tests on reading CSV (#866) * Re-implement df structure: enable rolling tests that pass (#867) * Re-implement df structure: refactor len (#868) * Re-implement df structure: refactor len * Undecorated all the remaining methods Co-authored-by: Denis <denis.smirnov@intel.com> * Merge master to feature/dataframe_model_refactoring (#869) * Enable CI on master Co-authored-by: Angelina Kharchevnikova <angelina.kharchevnikova@intel.com> Co-authored-by: Elena Totmenina <totmeninal@mail.ru> Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com> Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com> Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com> Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com> Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>

Redesign DataFrame structure

83a4630

akharche added the Ready for Review label Apr 23, 2020

akharche requested review from AlexanderKalistratov, kozlov-alexey and densmirn April 23, 2020 14:44

akharche mentioned this pull request Apr 23, 2020

Redesign DataFrame structure #815

Closed

Fix pep

efa54fc

densmirn reviewed Apr 24, 2020

View reviewed changes

Improved named tuple, added types ordering

677a673

densmirn reviewed Apr 27, 2020

View reviewed changes

Fix incrementing

2712d76

densmirn approved these changes Apr 27, 2020

View reviewed changes

densmirn added Ready to merge and removed Ready for Review labels Apr 27, 2020

Skip tests expected to fail

6468e82

densmirn approved these changes Apr 28, 2020

View reviewed changes

akharche added 2 commits April 29, 2020 14:44

Merge branch 'feature/dataframe_model_refactoring' of https://github.…

6b86a95

…com/IntelPython/sdc into new_df_structure

Skip blocks of tests expected to fail due to new structure

0048c7c

akharche added 2 commits April 30, 2020 20:08

Skip examples running + add decorator dfRefactoringNotImplemented

6a42d51

True negatives fix

926c123

AlexanderKalistratov merged commit efb0c64 into IntelPython:feature/dataframe_model_refactoring Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign DataFrame structure #817

Redesign DataFrame structure #817

akharche commented Apr 23, 2020

pep8speaks commented Apr 23, 2020 •

edited

Loading

densmirn Apr 24, 2020

densmirn Apr 24, 2020

densmirn Apr 24, 2020

densmirn Apr 24, 2020

densmirn Apr 24, 2020

densmirn Apr 27, 2020

densmirn Apr 27, 2020

akharche commented Apr 29, 2020 •

edited

Loading

densmirn commented Apr 29, 2020

AlexanderKalistratov commented Apr 29, 2020

		@@ -52,6 +53,9 @@ def generic_resolve(self, df, attr):
		return SeriesType(arr_typ.dtype, arr_typ, df.index, True)


		Column_id = namedtuple('Column_id', ('type_id', 'col_type_id'))

	for col_id, col_type in enumerate(fe_type.data):
	for col_type in fe_type.data:

Redesign DataFrame structure #817

Redesign DataFrame structure #817

Conversation

akharche commented Apr 23, 2020

pep8speaks commented Apr 23, 2020 • edited Loading

Comment last updated at 2020-04-30 18:36:58 UTC

densmirn Apr 24, 2020

Choose a reason for hiding this comment

densmirn Apr 24, 2020

Choose a reason for hiding this comment

densmirn Apr 24, 2020

Choose a reason for hiding this comment

densmirn Apr 24, 2020

Choose a reason for hiding this comment

densmirn Apr 24, 2020

Choose a reason for hiding this comment

densmirn Apr 27, 2020

Choose a reason for hiding this comment

densmirn Apr 27, 2020

Choose a reason for hiding this comment

akharche commented Apr 29, 2020 • edited Loading

densmirn commented Apr 29, 2020

AlexanderKalistratov commented Apr 29, 2020

pep8speaks commented Apr 23, 2020 •

edited

Loading

akharche commented Apr 29, 2020 •

edited

Loading