-
Notifications
You must be signed in to change notification settings - Fork 61
Redesign DataFrame structure #817
Redesign DataFrame structure #817
Conversation
sdc/hiframes/pd_dataframe_ext.py
Outdated
@@ -52,6 +53,9 @@ def generic_resolve(self, df, attr): | |||
return SeriesType(arr_typ.dtype, arr_typ, df.index, True) | |||
|
|||
|
|||
Column_id = namedtuple('Column_id', ('type_id', 'col_type_id')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mixed CamelCase and snake_case in one palce. I would propose to use CamelCase only for named tuples.
Also the definition of named tuple for me looks better in the following representation:
from typing import NamedTuple
class ColumnId(NamedTuple):
type_id: int
col_type_id: int
But it's up to you!
sdc/hiframes/pd_dataframe_ext.py
Outdated
col_idx_list = len(data_typs_map[col_typ][1]) | ||
type_id = data_typs_map[col_typ][0] | ||
df_structure[col_name] = Column_id(type_id, col_idx_list) | ||
data_typs_map[col_typ][1].append(col_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would unpack tuple something like this:
type_id, col_indices = data_typs_map[col_typ]
col_idx_list = len(col_indices)
df_structure[col_name] = ColumnId(type_id, col_idx_list)
col_indices.append(col_id)
sdc/hiframes/pd_dataframe_type.py
Outdated
n_cols = len(fe_type.columns) | ||
types_unique = set() | ||
df_types = [] | ||
for col_id, col_type in enumerate(fe_type.data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for col_id, col_type in enumerate(fe_type.data): | |
for col_type in fe_type.data: |
sdc/hiframes/pd_dataframe_ext.py
Outdated
@@ -76,15 +101,23 @@ def codegen(context, builder, signature, args): | |||
dataframe = cgutils.create_struct_proxy( | |||
signature.return_type)(context, builder) | |||
|
|||
data_list_type = [types.List(typ) for typ in data_typs_map.keys()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that the order of types in the dict will always match the order of columns in DF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more reliable to create extra list of types as it was done in the model. Then use the list for ordered iterating.
sdc/hiframes/pd_dataframe_ext.py
Outdated
types_order.append(col_typ) | ||
else: | ||
# Get index of column in list of types | ||
type_id, col_indices = data_typs_map[col_typ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type_id
looks like a counter. Is it ok that we rewrite it here? Maybe use another name for type_id
in else-block and increment type_id
only inside of if-block.
sdc/hiframes/pd_dataframe_ext.py
Outdated
data_list_type = [types.List(typ) for typ in types_order] | ||
|
||
data_lists = [] | ||
for typ_id, typ in enumerate(data_typs_map.keys()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no problem with order of lists in the tuple?
Performance results of DataFrame with mixed types of columns
|
The results look very good. |
@akharche examples failed |
efb0c64
into
IntelPython:feature/dataframe_model_refactoring
* Turn on Azure CI for branch (#822) * Redesign DataFrame structure (#817) * Merge master (#840) * Df.at impl (#738) * Series.add / Series.lt with fill_value (#655) * Impl Series.skew() (#813) * Run tests in separate processes (#833) * Run tests in separate processes * Take tests list from sdc/tests/__init__.py * change README (#818) * change README * change README for doc * add refs * change ref * change ref * change ref * change readme * Improve boxing (#832) * Specify sdc version from channel for examples testing (#837) * Specify sdc version from channel for examples testing It occurs that conda resolver can take Intel SDC package not from first channel where it is found. Specify particular SDC version to avoid this in examples for now. Also print info for environment creation and package installing * Fix incerrectly used f-string * Fix log_info call * Numba 0.49.0 all (#824) * Fix run tests Remove import of _getitem_array1d * expectedFailure * expectedFailure-2 * expectedFailure-3 * Conda recipe numba==0.49 * expectedFailure-4 * Refactor imports from Numba * Unskip tests * Fix using of numpy_support.from_dtype() * Unskip tests * Fix DataFrame tests with rewrite IR without Del statements * Unskip tests * Fix corr_overload with type inference error for none < 1 * Fix hpat_pandas_series_cov with type inference error for none < 2 * Unskip tests * Unskip tests * Fixed iternext_series_array with using _getitem_array1d. _getitem_array1d is replaced with _getitem_array_single_int in Numba 0.49. * Unskip tests * Unskip old test * Fix Series.at * Unskip tests * Add decrefs in boxing (#836) * Adding extension type for pd.RangeIndex (#820) * Adding extension type for pd.RangeIndex This commit adds Numba extension types for pandas.RangeIndex class, allowing creation of pd.RangeIndex objects and passing and returning them to/from nopython functions. * Applying review comments * Fix for PR 831 (#839) * Update pyarrow version to 0.17.0 Update recipe, code and docs. * Disable intel channel * Disable intel channel for testing * Fix remarks Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com> * Update to Numba 0.49.1 (#838) * Update to Numba 0.49.1 * Fix requirements.txt * Add travis Co-authored-by: Elena Totmenina <totmeninal@mail.ru> Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com> Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com> Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com> Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com> Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com> * Re-implement df.getitem based on new structure (#845) * Re-implement df.getitem based on new structure * Re-implemented remaining getitem overloads, add tests * Re-implement df.values based on new structure (#846) * Re-implement df.pct_change based on new structure (#847) * Re-implement df.drop based on new structure (#848) * Re-implement df.append based on new structure (#857) * Re-implement df.reset_index based on new structure (#849) * Re-implement df._set_column based on new strcture (#850) * Re-implement df.rolling methods based on new structure (#852) * Re-implement df.index based on new structure (#853) * Re-implement df.copy based on new structure (#854) * Re-implement df.isna based on new structure (#856) * Re-implement df.at/iat/loc/iloc based on new structure (#858) * Re-implement df.head based on new structure (#855) * Re-implement df.head based on new structure * Simplify codegen docstring * Re-implement df.groupby methods based on new structure (#859) * Re-implement dataframe boxing based on new structure (#861) * Re-implement DataFrame unboxing (#860) * Boxing draft Merge branch 'master' of https://github.com/IntelPython/sdc into merge_master # Conflicts: # sdc/hiframes/pd_dataframe_ext.py # sdc/tests/test_dataframe.py * Implement unboxing in new structure * Improve variable names + add error handling * Return error status * Move getting list size to if_ok block * Unskipped unexpected success tests * Unskipped unexpected success tests in GroupBy * Remove decorators * Change to incref False * Skip tests failed due to unimplemented df structure * Bug in rolling * Fix rolling (#865) * Undecorate tests on reading CSV (#866) * Re-implement df structure: enable rolling tests that pass (#867) * Re-implement df structure: refactor len (#868) * Re-implement df structure: refactor len * Undecorated all the remaining methods Co-authored-by: Denis <denis.smirnov@intel.com> * Merge master to feature/dataframe_model_refactoring (#869) * Enable CI on master Co-authored-by: Angelina Kharchevnikova <angelina.kharchevnikova@intel.com> Co-authored-by: Elena Totmenina <totmeninal@mail.ru> Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com> Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com> Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com> Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com> Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>
Extension for #801
Implementation of new DataFrame structure based on lists instead of tuples
Improved df.count() codegen for testing
Example:
Reproduce: