Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

Conversation

akharche
Copy link
Contributor

Extension for #801

Implementation of new DataFrame structure based on lists instead of tuples
Improved df.count() codegen for testing
Example:

df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

(['A', 'B', 'C', 'D'],)
([array([1, 2, 3], dtype=int64), array([4, 5, 6], dtype=int64)], [array([0.5, 0.6, 0.7])], [array(['a', 'b', 'c'], dtype=object)])

Reproduce:

@njit
def run_df():
    df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

    print(df._columns)
    print(df._data)

    return df.count()

@pep8speaks
Copy link

pep8speaks commented Apr 23, 2020

Hello @akharche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-30 18:36:58 UTC

@@ -52,6 +53,9 @@ def generic_resolve(self, df, attr):
return SeriesType(arr_typ.dtype, arr_typ, df.index, True)


Column_id = namedtuple('Column_id', ('type_id', 'col_type_id'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mixed CamelCase and snake_case in one palce. I would propose to use CamelCase only for named tuples.
Also the definition of named tuple for me looks better in the following representation:

from typing import NamedTuple

class ColumnId(NamedTuple):
    type_id: int
    col_type_id: int

But it's up to you!

Comment on lines 87 to 90
col_idx_list = len(data_typs_map[col_typ][1])
type_id = data_typs_map[col_typ][0]
df_structure[col_name] = Column_id(type_id, col_idx_list)
data_typs_map[col_typ][1].append(col_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would unpack tuple something like this:

type_id, col_indices = data_typs_map[col_typ]

col_idx_list = len(col_indices)
df_structure[col_name] = ColumnId(type_id, col_idx_list)
col_indices.append(col_id)

n_cols = len(fe_type.columns)
types_unique = set()
df_types = []
for col_id, col_type in enumerate(fe_type.data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for col_id, col_type in enumerate(fe_type.data):
for col_type in fe_type.data:

@@ -76,15 +101,23 @@ def codegen(context, builder, signature, args):
dataframe = cgutils.create_struct_proxy(
signature.return_type)(context, builder)

data_list_type = [types.List(typ) for typ in data_typs_map.keys()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that the order of types in the dict will always match the order of columns in DF?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more reliable to create extra list of types as it was done in the model. Then use the list for ordered iterating.

types_order.append(col_typ)
else:
# Get index of column in list of types
type_id, col_indices = data_typs_map[col_typ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type_id looks like a counter. Is it ok that we rewrite it here? Maybe use another name for type_id in else-block and increment type_id only inside of if-block.

data_list_type = [types.List(typ) for typ in types_order]

data_lists = []
for typ_id, typ in enumerate(data_typs_map.keys()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no problem with order of lists in the tuple?

@akharche
Copy link
Contributor Author

akharche commented Apr 29, 2020

Performance results of DataFrame with mixed types of columns

Columns Current compile time, s New compile time, s New/Current
16 11.092775 9.713415 0.875652
32 28.611168 20.057998 0.701055
64 128.307209 67.369364 0.525062
128 803.671405 310.529887 0.386389

@densmirn
Copy link
Contributor

Performance results of DataFrame with mixed types of columns

Columns Current compile time, s New compile time, s New/Current
4 11.092775 9.713415 0.875652
8 28.611168 20.057998 0.701055
16 128.307209 67.369364 0.525062
32 803.671405 310.529887 0.386389

The results look very good.

@AlexanderKalistratov
Copy link
Collaborator

@akharche examples failed

@AlexanderKalistratov AlexanderKalistratov merged commit efb0c64 into IntelPython:feature/dataframe_model_refactoring Apr 30, 2020
densmirn added a commit that referenced this pull request May 30, 2020
* Turn on Azure CI for branch (#822)

* Redesign DataFrame structure (#817)

* Merge master (#840)

* Df.at impl (#738)

* Series.add / Series.lt with fill_value (#655)

* Impl Series.skew() (#813)

* Run tests in separate processes (#833)

* Run tests in separate processes

* Take tests list from sdc/tests/__init__.py

* change README (#818)

* change README

* change README for doc

* add refs

* change ref

* change ref

* change ref

* change readme

* Improve boxing (#832)

* Specify sdc version from channel for examples testing (#837)

* Specify sdc version from channel for examples testing

It occurs that conda resolver can take Intel SDC package
not from first channel where it is found.
Specify particular SDC version to avoid this in examples
for now.
Also print info for environment creation and package installing

* Fix incerrectly used f-string

* Fix log_info call

* Numba 0.49.0 all (#824)

* Fix run tests

Remove import of _getitem_array1d

* expectedFailure

* expectedFailure-2

* expectedFailure-3

* Conda recipe numba==0.49

* expectedFailure-4

* Refactor imports from Numba

* Unskip tests

* Fix using of numpy_support.from_dtype()

* Unskip tests

* Fix DataFrame tests with rewrite IR without Del statements

* Unskip tests

* Fix corr_overload with type inference error for none < 1

* Fix hpat_pandas_series_cov with type inference error for none < 2

* Unskip tests

* Unskip tests

* Fixed iternext_series_array with using _getitem_array1d.

_getitem_array1d is replaced with _getitem_array_single_int in Numba 0.49.

* Unskip tests

* Unskip old test

* Fix Series.at

* Unskip tests

* Add decrefs in boxing (#836)

* Adding extension type for pd.RangeIndex (#820)

* Adding extension type for pd.RangeIndex

This commit adds Numba extension types for pandas.RangeIndex class,
allowing creation of pd.RangeIndex objects and passing and returning them
to/from nopython functions.

* Applying review comments

* Fix for PR 831 (#839)

* Update pyarrow version to 0.17.0

Update recipe, code and docs.

* Disable intel channel

* Disable intel channel for testing

* Fix remarks

Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>

* Update to Numba 0.49.1 (#838)

* Update to Numba 0.49.1

* Fix requirements.txt

* Add travis

Co-authored-by: Elena Totmenina <totmeninal@mail.ru>
Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com>
Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com>
Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com>
Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com>
Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>

* Re-implement df.getitem based on new structure (#845)

* Re-implement df.getitem based on new structure

* Re-implemented remaining getitem overloads, add tests

* Re-implement df.values based on new structure (#846)

* Re-implement df.pct_change based on new structure (#847)

* Re-implement df.drop based on new structure (#848)

* Re-implement df.append based on new structure (#857)

* Re-implement df.reset_index based on new structure (#849)

* Re-implement df._set_column based on new strcture (#850)

* Re-implement df.rolling methods based on new structure (#852)

* Re-implement df.index based on new structure (#853)

* Re-implement df.copy based on new structure (#854)

* Re-implement df.isna based on new structure (#856)

* Re-implement df.at/iat/loc/iloc based on new structure (#858)

* Re-implement df.head based on new structure (#855)

* Re-implement df.head based on new structure

* Simplify codegen docstring

* Re-implement df.groupby methods based on new structure (#859)

* Re-implement dataframe boxing based on new structure (#861)

* Re-implement DataFrame unboxing (#860)

* Boxing draft

Merge branch 'master' of https://github.com/IntelPython/sdc into merge_master

# Conflicts:
#	sdc/hiframes/pd_dataframe_ext.py
#	sdc/tests/test_dataframe.py

* Implement unboxing in new structure

* Improve variable names + add error handling

* Return error status

* Move getting list size to if_ok block

* Unskipped unexpected success tests

* Unskipped unexpected success tests in GroupBy

* Remove decorators

* Change to incref False

* Skip tests failed due to unimplemented df structure

* Bug in rolling

* Fix rolling (#865)

* Undecorate tests on reading CSV (#866)

* Re-implement df structure: enable rolling tests that pass (#867)

* Re-implement df structure: refactor len (#868)

* Re-implement df structure: refactor len

* Undecorated all the remaining methods

Co-authored-by: Denis <denis.smirnov@intel.com>

* Merge master to feature/dataframe_model_refactoring (#869)

* Enable CI on master

Co-authored-by: Angelina Kharchevnikova <angelina.kharchevnikova@intel.com>
Co-authored-by: Elena Totmenina <totmeninal@mail.ru>
Co-authored-by: Rubtsowa <36762665+Rubtsowa@users.noreply.github.com>
Co-authored-by: Sergey Pokhodenko <sergey.pokhodenko@intel.com>
Co-authored-by: Vyacheslav-Smirnov <51660067+Vyacheslav-Smirnov@users.noreply.github.com>
Co-authored-by: Alexey Kozlov <52973316+kozlov-alexey@users.noreply.github.com>
Co-authored-by: Vyacheslav Smirnov <vyacheslav.s.smirnov@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants