Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Conversation

@kozlov-alexey
Copy link
Contributor

@kozlov-alexey kozlov-alexey commented Oct 19, 2020

Motivation: init_dataframe was implemented via Numba intrinsic taking *args,
which seems to generate redundant extractvalue/insertvalue LLVM
instructions, producing quadratic IR when number of DF columns grows and affecting
total compilation time of function that create large DFs. This PR
replaces singe init_dataframe with multiple functions basing on number of columns
in a DF which are generated at compile time, thus avoiding use of *args.

n_columns   8 16 32 64 128 256 512
LLVM IR size, Mb on master 0.287622 0.55394 1.262865 3.383549 10.44003 35.79943 131.384
LLVM IR size, Mb With PR #936 0.143275 0.209119 0.341938 0.608992 1.143528 2.220672 4.406426
ratio without/with   2.007482 2.648924 3.693257 5.555986 9.12967 16.12099 29.81645
compilation time, s on master 0.521313 0.366884 0.67621 1.39326 4.603106 17.54948 126.7943
compilation time, s With PR #936 0.683099 0.413965 0.450348 0.715598 1.454044 3.210638 6.943996
ratio without/with   0.763159 0.886268 1.501529 1.946987 3.165726 5.466041 18.25956

@pep8speaks
Copy link

pep8speaks commented Oct 19, 2020

Hello @kozlov-alexey! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-13 15:02:36 UTC

Motivation: init_dataframe was implemented via Numba intrinsic taking *args,
which seems to generate redundant extractvalue/insertvalue LLVM
instructions, producing quadratic IR when number of DF columns grows and affecting
total compilation time of function that create large DFs. This PR
replaces singe init_dataframe with multiple functions basing on number of columns
in a DF which are generated at compile time, thus avoiding use of *args.
@kozlov-alexey kozlov-alexey force-pushed the feature/reduce_df_ctor_ir_size branch from d2a6b7e to 45bbc80 Compare October 19, 2020 01:02
@kozlov-alexey kozlov-alexey added the Waiting other PR This PR depends on functionality to be merged in other PR label Oct 19, 2020
@kozlov-alexey
Copy link
Contributor Author

Test failures of read_csv tests with:

Failed in nopython mode pipeline (step: nopython rewrites)
module 'sdc.hiframes.pd_dataframe_ext' has no attribute 'init_dataframe'

are expected because this PR requires changes from #918 which was rolled-back recently. So this will be blocked until #918 is returned.

@kozlov-alexey kozlov-alexey removed the Waiting other PR This PR depends on functionality to be merged in other PR label Nov 11, 2020
@kozlov-alexey kozlov-alexey requested review from AlexanderKalistratov, Hardcode84 and densmirn and removed request for densmirn November 11, 2020 16:31
@AlexanderKalistratov
Copy link
Collaborator

@kozlov-alexey @xaleryb win 3.6 build fails with svml error again:

test_series_apply_np (sdc.tests.test_series.TestSeries) ... LLVM ERROR: Symbol not found: __svml_log4_ha

@kozlov-alexey kozlov-alexey force-pushed the feature/reduce_df_ctor_ir_size branch from 7448111 to 473d773 Compare November 13, 2020 15:02
@kozlov-alexey
Copy link
Contributor Author

@kozlov-alexey @xaleryb win 3.6 build fails with svml error again:

test_series_apply_np (sdc.tests.test_series.TestSeries) ... LLVM ERROR: Symbol not found: __svml_log4_ha

I think something's wrong with the packages being used (see mkl and many others are installed from public channels, but not built). Can this be a reason?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants