-
Notifications
You must be signed in to change notification settings - Fork 62
Overload df.rolling.cov, add perf.test #547
Overload df.rolling.cov, add perf.test #547
Conversation
…ure/df_rolling_cov
…ure/df_rolling_cov
sdc/datatypes/common_functions.py
Outdated
| min_length = min(arr_len, other_arr_len) | ||
| length = max(arr_len, other_arr_len) if size == 'max' else min_length | ||
|
|
||
| aligned_arr = numpy.array([numpy.nan] * length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably use numpy.repeat(numpy.nan, length), but I doubt it's parallelized. So it might turn out that using numpy.array is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measured res = numpy.array([numpy.nan] * 10 ** 8) vs. numpy.repeat(numpy.nan, 10 ** 8) with SDC_AUTO_PARALLEL = True
| name | nthreads | type | size | median | min | max | compile | boxing |
|---|---|---|---|---|---|---|---|---|
| numpy.array | 1.0 | SDC | 100000000.0 | 1.0090000629425049 | 1.002000093460083 | 1.022000074386597 | 0.09800267219543456 | 0.09000849723815918 |
| numpy.array | 4.0 | SDC | 100000000.0 | 1.2319998741149902 | 1.0130000114440918 | 2.2649998664855957 | 0.0879981517791748 | 0.09700202941894531 |
| numpy.repeat | 1.0 | SDC | 100000000.0 | 0.4199998378753662 | 0.3970000743865967 | 0.49299979209899897 | 0.15896368026733398 | 5.0067901611328125e-06 |
| numpy.repeat | 4.0 | SDC | 100000000.0 | 0.3919999599456787 | 0.3859999179840088 | 0.4030001163482666 | 0.15500760078430176 | 1.6689300537109375e-06 |
I don't see scalability at all. numpy.repeat looks faster than numpy.array in this case. So let me replace numpy.array with numpy.repeat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I measured performance based on SDC performance test system:
master...densmirn:perf/array_vs_repeat
| kws = {'other': 'None', 'pairwise': 'None', 'ddof': '1'} | ||
|
|
||
| if none_other: | ||
| return gen_df_rolling_method_other_none_impl('_df_cov', self, kws=kws) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a comment here to explain the need for _df_cov (or maybe add docstrings to cov and _df_cov overloads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add a comment here.
kozlov-alexey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this implementation, but it would be good to clarify what's pandas motivation on having different implementations. Should we maybe open an issue?
Maybe but currently we are based on Pandas 0.25. So I think firstly needed to check this on Pandas 1.0 before opening the issue. To be honest I would like to spend time on other useful activity if you do not mind. @kozlov-alexey what do you think about that? |
@densmirn I agree of course, we should put this as a low priority. |
…ure/df_rolling_cov
…ure/df_rolling_cov
…ure/df_rolling_cov
…ure/df_rolling_cov
No description provided.