Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Conversation

@densmirn
Copy link
Contributor

No description provided.

min_length = min(arr_len, other_arr_len)
length = max(arr_len, other_arr_len) if size == 'max' else min_length

aligned_arr = numpy.array([numpy.nan] * length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably use numpy.repeat(numpy.nan, length), but I doubt it's parallelized. So it might turn out that using numpy.array is faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measured res = numpy.array([numpy.nan] * 10 ** 8) vs. numpy.repeat(numpy.nan, 10 ** 8) with SDC_AUTO_PARALLEL = True

name nthreads type size median min max compile boxing
numpy.array 1.0 SDC 100000000.0 1.0090000629425049 1.002000093460083 1.022000074386597 0.09800267219543456 0.09000849723815918
numpy.array 4.0 SDC 100000000.0 1.2319998741149902 1.0130000114440918 2.2649998664855957 0.0879981517791748 0.09700202941894531
numpy.repeat 1.0 SDC 100000000.0 0.4199998378753662 0.3970000743865967 0.49299979209899897 0.15896368026733398 5.0067901611328125e-06
numpy.repeat 4.0 SDC 100000000.0 0.3919999599456787 0.3859999179840088 0.4030001163482666 0.15500760078430176 1.6689300537109375e-06

I don't see scalability at all. numpy.repeat looks faster than numpy.array in this case. So let me replace numpy.array with numpy.repeat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured performance based on SDC performance test system:
master...densmirn:perf/array_vs_repeat

kws = {'other': 'None', 'pairwise': 'None', 'ddof': '1'}

if none_other:
return gen_df_rolling_method_other_none_impl('_df_cov', self, kws=kws)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a comment here to explain the need for _df_cov (or maybe add docstrings to cov and _df_cov overloads).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a comment here.

Copy link
Contributor

@kozlov-alexey kozlov-alexey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this implementation, but it would be good to clarify what's pandas motivation on having different implementations. Should we maybe open an issue?

@densmirn
Copy link
Contributor Author

I'm OK with this implementation, but it would be good to clarify what's pandas motivation on having different implementations. Should we maybe open an issue?

Maybe but currently we are based on Pandas 0.25. So I think firstly needed to check this on Pandas 1.0 before opening the issue. To be honest I would like to spend time on other useful activity if you do not mind. @kozlov-alexey what do you think about that?

@kozlov-alexey
Copy link
Contributor

To be honest I would like to spend time on other useful activity if you do not mind. @kozlov-alexey what do you think about that?

@densmirn I agree of course, we should put this as a low priority.

@densmirn densmirn merged commit b5bbbaa into IntelPython:master Feb 14, 2020
@densmirn densmirn deleted the feature/df_rolling_cov branch June 9, 2020 12:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants