Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing rounding errors in to_dataframe in wfdb records #456

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lamawmouk
Copy link
Member

This pull request fixes issue #444.

This PR refactors the index generation logic in the to_dataframe method in BaseRecord class

The current code uses the freq argument in the pd.date_range and pd.timedelta_range functions to generate the index for the DataFrame. However, using end argument improves accuracy of the time without being off by 0.287sec

>>>r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')`
>>> str(r.base_datetime)
'2148-08-16 09:00:17.566000'
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320232945 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336239988 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352247031 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368254074 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384261117 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020
[6661120 rows x 7 columns]
>>> str(r.get_absolute_time(6661119)
'2148-08-17 14:37:22.384920'

The get_absolute_time is correct to the nearest microsecond 14:37:22.384920 and the to_dataframe() is correct to the nearest nanosecond 14:37:22.384261117
Please review the PR at your convenience.

@bemoody
Copy link
Collaborator

bemoody commented Jun 12, 2023

Thanks, Lama!

I have to nitpick here, though.

It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)

So this code is much more accurate than before but still has the same fundamental problem.

Here's how I figure the timestamp:

  • base time is 9:00:17.566 = (9 * 3600 + 17.566) seconds
  • interval between start and end is (6661119 / 62.4725) seconds
$ bc -l
9 * 3600 + 17.566 + 6661119 / 62.4725
139042.38492032494297490895
last - (24 + 14) * 3600 - 37 * 60
22.38492032494297490895

so the end time in nanseconds should be 14:37:22.384920325.

(On the other hand, 9 * 3600 + 17.566 + 6661119 * 0.016007043 - (24 + 14) * 3600 - 37 * 60 = 22.384261117 exactly.)

I think this is a bug in pandas. But still this is an improvement and good enough for many practical purposes.

@bemoody
Copy link
Collaborator

bemoody commented Jun 12, 2023

This also fails in the case where base_datetime is None. Try it out with a record that doesn't have a base time, such as one of the records from mitdb. If base_datetime is None then we would want to use get_elapsed_time instead of get_absolute_time.

@lamawmouk
Copy link
Member Author

Thank you Benjamin!

I have updated the changes. Please merge at your convenience.

@bemoody
Copy link
Collaborator

bemoody commented Jun 21, 2023

It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)

I get different results (on pandas main branch as well as on pandas 1.1.5), so perhaps this issue has been fixed?

Lama, what do you get if you run the following?

import pandas, platform
print(platform.python_implementation(), platform.python_version(), platform.platform())
print(pandas.__version__)
print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))

@lamawmouk
Copy link
Member Author

Lama, what do you get if you run the following?

import pandas, platform
print(platform.python_implementation(), platform.python_version(), platform.platform())
print(pandas.__version__)
print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))

Below is my output:

>>> print(platform.python_implementation(), platform.python_version(), platform.platform())
CPython 3.8.10 Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
>>> print(pandas.__version__)
2.0.2
>>> print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
TimedeltaIndex([          '0 days 00:00:00', '0 days 00:00:00.000000016',
               '0 days 00:00:00.000000033', '0 days 00:00:00.000000050',
               '0 days 00:00:00.000000066', '0 days 00:00:00.000000083',
               '0 days 00:00:00.000000100'],
              dtype='timedelta64[ns]', freq=None)

My pandas version is 2.0.2 and has timedelta64[ns]

@bemoody
Copy link
Collaborator

bemoody commented Jun 21, 2023

Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)?

Here's what I get:

>>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320891827 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336898870 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352905913 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368912956 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384920000 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020

[6661120 rows x 7 columns]

@lamawmouk
Copy link
Member Author

Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)?
Yes, I reran it in a new environment.

>>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320232945 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336239988 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352247031 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368254074 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384261117 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020

[6661120 rows x 7 columns]
>>> import pandas, platform
>>> print(platform.python_implementation(), platform.python_version(), platform.platform())
CPython 3.9.13 Windows-10-10.0.19045-SP0
>>> print(pandas.__version__)
2.0.2
>>> print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
TimedeltaIndex([          '0 days 00:00:00', '0 days 00:00:00.000000016',
                '0 days 00:00:00.000000033', '0 days 00:00:00.000000050',
                '0 days 00:00:00.000000066', '0 days 00:00:00.000000083',
                '0 days 00:00:00.000000100'],
               dtype='timedelta64[ns]', freq=None)

@bemoody
Copy link
Collaborator

bemoody commented Jun 22, 2023

Thanks for checking. It is really strange that we're seeing different results. While I'd like to get to the bottom of this, I don't really have time to debug it right now.

It looks like the sequence of values originates from np.linspace and therefore might depend on the version of numpy or how it was compiled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants