fixing rounding errors in to_dataframe in wfdb records #456

lamawmouk · 2023-06-09T21:41:14Z

This pull request fixes issue #444.

This PR refactors the index generation logic in the to_dataframe method in BaseRecord class

The current code uses the freq argument in the pd.date_range and pd.timedelta_range functions to generate the index for the DataFrame. However, using end argument improves accuracy of the time without being off by 0.287sec

>>>r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')`
>>> str(r.base_datetime)
'2148-08-16 09:00:17.566000'
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320232945 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336239988 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352247031 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368254074 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384261117 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020
[6661120 rows x 7 columns]
>>> str(r.get_absolute_time(6661119)
'2148-08-17 14:37:22.384920'

The get_absolute_time is correct to the nearest microsecond 14:37:22.384920 and the to_dataframe() is correct to the nearest nanosecond 14:37:22.384261117
Please review the PR at your convenience.

bemoody · 2023-06-12T18:47:03Z

Thanks, Lama!

I have to nitpick here, though.

It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)

So this code is much more accurate than before but still has the same fundamental problem.

Here's how I figure the timestamp:

base time is 9:00:17.566 = (9 * 3600 + 17.566) seconds
interval between start and end is (6661119 / 62.4725) seconds

$ bc -l
9 * 3600 + 17.566 + 6661119 / 62.4725
139042.38492032494297490895
last - (24 + 14) * 3600 - 37 * 60
22.38492032494297490895

so the end time in nanseconds should be 14:37:22.384920325.

(On the other hand, 9 * 3600 + 17.566 + 6661119 * 0.016007043 - (24 + 14) * 3600 - 37 * 60 = 22.384261117 exactly.)

I think this is a bug in pandas. But still this is an improvement and good enough for many practical purposes.

bemoody · 2023-06-12T18:57:21Z

This also fails in the case where base_datetime is None. Try it out with a record that doesn't have a base time, such as one of the records from mitdb. If base_datetime is None then we would want to use get_elapsed_time instead of get_absolute_time.

lamawmouk · 2023-06-13T19:03:53Z

Thank you Benjamin!

I have updated the changes. Please merge at your convenience.

bemoody · 2023-06-21T18:07:22Z

It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)

I get different results (on pandas main branch as well as on pandas 1.1.5), so perhaps this issue has been fixed?

Lama, what do you get if you run the following?

import pandas, platform
print(platform.python_implementation(), platform.python_version(), platform.platform())
print(pandas.__version__)
print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))

lamawmouk · 2023-06-21T19:45:45Z

Lama, what do you get if you run the following?

import pandas, platform
print(platform.python_implementation(), platform.python_version(), platform.platform())
print(pandas.__version__)
print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))

Below is my output:

>>> print(platform.python_implementation(), platform.python_version(), platform.platform())
CPython 3.8.10 Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
>>> print(pandas.__version__)
2.0.2
>>> print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
TimedeltaIndex([          '0 days 00:00:00', '0 days 00:00:00.000000016',
               '0 days 00:00:00.000000033', '0 days 00:00:00.000000050',
               '0 days 00:00:00.000000066', '0 days 00:00:00.000000083',
               '0 days 00:00:00.000000100'],
              dtype='timedelta64[ns]', freq=None)

My pandas version is 2.0.2 and has timedelta64[ns]

bemoody · 2023-06-21T20:15:32Z

Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)?

Here's what I get:

>>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320891827 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336898870 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352905913 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368912956 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384920000 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020

[6661120 rows x 7 columns]

lamawmouk · 2023-06-21T22:13:18Z

Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)?
Yes, I reran it in a new environment.

>>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320232945 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336239988 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352247031 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368254074 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384261117 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020

[6661120 rows x 7 columns]
>>> import pandas, platform
>>> print(platform.python_implementation(), platform.python_version(), platform.platform())
CPython 3.9.13 Windows-10-10.0.19045-SP0
>>> print(pandas.__version__)
2.0.2
>>> print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
TimedeltaIndex([          '0 days 00:00:00', '0 days 00:00:00.000000016',
                '0 days 00:00:00.000000033', '0 days 00:00:00.000000050',
                '0 days 00:00:00.000000066', '0 days 00:00:00.000000083',
                '0 days 00:00:00.000000100'],
               dtype='timedelta64[ns]', freq=None)

bemoody · 2023-06-22T16:54:36Z

Thanks for checking. It is really strange that we're seeing different results. While I'd like to get to the bottom of this, I don't really have time to debug it right now.

It looks like the sequence of values originates from np.linspace and therefore might depend on the version of numpy or how it was compiled.

fixing rounding erros in to_dataframe

6e545cf

lamawmouk added 2 commits June 13, 2023 14:44

adds the timedelta if no base time is provided

f71c8b6

updates code format

b9d343f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing rounding errors in to_dataframe in wfdb records #456

fixing rounding errors in to_dataframe in wfdb records #456

lamawmouk commented Jun 9, 2023

bemoody commented Jun 12, 2023

bemoody commented Jun 12, 2023

lamawmouk commented Jun 13, 2023

bemoody commented Jun 21, 2023 •

edited

Loading

lamawmouk commented Jun 21, 2023

bemoody commented Jun 21, 2023

lamawmouk commented Jun 21, 2023

bemoody commented Jun 22, 2023

fixing rounding errors in to_dataframe in wfdb records #456

Are you sure you want to change the base?

fixing rounding errors in to_dataframe in wfdb records #456

Conversation

lamawmouk commented Jun 9, 2023

bemoody commented Jun 12, 2023

bemoody commented Jun 12, 2023

lamawmouk commented Jun 13, 2023

bemoody commented Jun 21, 2023 • edited Loading

lamawmouk commented Jun 21, 2023

bemoody commented Jun 21, 2023

lamawmouk commented Jun 21, 2023

bemoody commented Jun 22, 2023

bemoody commented Jun 21, 2023 •

edited

Loading