exit code -1073741819 (0xC0000005) & Out of bounds read to internal chunk buffer of size 65536 #1721

trinhtrannp · 2020-07-13T19:39:32Z

Hi teams,

I'm currently trying tiledb (python) with historical fx data which looking like this.

date bidopen bidclose bidhigh ... askhigh asklow tickqty
2020-07-01 00:00:00 1.12330 1.12319 1.12331 ... 1.12342 1.12330 107
2020-07-01 00:01:00 1.12319 1.12304 1.12320 ... 1.12332 1.12315 243
2020-07-01 00:02:00 1.12304 1.12297 1.12305 ... 1.12317 1.12308 168
2020-07-01 00:03:00 1.12297 1.12288 1.12300 ... 1.12312 1.12291 156
2020-07-01 00:04:00 1.12288 1.12286 1.12289 ... 1.12302 1.12293 195
2020-07-01 00:05:00 1.12286 1.12286 1.12290 ... 1.12302 1.12297 109
2020-07-01 00:06:00 1.12286 1.12302 1.12302 ... 1.12315 1.12297 111
2020-07-01 00:07:00 1.12302 1.12293 1.12304 ... 1.12317 1.12306 82
2020-07-01 00:08:00 1.12293 1.12287 1.12293 ... 1.12307 1.12298 63
2020-07-01 00:09:00 1.12287 1.12285 1.12293 ... 1.12305 1.12291 165

I have a data frame for each month, which is about 30k rows.

The storing process go well, I check the total row in tiledb and there is no missing data. However, I see a lot of this log.

[2020-07-13 21:30:37.845] [tiledb] [Process: 24960] [Thread: 29992] [error] [TileDB::Dimension] Error: Out of bounds read to internal chunk buffer of size 65536

Eventhough I have already set the chunksize.

existed = tiledb.highlevel.array_exists(uri)
tiledb.from_pandas(uri, df, 
  sparse=True,
  mode='append' if existed else 'ingest',
  tile_order='row_major',
  cell_order='row_major',
  allows_duplicates=True,
  attrs_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000),
  coords_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000)
)

Also when I tried to persist larger data (from 2007 until now), the process exit with exit code -1073741819 (0xC0000005), which means tiledb may tried to access invalid memory address.

My setups: Windows 10 64-bit build 19577, Python 3.7.7. tiledb-py version 0.6.5.

Cheers.

The text was updated successfully, but these errors were encountered:

joe-maley · 2020-07-13T20:07:13Z

Hi @trinhtrannp

This definitely looks like a bug, thanks for opening the ticket. Could you provide me with the exact steps to reproduce this issue? I am specifically looking for the code snippet required to create and write the array at uri in the current code snippet.

Thanks!

ihnorton · 2020-07-13T20:13:20Z

Hi @trinhtrannp -- also, if the test dataset cannot be shared, please let us know the following information about the dataframe, and I will try to reproduce it myself on Windows, with generated data:

df.columns, df.dtypes, df.shape, df.index

For example:

>>> p df.columns
Index(['time', 'double_range', 'int_vals'], dtype='object')
>>> p df.dtypes
time            datetime64[ns]
double_range           float64
int_vals                 int64
dtype: object
>>> p df.shape
(10, 3)
>>> p len(df)
10
>>> p df.index
RangeIndex(start=0, stop=10, step=1)

trinhtrannp · 2020-07-14T22:34:28Z

Hi @joe-maley, @ihnorton

I'm able to solve the exit code 0xC00000005 by setting writer thread to 1. The issue comebacks when I remove that setting or set it to more than 1 thread.

config = tiledb.Config()
config["sm.num_writer_threads"] = 1

Here is the information about the one dataframe of June 2020. All other dataframes are nearly the same, and I have dataframe for each month from 11/2001 until 7/2020.

df.columns
Index(['bidopen', 'bidclose', 'bidhigh', 'bidlow', 'askopen', 'askclose',
       'askhigh', 'asklow', 'tickqty'],
      dtype='object')
----------------
df.dtypes
bidopen     float64
bidclose    float64
bidhigh     float64
bidlow      float64
askopen     float64
askclose    float64
askhigh     float64
asklow      float64
tickqty       int64
dtype: object
----------------
df.shape
(31727, 9)
----------------
df.index
Index(['2020-06-01 00:00:00', '2020-06-01 00:01:00', '2020-06-01 00:02:00',
       '2020-06-01 00:03:00', '2020-06-01 00:04:00', '2020-06-01 00:05:00',
       '2020-06-01 00:06:00', '2020-06-01 00:07:00', '2020-06-01 00:08:00',
       '2020-06-01 00:09:00',
       ...
       '2020-06-30 23:50:00', '2020-06-30 23:51:00', '2020-06-30 23:52:00',
       '2020-06-30 23:53:00', '2020-06-30 23:54:00', '2020-06-30 23:55:00',
       '2020-06-30 23:56:00', '2020-06-30 23:57:00', '2020-06-30 23:58:00',
       '2020-06-30 23:59:00'],
      dtype='object', name='date', length=31727)

I use this the code similar as below to write the data.

if not vfs.is_bucket("s3://raw/"):
    vfs.create_bucket("s3://raw")

for ccypair in ['EURUSD', 'EURJPY', 'USDJPY']:
    uri = f"s3://raw/{ccypair}"
    for year in range(2001, 2021):
        for month in range(1, 13):
           data_file = f"E:\\data\\{ccypair}-{year}-{month}.csv"
           if os.path.exists(data_file):
                df = pd.read_csv(data_file, index_col="date")
                existed = tiledb.highlevel.array_exists(uri)
                tiledb.from_pandas(uri, df,
                                   sparse=True,
                                   mode='append' if existed else 'ingest',
                                   tile_order='row_major',
                                   cell_order='row_major',
                                   allows_duplicates=True,
                                   attrs_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000),
                                   coords_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000))

ihnorton · 2020-07-16T18:50:35Z

Hi @trinhtrannp,
Thanks for the information. So far I have not been able to reproduce this with synthetic data on Windows (Python 3.7.7; Windows 10 64-bit, build 18362). I will share my test script below, and if possible could you also let us know:

any other Config parameters you have set? (I checked sm.num_writer_threads 1, 4, and 8).
any output you can share from pip list, in particular the numpy version (excluding non-public package names)
please try the test program below, and let us know if the same issue is reproduced.

Thanks!

import tiledb
import pandas as pd
import numpy as np
import tempfile

from tiledb.tests.common import rand_datetime64_array

config = tiledb.Config()
config["sm.num_writer_threads"] = 8

tiledb.default_ctx(config)


uri = tempfile.mkdtemp()
#uri = "C:\\tmp\\fx.tiledb"

#col_size = 100000
col_size = 31727
#col_size = 317

def fcol():
    return np.random.rand(col_size)


data = {
    'date': rand_datetime64_array(col_size),
    'bidopen': fcol(),
    'bidclose': fcol(),
    'bidhigh': fcol(),
    'bidlow': fcol(),
    'askopen': fcol(),
    'askclose': fcol(),
    'askhigh': fcol(),
    'asklow': fcol(),
    'tickqty': np.arange(col_size, dtype=np.int64)
}

df = pd.DataFrame.from_dict(data)
df = df.set_index(['date'])


for i in range(4):
    tiledb.from_pandas(uri, df,
                   sparse=True,
                   mode='ingest' if i == 0 else 'append',
                   tile_order='row-major',
                   cell_order='row-major',
                   allows_duplicates=True,
                   attrs_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)]),
                   coords_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)]))

For reference, my test environment was using Python 3.7.7 from python.org, set up with:

C:\opt\pyorg37\python.exe -m venv vv 
.\vv\Scripts\activate
pip install tiledb pandas

 pip list                                                                                       Package         Version
--------------- -------
numpy           1.16.0
pandas          1.0.5
pip             19.2.3
python-dateutil 2.8.1
pytz            2020.1
setuptools      41.2.0
six             1.15.0
tiledb          0.6.5
wheel           0.34.2

joe-maley · 2020-07-22T13:26:46Z

Hi @trinhtrannp --

I've reproduced this issue and am now working on a fix. Thanks for the bug report and I'll let you know once it has been merged.

joe-maley · 2020-07-23T13:38:22Z

Fixed by #1732 and #1736

ihnorton · 2020-08-05T15:20:07Z

Hi @trinhtrannp, the fixes here are included in the latest TileDB-Py release (0.6.6) on PyPI. If you have a chance to test again, please let us know if the issue is not resolved after upgrading.

joe-maley self-assigned this Jul 13, 2020

joe-maley closed this as completed Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exit code -1073741819 (0xC0000005) & Out of bounds read to internal chunk buffer of size 65536 #1721

exit code -1073741819 (0xC0000005) & Out of bounds read to internal chunk buffer of size 65536 #1721

trinhtrannp commented Jul 13, 2020

joe-maley commented Jul 13, 2020

ihnorton commented Jul 13, 2020

trinhtrannp commented Jul 14, 2020 •

edited

ihnorton commented Jul 16, 2020 •

edited

joe-maley commented Jul 22, 2020

joe-maley commented Jul 23, 2020

ihnorton commented Aug 5, 2020

exit code -1073741819 (0xC0000005) & Out of bounds read to internal chunk buffer of size 65536 #1721

exit code -1073741819 (0xC0000005) & Out of bounds read to internal chunk buffer of size 65536 #1721

Comments

trinhtrannp commented Jul 13, 2020

joe-maley commented Jul 13, 2020

ihnorton commented Jul 13, 2020

trinhtrannp commented Jul 14, 2020 • edited

ihnorton commented Jul 16, 2020 • edited

joe-maley commented Jul 22, 2020

joe-maley commented Jul 23, 2020

ihnorton commented Aug 5, 2020

trinhtrannp commented Jul 14, 2020 •

edited

ihnorton commented Jul 16, 2020 •

edited