Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exit code -1073741819 (0xC0000005) & Out of bounds read to internal chunk buffer of size 65536 #1721

Closed
trinhtrannp opened this issue Jul 13, 2020 · 7 comments
Assignees

Comments

@trinhtrannp
Copy link

Hi teams,

I'm currently trying tiledb (python) with historical fx data which looking like this.

date bidopen bidclose bidhigh ... askhigh asklow tickqty
2020-07-01 00:00:00 1.12330 1.12319 1.12331 ... 1.12342 1.12330 107
2020-07-01 00:01:00 1.12319 1.12304 1.12320 ... 1.12332 1.12315 243
2020-07-01 00:02:00 1.12304 1.12297 1.12305 ... 1.12317 1.12308 168
2020-07-01 00:03:00 1.12297 1.12288 1.12300 ... 1.12312 1.12291 156
2020-07-01 00:04:00 1.12288 1.12286 1.12289 ... 1.12302 1.12293 195
2020-07-01 00:05:00 1.12286 1.12286 1.12290 ... 1.12302 1.12297 109
2020-07-01 00:06:00 1.12286 1.12302 1.12302 ... 1.12315 1.12297 111
2020-07-01 00:07:00 1.12302 1.12293 1.12304 ... 1.12317 1.12306 82
2020-07-01 00:08:00 1.12293 1.12287 1.12293 ... 1.12307 1.12298 63
2020-07-01 00:09:00 1.12287 1.12285 1.12293 ... 1.12305 1.12291 165

I have a data frame for each month, which is about 30k rows.

The storing process go well, I check the total row in tiledb and there is no missing data. However, I see a lot of this log.

[2020-07-13 21:30:37.845] [tiledb] [Process: 24960] [Thread: 29992] [error] [TileDB::Dimension] Error: Out of bounds read to internal chunk buffer of size 65536

Eventhough I have already set the chunksize.

existed = tiledb.highlevel.array_exists(uri)
tiledb.from_pandas(uri, df, 
  sparse=True,
  mode='append' if existed else 'ingest',
  tile_order='row_major',
  cell_order='row_major',
  allows_duplicates=True,
  attrs_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000),
  coords_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000)
)

Also when I tried to persist larger data (from 2007 until now), the process exit with exit code -1073741819 (0xC0000005), which means tiledb may tried to access invalid memory address.

My setups: Windows 10 64-bit build 19577, Python 3.7.7. tiledb-py version 0.6.5.

Cheers.

@joe-maley joe-maley self-assigned this Jul 13, 2020
@joe-maley
Copy link
Contributor

Hi @trinhtrannp

This definitely looks like a bug, thanks for opening the ticket. Could you provide me with the exact steps to reproduce this issue? I am specifically looking for the code snippet required to create and write the array at uri in the current code snippet.

Thanks!

@ihnorton
Copy link
Member

Hi @trinhtrannp -- also, if the test dataset cannot be shared, please let us know the following information about the dataframe, and I will try to reproduce it myself on Windows, with generated data:

df.columns, df.dtypes, df.shape, df.index

For example:

>>> p df.columns
Index(['time', 'double_range', 'int_vals'], dtype='object')
>>> p df.dtypes
time            datetime64[ns]
double_range           float64
int_vals                 int64
dtype: object
>>> p df.shape
(10, 3)
>>> p len(df)
10
>>> p df.index
RangeIndex(start=0, stop=10, step=1)

@trinhtrannp
Copy link
Author

trinhtrannp commented Jul 14, 2020

Hi @joe-maley, @ihnorton

I'm able to solve the exit code 0xC00000005 by setting writer thread to 1. The issue comebacks when I remove that setting or set it to more than 1 thread.

config = tiledb.Config()
config["sm.num_writer_threads"] = 1


Here is the information about the one dataframe of June 2020. All other dataframes are nearly the same, and I have dataframe for each month from 11/2001 until 7/2020.

df.columns
Index(['bidopen', 'bidclose', 'bidhigh', 'bidlow', 'askopen', 'askclose',
       'askhigh', 'asklow', 'tickqty'],
      dtype='object')
----------------
df.dtypes
bidopen     float64
bidclose    float64
bidhigh     float64
bidlow      float64
askopen     float64
askclose    float64
askhigh     float64
asklow      float64
tickqty       int64
dtype: object
----------------
df.shape
(31727, 9)
----------------
df.index
Index(['2020-06-01 00:00:00', '2020-06-01 00:01:00', '2020-06-01 00:02:00',
       '2020-06-01 00:03:00', '2020-06-01 00:04:00', '2020-06-01 00:05:00',
       '2020-06-01 00:06:00', '2020-06-01 00:07:00', '2020-06-01 00:08:00',
       '2020-06-01 00:09:00',
       ...
       '2020-06-30 23:50:00', '2020-06-30 23:51:00', '2020-06-30 23:52:00',
       '2020-06-30 23:53:00', '2020-06-30 23:54:00', '2020-06-30 23:55:00',
       '2020-06-30 23:56:00', '2020-06-30 23:57:00', '2020-06-30 23:58:00',
       '2020-06-30 23:59:00'],
      dtype='object', name='date', length=31727)

I use this the code similar as below to write the data.

if not vfs.is_bucket("s3://raw/"):
    vfs.create_bucket("s3://raw")

for ccypair in ['EURUSD', 'EURJPY', 'USDJPY']:
    uri = f"s3://raw/{ccypair}"
    for year in range(2001, 2021):
        for month in range(1, 13):
           data_file = f"E:\\data\\{ccypair}-{year}-{month}.csv"
           if os.path.exists(data_file):
                df = pd.read_csv(data_file, index_col="date")
                existed = tiledb.highlevel.array_exists(uri)
                tiledb.from_pandas(uri, df,
                                   sparse=True,
                                   mode='append' if existed else 'ingest',
                                   tile_order='row_major',
                                   cell_order='row_major',
                                   allows_duplicates=True,
                                   attrs_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000),
                                   coords_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)], chunksize=512000))

@ihnorton
Copy link
Member

ihnorton commented Jul 16, 2020

Hi @trinhtrannp,
Thanks for the information. So far I have not been able to reproduce this with synthetic data on Windows (Python 3.7.7; Windows 10 64-bit, build 18362). I will share my test script below, and if possible could you also let us know:

  • any other Config parameters you have set? (I checked sm.num_writer_threads 1, 4, and 8).
  • any output you can share from pip list, in particular the numpy version (excluding non-public package names)
  • please try the test program below, and let us know if the same issue is reproduced.

Thanks!

import tiledb
import pandas as pd
import numpy as np
import tempfile

from tiledb.tests.common import rand_datetime64_array

config = tiledb.Config()
config["sm.num_writer_threads"] = 8

tiledb.default_ctx(config)


uri = tempfile.mkdtemp()
#uri = "C:\\tmp\\fx.tiledb"

#col_size = 100000
col_size = 31727
#col_size = 317

def fcol():
    return np.random.rand(col_size)


data = {
    'date': rand_datetime64_array(col_size),
    'bidopen': fcol(),
    'bidclose': fcol(),
    'bidhigh': fcol(),
    'bidlow': fcol(),
    'askopen': fcol(),
    'askclose': fcol(),
    'askhigh': fcol(),
    'asklow': fcol(),
    'tickqty': np.arange(col_size, dtype=np.int64)
}

df = pd.DataFrame.from_dict(data)
df = df.set_index(['date'])


for i in range(4):
    tiledb.from_pandas(uri, df,
                   sparse=True,
                   mode='ingest' if i == 0 else 'append',
                   tile_order='row-major',
                   cell_order='row-major',
                   allows_duplicates=True,
                   attrs_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)]),
                   coords_filters=tiledb.FilterList([tiledb.GzipFilter(level=-1)]))

For reference, my test environment was using Python 3.7.7 from python.org, set up with:

C:\opt\pyorg37\python.exe -m venv vv 
.\vv\Scripts\activate
pip install tiledb pandas
 pip list                                                                                       Package         Version
--------------- -------
numpy           1.16.0
pandas          1.0.5
pip             19.2.3
python-dateutil 2.8.1
pytz            2020.1
setuptools      41.2.0
six             1.15.0
tiledb          0.6.5
wheel           0.34.2

@joe-maley
Copy link
Contributor

Hi @trinhtrannp --

I've reproduced this issue and am now working on a fix. Thanks for the bug report and I'll let you know once it has been merged.

@joe-maley
Copy link
Contributor

Fixed by #1732 and #1736

@ihnorton
Copy link
Member

ihnorton commented Aug 5, 2020

Hi @trinhtrannp, the fixes here are included in the latest TileDB-Py release (0.6.6) on PyPI. If you have a chance to test again, please let us know if the issue is not resolved after upgrading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants