Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Try to work around Arrow combine_chunks error #2462

Open
johnkerl opened this issue Apr 22, 2024 · 0 comments
Open

[python] Try to work around Arrow combine_chunks error #2462

johnkerl opened this issue Apr 22, 2024 · 0 comments
Labels
bug Something isn't working python-api

Comments

@johnkerl
Copy link
Member

johnkerl commented Apr 22, 2024

Reported by @ebezzi (see also #2120)

Repro script

import cellxgene_census
from scipy.sparse import csr_matrix, coo_matrix


import tiledbsoma
import pyarrow as pa

census = cellxgene_census.open_soma(census_version="stable")

exp = census["census_data"]["homo_sapiens"]

obs = exp.obs

obs_df = obs.read().concat().to_pandas()
obs_df_shuffled = obs_df.sample(frac=1, random_state=1).reset_index(drop=True)

import pandas as pd
obs_df_shuffled["soma_joinid"] = pd.Series(range(len(obs_df_shuffled)))

idx = obs_df_shuffled.copy()["soma_joinid"]
idx.to_pickle('index.pkl')


with tiledbsoma.DataFrame.create("./obs", schema=obs.schema) as df:
    data = pa.Table.from_pandas(
        obs_df_shuffled
    )
    df.write(data)

Output

    df.write(data)
  File "/home/ssm-user/venv/lib/python3.10/site-packages/tiledbsoma/_dataframe.py", line 408, in write
    col = values.column(name).combine_chunks()
  File "pyarrow/table.pxi", line 746, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3775, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Analysis

This looks like an Arrow bug in its combine_chunks but perhaps there is something we can do to work around it ...

cc @johnkerl @nguyenv for visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python-api
Projects
None yet
Development

No branches or pull requests

2 participants