Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting signed integer is greater than maximum error when querying large table #414

Open
maky-hnou opened this issue Mar 1, 2023 · 5 comments

Comments

@maky-hnou
Copy link

When I query a large table (about 5 million rows), I get this error:

pyathena.error.OperationalError: signed integer is greater than maximum

Here is the code I am using:

from pyathena import connect as athena_connect
from pyathena.pandas.cursor import PandasCursor

AWS_ACCESS_KEY_ID = 'my_key_id'
AWS_SECRET_ACCESS_KEY = 'my_secret_key'
ATHENA_SCHEMA = 'my_schema'
REGION_NAME = 'us-west-2'
WORK_GROUP = 'my_work_group'


schema_cursor = athena_connect(
            region_name=REGION_NAME,
            schema_name=ATHENA_SCHEMA,
            work_group=WORK_GROUP,
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            cursor_class=PandasCursor,
        ).cursor()

data = schema_cursor.execute(f"""SELECT * FROM my_table""").as_pandas()

print(data.shape)

This is the output I get:

Failed to read s3://path_to_csv.csv.
Traceback (most recent call last):
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/pandas/result_set.py", line 273, in _read_csv
    return pd.read_csv(
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/filesystem/s3.py", line 395, in _get_object
    return ranges[0], cast(bytes, response["Body"].read())
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/botocore/response.py", line 99, in read
    chunk = self._raw_stream.read(amt)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/urllib3/response.py", line 518, in read
    data = self._fp.read() if not fp_closed else b""
  File "/usr/lib/python3.8/http/client.py", line 472, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 613, in _safe_read
    data = self.fp.read(amt)
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum
Traceback (most recent call last):
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/pandas/result_set.py", line 273, in _read_csv
    return pd.read_csv(
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/filesystem/s3.py", line 395, in _get_object
    return ranges[0], cast(bytes, response["Body"].read())
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/botocore/response.py", line 99, in read
    chunk = self._raw_stream.read(amt)
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/urllib3/response.py", line 518, in read
    data = self._fp.read() if not fp_closed else b""
  File "/usr/lib/python3.8/http/client.py", line 472, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 613, in _safe_read
    data = self.fp.read(amt)
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "py_doc.py", line 24, in <module>
    data = schema_cursor.execute(f"""SELECT * FROM my_table""").as_pandas()
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/pandas/cursor.py", line 173, in execute
    self.result_set = AthenaPandasResultSet(
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/pandas/result_set.py", line 141, in __init__
    df = self._as_pandas()
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/pandas/result_set.py", line 382, in _as_pandas
    df = self._read_csv()
  File "/home/hani.yousfi/.virtualenvs/my_env/lib/python3.8/site-packages/pyathena/pandas/result_set.py", line 297, in _read_csv
    raise OperationalError(*e.args) from e
pyathena.error.OperationalError: signed integer is greater than maximum

My dependencies:

boto3==1.26.74
botocore==1.29.74
pandas==1.5.3
psycopg2-binary==2.9.5
pyathena==2.23.0
urllib3==1.26.9
python --version 
Python 3.8.10

Further info:
The table contains 15 columns, most of them are with timestamps and strings values.
The column with timestamps contains unix format timestamps and 'NOT APPLICABLE' when the value is empty.

I was not getting this issue with lower versions of pyathena and pandas. But I had to update them due to vulnerabilities.

Do you know how to fix this?
Or how to query the table in chunks then merge them?

@laughingman7743
Copy link
Owner

https://stackoverflow.com/questions/70905872/overflowerror-when-reading-from-s3-signed-integer-is-greater-than-maximum
https://bugs.python.org/issue42853
It may be an issue with Python 3.8, could you try with Python 3.9 or higher version?

Please check the following section of the README for usage of the chunksize option.
https://github.com/laughingman7743/PyAthena#pandascursor-chunksize-options

@maky-hnou
Copy link
Author

Hey @laughingman7743
Thank you for you quick answer.
I tried with Python3.9, I got the same error. Then I tried with Python3.10 and it got stuck then the process was killed.

I do believe it is related to Python not to pyathena.

I will try the links you mentioned and close the issue if it will solve my problem.
Thank you.

@maky-hnou
Copy link
Author

Hey @laughingman7743

Unfortunately, the chunksize didn't work either.
I think it has a relation to the data inside the timestamp column.
However I tried with these versions and it worked fine without even need to set the chuncksize option:

numpy==1.19.4
pandas==1.2.0
boto3==1.16.47
PyAthena==1.11.3

I will investigate more tomorrow and post the updates here.

@laughingman7743
Copy link
Owner

PyAthena==1.11.3 🤔

@maky-hnou
Copy link
Author

Yup, the code I took ownership of is a bit old, so I decided to update the dependencies. But I got that error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants