Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converter class does not convert Athena string data to pandas str type #148

Open
krishanunandy opened this issue Jul 1, 2020 · 5 comments

Comments

@krishanunandy
Copy link

krishanunandy commented Jul 1, 2020

First of all, thank you for creating this library! It's been immensely helpful and I've used it in multiple contexts over several years and would love to contribute - especially if it helps solve my current problem!

With pyathena=1.10.7 and pandas=1.0.5 I am running the following code with the expectation that the converter class will cast the Athena string data type as an str pandas dtype.

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.converter import Converter

class CustomPandasTypeConverter(Converter):

    def __init__(self):
        super(CustomPandasTypeConverter, self).__init__(
            mappings=None,
            types={
                'boolean': bool,
                'tinyint': int,
                'smallint': int,
                'integer': int,
                'bigint': int,
                'float': float,
                'real': float,
                'double': float,
                'decimal': float,
                'char': str,
                'varchar': str,
                'array': str,
                'map': str,
                'row': str,
                'varbinary': str,
                'json': str,
                'string': str
            }
        )

    def convert(self, type_, value):
        # Not used in PandasCursor.
        pass
    
cur = connect(s3_staging_dir='<staging_directory_url>',
                region_name='<aws_region>',
                cursor_class = PandasCursor,
                converter=CustomPandasTypeConverter(),
                work_group = '<workgroup_name>').cursor()

query = 'SELECT * FROM <schema>.<table>'
df = cur.execute(query).as_pandas()
df.dtypes

When I inspect the dtypes, Athena ints are converted to pandas ints, decimals are converted to floats and strings are consistently returned as object dtypes. However Athena string NULLs are cast as NaNs which require explicit column-by-column fillna operations. This is particularly inconvenient, since I'm trying to subsequently convert the pandas dataframe to a Spark dataframe. Now that I've typed all this out, I'm guessing this is related to #118?

Also, I'm not sure where the right place to ask this is, but are there any plans to implement a PySparkCursor for PyAthena? If not can I help by contributing?

@laughingman7743
Copy link
Owner

The problem with #118 is that the string type cannot distinguish between empty and null characters. Currently, all empty characters are treated as null.

The reason Dtype is an Object is probably because the type is represented as a Python String object.
Isn't the missing value of the String object a NaN? Is there any other way to describe it?

I don't plan to implement PySparkCursor.
Contributions are free. However, you are required to provide your own test environment with AWS.

@krishanunandy
Copy link
Author

krishanunandy commented Jul 2, 2020

The reason Dtype is an Object is probably because the type is represented as a Python String object.

That's my understanding too.

Isn't the missing value of the String object a NaN? Is there any other way to describe it?

A NaN is a contraction of 'Not a Number' - it probably shouldn't be used to describe the missing value for a string though I think it is. Until very recently, pandas int dtypes didn't support null values - only float did - and datetimes have their own NaT (Not a Time) type. While not ideal, giving the user the choice to explicitly cast an Athena NULL string as an empty string ('') might be a good solution until someone comes up with an NaS type?

I don't plan to implement PySparkCursor.
Contributions are free. However, you are required to provide your own test environment with AWS.

Thanks for the information. I'll look at the source code and see what I can come up with.

@laughingman7743
Copy link
Owner

https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values
If Pnadas 1.0.0 or higher, it seems to good to handle all missing values in pd.NA.

@krishanunandy
Copy link
Author

Good find. It seems though, that -

  1. PyAthena casts strings as object type (i.e. Numpy arrays) instead of the new dedicated StringDtype
  2. The StringDtype is still experimental and probably shouldn't be a dependency of PyAthena

Any thoughts on a potential solution? Pinging @EdwardJRoss since he seems to understand the codebase better than I do.

@Liam3851
Copy link

@krishanunandy As noted upthread in Pandas the canonical way of representing strings, before the dedicated StringDtype, was as objects. dtype=str is accepted but results in an object dtype. This is just how Pandas works and has nothing to do with PyAthena, at least not directly:

>>> pd.Series(['a', 'b'], dtype=str).reindex([0, 1, 2])
0    a
1    b
2    NaN
dtype: object

Pandas of course is developing the new String dtype which is the ultimate solution here. You could change your custom converter to use "string" instead of str and use the new dtype.

However, given your use case, it is relevant that through version 3.0, Spark has no support for pd.NA, in any of the new nullable dtypes (including Int64, which PyAthena supports). See SPARK-30966 in the Spark Jira for the ticket on this. Realistically you're probably looking at Spark 3.1... maybe early 2021?... at the earliest.

My suggestion would just be to write a wrapper method that calls .as_pandas(), takes the result, loops over the cursor description to get the type of each column, and runs .fillna('') on any columns with string type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants