Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append error: TypeError: Cannot compare tz-naive and tz-aware timestamps #35

Open
yohplala opened this issue Jan 9, 2020 · 6 comments

Comments

@yohplala
Copy link

yohplala commented Jan 9, 2020

Hello,

I am passing a tz-aware dataframe to pystore/append, and I get this error message.

 collection.append(item_ID, df, npartitions=item.data.npartitions)
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\pystore\coll
ection.py", line 184, in append
    combined = dd.concat([current.data, new]).drop_duplicates(keep="last")
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\dask\datafra
me\multi.py", line 1070, in concat
    for i in range(len(dfs) - 1)
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\dask\datafra
me\multi.py", line 1070, in <genexpr>
    for i in range(len(dfs) - 1)
  File "pandas\_libs\tslibs\c_timestamp.pyx", line 109, in pandas._libs.tslibs.c_timestamp.
_Timestamp.__richcmp__
  File "pandas\_libs\tslibs\c_timestamp.pyx", line 169, in pandas._libs.tslibs.c_timestamp.
_Timestamp._assert_tzawareness_compat
TypeError: Cannot compare tz-naive and tz-aware timestamps

[EDIT]
Here is a code that can be simply copy/past to reproduce the error message.
Please, does someone sees what I can be possibly doing wrong?

import os
import pandas as pd
import pystore

ts_list = ['Sun Dec 22 2019 07:40:00 GMT-0100',
           'Sun Dec 22 2019 07:45:00 GMT-0100',
           'Sun Dec 22 2019 07:50:00 GMT-0100',
           'Sun Dec 22 2019 07:55:00 GMT-0100']

op_list = [7134.0, 7134.34, 7135.03, 7131.74]

GC = pd.DataFrame(list(zip(ts_list, op_list)), columns =['date', 'open'])

# Getting timestamps back into GC, and resolving it to UTC time
GC['date'] = pd.to_datetime(GC['date'], utc=True)

# Rename columns
GC.rename(columns={'date': 'Timestamp'}, inplace=True)
    
# Set timestamp column as index
GC.set_index('Timestamp', inplace = True, verify_integrity = True)

# Connect to datastore (create it if not exist)
store = pystore.store('OHLCV')
# Access a collection (create it if not exist)
collection = store.collection('AAPL')
item_ID = 'EOD'
collection.write(item_ID, GC[:-1], overwrite=True)
item = collection.item(item_ID)
collection.append(item_ID, GC[-1:], npartitions=item.data.npartitions)

I thank you for your help.
Have a good day,
Bests,
Pierrot

@yohplala
Copy link
Author

Hello,
I have updated the code so that anyone can execute it in a terminal and can reproduce the error (previous code was not working on its own, it needed a data file. I have made an extract that I have embedded in the code)
Thanks in advance for any help and advice.
Bests,
Pierrot

@yohplala
Copy link
Author

yohplala commented Jan 10, 2020

[ADDITION]
Ok, I tested 1st the use of pandas concat() function (not using pystore). I don't have the error message.
It would mean that the trouble is coming from dask dataframe handling?

Following code (direct use of pandas, not pystore/dask/parquet) works:

import os
import pandas as pd

ts_list = ['Sun Dec 22 2019 07:40:00 GMT-0100',
           'Sun Dec 22 2019 07:45:00 GMT-0100',
           'Sun Dec 22 2019 07:50:00 GMT-0100',
           'Sun Dec 22 2019 07:55:00 GMT-0100']

op_list = [7134.0, 7134.34, 7135.03, 7131.74]

GC = pd.DataFrame(list(zip(ts_list, op_list)), columns =['date', 'open'])

# Getting timestamps back into GC, and resolving it to UTC time
GC['date'] = pd.to_datetime(GC['date'], utc=True)

# Rename columns
GC.rename(columns={'date': 'Timestamp'}, inplace=True)
    
# Set timestamp column as index
GC.set_index('Timestamp', inplace = True, verify_integrity = True)

combined = pd.concat([GC[:-1], GC[-1:]]).drop_duplicates(keep="last")

Problem is not solved.

@yohplala
Copy link
Author

yohplala commented Jan 10, 2020

Hmm, it seems I don't succeed to reproduce the error in a script without having to re-write in depth collection.py.
I am stopping the delving here (it seemed to me, it could be an error with my dataframe formatting maybe, that I could then submit either in stackoverflow or pandas github or dask if it was dask related)
But I have no clue where the bug is without going further into dask.

As this is not my priority at the moment, I will only use the write() funciton of pystore, and when I will have to append data, I will do it with pandas concat() function, then write() with pystore using overwrite=True.

I hope this trouble in Windows 10 environment can be solved (I am hinting that this error, along with having to use 'npartitions=item.data.npartitions' in append() function may actually be linked)

Have a good day,
Bests,
Pierrot

@yohplala
Copy link
Author

For those who are in the same case, here is an ugly workaround which logic I mention in above comment.

import os
import pandas as pd
import pystore

ts_list = ['Sun Dec 22 2019 07:40:00 GMT-0100',
           'Sun Dec 22 2019 07:45:00 GMT-0100',
           'Sun Dec 22 2019 07:50:00 GMT-0100',
           'Sun Dec 22 2019 07:55:00 GMT-0100']

op_list = [7134.0, 7134.34, 7135.03, 7131.74]

GC = pd.DataFrame(list(zip(ts_list, op_list)), columns =['date', 'open'])

# Getting timestamps back into GC, and resolving it to UTC time
GC['date'] = pd.to_datetime(GC['date'], utc=True)

# Rename columns
GC.rename(columns={'date': 'Timestamp'}, inplace=True)
    
# Set timestamp column as index
GC.set_index('Timestamp', inplace = True, verify_integrity = True)

# Connect to datastore (create it if not exist)
store = pystore.store('OHLCV')
# Access a collection (create it if not exist)
collection = store.collection('AAPL')
item_ID = 'EOD'
collection.write(item_ID, GC[:-1], overwrite=True)

# WORKAROUND
# Re-create an append function

item = collection.item(item_ID)
current = item.to_pandas()
combined = pd.concat([current, GC[-1:]]).drop_duplicates(keep="last")
collection.write(item_ID, combined, overwrite=True)

Bests,

@sdementen
Copy link

sdementen commented Nov 26, 2020

I think that https://github.com/ranaroussi/pystore/blob/master/pystore/collection.py#L181 should
combined = dd.concat([current.to_pandas(), new]).drop_duplicates(keep="last")
instead of currently
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

@ranaroussi could you confirm ?

@sdementen
Copy link

probably related to issue dask/dask#6925

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants