New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write_sav() and datetime gives wrong format #69
Comments
Can you please provide a complete minimal example? (meaning a short piece of code that would reproduce the issue) in my hands datetime64[ns] gets translated to SPSS DATETIME8, so I need more context to understand what is happening in your case. |
Here is my example:
|
question: if you read the produced file back into pyreadstat and check meta.original_variable_types, what do you get? I get DATETIME8. If you get the same, does that mean that SPSS is ignoring the format and translating to F8.2? (or how do you get the F8.2)?
|
I am getting {'dates': 'DATETIME8', 'times': 'DATETIME8'} |
Let me check what SPSS thinks |
SPSS reads it correctly as Dates. So, it seems it is a problem with PSPP. However, there is one strange thing, NaTs are being translated to 1677-09-21 00:12:43 instead of missing. That looks like a bug in pyreadstat. Maybe PSPP gets confused by that very old date and therefore translates everything to number? Can you try to save a file without NaTs or any other strange date to see if PSPP reads it correctly?
|
what if you leave out the time variable? maybe that one causes problems since it's year 1900? |
Still F8.2 with this one:
|
also when removing the "time" variable |
I was thinking using only column "datetime" In any case since SPSS is reading it correctly, it is an issue in PSPP. I would suggest you create an issue there. I will check what is happening with the NaTs being translated to funny dates. |
Ok, thanks a lot! |
the NaT writing issue has been corrected on dev branch. I will make a release once some other changes that are coming in the next few weeks are ready. If urgent, you can compile from dev branch and at least you will get missing values where you have NaTs when visualising in PSPP. |
and a final piece of information is that pyreadstat handles formatting for datetime/date/time. In the case of date you get it if you use python's built in datetime.date and it get's translated to SPSS DATE8. I tested PSPP and I got numeric for DATETIME8 and DATE8, interestingly TIME8 is working fine. That means, in case you open a ticket in PSPP you could ask them to correct both DATETIME8 and DATE8. |
and this is how to translate pandas Timestamp to python's datetime.time (that will become SPSS TIME8) and datetime.date (that will become SPSS DATE8):
|
I found out it can even be vectorized! |
true! that's even better |
When checking the log files from PSPP when opening files I find error for date and time fields like this one:
|
The thing is that all of this works fine when SPSS reads these files. Can you produce a file with the features you need in PSPP and put it here in order to examine how the code things? |
In any case the way the variables are written is controlled in the C library underlying pyreadstat, "Readstat". Maybe you can write an issue there asking why this is happening. If they would somehow fix it, then I can inherit their changes and it will work for pyreadstat. Otherwise I cannot change those things direclty here in pyreadstat, but I always need it to be fixed in Readstat first. |
OK I used PSPP to write a file with one date variable (with format dd.mm.yy) and one datetime (with format yyyy-mm-dd HH:MM:SS ) (No time variable, because PSPP does not have the option to format as time). When reading in pyreadstat I noticed two interesting things: 1- Both variables are read as numeric in pyreadstat. So, what I think is happening is that PSPP uses its own date and time formats that are not necessarily compatible with SPSS. Reading PSPP files in pyreadstat is easy to solve for me - I think- , I just need to add things like EDATE8 and YMDHMS20 to the list of recognized formats. I will do so actually. I have seen that PSPP has many possible formats, if somebody would start telling me what those are, then I could add these others as well. In my opinion since DATE8 and DATETIME8 (the formats that Readstat is using when writing files) are valid SPSS formats, PSPP should support those. I would suggest open an issue on their side. a bit more elaboration: the way this works is that SPSS has only two kind of variables: character and numeric. For numeric, there is a format (that you can see in meta.original_variable_types ) that tells how the number should be displayed, for example as a numer with n decimals, or in this case as a date, datetime or time with certain arrangement of months, days, year, week, hours, minutes, seconds, etc. So, in PSPP every time you change the type of the variable to DATE and then select a different format, the format I am talking about will change as well. In pyreadstat when I get a number and a format, I check if that format is something I know to be a date, datetime, or time. If so, I convert the number to the corresponding python type. If not, I let it be a number. So, the only thing I have to do to read more formats is to include them in the list of known formats. I assume a similar thing happens in PSPP, it seems they list of known formats is not including DATE8 or DATETIME8 and they would need to include those (and probably some information on how to display it in the GUI), and that should do it. |
I have added EDATE8, EDATE11 (PSPP default) and YMDHMS20 to the list of known formats for SPSS and now pyreadstat can read those formats coming from sav produced by PSPP. Still on dev branch for now. |
Looking at the PSPP source code I found EDATE in several places, among those this comment:
so, we are giving TIME8 and it's happy (meaning that the 8 doesn't matter, also it says EDATE and not EDATE8), but for some reason DATE8 and DATETIME8, which theoretically are in the list are not recognized. So probably they have a bug somewhere else (as you pointed already looking at that strange message in your logs). No idea, you have to ask them. @josmos In any case my current conclusion is that everything is good with Readstat/Pyreadstat and the bug is in PSPP side. I will leave this open for a few more days in case you would like to say something else, otherwise I will eventually close this. |
the particular log warning that you observed comes from file src/data/sys-file-reader.c, from this function:
No idea what it means or if it is related to the issue here |
the command line tool pspp-dump-sav seems to recognize the formats OK.
But trying to read the file in PSPP command line fails:
Interestingly the dump command says |
Readstat is following the file specification given by PSPP, DATE and DATETIME are specifiied: https://www.gnu.org/software/pspp/pspp-dev/html_node/Variable-Record.html |
@josmos solved !!!! with a big help from @evanmiller on the latest code on dev, this is working well both in SPSS and PSPP
|
fixed in version 1.0.2 |
According to the Readme datetime, date, time should be converted to numeric with datetime/date/time formatting.
I am converting strings to datetime64[ns] dtype with:
df[field_name] = pd.to_datetime(df[field_name], errors="coerce")
or
df[field_name] = pd.to_datetime(df[field_name], format='%H:%M', errors="coerce")
for time, respectively.
The resulting sav variable has numeric type with F8.2 format
How can I convert it to SPSS DATE or TIME format?
Is this a bug? In case I am getting something wrong, please explain how to get the right format.
Thanks
Josef
Setup Information:
i installed pyreadstat via pipenv
INSTALLED VERSIONS
commit : d9fff2792bf16178d4e450fe7384244e50635733
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-42-generic
Version : #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1
The text was updated successfully, but these errors were encountered: