Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_sav() and datetime gives wrong format #69

Closed
josmos opened this issue Aug 21, 2020 · 27 comments
Closed

write_sav() and datetime gives wrong format #69

josmos opened this issue Aug 21, 2020 · 27 comments

Comments

@josmos
Copy link

josmos commented Aug 21, 2020

According to the Readme datetime, date, time should be converted to numeric with datetime/date/time formatting.

I am converting strings to datetime64[ns] dtype with:

df[field_name] = pd.to_datetime(df[field_name], errors="coerce")

or

df[field_name] = pd.to_datetime(df[field_name], format='%H:%M', errors="coerce")

for time, respectively.

The resulting sav variable has numeric type with F8.2 format

How can I convert it to SPSS DATE or TIME format?
Is this a bug? In case I am getting something wrong, please explain how to get the right format.

Thanks
Josef

Setup Information:

i installed pyreadstat via pipenv

INSTALLED VERSIONS

commit : d9fff2792bf16178d4e450fe7384244e50635733
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-42-generic
Version : #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1

@ofajardo
Copy link
Collaborator

ofajardo commented Aug 21, 2020

Can you please provide a complete minimal example? (meaning a short piece of code that would reproduce the issue) in my hands datetime64[ns] gets translated to SPSS DATETIME8, so I need more context to understand what is happening in your case.

@josmos
Copy link
Author

josmos commented Aug 21, 2020

Here is my example:

import pandas as pd
import pyreadstat

df = pd.DataFrame({
    "dates": {
        "10001": "nk.nk.2001",
        "10002": "nk.May.2003",
        "10003": "",
        "10004": "nk.nk.2018",
        "10005": None,
        "10006": "30.Dec.2010",
    },
    "times": {
        "10001": "",
        "10002": None,
        "10003": "16:30",
        "10004": "na:na",
        "10005": "10:10",
        "10006": "10:12",
    }
})

df["dates"] = df["dates"].str.replace("nk.nk", "15.Jun")
df["dates"] = df["dates"].str.replace("nk", "15")
df["dates"] = pd.to_datetime(df["dates"], errors="coerce")
print(df["dates"].dtype)
print(list(df["dates"]))

df["times"] = pd.to_datetime(df["times"], format='%H:%M', errors="coerce")
print(df["times"].dtype)
print(list(df["times"]))

pyreadstat.write_sav(df, "export.sav",
                     column_labels=["mydate", "mytime"],
                     variable_measure={"dates": "scale", "times": "scale"},
                     )

@ofajardo
Copy link
Collaborator

question: if you read the produced file back into pyreadstat and check meta.original_variable_types, what do you get? I get DATETIME8. If you get the same, does that mean that SPSS is ignoring the format and translating to F8.2? (or how do you get the F8.2)?

import pyreadstat 
df, meta = pyreadstat.read_sav("export.sav")
print(meta.original_variable_types)
# {'dates': 'DATETIME8', 'times': 'DATETIME8'}

@josmos
Copy link
Author

josmos commented Aug 21, 2020

I am getting {'dates': 'DATETIME8', 'times': 'DATETIME8'}
So this is probably an issue with the viewer i am using.
I am using PSPP (https://www.gnu.org/software/pspp/) to view the result (I do not have a licence for SPSS)

@ofajardo
Copy link
Collaborator

Let me check what SPSS thinks

@ofajardo
Copy link
Collaborator

SPSS reads it correctly as Dates. So, it seems it is a problem with PSPP. However, there is one strange thing, NaTs are being translated to 1677-09-21 00:12:43 instead of missing. That looks like a bug in pyreadstat.

Maybe PSPP gets confused by that very old date and therefore translates everything to number? Can you try to save a file without NaTs or any other strange date to see if PSPP reads it correctly?

import pyreadstat 
df, meta = pyreadstat.read_sav("export.sav")
df
                 dates                times
0  2001-06-15 00:00:00  1677-09-21 00:12:43
1  2003-05-15 00:00:00  1677-09-21 00:12:43
2  1677-09-21 00:12:43  1900-01-01 16:30:00
3  2018-06-15 00:00:00  1677-09-21 00:12:43
4  1677-09-21 00:12:43  1900-01-01 10:10:00
5  2010-12-30 00:00:00  1900-01-01 10:12:00

@ofajardo
Copy link
Collaborator

what if you leave out the time variable? maybe that one causes problems since it's year 1900?

@josmos
Copy link
Author

josmos commented Aug 21, 2020

Still F8.2 with this one:

import pandas as pd
import pyreadstat

df = pd.DataFrame({
    "dates": {
        "10001": "nk.nk.2001",
        "10002": "nk.May.2003",
        "10003": "30.Dec.2010",
    },
    "times": {
        "10001": "16:30",
        "10002": "10:10",
        "10003": "10:12",
    },
    "datetime": {
        "10001": "1999-04-11 00:42:02",
        "10002": "2000-05-01 01:12:01",
        "10003": "2020-07-21 13:10:00",
    }
})

df["dates"] = df["dates"].str.replace("nk.nk", "15.Jun")
df["dates"] = df["dates"].str.replace("nk", "15")
df["dates"] = pd.to_datetime(df["dates"], errors="coerce")

print(df["dates"].dtype)
print(list(df["dates"]))

df["times"] = pd.to_datetime(df["times"], format='%H:%M', errors="coerce")
print(df["times"].dtype)
print(list(df["times"]))

df["datetime"] = pd.to_datetime(df["datetime"], errors="coerce")
print(df["datetime"].dtype)
print(list(df["datetime"]))

pyreadstat.write_sav(df, "export.sav",
                     column_labels=["mydate", "mytime", "mydatetime"],
                     variable_measure={"dates": "scale", "times": "scale"},
                     )

df, meta = pyreadstat.read_sav("export.sav")
print(meta.original_variable_types)

@josmos
Copy link
Author

josmos commented Aug 21, 2020

also when removing the "time" variable

@ofajardo
Copy link
Collaborator

I was thinking using only column "datetime"

In any case since SPSS is reading it correctly, it is an issue in PSPP. I would suggest you create an issue there.

I will check what is happening with the NaTs being translated to funny dates.

@josmos
Copy link
Author

josmos commented Aug 21, 2020

Ok, thanks a lot!

@ofajardo
Copy link
Collaborator

the NaT writing issue has been corrected on dev branch. I will make a release once some other changes that are coming in the next few weeks are ready. If urgent, you can compile from dev branch and at least you will get missing values where you have NaTs when visualising in PSPP.

@ofajardo
Copy link
Collaborator

ofajardo commented Aug 21, 2020

and a final piece of information is that pyreadstat handles formatting for datetime/date/time.

In the case of date you get it if you use python's built in datetime.date and it get's translated to SPSS DATE8.
In the case of time, you get it if you use python's built in datetime.time and it get's translated to SPSS TIME8.
In the case of datetime, you get it if you use python's built in datetime.datetime, pandas TimeStamp or np.datetime64 and it gets translated to SPSS DATETIME8.

I tested PSPP and I got numeric for DATETIME8 and DATE8, interestingly TIME8 is working fine.

That means, in case you open a ticket in PSPP you could ask them to correct both DATETIME8 and DATE8.

@ofajardo
Copy link
Collaborator

and this is how to translate pandas Timestamp to python's datetime.time (that will become SPSS TIME8) and datetime.date (that will become SPSS DATE8):

# after exectuting your code, but before you use write_sav
df["times"] = df["times"].apply(lambda x: x.time()) # time
df["dates"] = df["dates"].apply(lambda x: x.date()) # date

@josmos
Copy link
Author

josmos commented Aug 21, 2020

I found out it can even be vectorized!
df[field_name] = df[field_name].dt.time

@ofajardo
Copy link
Collaborator

true! that's even better

@josmos
Copy link
Author

josmos commented Aug 25, 2020

When checking the log files from PSPP when opening files I find error for date and time fields like this one:

warning: test.sav near offset 0x248: variable V4 with width 0 has invalid print format 0x140800.
However, if i set variable_display_with I still get the same errors.
Maybe this is causing the incorrect display of datetime values?

@ofajardo
Copy link
Collaborator

The thing is that all of this works fine when SPSS reads these files.

Can you produce a file with the features you need in PSPP and put it here in order to examine how the code things?

@ofajardo
Copy link
Collaborator

In any case the way the variables are written is controlled in the C library underlying pyreadstat, "Readstat". Maybe you can write an issue there asking why this is happening. If they would somehow fix it, then I can inherit their changes and it will work for pyreadstat. Otherwise I cannot change those things direclty here in pyreadstat, but I always need it to be fixed in Readstat first.

@ofajardo
Copy link
Collaborator

ofajardo commented Aug 28, 2020

OK I used PSPP to write a file with one date variable (with format dd.mm.yy) and one datetime (with format yyyy-mm-dd HH:MM:SS ) (No time variable, because PSPP does not have the option to format as time). When reading in pyreadstat I noticed two interesting things:

1- Both variables are read as numeric in pyreadstat.
2- The format of the variables is as follows: (meta.original_variable_types)
{'mydate': 'EDATE8', 'mydatetime': 'YMDHMS20'}
3- Opening the file in SPSS: it did understand the format of mydate, but it did not understand mydatetime and it was shown as numeric.

So, what I think is happening is that PSPP uses its own date and time formats that are not necessarily compatible with SPSS.

Reading PSPP files in pyreadstat is easy to solve for me - I think- , I just need to add things like EDATE8 and YMDHMS20 to the list of recognized formats. I will do so actually. I have seen that PSPP has many possible formats, if somebody would start telling me what those are, then I could add these others as well.

In my opinion since DATE8 and DATETIME8 (the formats that Readstat is using when writing files) are valid SPSS formats, PSPP should support those. I would suggest open an issue on their side.

a bit more elaboration:

the way this works is that SPSS has only two kind of variables: character and numeric. For numeric, there is a format (that you can see in meta.original_variable_types ) that tells how the number should be displayed, for example as a numer with n decimals, or in this case as a date, datetime or time with certain arrangement of months, days, year, week, hours, minutes, seconds, etc. So, in PSPP every time you change the type of the variable to DATE and then select a different format, the format I am talking about will change as well.

In pyreadstat when I get a number and a format, I check if that format is something I know to be a date, datetime, or time. If so, I convert the number to the corresponding python type. If not, I let it be a number. So, the only thing I have to do to read more formats is to include them in the list of known formats.

I assume a similar thing happens in PSPP, it seems they list of known formats is not including DATE8 or DATETIME8 and they would need to include those (and probably some information on how to display it in the GUI), and that should do it.

@ofajardo
Copy link
Collaborator

I have added EDATE8, EDATE11 (PSPP default) and YMDHMS20 to the list of known formats for SPSS and now pyreadstat can read those formats coming from sav produced by PSPP. Still on dev branch for now.

@ofajardo
Copy link
Collaborator

ofajardo commented Aug 28, 2020

Looking at the PSPP source code I found EDATE in several places, among those this comment:

/* Parses DATE, ADATE, EDATE, JDATE, SDATE, QYR, MOYR, KWYR,
1120    DATETIME, YMDHMS, MTIME, TIME, and DTIME formats. */

so, we are giving TIME8 and it's happy (meaning that the 8 doesn't matter, also it says EDATE and not EDATE8), but for some reason DATE8 and DATETIME8, which theoretically are in the list are not recognized.

So probably they have a bug somewhere else (as you pointed already looking at that strange message in your logs). No idea, you have to ask them.

@josmos In any case my current conclusion is that everything is good with Readstat/Pyreadstat and the bug is in PSPP side. I will leave this open for a few more days in case you would like to say something else, otherwise I will eventually close this.

@ofajardo
Copy link
Collaborator

the particular log warning that you observed comes from file src/data/sys-file-reader.c, from this function:

/* Translates the format spec from sysfile format to internal
1527    format. */
1528 static void
1529 parse_format_spec (struct sfm_reader *r, off_t pos, unsigned int format,
1530                    enum which_format which, struct variable *v,
1531                    int *n_warnings)

1549       if (which == PRINT_FORMAT)
1550         sys_warn (r, pos, _("Variable %s with width %d has invalid print "
1551                             "format 0x%x."),
1552                   var_get_name (v), var_get_width (v), format);

No idea what it means or if it is related to the issue here

@ofajardo
Copy link
Collaborator

ofajardo commented Aug 28, 2020

the command line tool pspp-dump-sav seems to recognize the formats OK.

(base) # pspp-dump-sav export.sav
File header record:
	     Product name: @(#) SPSS DATA FILE - https://github.com/WizardMac/ReadStat
	      Layout code: 2
	       Compressed: 0 (no compression)
	     Weight index: 0
	  Number of cases: 6
	 Compression bias: 100
	    Creation date: 28 Aug 20
	    Creation time: 13:13:26
	       File label: ""
000000b4: variable record #0
	Width: 0 (numeric)
	Variable label: 1
	Missing values code: 0 (no missing values)
	Print format: 160800 (DATETIME8.0)
	Write format: 160800 (DATETIME8.0)
	Name: DATES
	000000d0 Variable label: "mydate"
000000e0: variable record #1
	Width: 0 (numeric)
	Variable label: 1
	Missing values code: 0 (no missing values)
	Print format: 160800 (DATETIME8.0)
	Write format: 160800 (DATETIME8.0)
	Name: TIMES
	000000fc Variable label: "mytime"

But trying to read the file in PSPP command line fails:

PSPP> get file='export.sav' .

warning: `export.sav' near offset 0xc0: Variable DATES with width 0 has invalid print format 0x160800.
warning: `export.sav' near offset 0xc4: Variable DATES with width 0 has invalid write format 0x160800.
warning: `export.sav' near offset 0xec: Variable TIMES with width 0 has invalid print format 0x160800.
warning: `export.sav' near offset 0xf0: Variable TIMES with width 0 has invalid write format 0x160800.

Interestingly the dump command says Print format: 160800 (DATETIME8.0) (meaning it is recognizing the format correctly?) but later PSPP invalid print format 0x160800. It truly looks like a bug in PSPP

@ofajardo ofajardo added the wontfix This will not be worked on label Aug 28, 2020
@ofajardo
Copy link
Collaborator

Readstat is following the file specification given by PSPP, DATE and DATETIME are specifiied:

https://www.gnu.org/software/pspp/pspp-dev/html_node/Variable-Record.html

@ofajardo
Copy link
Collaborator

ofajardo commented Aug 28, 2020

@josmos solved !!!! with a big help from @evanmiller

on the latest code on dev, this is working well both in SPSS and PSPP

import pandas as pd
import pyreadstat

df = pd.DataFrame({
    "dates": {
        "10001": "nk.nk.2001",
        "10002": "nk.May.2003",
        "10003": "",
        "10004": "nk.nk.2018",
        "10005": None,
        "10006": "30.Dec.2010",
    },
    "times": {
        "10001": "",
        "10002": None,
        "10003": "16:30",
        "10004": "na:na",
        "10005": "10:10",
        "10006": "10:12",
    }
})

df["dates"] = df["dates"].str.replace("nk.nk", "15.Jun")
df["dates"] = df["dates"].str.replace("nk", "15")
df["dates"] = pd.to_datetime(df["dates"], errors="coerce")

df["times"] = pd.to_datetime(df["times"], format='%H:%M', errors="coerce")

df["datetimes"] = df["dates"]
df.dates = df.dates.dt.date
df.times = df.times.dt.time

pyreadstat.write_sav(df, "export.sav",
                     column_labels=["mydate", "mytime", "mydatetime"],
                     variable_measure={"dates": "scale", "times": "scale"},
                     )

@ofajardo
Copy link
Collaborator

ofajardo commented Sep 5, 2020

fixed in version 1.0.2

@ofajardo ofajardo closed this as completed Sep 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants