write_sav() and datetime gives wrong format #69

josmos · 2020-08-21T09:04:48Z

According to the Readme datetime, date, time should be converted to numeric with datetime/date/time formatting.

I am converting strings to datetime64[ns] dtype with:

df[field_name] = pd.to_datetime(df[field_name], errors="coerce")

or

df[field_name] = pd.to_datetime(df[field_name], format='%H:%M', errors="coerce")

for time, respectively.

The resulting sav variable has numeric type with F8.2 format

How can I convert it to SPSS DATE or TIME format?
Is this a bug? In case I am getting something wrong, please explain how to get the right format.

Thanks
Josef

Setup Information:

i installed pyreadstat via pipenv

INSTALLED VERSIONS

commit : d9fff2792bf16178d4e450fe7384244e50635733
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-42-generic
Version : #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1

The text was updated successfully, but these errors were encountered:

ofajardo · 2020-08-21T09:48:24Z

Can you please provide a complete minimal example? (meaning a short piece of code that would reproduce the issue) in my hands datetime64[ns] gets translated to SPSS DATETIME8, so I need more context to understand what is happening in your case.

josmos · 2020-08-21T10:15:15Z

Here is my example:

import pandas as pd
import pyreadstat

df = pd.DataFrame({
    "dates": {
        "10001": "nk.nk.2001",
        "10002": "nk.May.2003",
        "10003": "",
        "10004": "nk.nk.2018",
        "10005": None,
        "10006": "30.Dec.2010",
    },
    "times": {
        "10001": "",
        "10002": None,
        "10003": "16:30",
        "10004": "na:na",
        "10005": "10:10",
        "10006": "10:12",
    }
})

df["dates"] = df["dates"].str.replace("nk.nk", "15.Jun")
df["dates"] = df["dates"].str.replace("nk", "15")
df["dates"] = pd.to_datetime(df["dates"], errors="coerce")
print(df["dates"].dtype)
print(list(df["dates"]))

df["times"] = pd.to_datetime(df["times"], format='%H:%M', errors="coerce")
print(df["times"].dtype)
print(list(df["times"]))

pyreadstat.write_sav(df, "export.sav",
                     column_labels=["mydate", "mytime"],
                     variable_measure={"dates": "scale", "times": "scale"},
                     )

ofajardo · 2020-08-21T10:24:58Z

question: if you read the produced file back into pyreadstat and check meta.original_variable_types, what do you get? I get DATETIME8. If you get the same, does that mean that SPSS is ignoring the format and translating to F8.2? (or how do you get the F8.2)?

import pyreadstat 
df, meta = pyreadstat.read_sav("export.sav")
print(meta.original_variable_types)
# {'dates': 'DATETIME8', 'times': 'DATETIME8'}

josmos · 2020-08-21T10:30:07Z

I am getting {'dates': 'DATETIME8', 'times': 'DATETIME8'}
So this is probably an issue with the viewer i am using.
I am using PSPP (https://www.gnu.org/software/pspp/) to view the result (I do not have a licence for SPSS)

ofajardo · 2020-08-21T10:32:23Z

Let me check what SPSS thinks

ofajardo · 2020-08-21T10:41:43Z

SPSS reads it correctly as Dates. So, it seems it is a problem with PSPP. However, there is one strange thing, NaTs are being translated to 1677-09-21 00:12:43 instead of missing. That looks like a bug in pyreadstat.

Maybe PSPP gets confused by that very old date and therefore translates everything to number? Can you try to save a file without NaTs or any other strange date to see if PSPP reads it correctly?

import pyreadstat 
df, meta = pyreadstat.read_sav("export.sav")
df
                 dates                times
0  2001-06-15 00:00:00  1677-09-21 00:12:43
1  2003-05-15 00:00:00  1677-09-21 00:12:43
2  1677-09-21 00:12:43  1900-01-01 16:30:00
3  2018-06-15 00:00:00  1677-09-21 00:12:43
4  1677-09-21 00:12:43  1900-01-01 10:10:00
5  2010-12-30 00:00:00  1900-01-01 10:12:00

ofajardo · 2020-08-21T10:53:07Z

what if you leave out the time variable? maybe that one causes problems since it's year 1900?

josmos · 2020-08-21T10:54:06Z

Still F8.2 with this one:

import pandas as pd
import pyreadstat

df = pd.DataFrame({
    "dates": {
        "10001": "nk.nk.2001",
        "10002": "nk.May.2003",
        "10003": "30.Dec.2010",
    },
    "times": {
        "10001": "16:30",
        "10002": "10:10",
        "10003": "10:12",
    },
    "datetime": {
        "10001": "1999-04-11 00:42:02",
        "10002": "2000-05-01 01:12:01",
        "10003": "2020-07-21 13:10:00",
    }
})

df["dates"] = df["dates"].str.replace("nk.nk", "15.Jun")
df["dates"] = df["dates"].str.replace("nk", "15")
df["dates"] = pd.to_datetime(df["dates"], errors="coerce")

print(df["dates"].dtype)
print(list(df["dates"]))

df["times"] = pd.to_datetime(df["times"], format='%H:%M', errors="coerce")
print(df["times"].dtype)
print(list(df["times"]))

df["datetime"] = pd.to_datetime(df["datetime"], errors="coerce")
print(df["datetime"].dtype)
print(list(df["datetime"]))

pyreadstat.write_sav(df, "export.sav",
                     column_labels=["mydate", "mytime", "mydatetime"],
                     variable_measure={"dates": "scale", "times": "scale"},
                     )

df, meta = pyreadstat.read_sav("export.sav")
print(meta.original_variable_types)

josmos · 2020-08-21T10:55:50Z

also when removing the "time" variable

ofajardo · 2020-08-21T10:56:49Z

I was thinking using only column "datetime"

In any case since SPSS is reading it correctly, it is an issue in PSPP. I would suggest you create an issue there.

I will check what is happening with the NaTs being translated to funny dates.

josmos · 2020-08-21T10:59:26Z

Ok, thanks a lot!

ofajardo · 2020-08-21T14:48:51Z

the NaT writing issue has been corrected on dev branch. I will make a release once some other changes that are coming in the next few weeks are ready. If urgent, you can compile from dev branch and at least you will get missing values where you have NaTs when visualising in PSPP.

ofajardo · 2020-08-21T15:08:32Z

and a final piece of information is that pyreadstat handles formatting for datetime/date/time.

In the case of date you get it if you use python's built in datetime.date and it get's translated to SPSS DATE8.
In the case of time, you get it if you use python's built in datetime.time and it get's translated to SPSS TIME8.
In the case of datetime, you get it if you use python's built in datetime.datetime, pandas TimeStamp or np.datetime64 and it gets translated to SPSS DATETIME8.

I tested PSPP and I got numeric for DATETIME8 and DATE8, interestingly TIME8 is working fine.

That means, in case you open a ticket in PSPP you could ask them to correct both DATETIME8 and DATE8.

ofajardo · 2020-08-21T15:51:35Z

and this is how to translate pandas Timestamp to python's datetime.time (that will become SPSS TIME8) and datetime.date (that will become SPSS DATE8):

# after exectuting your code, but before you use write_sav
df["times"] = df["times"].apply(lambda x: x.time()) # time
df["dates"] = df["dates"].apply(lambda x: x.date()) # date

josmos · 2020-08-21T15:53:15Z

I found out it can even be vectorized!
df[field_name] = df[field_name].dt.time

ofajardo · 2020-08-21T15:54:58Z

true! that's even better

josmos · 2020-08-25T13:10:33Z

When checking the log files from PSPP when opening files I find error for date and time fields like this one:

warning: test.sav near offset 0x248: variable V4 with width 0 has invalid print format 0x140800.
However, if i set variable_display_with I still get the same errors.
Maybe this is causing the incorrect display of datetime values?

ofajardo · 2020-08-25T15:06:01Z

The thing is that all of this works fine when SPSS reads these files.

Can you produce a file with the features you need in PSPP and put it here in order to examine how the code things?

ofajardo · 2020-08-25T16:04:09Z

In any case the way the variables are written is controlled in the C library underlying pyreadstat, "Readstat". Maybe you can write an issue there asking why this is happening. If they would somehow fix it, then I can inherit their changes and it will work for pyreadstat. Otherwise I cannot change those things direclty here in pyreadstat, but I always need it to be fixed in Readstat first.

ofajardo · 2020-08-28T10:10:24Z

OK I used PSPP to write a file with one date variable (with format dd.mm.yy) and one datetime (with format yyyy-mm-dd HH:MM:SS ) (No time variable, because PSPP does not have the option to format as time). When reading in pyreadstat I noticed two interesting things:

1- Both variables are read as numeric in pyreadstat.
2- The format of the variables is as follows: (meta.original_variable_types)
{'mydate': 'EDATE8', 'mydatetime': 'YMDHMS20'}
3- Opening the file in SPSS: it did understand the format of mydate, but it did not understand mydatetime and it was shown as numeric.

So, what I think is happening is that PSPP uses its own date and time formats that are not necessarily compatible with SPSS.

Reading PSPP files in pyreadstat is easy to solve for me - I think- , I just need to add things like EDATE8 and YMDHMS20 to the list of recognized formats. I will do so actually. I have seen that PSPP has many possible formats, if somebody would start telling me what those are, then I could add these others as well.

In my opinion since DATE8 and DATETIME8 (the formats that Readstat is using when writing files) are valid SPSS formats, PSPP should support those. I would suggest open an issue on their side.

a bit more elaboration:

the way this works is that SPSS has only two kind of variables: character and numeric. For numeric, there is a format (that you can see in meta.original_variable_types ) that tells how the number should be displayed, for example as a numer with n decimals, or in this case as a date, datetime or time with certain arrangement of months, days, year, week, hours, minutes, seconds, etc. So, in PSPP every time you change the type of the variable to DATE and then select a different format, the format I am talking about will change as well.

In pyreadstat when I get a number and a format, I check if that format is something I know to be a date, datetime, or time. If so, I convert the number to the corresponding python type. If not, I let it be a number. So, the only thing I have to do to read more formats is to include them in the list of known formats.

I assume a similar thing happens in PSPP, it seems they list of known formats is not including DATE8 or DATETIME8 and they would need to include those (and probably some information on how to display it in the GUI), and that should do it.

ofajardo · 2020-08-28T10:29:06Z

I have added EDATE8, EDATE11 (PSPP default) and YMDHMS20 to the list of known formats for SPSS and now pyreadstat can read those formats coming from sav produced by PSPP. Still on dev branch for now.

ofajardo · 2020-08-28T10:42:42Z

Looking at the PSPP source code I found EDATE in several places, among those this comment:

/* Parses DATE, ADATE, EDATE, JDATE, SDATE, QYR, MOYR, KWYR,
1120    DATETIME, YMDHMS, MTIME, TIME, and DTIME formats. */

so, we are giving TIME8 and it's happy (meaning that the 8 doesn't matter, also it says EDATE and not EDATE8), but for some reason DATE8 and DATETIME8, which theoretically are in the list are not recognized.

So probably they have a bug somewhere else (as you pointed already looking at that strange message in your logs). No idea, you have to ask them.

@josmos In any case my current conclusion is that everything is good with Readstat/Pyreadstat and the bug is in PSPP side. I will leave this open for a few more days in case you would like to say something else, otherwise I will eventually close this.

ofajardo · 2020-08-28T10:58:58Z

the particular log warning that you observed comes from file src/data/sys-file-reader.c, from this function:

/* Translates the format spec from sysfile format to internal
1527    format. */
1528 static void
1529 parse_format_spec (struct sfm_reader *r, off_t pos, unsigned int format,
1530                    enum which_format which, struct variable *v,
1531                    int *n_warnings)

1549       if (which == PRINT_FORMAT)
1550         sys_warn (r, pos, _("Variable %s with width %d has invalid print "
1551                             "format 0x%x."),
1552                   var_get_name (v), var_get_width (v), format);

No idea what it means or if it is related to the issue here

ofajardo · 2020-08-28T13:23:15Z

the command line tool pspp-dump-sav seems to recognize the formats OK.

(base) # pspp-dump-sav export.sav
File header record:
	     Product name: @(#) SPSS DATA FILE - https://github.com/WizardMac/ReadStat
	      Layout code: 2
	       Compressed: 0 (no compression)
	     Weight index: 0
	  Number of cases: 6
	 Compression bias: 100
	    Creation date: 28 Aug 20
	    Creation time: 13:13:26
	       File label: ""
000000b4: variable record #0
	Width: 0 (numeric)
	Variable label: 1
	Missing values code: 0 (no missing values)
	Print format: 160800 (DATETIME8.0)
	Write format: 160800 (DATETIME8.0)
	Name: DATES
	000000d0 Variable label: "mydate"
000000e0: variable record #1
	Width: 0 (numeric)
	Variable label: 1
	Missing values code: 0 (no missing values)
	Print format: 160800 (DATETIME8.0)
	Write format: 160800 (DATETIME8.0)
	Name: TIMES
	000000fc Variable label: "mytime"

But trying to read the file in PSPP command line fails:

PSPP> get file='export.sav' .

warning: `export.sav' near offset 0xc0: Variable DATES with width 0 has invalid print format 0x160800.
warning: `export.sav' near offset 0xc4: Variable DATES with width 0 has invalid write format 0x160800.
warning: `export.sav' near offset 0xec: Variable TIMES with width 0 has invalid print format 0x160800.
warning: `export.sav' near offset 0xf0: Variable TIMES with width 0 has invalid write format 0x160800.

Interestingly the dump command says Print format: 160800 (DATETIME8.0) (meaning it is recognizing the format correctly?) but later PSPP invalid print format 0x160800. It truly looks like a bug in PSPP

ofajardo · 2020-08-28T14:43:59Z

Readstat is following the file specification given by PSPP, DATE and DATETIME are specifiied:

https://www.gnu.org/software/pspp/pspp-dev/html_node/Variable-Record.html

See #211 Roche/pyreadstat#69

ofajardo · 2020-08-28T16:00:29Z

@josmos solved !!!! with a big help from @evanmiller

on the latest code on dev, this is working well both in SPSS and PSPP

import pandas as pd
import pyreadstat

df = pd.DataFrame({
    "dates": {
        "10001": "nk.nk.2001",
        "10002": "nk.May.2003",
        "10003": "",
        "10004": "nk.nk.2018",
        "10005": None,
        "10006": "30.Dec.2010",
    },
    "times": {
        "10001": "",
        "10002": None,
        "10003": "16:30",
        "10004": "na:na",
        "10005": "10:10",
        "10006": "10:12",
    }
})

df["dates"] = df["dates"].str.replace("nk.nk", "15.Jun")
df["dates"] = df["dates"].str.replace("nk", "15")
df["dates"] = pd.to_datetime(df["dates"], errors="coerce")

df["times"] = pd.to_datetime(df["times"], format='%H:%M', errors="coerce")

df["datetimes"] = df["dates"]
df.dates = df.dates.dt.date
df.times = df.times.dt.time

pyreadstat.write_sav(df, "export.sav",
                     column_labels=["mydate", "mytime", "mydatetime"],
                     variable_measure={"dates": "scale", "times": "scale"},
                     )

ofajardo · 2020-09-05T07:14:46Z

fixed in version 1.0.2

ofajardo pushed a commit that referenced this issue Aug 21, 2020

fixing writing NaT values, partially fixes issue #69

25c8cef

ofajardo added the wontfix This will not be worked on label Aug 28, 2020

ofajardo mentioned this issue Aug 28, 2020

Question: variables with DATE and DATETIME formats not read correctly in PSPP WizardMac/ReadStat#211

Closed

evanmiller added a commit to WizardMac/ReadStat that referenced this issue Aug 28, 2020

Provide better default widths for SPSS formats

174a899

See #211 Roche/pyreadstat#69

ofajardo removed the wontfix This will not be worked on label Aug 28, 2020

ofajardo closed this as completed Sep 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_sav() and datetime gives wrong format #69

write_sav() and datetime gives wrong format #69

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020 •

edited

josmos commented Aug 21, 2020 •

edited

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

ofajardo commented Aug 21, 2020

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

ofajardo commented Aug 21, 2020 •

edited

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

josmos commented Aug 25, 2020

ofajardo commented Aug 25, 2020

ofajardo commented Aug 25, 2020

ofajardo commented Aug 28, 2020 •

edited

ofajardo commented Aug 28, 2020

ofajardo commented Aug 28, 2020 •

edited

ofajardo commented Aug 28, 2020

ofajardo commented Aug 28, 2020 •

edited

ofajardo commented Aug 28, 2020

ofajardo commented Aug 28, 2020 •

edited

ofajardo commented Sep 5, 2020

write_sav() and datetime gives wrong format #69

write_sav() and datetime gives wrong format #69

Comments

josmos commented Aug 21, 2020

INSTALLED VERSIONS

ofajardo commented Aug 21, 2020 • edited

josmos commented Aug 21, 2020 • edited

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

ofajardo commented Aug 21, 2020

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

ofajardo commented Aug 21, 2020 • edited

ofajardo commented Aug 21, 2020

josmos commented Aug 21, 2020

ofajardo commented Aug 21, 2020

josmos commented Aug 25, 2020

ofajardo commented Aug 25, 2020

ofajardo commented Aug 25, 2020

ofajardo commented Aug 28, 2020 • edited

ofajardo commented Aug 28, 2020

ofajardo commented Aug 28, 2020 • edited

ofajardo commented Aug 28, 2020

ofajardo commented Aug 28, 2020 • edited

ofajardo commented Aug 28, 2020

ofajardo commented Aug 28, 2020 • edited

ofajardo commented Sep 5, 2020

ofajardo commented Aug 21, 2020 •

edited

josmos commented Aug 21, 2020 •

edited

ofajardo commented Aug 21, 2020 •

edited

ofajardo commented Aug 28, 2020 •

edited

ofajardo commented Aug 28, 2020 •

edited

ofajardo commented Aug 28, 2020 •

edited

ofajardo commented Aug 28, 2020 •

edited