Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent formatting information in SPSS metadata? #77

Closed
TheManInTheShack opened this issue Sep 11, 2020 · 6 comments
Closed

Inconsistent formatting information in SPSS metadata? #77

TheManInTheShack opened this issue Sep 11, 2020 · 6 comments

Comments

@TheManInTheShack
Copy link

Hi! I've been using this package for a good while now, and love it immensely - it is the centerpiece of several advanced applications that I have written for organizing and modifying SPSS files, and it's made a real difference to my organization and clients. I can't thank you enough for providing it.

This issue is something that I detected a while back, but have heretofore just been working around; I'm not sure how to classify it, and I'm hoping I can get some information about how the metadata information is gathered.

Describe the issue

The basic problem is that there is a difference between these three things:
Here's what we see in variable view of SPSS:
image
Here's the original_variable_types:
{'ResponseId': 'A18', 'StartDate': 'A255', 'Duration__in_seconds_': 'F40.2', 'Finished': 'F1.0'}
...and here's the variable_storage_width:
{'ResponseId': 24, 'StartDate': 1024, 'Duration__in_seconds_': 8, 'Finished': 8}

Look at the two text variables: ResponseId reads the A18 'correctly', but the StartDate field is showing A255 when it should be showing A1024. If it were always that the variable_storage_width were the reliable source, I could use that to overwrite the format, but, looking again at ResponseId, if I did that in this case, I would get A24, which would be incorrect. Note that the numeric variables do provide the correct thing - I just left those in for visibility/comparison.

So I guess the question is, how does original_variable_types gather its data, and is there a way that I can predict which one of these items is the one that SPSS will expect, so that I can reliably hold the 'real' format? Or is this a bug, and the A255 is showing because it's hitting some kind of small-string limit? Thinking about it as I'm writing all of this out, I suppose 255 is a very suspicious number for that to insert...

To Reproduce

This isn't really a code issue, but here's the simple code I ran to produce those, nothing out of the ordinary:

import pyreadstat
df, meta = pyreadstat.read_sav("test_width.sav")
print(meta.original_variable_types)
print(meta.variable_storage_width)

File example

test_width.zip

Expected behavior

I guess what I'm really after is how do I reliably recreate the 'actual' format as shown in the variable view, so that I can write syntax against it that refers to the correct size.

Setup Information:

How did you install pyreadstat? (pip)
Platform (windows, 64 bit)
Python Version (3.7)
Python Distribution (plain python)
Using Virtualenv or condaenv? (No)

@evanmiller
Copy link

Hi, your issue looks very much like WizardMac/ReadStat#210. Try updating to pyreadstat 1.0.2 and see if that fixes the issue. pip install --upgrade pyreadstat should do it.

@TheManInTheShack
Copy link
Author

I agree it does look similar, but I'm afraid this is happening on 1.0.2.

@evanmiller
Copy link

Okay, it looks like each variable has several "widths" that need to be distinguished.

  • ReadStat's display_width corresponds to SPSS's "Columns"
  • ReadStat's format corresponds to SPSS's "Width" and "Decimals"

Then there is the storage_width, which SPSS does not display. (For strings, this should be the format width rounded up to the nearest 8-byte.)

It looks like there is a bug in ReadStat similar to #210 that affects format rather than display_width. The underlying cause is similar: the old "print format" and "write format" data fields were limited to a single byte, i.e. maxed out at 255. SPSS later added other records to override those values.

While ReadStat is successfully reading the special record that indicates the length of a long string, it's only using that information to determine the storage width. It should be using that information to override the format width as well.

All of this is to say, I think I know what's going on, and a fix should find its way through pipes before long.

Thanks for the detailed report!

evanmiller added a commit to WizardMac/ReadStat that referenced this issue Sep 12, 2020
@TheManInTheShack
Copy link
Author

Great to hear, and thanks so much for the swift attention! :)

@ofajardo
Copy link
Collaborator

ofajardo commented Dec 3, 2020

the issue will be solved on the next version as it has already been fixed in Readstat.

@ofajardo
Copy link
Collaborator

solved in pyreadstat version 1.0.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants