Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading and Writing of long String Variables from SPSS #119

Open
Ov-ille opened this issue Mar 31, 2021 · 12 comments
Open

Reading and Writing of long String Variables from SPSS #119

Ov-ille opened this issue Mar 31, 2021 · 12 comments
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat

Comments

@Ov-ille
Copy link

Ov-ille commented Mar 31, 2021

When reading and writing spss files with long string variables, the respective variable is being split into several variables.

Reproducing writing issue:

a = pd.DataFrame()
a["LongString1"] = ["Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."]
a["LongString2"] = [
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
]

sav.write_sav(
    a,
    r"C:\Users\XX\test_out1.sav"
)

When this file is opened in SPSS, instead of 2 variable, it contains 5 ("LongString2" is follwed by "V2_A1", "V2_A2", "V2_A3").
When read back into Python with pyreadstat it only shows the 2 created variables.

Strangely, when only "LongString2" is created and written, or when its variable name is shorter ("LongStr"), the splitting does not occur.

Reproducing Reading Issue
Unfortunately I can't offer a file to reproduce the reading issue. The one, that causes a problem for me, can't be shared due to data protection.
And I didn't succeed in creating a sample file, that produces the same problem.

Setup Information:

  • pyreadstat was installed with pip
  • a virtual environment created with venv
  • Python3.8 (plain)
  • Windows10, 64bit
@ofajardo
Copy link
Collaborator

ofajardo commented Apr 2, 2021

Very similar to #118 . Reported to Readstat for them to take a look.

@ofajardo ofajardo added bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat labels May 5, 2021
@Ov-ille
Copy link
Author

Ov-ille commented Dec 14, 2021

@ofajardo Are there any news regarding this bug? I just stumbled across this problem again when reading spss data with long strings. Some standard code wasn't working all of a sudden and it took me ages to realise that it was down to this problem again (columns being split without any warning).

@ofajardo
Copy link
Collaborator

no news, sorry

@ofajardo
Copy link
Collaborator

the issue can be replicated in pure C: WizardMac/ReadStat#260

@Ov-ille
Copy link
Author

Ov-ille commented Aug 12, 2022

@ofajardo Since I keep encountering this issue, I spent some time creating data to reproduce this issue, in case that it is of any help for finding the bug (sadly I don't have the abilities to actually help solve the issue).

There are alot of variaties how the error is expressed when opening the file in spss, I tried to find a few examples.
It seems to have to do with the number of columns in the dataset, the number of characters in the strings, and also the format of the variable name.

import pandas as pd
import numpy as np
import pyreadstat as sav

### create dataframe
error_file = pd.DataFrame()
columnNames = ['so3_10_9_1', 'so3_10_10_1', 'so3_10_11_1', 'so3_10_12_1',
       'so3_10_13_1', 'so3_10_14_1', 'so3_10_15_1', 'so3_10_16_1',
       'so3_10_17_1', 'so3_10_18_1', 'so3_10_19_1', 'so3_10_20_1',
       'so3_10_96opn', 'so3_10_97opn', 'so3_10_98opn']
error_file[columnNames] = np.nan
# 504 characters or more to produce error
error_file.loc[0,"so3_10_98opn"] = "a"*505
# 504 characters will produce an error
error_file.loc[0,"so3_10_97opn"] = "a"*504
# 503 characters or less to work
error_file.loc[0,"so3_10_96opn"] = "a"*503
# bug example: variable is split with pattern V{number of variable}A{1/2/3/...})
sav.write_sav(error_file, "error_file.sav")

image

### keep only problem-column with 504 characters
# bug example: variable is split (different pattern!)
sav.write_sav(error_file[["so3_10_97opn"]], "error2_file.sav")

image

### keep only problem-column with 505 characters
# bug example: variable is exported CORRECTLY!
sav.write_sav(error_file[["so3_10_98opn"]], "error3_file.sav")

image

### keep only problem-column and create different variable names
error4_file = error_file[["so3_10_97opn"]].copy()
## same amount of characters variable name without underscores
error4_file["so31097opn"] = error4_file["so3_10_97opn"]
## variable name with only one underscore
error4_file["so3_1097opn"] = error4_file["so3_10_97opn"]
## variable name with no underscores but same amount of characters
error4_file["so3x10x97opn"] = error4_file["so3_10_97opn"]
# bug example: variables are split with different naming patterns!
sav.write_sav(error4_file, "error4_file.sav")

image

@ofajardo
Copy link
Collaborator

ofajardo commented Feb 22, 2023

hi @Ov-ille I have tested your initial report code and in the version I just released 1.2.1 it seems to be fixed. Would you please check if it is fully solved now?
@mtr

@ofajardo
Copy link
Collaborator

I also tried the other examples and all of them seem good now. Closing this.

@Ov-ille
Copy link
Author

Ov-ille commented Feb 23, 2023

Hi @ofajardo and thanks for testing! I just installed the newest version (1.2.1) but the problems from this issue haven't changed. Did you open the file in spss or how did you check whether it worked? When reading the same files back into python after writing them with pyreadstat the split columns don't appear. But when opened in spss they are being split.

When creating the file directly in spss and then reading with pyreadstat, the variables were kept the way they should be.

@ofajardo
Copy link
Collaborator

ofajardo commented Feb 23, 2023

I see, I was checking by reading them with pyreadstat only. I re-open this issue then. Now I realize the issue was always that pyreadstat was reading it correctly but SPSS was not.

@ofajardo ofajardo reopened this Feb 23, 2023
@KevinCrossDCL
Copy link

I'm also experiencing a similar issue:

variable_format = { 'VariableName': 'A1000' }
....
pyreadstat.write_sav(df, "SPSS.sav", column_labels=variable_labels, variable_format=variable_format, variable_value_labels=variable_value_labels)

In the SPSS file that's created the length is set at 255. If I set it as 255 or less it will work, but anything higher than that and it will default to 255.

@ME-researchgroup
Copy link

I am running into the same issue.
I see you have already opened an issue at WizardMac/ReadStat#260 that is still open.

The names of the variables that are being created seem quite unpredictable, which makes writing a hacky quick fix difficult. Hopefully our friends at ReadStat can look into it!

@gulchitai
Copy link

gulchitai commented Jan 19, 2024

image
I'm facing the same problem too. Use library haven 2.5.3 from R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat
Projects
None yet
Development

No branches or pull requests

5 participants