Reading and Writing of long String Variables from SPSS #119

Ov-ille · 2021-03-31T09:09:51Z

When reading and writing spss files with long string variables, the respective variable is being split into several variables.

Reproducing writing issue:

a = pd.DataFrame()
a["LongString1"] = ["Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."]
a["LongString2"] = [
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
]

sav.write_sav(
    a,
    r"C:\Users\XX\test_out1.sav"
)

When this file is opened in SPSS, instead of 2 variable, it contains 5 ("LongString2" is follwed by "V2_A1", "V2_A2", "V2_A3").
When read back into Python with pyreadstat it only shows the 2 created variables.

Strangely, when only "LongString2" is created and written, or when its variable name is shorter ("LongStr"), the splitting does not occur.

Reproducing Reading Issue
Unfortunately I can't offer a file to reproduce the reading issue. The one, that causes a problem for me, can't be shared due to data protection.
And I didn't succeed in creating a sample file, that produces the same problem.

Setup Information:

pyreadstat was installed with pip
a virtual environment created with venv
Python3.8 (plain)
Windows10, 64bit

The text was updated successfully, but these errors were encountered:

ofajardo · 2021-04-02T13:11:48Z

Very similar to #118 . Reported to Readstat for them to take a look.

Ov-ille · 2021-12-14T16:37:51Z

@ofajardo Are there any news regarding this bug? I just stumbled across this problem again when reading spss data with long strings. Some standard code wasn't working all of a sudden and it took me ages to realise that it was down to this problem again (columns being split without any warning).

ofajardo · 2021-12-14T16:39:44Z

no news, sorry

ofajardo · 2022-02-23T13:53:14Z

the issue can be replicated in pure C: WizardMac/ReadStat#260

Ov-ille · 2022-08-12T09:53:34Z

@ofajardo Since I keep encountering this issue, I spent some time creating data to reproduce this issue, in case that it is of any help for finding the bug (sadly I don't have the abilities to actually help solve the issue).

There are alot of variaties how the error is expressed when opening the file in spss, I tried to find a few examples.
It seems to have to do with the number of columns in the dataset, the number of characters in the strings, and also the format of the variable name.

import pandas as pd
import numpy as np
import pyreadstat as sav

### create dataframe
error_file = pd.DataFrame()
columnNames = ['so3_10_9_1', 'so3_10_10_1', 'so3_10_11_1', 'so3_10_12_1',
       'so3_10_13_1', 'so3_10_14_1', 'so3_10_15_1', 'so3_10_16_1',
       'so3_10_17_1', 'so3_10_18_1', 'so3_10_19_1', 'so3_10_20_1',
       'so3_10_96opn', 'so3_10_97opn', 'so3_10_98opn']
error_file[columnNames] = np.nan
# 504 characters or more to produce error
error_file.loc[0,"so3_10_98opn"] = "a"*505
# 504 characters will produce an error
error_file.loc[0,"so3_10_97opn"] = "a"*504
# 503 characters or less to work
error_file.loc[0,"so3_10_96opn"] = "a"*503
# bug example: variable is split with pattern V{number of variable}A{1/2/3/...})
sav.write_sav(error_file, "error_file.sav")

### keep only problem-column with 504 characters
# bug example: variable is split (different pattern!)
sav.write_sav(error_file[["so3_10_97opn"]], "error2_file.sav")

### keep only problem-column with 505 characters
# bug example: variable is exported CORRECTLY!
sav.write_sav(error_file[["so3_10_98opn"]], "error3_file.sav")

### keep only problem-column and create different variable names
error4_file = error_file[["so3_10_97opn"]].copy()
## same amount of characters variable name without underscores
error4_file["so31097opn"] = error4_file["so3_10_97opn"]
## variable name with only one underscore
error4_file["so3_1097opn"] = error4_file["so3_10_97opn"]
## variable name with no underscores but same amount of characters
error4_file["so3x10x97opn"] = error4_file["so3_10_97opn"]
# bug example: variables are split with different naming patterns!
sav.write_sav(error4_file, "error4_file.sav")

ofajardo · 2023-02-22T15:13:09Z

hi @Ov-ille I have tested your initial report code and in the version I just released 1.2.1 it seems to be fixed. Would you please check if it is fully solved now?
@mtr

ofajardo · 2023-02-23T14:14:39Z

I also tried the other examples and all of them seem good now. Closing this.

Ov-ille · 2023-02-23T16:10:08Z

Hi @ofajardo and thanks for testing! I just installed the newest version (1.2.1) but the problems from this issue haven't changed. Did you open the file in spss or how did you check whether it worked? When reading the same files back into python after writing them with pyreadstat the split columns don't appear. But when opened in spss they are being split.

When creating the file directly in spss and then reading with pyreadstat, the variables were kept the way they should be.

ofajardo · 2023-02-23T16:16:55Z

I see, I was checking by reading them with pyreadstat only. I re-open this issue then. Now I realize the issue was always that pyreadstat was reading it correctly but SPSS was not.

KevinCrossDCL · 2023-07-12T15:20:21Z

I'm also experiencing a similar issue:

variable_format = { 'VariableName': 'A1000' }
....
pyreadstat.write_sav(df, "SPSS.sav", column_labels=variable_labels, variable_format=variable_format, variable_value_labels=variable_value_labels)

In the SPSS file that's created the length is set at 255. If I set it as 255 or less it will work, but anything higher than that and it will default to 255.

ME-researchgroup · 2023-11-27T14:22:40Z

I am running into the same issue.
I see you have already opened an issue at WizardMac/ReadStat#260 that is still open.

The names of the variables that are being created seem quite unpredictable, which makes writing a hacky quick fix difficult. Hopefully our friends at ReadStat can look into it!

gulchitai · 2024-01-19T00:59:41Z

I'm facing the same problem too. Use library haven 2.5.3 from R.

ofajardo mentioned this issue Apr 2, 2021

long string variable split when reading in SPSS WizardMac/ReadStat#236

Open

ofajardo mentioned this issue May 5, 2021

Long string handling #118

Closed

ofajardo added bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat labels May 5, 2021

Ov-ille mentioned this issue Feb 25, 2022

Variable name not imported correctly #165

Open

ofajardo closed this as completed Feb 23, 2023

ofajardo reopened this Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading and Writing of long String Variables from SPSS #119

Reading and Writing of long String Variables from SPSS #119

Ov-ille commented Mar 31, 2021

ofajardo commented Apr 2, 2021

Ov-ille commented Dec 14, 2021

ofajardo commented Dec 14, 2021

ofajardo commented Feb 23, 2022

Ov-ille commented Aug 12, 2022 •

edited

ofajardo commented Feb 22, 2023 •

edited

ofajardo commented Feb 23, 2023

Ov-ille commented Feb 23, 2023

ofajardo commented Feb 23, 2023 •

edited

KevinCrossDCL commented Jul 12, 2023

ME-researchgroup commented Nov 27, 2023

gulchitai commented Jan 19, 2024 •

edited

Reading and Writing of long String Variables from SPSS #119

Reading and Writing of long String Variables from SPSS #119

Comments

Ov-ille commented Mar 31, 2021

ofajardo commented Apr 2, 2021

Ov-ille commented Dec 14, 2021

ofajardo commented Dec 14, 2021

ofajardo commented Feb 23, 2022

Ov-ille commented Aug 12, 2022 • edited

ofajardo commented Feb 22, 2023 • edited

ofajardo commented Feb 23, 2023

Ov-ille commented Feb 23, 2023

ofajardo commented Feb 23, 2023 • edited

KevinCrossDCL commented Jul 12, 2023

ME-researchgroup commented Nov 27, 2023

gulchitai commented Jan 19, 2024 • edited

Ov-ille commented Aug 12, 2022 •

edited

ofajardo commented Feb 22, 2023 •

edited

ofajardo commented Feb 23, 2023 •

edited

gulchitai commented Jan 19, 2024 •

edited