Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading Multiple Response from sav (WIP, DO NOT MERGE) #259

Open
wants to merge 20 commits into
base: dev
Choose a base branch
from

Conversation

slobodan-ilic
Copy link

@slobodan-ilic slobodan-ilic commented Apr 24, 2024

This PR implements reading multiple response metadata from sav format. It's intended to be based on a change to the readstat library. The code is included here for convenience, but should eventually be rebased, once the readstat lib is updated.

I'll keep posting here regularly, as updates happen on the readstat side.

Here's also an example on how to use the new functionality:

[ins] In [10]: import pyreadstat
[ins] In [11]: _, metadata = pyreadstat.read_sav('./simple_alltypes.sav')
[ins] In [12]: metadata.mr_sets

which gives the following output:

{
  'categorical_array': {
    'type': 'C',
    'is_dichotomy': False,
    'counted_value': None,
    'label': '',
    'variable_list': ['ca_subvar_1', 'ca_subvar_2', 'ca_subvar_3']
 },
 'mymrset': {
    'type': 'D',
    'is_dichotomy': True,
    'counted_value': 1,
    'label': 'My multiple response set',
    'variable_list': ['bool1', 'bool2', 'bool3']
  }
}

you can test it with the following (very basic) file:
simple_alltypes.sav.zip
but you have to unzip it (GitHub won't allow direct upload of the *.sav extension).

@joaquimadraz
Copy link

Hey @slobodan-ilic, tried to install your version with poetry and got these two errors:

  ./src/spss/readstat_sav_read.c:1878:40: error: call to undeclared library function 'toupper' with type 'int (int)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                      sv_name_upper[c] = toupper((unsigned char) mr.subvariables[j][c]);

  ./src/spss/readstat_sav_read.c:176:51: error: call to undeclared library function 'isdigit' with type 'int (int)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
              for (int i = 0; i < internal_count && isdigit(*next_part); i++) {

The exact command: poetry add git+ssh://git@github.com:slobodan-ilic/pyreadstat.git#ISS-25-support-mr-in-sav

@slobodan-ilic
Copy link
Author

@joaquimadraz fixed this, and also some other issues... Will report here when I update the readstat as well...

@ofajardo
Copy link
Collaborator

ofajardo commented May 21, 2024

hi there, thanks a lot for this PR, highly appreciated!
I am not familiar with this SPSS feature. It would be super nice if at some point (before final merge) you could provide a SPSS file with a toy example of this Multiple response, and even better a test to be added to the unit test file.

PS: I see that in the corresponding issue there is a sample file, so I guess we can use that one.

@slobodan-ilic
Copy link
Author

Hi @ofajardo , thanks for your comment! This work is likely to go through at least 1 or 2 more changes, where I'd need to make it more in-line with the rest of the design of readstat library. I will add the test file here, but you can use the one from the issue in the meantime, to test. You can also do some comparison with pyspssio library, they should provide very similar results with regards to the MR data.

@slobodan-ilic
Copy link
Author

Hi @ofajardo! 👋

We've done some work on this, fixed bugs and tested on bunch of real-life survey data. We also have our pr on readstat ready. Waiting for Evan to provide another round of 👀 on that.

Would you be able to provide some more info about the process your releases take? Do you always need to be in sync with readstat? How are the files copied over to pyreadstat, is it manually or a submodule or something else? Any info is more than welcome, so we can plan our roadmap accordingly (whether we fork or wait basically).

All the best!

@joaquimadraz fyi ☝️

@ofajardo
Copy link
Collaborator

hi, thanks a lot for working on this!

What I would like to do is the following:

  • Merging your PR into a new branch on the repo.
  • I would take a look at your code and test what you did (sorry, have not had time for that yet).
  • Then I would like to get it through the CD/CI pipeline to make sure that the code runs smoothly on all supported platforms, wheels are built correctly etc.
  • Once Evan merges into Readstat (meaning the code is final), I would be happy to merge the branch into the main pyreadstat branch and make a release, no need to wait for him to release on Readstat first. Later I would do another release once he releases, if he has added more features.
  • These are the two basic things that have to happen. I would be fine with whatever order they happen, i.e. if you prefer to wait until Evan finishes reviewing and merges, and then we start here, or if you prefer maybe certain things can already be initiated, like testing the code on the CI/CD.
  • I copy Readstat files manually to the src folder in pyreadstat, this is to make possible that people can do a pip install from the github repo or from the source package in pypi, a submodule approach would make that difficult (or not possible).

Hope that makes it clearer. Let me know if you have further questions!

@slobodan-ilic
Copy link
Author

@ofajardo Hi, thanks for the quick response. I'd definitely like to go forward with CI path first, as we might discover new issues more quickly this way. I guess the communication on the readstat is going to happen with higher latency than here.

Thanks for the explanation about copying files, that's what I had supposed.

Btw, do you know how to run the fuzzy tests locally? I had tried doing that (on readstat lib) but didn't have success, even after following instructions from Evan's repo. I see that the CI there relies on VS17, which seems old (and can't run it since I'm on mac).

Any advice is really welcome! Thanks!

@slobodan-ilic
Copy link
Author

@ofajardo Hi, thanks for the quick response. I'd definitely like to go forward with CI path first, as we might discover new issues more quickly this way. I guess the communication on the readstat is going to happen with higher latency than here.

Thanks for the explanation about copying files, that's what I had supposed.

Btw, do you know how to run the fuzzy tests locally? I had tried doing that (on readstat lib) but didn't have success, even after following instructions from Evan's repo. I see that the CI there relies on VS17, which seems old (and can't run it since I'm on mac).

Any advice is really welcome! Thanks!

Ok, through combination of perseverance and GPT, I was able to build and run fuzzers locally, even though the instructions on readstat don't seem to be applicable to (at least my own) setup. Found another bug and fixed it (since it was immediately clear what it was). Now it's in better shape then it was :)

@slobodan-ilic
Copy link
Author

Short update: After fixing all CI issues in readstat, Evan did a thorough review. Other than a couple minor malloc changes, it seemed as if it's gonna be finished quick, until I saw the proposition to completely rewrite the parsing part in ragel. I've been on it ever since, should be done soon-ish.

@ofajardo
Copy link
Collaborator

ofajardo commented Jul 2, 2024

hi

would you mind adding a few short lines in this file describing the new field in the metadata object? also subfields if applicable.

Regarding trying the pyreadstat CI/CD ... is this something you want to do on your side, or you prefer me doing it? If me, I would need to merge this PR to a dedicated new branch to launch the CI/CD pointing there. We can wait a bit also until everything is ready on your side.

@ofajardo ofajardo changed the base branch from master to dev July 18, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants