New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: Update Stata 13 ("New Stata") ingest plugin; add support for v.14 and v.15; fix bugs in handling of v.13 format #2301

Closed
kcondon opened this Issue Jun 30, 2015 · 32 comments

Comments

@kcondon
Contributor

kcondon commented Jun 30, 2015

This issue is for updating the ingest plugin for the the 2nd-gen. Stata format ("Stata 13" or "dta 117"), DTA117FileReader.java. See below for a primer on the Stata formats and ingest plugins copy-and-pasted from the "spike" issue #4408. The tasks involved are 1) adding support for the versions 14 and 15 (both are minor variants of this 2nd-gen. format); 2) and fixing bugs in the handling of the v.13 format proper, that result in some dta117 files not being ingested as tabular data. 1) can be achieved by consulting the format documentation on the Stata site and modifying the plugin accordingly. Then the updated plugin can be tested on all the uningested dta 118 and dta 119 files currently in our prod. holdings. For 2) we can consult all the IngestReport entries saved since 4.0 and find all the Stata 13/dta 117 ingest failures, and debug the plugin until they pass the ingest (or we verify that the failing files are in some way corrupt and are indeed uningestable).

From #4408:

We currently maintain 2 Stata ingest plugins: one for Stata 13 (Stata's internal format "dta 117") and one for the older versions. Their v.13 format was re-engineered completely from scratch. It's very different from the older formats, so it warranted a new and separately maintained piece of ingest code.

Having reviewed the format documentation quickly, the good news is that the newer formats appear to be merely an extension of Stata 13; and not new developments. So we don't seem to need a new ingest plugin - rather we should be able to simply teach the current "new Stata" ingest to understand the latest flavors of the format.

There's been 2 format variations since v.13:

Stata 14 ("dta 118")
Stata 15 ("dta 119")

A very large portion of the v.14 format specification document appears to be 1:1 identical to the v.13 spec. I'm seeing some minor differences (For ex., in the later version, the number of observations is encoded as an 8 byte integer, in the v.13 it was 4). It'll take more careful work to identify all such differences, but it seems manageable.

The v.15 is explicitly advertised as exactly the same as v.14, with the single exception: the later format allows for more than 32K variables.

The conclusion is: it appears to be possible to add support for v.14 and 15 by extending and improving the already existing code. We should definitely add support for both of these at the same time (since v.15 is a minor extension of v.14).

@scolapasta scolapasta modified the milestone: In Review Jul 2, 2015

@scolapasta scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016

@tlchristian

This comment has been minimized.

tlchristian commented May 4, 2016

Just wanted to check in to see the status on this issue. Our concern is that without Stata 14 support, Dataverse isn't generating that .tab derivative for preservation purposes. Once Stata 14 is supported, will the system be able to create the .tab subsettable file?

@djbrooke

This comment has been minimized.

Contributor

djbrooke commented Nov 21, 2017

@sbarbosadataverse - you said you wanted to discuss?

@donsizemore

This comment has been minimized.

Contributor

donsizemore commented Nov 21, 2017

At the risk of butting in... would you want to add ingest support for Stata 15 to this issue as well (if it isn't supported already)?

@djbrooke

This comment has been minimized.

Contributor

djbrooke commented Nov 21, 2017

Good idea. Maybe? @sbarbosadataverse and I will discuss early next week (not sure what she has in mind) and follow up here. Have a good Thanksgiving, @donsizemore!

@pdurbin

This comment has been minimized.

Member

pdurbin commented Jan 11, 2018

We discussed this issue in our backlog grooming meeting today and created #4408 as a spike for more investigation.

@tlchristian I'm not sure I understand your question. Support for newer versions of Stata would work the same as the versions Dataverse supports now, I assume.

@tlchristian

This comment has been minimized.

tlchristian commented Jan 11, 2018

@pdurbin I think that I wanted to be sure that newer Stata versions would work the same way for older versions in Dataverse. As of now, we've had issues with Dataverse generating the .tab derivative, and want to be sure that still happens for Stata 14 and future versions.

@djbrooke djbrooke changed the title from Ingest: Add support for Stata 14 to Ingest: Add support for Stata 14 and Stata 15 Jan 16, 2018

@landreev

This comment has been minimized.

Contributor

landreev commented Jun 22, 2018

I pushed in my commit, removing and/or rewriting some old comments, many TODOs that are no longer relevant.
I left a couple of TODOs in place for the future. One of them I will double-check on now - the missing values for strings, vs. empty strings.
Aside from that, there's one thing I have question about ; @oscardssmith could you please check on the following:
line 1173

category_values[i] = reader.readUInt();

the use of readUint() here suggests we can't read a negative category value. Could you try and create a Stata 13+ file, with a numeric vector with some negative values, and then assign some value labels to them... is it possible? and are we going to read the values incorrectly on ingest?

@landreev

This comment has been minimized.

Contributor

landreev commented Jun 26, 2018

@kcondon This is ready for QA.
I'll point you to a whole pile of Stata 14 files (that were uploaded, but not ingested in prod.); and a (smaller) pile of Stata 13 files, that the previous version of the plugin failed to ingest.

pdurbin added a commit that referenced this issue Jun 28, 2018

@oscardssmith oscardssmith referenced this issue Jul 5, 2018

Merged

4676 mixed labels in r #4814

0 of 2 tasks complete

@kcondon kcondon closed this in 4145d10 Jul 10, 2018

kcondon added a commit that referenced this issue Jul 10, 2018

Merge pull request #4708 from IQSS/2301-stata
support for Stata 14 and 15 #2301

@kcondon kcondon removed the Status: QA label Jul 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment