Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Ingest: Update Stata 13 ("New Stata") ingest plugin; add support for v.14 and v.15; fix bugs in handling of v.13 format #2301
This issue is for updating the ingest plugin for the the 2nd-gen. Stata format ("Stata 13" or "dta 117"), DTA117FileReader.java. See below for a primer on the Stata formats and ingest plugins copy-and-pasted from the "spike" issue #4408. The tasks involved are 1) adding support for the versions 14 and 15 (both are minor variants of this 2nd-gen. format); 2) and fixing bugs in the handling of the v.13 format proper, that result in some dta117 files not being ingested as tabular data. 1) can be achieved by consulting the format documentation on the Stata site and modifying the plugin accordingly. Then the updated plugin can be tested on all the uningested dta 118 and dta 119 files currently in our prod. holdings. For 2) we can consult all the IngestReport entries saved since 4.0 and find all the Stata 13/dta 117 ingest failures, and debug the plugin until they pass the ingest (or we verify that the failing files are in some way corrupt and are indeed uningestable).
We currently maintain 2 Stata ingest plugins: one for Stata 13 (Stata's internal format "dta 117") and one for the older versions. Their v.13 format was re-engineered completely from scratch. It's very different from the older formats, so it warranted a new and separately maintained piece of ingest code.
Having reviewed the format documentation quickly, the good news is that the newer formats appear to be merely an extension of Stata 13; and not new developments. So we don't seem to need a new ingest plugin - rather we should be able to simply teach the current "new Stata" ingest to understand the latest flavors of the format.
There's been 2 format variations since v.13:
Stata 14 ("dta 118")
A very large portion of the v.14 format specification document appears to be 1:1 identical to the v.13 spec. I'm seeing some minor differences (For ex., in the later version, the number of observations is encoded as an 8 byte integer, in the v.13 it was 4). It'll take more careful work to identify all such differences, but it seems manageable.
The v.15 is explicitly advertised as exactly the same as v.14, with the single exception: the later format allows for more than 32K variables.
The conclusion is: it appears to be possible to add support for v.14 and 15 by extending and improving the already existing code. We should definitely add support for both of these at the same time (since v.15 is a minor extension of v.14).
Just wanted to check in to see the status on this issue. Our concern is that without Stata 14 support, Dataverse isn't generating that .tab derivative for preservation purposes. Once Stata 14 is supported, will the system be able to create the .tab subsettable file?
referenced this issue
Jan 10, 2018
@pdurbin I think that I wanted to be sure that newer Stata versions would work the same way for older versions in Dataverse. As of now, we've had issues with Dataverse generating the .tab derivative, and want to be sure that still happens for Stata 14 and future versions.
changed the title from
Ingest: Add support for Stata 14
Ingest: Add support for Stata 14 and Stata 15
Jan 16, 2018
Jun 19, 2018
I pushed in my commit, removing and/or rewriting some old comments, many TODOs that are no longer relevant.
the use of readUint() here suggests we can't read a negative category value. Could you try and create a Stata 13+ file, with a numeric vector with some negative values, and then assign some value labels to them... is it possible? and are we going to read the values incorrectly on ingest?