Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multi-file Dominion format #569

Merged
merged 11 commits into from Jul 17, 2022

Conversation

catrope
Copy link
Contributor

@catrope catrope commented May 18, 2021

Starting with the March 2020 election, San Francisco has started
publishing their Dominion CVR data as many small files named
CvrExport_1.json, CvrExport_2.json, etc. rather than one big
CvrExport.json file.

To support this new format, first look for CvrExport.json, but if that
doesn't exist, look for CvrExport_1.json. If that does exist, read all
CvrExport_%d.json files in numerical sequence, until we hit one that
doesn't exist.

HEdingfield and others added 3 commits December 12, 2020 20:27
Starting with the March 2020 election, San Francisco has started
publishing their Dominion CVR data as many small files named
CvrExport_1.json, CvrExport_2.json, etc. rather than one big
CvrExport.json file.

To support this new format, first look for CvrExport.json, but if that
doesn't exist, look for CvrExport_1.json. If that does exist, read all
CvrExport_%d.json files in numerical sequence, until we hit one that
doesn't exist.
Instead of logging for every file, log for every 50k records parsed.
@tarheel tarheel requested review from tarheel and moldover May 18, 2021 21:52
The Dominion JSON format has a Sessions key at the top level that
contains an array of objects. Each of these objects has a .Cards
property that is typically an array of one object that corresponds to a
CVR. But sometimes (~1-2% of the time) it's an array of multiple
objects, each of which is a CVR. Our Dominion CVR parser was ignoring
these, because it only ever looked at the first element of the array.

Instead, iterate over this array, so that we also capture the CVRs
stored in the second or third elements of these Cards arrays. Without
this change, some votes are missing when parsing San Francisco's
November 2020 data.
Copy link
Contributor

@moldover moldover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @catrope. Thanks for the PR! I've been traveling so I apologize I'm late responding. Looks pretty good and it will be great to support the new SF format. It would be really good to have a regression test which parses the new format. Is it possible to add one? How large would the data be to use actual published SF cvr data?

We use google checkstyle plugin for IntelliJ to ensure consistent code "style". It's good practice to install that and run it before commiting anything. Now that I'm writing this I think we should add some documentation for contributors around this stuff.

@HEdingfield HEdingfield changed the base branch from master to develop August 11, 2021 07:32
@tarheel
Copy link
Contributor

tarheel commented Feb 23, 2022

Hi @catrope, we haven't forgotten about this PR! Are you interested in finishing it, or do you want us to update it? We're actually hoping to make use of this functionality so that we can use our application to verify some SF results.

@catrope
Copy link
Contributor Author

catrope commented Feb 23, 2022

Hi! So sorry for forgetting about this last year. Yes, I am interested in finishing this, probably within the next week or so.

@catrope
Copy link
Contributor Author

catrope commented Mar 10, 2022

Looks pretty good and it will be great to support the new SF format. It would be really good to have a regression test which parses the new format. Is it possible to add one? How large would the data be to use actual published SF cvr data?

The SF data is huge. The Nov 2020 election data was split over 24,940 JSON files, adding up to 4.2 GB. The Feb 2022 special election was "only" 1902 files totaling 200 MB, and that one didn't have a meaningful RCV contest on the ballot.

Instead, I'll add tests with a manually created example of a multi-file election with only 3 files, including one where there are multiple Cards in one Session.

This uses a selection of 4 files (out of 24,940) from the November 2020
election in San Francisco, some with one Card per Session, some with
multiple Cards per Session.

These files come from https://www.sfelections.org/results/20201103/data/20201201/CVR_Export_20201201091840.zip
and were renamed as follows:

- CvrExport_9231.json -> CvrExport_0.json
- CvrExport_9232.json -> CvrExport_1.json
- CvrExport_9233.json -> CvrExport_2.json
- CvrExport_9234.json -> CvrExport_3.json
- All other CvrExport_*.json files were removed
@catrope
Copy link
Contributor Author

catrope commented Mar 10, 2022

I've added a test case based on 4 files from the SF Nov 2020 election, 2 of which have the multi-Card format. Using only four files limited the damage to 3.6 MB. All pre-existing test data combined was 60 MB (now 64 MB with this addition).

@catrope catrope requested a review from moldover March 10, 2022 23:15
Copy link
Contributor

@HEdingfield HEdingfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya @catrope, thanks for working on this, and for adding the tests! Unfortunately, it looks like you were working off a fairly old version. Could you pull the latest changes from develop into your branch and sort out the merging? The biggest change that will affect you is the fact that we've moved all our test data into a separate repo. Check out PR #576, in particular here.

@HEdingfield
Copy link
Contributor

@catrope would you also be able to add an automated test that actually uses the data you provided in the process?

@moldover
Copy link
Contributor

moldover commented Mar 12, 2022

@catrope - Code changes look solid and the new test is great. Thank you. As @HEdingfield mentions, we work off of develop branch now, and the latest develop code moves all test_data into a submodule -- in part so that we can accommodate larger data sets. But it also means more commits to get things where they need to be. But I can totally help with that if you'd like, or even if you just want to hand off what you've done that's cool too. Apologies for the confusion. We need to get better about communicating and onboarding. I think you are the first outside contributor!

I believe something like this should happen:

  1. delete new test data from dominion-multi-file (make a copy first)
  2. add test data to the test_data repo: https://github.com/artoonie/rcvtests via another PR which we can quickly approve
  3. merge latest develop -> dominion-multi-file
  4. pin test_data to your new
  5. add lines into TabulatorTests.java to execute your test there
  6. we approve and merge dominion-multi-file -> develop

@moldover
Copy link
Contributor

moldover commented Jul 11, 2022

@HEdingfield I went ahead and merged the latest develop into this PR branch, including the new test files. Should be good to go but just wanted to get your eyeballs on this too :)

@moldover moldover self-assigned this Jul 11, 2022
Copy link
Contributor

@HEdingfield HEdingfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@moldover moldover merged commit 8abc64f into BrightSpots:develop Jul 17, 2022
@moldover
Copy link
Contributor

moldover commented Jul 17, 2022

@catrope we finally pushed your work here. Thanks so much for your contribution!!

We are also putting together a contributors guide which will hopefully help make things cleaner in the future: https://github.com/BrightSpots/rcv/blob/develop/contributor_guide.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants