Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OGE.gov Scraper #8

Open
gregoryfoster opened this issue Apr 3, 2017 · 6 comments
Open

OGE.gov Scraper #8

gregoryfoster opened this issue Apr 3, 2017 · 6 comments

Comments

@gregoryfoster
Copy link
Collaborator

gregoryfoster commented Apr 3, 2017

The US Office of Government Ethics posts certain Public Financial Disclosure Reports on their website directly. In particular, there is a table of reports for "executive branch officials occupying positions for which the pay is set at Levels 1 and 2 of the Executive Schedule" sorted in reverse chronological order here:
https://extapps2.oge.gov/201/Presiden.nsf/PAS%20Filings%20by%20Date?OpenView

There is also a screen sorted by name which includes Ethics Agreements:
https://extapps2.oge.gov/201/Presiden.nsf/PAS%20Index?OpenView

The President and Vice President's reports are here:
https://extapps2.oge.gov/201/Presiden.nsf/President%20and%20Vice%20President%20Index

It seems reasonable to build a web scraper to search this site daily for new PFDs. That should probably be a separate GitHub project, but I wanted to document the idea here.

@bnsmith3
Copy link

Do you want help with this?

@gregoryfoster
Copy link
Collaborator Author

Hi @bnsmith3! I'd love some help on this one! Feel free to approach this however you'd like, though I do recommend standing up an independent repository solely focused on a scraper.

As this would be a civic data service, I was considering implementation on morph.io:
https://morph.io/

@bnsmith3
Copy link

Ok. Is it okay with you if the first version simply stores the pdfs, and maybe version two will involve actually parsing the pdfs?

@Nix-Bohon
Copy link

If you want to just do the downloading part, I can take a swing at extractibg the text from the PDFs

@gregoryfoster
Copy link
Collaborator Author

Hi @zacherybohon!

re: parsing. Thus far, we've been using CPI's pfd-parser project to parse the (somewhat) structured PDF files. It's a NodeJS script and it's pretty ugly but it seems to work well enough.

That said, the outputs of that script assume you're parsing a set of Public Financial Disclosure Reports and want the data partitioned by table, indexed by filer. That may or may not make sense based on your use case. There's a use case for parsing one file into, say, JSON. Long story short, I wouldn't be opposed to seeing another Public Financial Disclosure Report parser crop up. I would again suggest any new parser should do one thing well and therefore be in a standalone repository.

@bnsmith3
Copy link

Sounds good. Based on that, it made sense to me to make the parsing a separate issue, so I made one: #11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants