Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape .docs with bill projects #6

Closed
gvilarino opened this issue Aug 13, 2013 · 10 comments
Closed

Scrape .docs with bill projects #6

gvilarino opened this issue Aug 13, 2013 · 10 comments
Assignees

Comments

@gvilarino
Copy link
Member

Here you can find two .docs, one with the originally presented bill project (like the ones you can scrape from CEDOM) and a Despacho, which is the final version that ACTUALLY got treated by congressmen in the recinct.

So, we need to be able to turn the latter into HTML (https://crocodoc.com/ seems a fine tool to do so) and scrape them back into our platform.

@ghost ghost assigned ultraklon Aug 13, 2013
@gvilarino
Copy link
Member Author

So, after a lot of research, trial and error, we've reached that crocodoc isn't useful for us. So what we'll do is the following:

  1. Use either the OpenOffice command line tool or unoconv to convert Despachos from .doc/.docx to HTML, scrapable files (i'd rather we used unoconv since it supports both OpenOffice and LibreOffice formats, and it's a specific command-line tool).
  2. Scrape the resulting HTML Despacho with noodle into a DemocracyOS-compatible JSON structure.
  3. Persist the resulting JSON in mongo.

There you go, @ultraklon

@ultraklon
Copy link
Contributor

I finally made this work with Libre Office, using the following command line
soffice --headless --convert-to htm:html --outdir ./ despacho.doc
Nothing special, I just tried it again, seems to need the program (LibreOffice) to be closed for this to work

@gvilarino
Copy link
Member Author

Will this work server side?

@ultraklon
Copy link
Contributor

I don't know what server are we using. What server are we using?

@ultraklon
Copy link
Contributor

I mean, we need a place to have a copy of LibreOffice and execute it somehow with parameters and disk access

@gvilarino
Copy link
Member Author

No, we can't guarantee to have disk access in a server running DemocracyOS.

For the time being make it be HTML and scrape it locally. Even if we have to upload the resulting JSON by hand, I'd rather have that than no process at all. We could then add a script that does it all from a single command.

I guess we could upload the resulting HTML to an accesible URL (GDrive, Dropbox, whatever) and have the scraper run server side. Still, following @jazzido's advice, it's better to have it run locally, halting on ALL errors and ensuring things get scraped the right way, than trusting too much on an automated solution that fails silently and messes up our data.

@gvilarino
Copy link
Member Author

Anyway :shipit:

@ultraklon
Copy link
Contributor

Got it, I'm thinking about using GDrive API to convert docs.
I'll proceed with Noodle, w/o caring about how to receibe html docs (yet)

@gvilarino
Copy link
Member Author

@ultraklon , @oscarguindzberg will be uploading newly obtained data files to @DemocracyOS 's beta app, he'll be getting in touch with you for some assistance.

BTW: where's the code for converting .docs to .htmls through Google's GDrive API?

@gvilarino
Copy link
Member Author

I'm closing this as it's now followed by #10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants