Scrape .docs with bill projects #6

gvilarino · 2013-08-13T22:01:49Z

Here you can find two .docs, one with the originally presented bill project (like the ones you can scrape from CEDOM) and a Despacho, which is the final version that ACTUALLY got treated by congressmen in the recinct.

So, we need to be able to turn the latter into HTML (https://crocodoc.com/ seems a fine tool to do so) and scrape them back into our platform.

gvilarino · 2013-09-04T18:12:31Z

So, after a lot of research, trial and error, we've reached that crocodoc isn't useful for us. So what we'll do is the following:

Use either the OpenOffice command line tool or unoconv to convert Despachos from .doc/.docx to HTML, scrapable files (i'd rather we used unoconv since it supports both OpenOffice and LibreOffice formats, and it's a specific command-line tool).
Scrape the resulting HTML Despacho with noodle into a DemocracyOS-compatible JSON structure.
Persist the resulting JSON in mongo.

There you go, @ultraklon

ultraklon · 2013-09-04T23:16:51Z

I finally made this work with Libre Office, using the following command line
soffice --headless --convert-to htm:html --outdir ./ despacho.doc
Nothing special, I just tried it again, seems to need the program (LibreOffice) to be closed for this to work

gvilarino · 2013-09-04T23:27:59Z

Will this work server side?

ultraklon · 2013-09-04T23:31:52Z

I don't know what server are we using. What server are we using?

ultraklon · 2013-09-04T23:38:58Z

I mean, we need a place to have a copy of LibreOffice and execute it somehow with parameters and disk access

gvilarino · 2013-09-05T18:06:24Z

No, we can't guarantee to have disk access in a server running DemocracyOS.

For the time being make it be HTML and scrape it locally. Even if we have to upload the resulting JSON by hand, I'd rather have that than no process at all. We could then add a script that does it all from a single command.

I guess we could upload the resulting HTML to an accesible URL (GDrive, Dropbox, whatever) and have the scraper run server side. Still, following @jazzido's advice, it's better to have it run locally, halting on ALL errors and ensuring things get scraped the right way, than trusting too much on an automated solution that fails silently and messes up our data.

gvilarino · 2013-09-05T18:06:33Z

Anyway

ultraklon · 2013-09-05T19:45:33Z

Got it, I'm thinking about using GDrive API to convert docs.
I'll proceed with Noodle, w/o caring about how to receibe html docs (yet)

gvilarino · 2013-09-23T15:56:49Z

@ultraklon , @oscarguindzberg will be uploading newly obtained data files to @DemocracyOS 's beta app, he'll be getting in touch with you for some assistance.

BTW: where's the code for converting .docs to .htmls through Google's GDrive API?

gvilarino · 2013-09-24T22:20:46Z

I'm closing this as it's now followed by #10

ghost assigned ultraklon Aug 13, 2013

gvilarino closed this as completed Sep 24, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape .docs with bill projects #6

Scrape .docs with bill projects #6

gvilarino commented Aug 13, 2013

gvilarino commented Sep 4, 2013

ultraklon commented Sep 4, 2013

gvilarino commented Sep 4, 2013

ultraklon commented Sep 4, 2013

ultraklon commented Sep 4, 2013

gvilarino commented Sep 5, 2013

gvilarino commented Sep 5, 2013

ultraklon commented Sep 5, 2013

gvilarino commented Sep 23, 2013

gvilarino commented Sep 24, 2013

Scrape .docs with bill projects #6

Scrape .docs with bill projects #6

Comments

gvilarino commented Aug 13, 2013

gvilarino commented Sep 4, 2013

ultraklon commented Sep 4, 2013

gvilarino commented Sep 4, 2013

ultraklon commented Sep 4, 2013

ultraklon commented Sep 4, 2013

gvilarino commented Sep 5, 2013

gvilarino commented Sep 5, 2013

ultraklon commented Sep 5, 2013

gvilarino commented Sep 23, 2013

gvilarino commented Sep 24, 2013