-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape .docs with bill projects #6
Comments
So, after a lot of research, trial and error, we've reached that crocodoc isn't useful for us. So what we'll do is the following:
There you go, @ultraklon |
I finally made this work with Libre Office, using the following command line |
Will this work server side? |
I don't know what server are we using. What server are we using? |
I mean, we need a place to have a copy of LibreOffice and execute it somehow with parameters and disk access |
No, we can't guarantee to have disk access in a server running DemocracyOS. For the time being make it be HTML and scrape it locally. Even if we have to upload the resulting JSON by hand, I'd rather have that than no process at all. We could then add a script that does it all from a single command. I guess we could upload the resulting HTML to an accesible URL (GDrive, Dropbox, whatever) and have the scraper run server side. Still, following @jazzido's advice, it's better to have it run locally, halting on ALL errors and ensuring things get scraped the right way, than trusting too much on an automated solution that fails silently and messes up our data. |
Anyway |
Got it, I'm thinking about using GDrive API to convert docs. |
@ultraklon , @oscarguindzberg will be uploading newly obtained data files to @DemocracyOS 's beta app, he'll be getting in touch with you for some assistance. BTW: where's the code for converting .docs to .htmls through Google's GDrive API? |
I'm closing this as it's now followed by #10 |
Here you can find two .docs, one with the originally presented bill project (like the ones you can scrape from CEDOM) and a Despacho, which is the final version that ACTUALLY got treated by congressmen in the recinct.
So, we need to be able to turn the latter into HTML (https://crocodoc.com/ seems a fine tool to do so) and scrape them back into our platform.
The text was updated successfully, but these errors were encountered: