-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
20240609 data parser #2
Conversation
if anyone is interested, you can go download a https://gist.github.com/Jeffrey04/b1737d5ee02a1ced0ac9c34b1a8fe827 (run in the same virtualenv as the project) |
@sweemeng once mentioned unstructured.io requires a gpu to run, will try on another computer one day to test it |
notes from @samqi On Github sayit repo: file import_akomontoso.py uses debate: Thus contextually for |
src/legisdata/main.py
Outdated
parse_time: str = str(datetime.now()) | ||
|
||
|
||
class Inquiry(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just keeping as a note, this is the structure of the resulting JSON file for an inquiry
You can test GPU for free on Google Colab (. Use the T4 GPU https://research.google.com/colaboratory/faq.html ) |
oh ya, forgot about it, but notebook is used mostly to test out ideas, and/or explore data |
In order to extract data out of the downloaded data (refer #1), run
This will extract the data with unstructured.io, and then pickled and saved. In order to parse the data into a structured format,
The schema of the structured format can be found at
src/legisdata/schema.py
. The data in akomaNtoso format can then be found in each JSON file we generated above (./data/2020/session-2/{hansard,inquiry}-parse/*.json
), by issuingif you have batcat installed, you can pipe it over to have schema-highlighting
As usual, data is posted to https://huggingface.co/datasets/sinarproject/legisdata/tree/main