Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

20240609 data parser #2

Merged
merged 9 commits into from
Jun 25, 2024
Merged

20240609 data parser #2

merged 9 commits into from
Jun 25, 2024

Conversation

Jeffrey04
Copy link
Collaborator

@Jeffrey04 Jeffrey04 commented Jun 9, 2024

In order to extract data out of the downloaded data (refer #1), run

$ legisdata extract 2020 2

This will extract the data with unstructured.io, and then pickled and saved. In order to parse the data into a structured format,

$ legisdata parse 2020 2

The schema of the structured format can be found at src/legisdata/schema.py. The data in akomaNtoso format can then be found in each JSON file we generated above (./data/2020/session-2/{hansard,inquiry}-parse/*.json), by issuing

$ jq -r .akn $filename

if you have batcat installed, you can pipe it over to have schema-highlighting

$ jq -r .akn $filename | batcat

As usual, data is posted to https://huggingface.co/datasets/sinarproject/legisdata/tree/main

@Jeffrey04
Copy link
Collaborator Author

Jeffrey04 commented Jun 9, 2024

if anyone is interested, you can go download a .pickle file from https://huggingface.co/datasets/sinarproject/legisdata and start experimenting with them. These are the notebooks that give you an idea on what the data look like

https://gist.github.com/Jeffrey04/b1737d5ee02a1ced0ac9c34b1a8fe827

(run in the same virtualenv as the project)

@Jeffrey04
Copy link
Collaborator Author

@sweemeng once mentioned unstructured.io requires a gpu to run, will try on another computer one day to test it

@Jeffrey04
Copy link
Collaborator Author

notes from @samqi

On Github sayit repo: file import_akomontoso.py uses debate:
debatesection records which maps cleanly to 6.4 debate structure
'debateSection', 'administrationOfOath', 'rollCall',
'prayers', 'oralStatements', 'writtenStatements',
'personalStatements', 'ministerialStatements',
'resolutions', 'nationalInterest', 'declarationOfVote',
'communication', 'petitions', 'papers', 'noticesOfMotion',
'questions', 'address', 'proceduralMotions',
'pointOfOrder', 'adjournment')
& containers (speech, question, answer only)
full vocab has more but we ignore the full vocab as sayit only takes these 3

Thus contextually for
hansard
would be as per open voices govhack-parliament.nz repo which also made a sayit compatible hansard xml for sayit
Example xml:
.......
inquiries,
would have debatestructure soalan mulut & bertulis maps to oralStatements, writtenStatements accordingly
Example for soalan mulut/oral statements
.......

parse_time: str = str(datetime.now())


class Inquiry(NamedTuple):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just keeping as a note, this is the structure of the resulting JSON file for an inquiry

@samqi
Copy link
Contributor

samqi commented Jun 10, 2024

You can test GPU for free on Google Colab (. Use the T4 GPU https://research.google.com/colaboratory/faq.html )

@Jeffrey04
Copy link
Collaborator Author

oh ya, forgot about it, but notebook is used mostly to test out ideas, and/or explore data
if you want to just read content in .pickle you don't need gpu

@Jeffrey04 Jeffrey04 marked this pull request as ready for review June 16, 2024 10:34
@samqi samqi merged commit e0ea1a5 into Sinar:main Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants