20240609 data parser #2

Jeffrey04 · 2024-06-09T10:33:20Z

In order to extract data out of the downloaded data (refer #1), run

$ legisdata extract 2020 2

This will extract the data with unstructured.io, and then pickled and saved. In order to parse the data into a structured format,

$ legisdata parse 2020 2

The schema of the structured format can be found at src/legisdata/schema.py. The data in akomaNtoso format can then be found in each JSON file we generated above (./data/2020/session-2/{hansard,inquiry}-parse/*.json), by issuing

$ jq -r .akn $filename

if you have batcat installed, you can pipe it over to have schema-highlighting

$ jq -r .akn $filename | batcat

As usual, data is posted to https://huggingface.co/datasets/sinarproject/legisdata/tree/main

Jeffrey04 · 2024-06-09T10:38:05Z

if anyone is interested, you can go download a .pickle file from https://huggingface.co/datasets/sinarproject/legisdata and start experimenting with them. These are the notebooks that give you an idea on what the data look like

https://gist.github.com/Jeffrey04/b1737d5ee02a1ced0ac9c34b1a8fe827

(run in the same virtualenv as the project)

Jeffrey04 · 2024-06-09T10:39:07Z

@sweemeng once mentioned unstructured.io requires a gpu to run, will try on another computer one day to test it

Jeffrey04 · 2024-06-09T10:40:13Z

notes from @samqi

On Github sayit repo: file import_akomontoso.py uses debate:
debatesection records which maps cleanly to 6.4 debate structure
'debateSection', 'administrationOfOath', 'rollCall',
'prayers', 'oralStatements', 'writtenStatements',
'personalStatements', 'ministerialStatements',
'resolutions', 'nationalInterest', 'declarationOfVote',
'communication', 'petitions', 'papers', 'noticesOfMotion',
'questions', 'address', 'proceduralMotions',
'pointOfOrder', 'adjournment')
& containers (speech, question, answer only)
full vocab has more but we ignore the full vocab as sayit only takes these 3

Thus contextually for
hansard
would be as per open voices govhack-parliament.nz repo which also made a sayit compatible hansard xml for sayit
Example xml:
.......
inquiries,
would have debatestructure soalan mulut & bertulis maps to oralStatements, writtenStatements accordingly
Example for soalan mulut/oral statements
.......

Jeffrey04 · 2024-06-09T10:41:25Z

src/legisdata/main.py

+    parse_time: str = str(datetime.now())
+
+
+class Inquiry(NamedTuple):


Just keeping as a note, this is the structure of the resulting JSON file for an inquiry

samqi · 2024-06-10T04:45:43Z

You can test GPU for free on Google Colab (. Use the T4 GPU https://research.google.com/colaboratory/faq.html )

Jeffrey04 · 2024-06-10T04:52:33Z

oh ya, forgot about it, but notebook is used mostly to test out ideas, and/or explore data
if you want to just read content in .pickle you don't need gpu

Jeffrey04 added 2 commits June 9, 2024 18:30

new dependencies

73e76e2

add parser for inquiry data

f7bf84b

Jeffrey04 commented Jun 9, 2024

View reviewed changes

Jeffrey04 added 6 commits June 13, 2024 19:08

add hansard parsing to the script

c2e7b5c

split schema to individual module

8321fcb

move parser code out from main

94f4ce0

cleanup list manipulation code

e41cf71

added code to generate atomaNtoso xml format

6cc3b0b

remove unused import

e4c8bc4

Jeffrey04 marked this pull request as ready for review June 16, 2024 10:34

remove own implementation of serializer

8fef7de

samqi merged commit e0ea1a5 into Sinar:main Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240609 data parser #2

20240609 data parser #2

Jeffrey04 commented Jun 9, 2024 •

edited

Loading

Jeffrey04 commented Jun 9, 2024 •

edited

Loading

Jeffrey04 commented Jun 9, 2024

Jeffrey04 commented Jun 9, 2024

Jeffrey04 Jun 9, 2024

samqi commented Jun 10, 2024

Jeffrey04 commented Jun 10, 2024

		parse_time: str = str(datetime.now())


		class Inquiry(NamedTuple):

20240609 data parser #2

20240609 data parser #2

Conversation

Jeffrey04 commented Jun 9, 2024 • edited Loading

Jeffrey04 commented Jun 9, 2024 • edited Loading

Jeffrey04 commented Jun 9, 2024

Jeffrey04 commented Jun 9, 2024

Jeffrey04 Jun 9, 2024

Choose a reason for hiding this comment

samqi commented Jun 10, 2024

Jeffrey04 commented Jun 10, 2024

Jeffrey04 commented Jun 9, 2024 •

edited

Loading

Jeffrey04 commented Jun 9, 2024 •

edited

Loading