Extracting non-textuals from ParlaMint-{HR,BA,RS}
Idea: open a component file. For every utterance, reconstruct segments into full utterance. Extract non-textual elements. Reconstruct and renumber the component.
The RegEx patterns have been written and sequenced in proper order.
To discuss:
- I have the option of not splitting on sentences this time. Should I go for unsplit utterances? -> ask Tomaž! Yes, do not split.
To add:
- (NASTAVAK NAKON STANKE U 9,45 SATI)
As of now the component-level fixer works marvelously. Next step: running it on all the datasets.
I found a few more bugs, but it was finally sucessfully ran on all the three branches. Now I'll research if the add common content was done correctly.
- Change segment notation to
seg(e.g.<seg xml:id="ParlaMint-HR_T6.S12.u37297.seg0">) - Split files on agenda (for now 500 - 1k utterances)
- Rerun everything.
- Run annotation