Segmentation fault when combining datasets #192

cboulanger · 2022-08-17T22:32:26Z

I am getting a segmentation fault when loading a parser model that has been trained like so:

  paths=[...]
  ds = Wapiti::Dataset.new
  paths.each do |path|
    ds = ds | Wapiti::Dataset.open(path)
  end

isn't this the way to combine datasets?

inukshuk · 2022-08-18T10:23:02Z

Looks good at a quick glance. The individual datasets could still contain empty tags or other issues.

cboulanger · 2022-08-18T19:17:41Z

The segmentation fault seems to kick in after a certain size of the dataset, which might be around 3000 sequences - need to test more.

Training new parser model 'core' with 1514 sequences...
Loading model...
Training new parser model 'footnotes' with 1883 sequences...
Loading model...
Training new parser model 'excite-soc' with 1000 sequences...
Loading model...
Training new parser model 'core+excite-soc' with 2514 sequences...
Loading model...
Training new parser model 'core+footnotes' with 3397 sequences...
[BUG] Segmentation fault at 0x0000000000000000
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]

inukshuk · 2022-08-18T20:39:56Z

Hm, this could easily be an issue in wapiti, that it can't handle such large models. I doubt that you'd need such a large training set though?

cboulanger · 2022-08-19T06:01:09Z

Ok, interesting - I was assuming "the more the better" - which is obviously not the way to go. I am not familiar with how Wapiti is working. Is a larger amount of similar training data unneccessary to improve labeling, i.e. would one way of improving my dataset be to try to throw out structurally similar sequences? But what about the tokens in these sequences? Do they make a difference as words/characters (i.e. as a dictionary of terms/tokens that occur and increase the prediction value for a label)? Because here is where I thought a greate corpus would improve the model.

cboulanger · 2022-08-19T06:07:06Z

Maybe one solution could be to use different source datasets and randomly pick sequences from each to compose a dataset of a given size in a specific ratio and then see which ratio performs best.

cboulanger · 2022-08-19T08:12:27Z

Ok, I see progress! Using reduced parser datasets (1500 sequences) and two PDFs containing footnotes:

Training:

Model set 'excite-soc':

Finder model has been trained with 225 documents
Parser model has been trained with 1499 sequences.

Model set 'zfrsoz-footnotes':

Finder model has been trained with 50 documents
Parser model has been trained with 1494 sequences.

Testing:

Using model set 'anystyle-default':

10.1515_zfrs-1980-0104.pdf: 18 reference lines and 18 sequences
10.1111_j.1467-6478.2005.00328.x.pdf: 44 reference lines and 44 sequences

Using model set 'excite-soc':

10.1515_zfrs-1980-0104.pdf: 3 reference lines and 3 sequences
10.1111_j.1467-6478.2005.00328.x.pdf: 6 reference lines and 6 sequences

Using model set 'zfrsoz-footnotes':

10.1515_zfrs-1980-0104.pdf: 74 reference lines and 73 sequences
10.1111_j.1467-6478.2005.00328.x.pdf: 88 reference lines and 87 sequences

cboulanger · 2022-08-19T09:44:36Z

Oh but checking the result 'zfrsoz-footnote' mainly consists of false positives, so no progress really :-(

cboulanger · 2022-08-20T06:25:40Z

Ok, my finder training material is much worse than I thought it was - there were some problems with automatic translation from the excite material. Fixing it now to see if it makes a difference.

cboulanger · 2022-08-24T20:57:01Z

After correcting the training material and reducing the size of the dataset, I thought I was seeing progress but I am back at getting segmentation faults even though I reduced the parser dataset to 1500 sequences (also segfaults at 1000 sequences). Wouldn't this be a size that Wapiti should be able to handle? Should I open an issue at the Wapiti repo?

inukshuk · 2022-08-25T11:42:44Z

It's probably not a size issue then. I think even the core set is larger at the moment. This sounds like there are specific sequences causing the segfault, if you could isolate them we might be able to figure out why.

cboulanger · 2022-08-25T11:50:36Z

Ok, I got it to work, here's what I changed - not sure which was the decisive fix:

According to a closed issue: a <note> tag at the beginnin or end of a sequence seems to trigger a segmentation fault.
Also, empty lines in xml files may cause it (haven't re-checked), although whitespace should be insignificant when parsing XML.
Finally, while looping over the different models, I loaded the models using AnyStyle.X.model.load instead of (correctly) AnyStyle.X.load_model (X being finder/parser). this seems to not have cleared previous data and resulted in a mix of datasets (and maybe also the segfaults)

After cleaning the XML data and fixing the bugs, I got this:

Training:

Model set 'excite-soc':

Finder model has been trained with 225 documents
Parser model has been trained with 1500 sequences.

Model set 'zfrsoz-footnotes':

Finder model has been trained with 50 documents
Parser model has been trained with 1500 sequences.

Testing:

Using model set 'anystyle-default':

10.1515_zfrs-1980-0104: 18 sequences found
10.1111_j.1467-6478.2005.00328.x: 44 sequences found

Using model set 'excite-soc':

10.1515_zfrs-1980-0104: 15 sequences found
10.1111_j.1467-6478.2005.00328.x: 48 sequences found

Using model set 'zfrsoz-footnotes':

10.1515_zfrs-1980-0104: 57 sequences found
10.1111_j.1467-6478.2005.00328.x: 68 sequences found

The zfrsoz-footnotes model results are pretty good on first glance even though I need to find a better evaluation algorithm. The other perform worse because the finder model is trained with bibliography-at-the-end-of-paper datasets which do not catch the references in the footnotes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when combining datasets #192

Segmentation fault when combining datasets #192

cboulanger commented Aug 17, 2022 •

edited

inukshuk commented Aug 18, 2022

cboulanger commented Aug 18, 2022

inukshuk commented Aug 18, 2022

cboulanger commented Aug 19, 2022 •

edited

cboulanger commented Aug 19, 2022

cboulanger commented Aug 19, 2022 •

edited

cboulanger commented Aug 19, 2022

cboulanger commented Aug 20, 2022

cboulanger commented Aug 24, 2022 •

edited

inukshuk commented Aug 25, 2022

cboulanger commented Aug 25, 2022 •

edited

Segmentation fault when combining datasets #192

Segmentation fault when combining datasets #192

Comments

cboulanger commented Aug 17, 2022 • edited

inukshuk commented Aug 18, 2022

cboulanger commented Aug 18, 2022

inukshuk commented Aug 18, 2022

cboulanger commented Aug 19, 2022 • edited

cboulanger commented Aug 19, 2022

cboulanger commented Aug 19, 2022 • edited

Training:

Testing:

cboulanger commented Aug 19, 2022

cboulanger commented Aug 20, 2022

cboulanger commented Aug 24, 2022 • edited

inukshuk commented Aug 25, 2022

cboulanger commented Aug 25, 2022 • edited

Training:

Testing:

cboulanger commented Aug 17, 2022 •

edited

cboulanger commented Aug 19, 2022 •

edited

cboulanger commented Aug 19, 2022 •

edited

cboulanger commented Aug 24, 2022 •

edited

cboulanger commented Aug 25, 2022 •

edited