Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when combining datasets #192

Open
cboulanger opened this issue Aug 17, 2022 · 11 comments
Open

Segmentation fault when combining datasets #192

cboulanger opened this issue Aug 17, 2022 · 11 comments

Comments

@cboulanger
Copy link
Contributor

cboulanger commented Aug 17, 2022

I am getting a segmentation fault when loading a parser model that has been trained like so:

  paths=[...]
  ds = Wapiti::Dataset.new
  paths.each do |path|
    ds = ds | Wapiti::Dataset.open(path)
  end

isn't this the way to combine datasets?

@inukshuk
Copy link
Owner

Looks good at a quick glance. The individual datasets could still contain empty tags or other issues.

@cboulanger
Copy link
Contributor Author

The segmentation fault seems to kick in after a certain size of the dataset, which might be around 3000 sequences - need to test more.

Training new parser model 'core' with 1514 sequences...
Loading model...
Training new parser model 'footnotes' with 1883 sequences...
Loading model...
Training new parser model 'excite-soc' with 1000 sequences...
Loading model...
Training new parser model 'core+excite-soc' with 2514 sequences...
Loading model...
Training new parser model 'core+footnotes' with 3397 sequences...
[BUG] Segmentation fault at 0x0000000000000000
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]

@inukshuk
Copy link
Owner

Hm, this could easily be an issue in wapiti, that it can't handle such large models. I doubt that you'd need such a large training set though?

@cboulanger
Copy link
Contributor Author

cboulanger commented Aug 19, 2022

Ok, interesting - I was assuming "the more the better" - which is obviously not the way to go. I am not familiar with how Wapiti is working. Is a larger amount of similar training data unneccessary to improve labeling, i.e. would one way of improving my dataset be to try to throw out structurally similar sequences? But what about the tokens in these sequences? Do they make a difference as words/characters (i.e. as a dictionary of terms/tokens that occur and increase the prediction value for a label)? Because here is where I thought a greate corpus would improve the model.

@cboulanger
Copy link
Contributor Author

Maybe one solution could be to use different source datasets and randomly pick sequences from each to compose a dataset of a given size in a specific ratio and then see which ratio performs best.

@cboulanger
Copy link
Contributor Author

cboulanger commented Aug 19, 2022

Ok, I see progress! Using reduced parser datasets (1500 sequences) and two PDFs containing footnotes:

Training:

Model set 'excite-soc':

  • Finder model has been trained with 225 documents
  • Parser model has been trained with 1499 sequences.

Model set 'zfrsoz-footnotes':

  • Finder model has been trained with 50 documents
  • Parser model has been trained with 1494 sequences.

Testing:

Using model set 'anystyle-default':

  • 10.1515_zfrs-1980-0104.pdf: 18 reference lines and 18 sequences
  • 10.1111_j.1467-6478.2005.00328.x.pdf: 44 reference lines and 44 sequences

Using model set 'excite-soc':

  • 10.1515_zfrs-1980-0104.pdf: 3 reference lines and 3 sequences
  • 10.1111_j.1467-6478.2005.00328.x.pdf: 6 reference lines and 6 sequences

Using model set 'zfrsoz-footnotes':

  • 10.1515_zfrs-1980-0104.pdf: 74 reference lines and 73 sequences
  • 10.1111_j.1467-6478.2005.00328.x.pdf: 88 reference lines and 87 sequences

@cboulanger
Copy link
Contributor Author

Oh but checking the result 'zfrsoz-footnote' mainly consists of false positives, so no progress really :-(

@cboulanger
Copy link
Contributor Author

Ok, my finder training material is much worse than I thought it was - there were some problems with automatic translation from the excite material. Fixing it now to see if it makes a difference.

@cboulanger
Copy link
Contributor Author

cboulanger commented Aug 24, 2022

After correcting the training material and reducing the size of the dataset, I thought I was seeing progress but I am back at getting segmentation faults even though I reduced the parser dataset to 1500 sequences (also segfaults at 1000 sequences). Wouldn't this be a size that Wapiti should be able to handle? Should I open an issue at the Wapiti repo?

@inukshuk
Copy link
Owner

It's probably not a size issue then. I think even the core set is larger at the moment. This sounds like there are specific sequences causing the segfault, if you could isolate them we might be able to figure out why.

@cboulanger
Copy link
Contributor Author

cboulanger commented Aug 25, 2022

Ok, I got it to work, here's what I changed - not sure which was the decisive fix:

  • According to a closed issue: a <note> tag at the beginnin or end of a sequence seems to trigger a segmentation fault.
  • Also, empty lines in xml files may cause it (haven't re-checked), although whitespace should be insignificant when parsing XML.
  • Finally, while looping over the different models, I loaded the models using AnyStyle.X.model.load instead of (correctly) AnyStyle.X.load_model (X being finder/parser). this seems to not have cleared previous data and resulted in a mix of datasets (and maybe also the segfaults)

After cleaning the XML data and fixing the bugs, I got this:

Training:

Model set 'excite-soc':

  • Finder model has been trained with 225 documents
  • Parser model has been trained with 1500 sequences.

Model set 'zfrsoz-footnotes':

  • Finder model has been trained with 50 documents
  • Parser model has been trained with 1500 sequences.

Testing:

Using model set 'anystyle-default':

  • 10.1515_zfrs-1980-0104: 18 sequences found
  • 10.1111_j.1467-6478.2005.00328.x: 44 sequences found

Using model set 'excite-soc':

  • 10.1515_zfrs-1980-0104: 15 sequences found
  • 10.1111_j.1467-6478.2005.00328.x: 48 sequences found

Using model set 'zfrsoz-footnotes':

  • 10.1515_zfrs-1980-0104: 57 sequences found
  • 10.1111_j.1467-6478.2005.00328.x: 68 sequences found

The zfrsoz-footnotes model results are pretty good on first glance even though I need to find a better evaluation algorithm. The other perform worse because the finder model is trained with bibliography-at-the-end-of-paper datasets which do not catch the references in the footnotes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants