Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide some more info on Anystyle algorithms #189

Open
cboulanger opened this issue Aug 11, 2022 · 1 comment
Open

Provide some more info on Anystyle algorithms #189

cboulanger opened this issue Aug 11, 2022 · 1 comment

Comments

@cboulanger
Copy link
Contributor

cboulanger commented Aug 11, 2022

Hi,

even though I don't speak Ruby (yet), I am able to follow what's happening under the hood in the AnyStyle source code since it is so well-organized and cleanly written - thanks for that! But it would help tremendously if you could provide some more info on how the whole thing works algorithmically - do you have papers published on it?

In particular, did I understand the following correctly?

  • the finder extracts contiguous lines of reference text, which means that the training is on "reference line vs. no reference line", not "reference starts here - reference ends here" (which is the EXparser training model)
  • the Parser then is responsible to segments these contiguos lines into individual references and further into the reference information bits that can then be formatted into the output format.

Can you point me to the parts in the code where this magic happens?

@inukshuk
Copy link
Owner

Both the Finder and Parser use wapiti to label tokens in a sequence. To do this, the input is first tokenized (Finder tokens are whole lines, Parser tokens are words) and for each token a set of features is computed. A model is created by supplying labeled token feature sets. Based on this model and the set of features the parser will label each token.

For full-texts this approach can be useful to parse the structure of a document (e.g., to produce tables of content based section headings, or to extract information like title, author, etc.). Since we use it only to detect reference sections, we tried to use a minimal amount of labels for the finder model (usually you'd want to label stuff like 'title', 'header', 'figure', etc. but in our case that just adds extra complexity). I'm not fully convinced it's the best approach but it works reasonably well.

After references are labeled, segments are grouped by label and then 'normalized' -- the normalizers usually operate on selected segments only and, contrary to the parser itself, they may alter the input (e.g., strip away punctuation). Because all the post-processing/normalizers currently assume there's only a single reference per input one limitation we have with the parser is that each line should represent a single reference. Since the Finder just returns 'reference lines' from a larger body of text one of the biggest obstacles is splitting/joining those lines correctly. There's currently no good solution for that, just a naive scoring system based on lots of regular expressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants