Provide some more info on Anystyle algorithms #189

cboulanger · 2022-08-11T08:09:22Z

Hi,

even though I don't speak Ruby (yet), I am able to follow what's happening under the hood in the AnyStyle source code since it is so well-organized and cleanly written - thanks for that! But it would help tremendously if you could provide some more info on how the whole thing works algorithmically - do you have papers published on it?

In particular, did I understand the following correctly?

the finder extracts contiguous lines of reference text, which means that the training is on "reference line vs. no reference line", not "reference starts here - reference ends here" (which is the EXparser training model)
the Parser then is responsible to segments these contiguos lines into individual references and further into the reference information bits that can then be formatted into the output format.

Can you point me to the parts in the code where this magic happens?

inukshuk · 2022-08-11T09:26:14Z

Both the Finder and Parser use wapiti to label tokens in a sequence. To do this, the input is first tokenized (Finder tokens are whole lines, Parser tokens are words) and for each token a set of features is computed. A model is created by supplying labeled token feature sets. Based on this model and the set of features the parser will label each token.

For full-texts this approach can be useful to parse the structure of a document (e.g., to produce tables of content based section headings, or to extract information like title, author, etc.). Since we use it only to detect reference sections, we tried to use a minimal amount of labels for the finder model (usually you'd want to label stuff like 'title', 'header', 'figure', etc. but in our case that just adds extra complexity). I'm not fully convinced it's the best approach but it works reasonably well.

After references are labeled, segments are grouped by label and then 'normalized' -- the normalizers usually operate on selected segments only and, contrary to the parser itself, they may alter the input (e.g., strip away punctuation). Because all the post-processing/normalizers currently assume there's only a single reference per input one limitation we have with the parser is that each line should represent a single reference. Since the Finder just returns 'reference lines' from a larger body of text one of the biggest obstacles is splitting/joining those lines correctly. There's currently no good solution for that, just a naive scoring system based on lots of regular expressions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide some more info on Anystyle algorithms #189

Provide some more info on Anystyle algorithms #189

cboulanger commented Aug 11, 2022 •

edited

inukshuk commented Aug 11, 2022

Provide some more info on Anystyle algorithms #189

Provide some more info on Anystyle algorithms #189

Comments

cboulanger commented Aug 11, 2022 • edited

inukshuk commented Aug 11, 2022

cboulanger commented Aug 11, 2022 •

edited