Training set creation using data from GIANT project? #198

heikojansen · 2022-10-13T10:30:36Z

This isn't exactly an issue but a question: Would you consider it feasible and worth-while to adopt the data generated here:
GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing
as training input for AnyStyle?
Just curious if you see enough potential there.

inukshuk · 2022-10-13T12:52:17Z

That's interesting! We've also discussed using CSL to generate training data in the past; I'd be curious to know how a model trained on such data performs with real world input.

Obviously you would not want to train a model on 1 billion references, but with such a large resource you could just pick out samples (would also be interesting to see if a model improves after the first couple of thousand references).

heikojansen · 2022-10-13T14:03:41Z

So the basic idea would be to take a random set of publications from that GIANT dataset and for each publication create many citations using a number of different CSL styles; only that instead of plain strings these citations would be converted to XML sequence elements where the different parts of the citation are chopped up into child-elements declaring the type of information within them. And then use that XML as training input.

So the most interesting question is how to generate the "annotated" (by way of XML elems) sequences for different CSL styles.
Is there a list of allowed child element names to the sequence elements available?

inukshuk · 2022-10-13T14:22:58Z

You can put any element into the sequence: each element will correspond to a label that is known to the model. From what I saw it should be enough to wrap each generated XML reference in a <sequence> and then the whole sample in a <dataset>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training set creation using data from GIANT project? #198

Training set creation using data from GIANT project? #198

heikojansen commented Oct 13, 2022

inukshuk commented Oct 13, 2022

heikojansen commented Oct 13, 2022

inukshuk commented Oct 13, 2022

Training set creation using data from GIANT project? #198

Training set creation using data from GIANT project? #198

Comments

heikojansen commented Oct 13, 2022

inukshuk commented Oct 13, 2022

heikojansen commented Oct 13, 2022

inukshuk commented Oct 13, 2022