Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training set creation using data from GIANT project? #198

Open
heikojansen opened this issue Oct 13, 2022 · 3 comments
Open

Training set creation using data from GIANT project? #198

heikojansen opened this issue Oct 13, 2022 · 3 comments

Comments

@heikojansen
Copy link

This isn't exactly an issue but a question: Would you consider it feasible and worth-while to adopt the data generated here:
GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing
as training input for AnyStyle?
Just curious if you see enough potential there.

@inukshuk
Copy link
Owner

That's interesting! We've also discussed using CSL to generate training data in the past; I'd be curious to know how a model trained on such data performs with real world input.

Obviously you would not want to train a model on 1 billion references, but with such a large resource you could just pick out samples (would also be interesting to see if a model improves after the first couple of thousand references).

@heikojansen
Copy link
Author

So the basic idea would be to take a random set of publications from that GIANT dataset and for each publication create many citations using a number of different CSL styles; only that instead of plain strings these citations would be converted to XML sequence elements where the different parts of the citation are chopped up into child-elements declaring the type of information within them. And then use that XML as training input.

So the most interesting question is how to generate the "annotated" (by way of XML elems) sequences for different CSL styles.
Is there a list of allowed child element names to the sequence elements available?

@inukshuk
Copy link
Owner

You can put any element into the sequence: each element will correspond to a label that is known to the model. From what I saw it should be enough to wrap each generated XML reference in a <sequence> and then the whole sample in a <dataset>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants