Sep 24, 2018

Changes for generating the arXMLiv 08.2018 token models:

  • Update dependencies
  • Improve corpus_token_model generation to include math lexemes
  • Improve paragraph iterator to skip over paragraphs containing ltx_ERROR markup
  • improve sentence tokenization to treat words with any capital letters as potential sentence breakers
  • word lexemes now properly attach 's possessives
