@dginev dginev released this Sep 24, 2018 · 14 commits to master since this release

Changes for generating the arXMLiv 08.2018 token models:

  • Update dependencies
  • Improve corpus_token_model generation to include math lexemes
  • Improve paragraph iterator to skip over paragraphs containing ltx_ERROR markup
  • improve sentence tokenization to treat words with any capital letters as potential sentence breakers
  • word lexemes now properly attach 's possessives
Assets 2