Skip to content

PARC/MicroTextTokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MicroText Tokenizer

The MicroText Tokenizer is a Java tokenization library that can be used by NLP systems to operate on text from Twitter, SMS, Slack, and other messaging platforms. Micro-text refers to messages found in microblogging platforms like Twitter or communications through SMS or similar messaging platforms. Micro-text has several characteristics that make it different from text found in more traditional documents. Standard NLP tools perform poorly on text like the following:

Cake?@Username K gotta get the hair,eyebrows n nails
done today b4 9 then I WISH I had sum one I could go
CAKE wit :( sigh

This example demonstrates several of the challenges in tokenization of micro-text.

About

Library for tokenizing micro-text from sources like Twitter and SMS messaging

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages