GitHub - turian/pytextpreprocess: Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README		README
english.stop		english.stop
textpreprocess.py		textpreprocess.py

Repository files navigation

pytextpreprocess
================

written by Joseph Turian
released under a BSD license

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

REQUIREMENTS:
    * My Python common library:
        http://github.com/turian/common
    and sub-requirements thereof.
    * NLTK, for word tokenization
        e.g.
            apt-get install python-nltk

    * Splitta if you want to sentence tokenize

The English stoplist is from:
    http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
However, I added words at the top (above "a").