GitHub - Munduruca/pytextpreprocess: Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

Munduruca / pytextpreprocess Public

forked from turian/pytextpreprocess

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README		README
english.stop		english.stop
textpreprocess.py		textpreprocess.py

Repository files navigation

pytextpreprocess
================

written by Joseph Turian
released under a BSD license

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

REQUIREMENTS:
    * My Python common library:
        http://github.com/turian/common
    and sub-requirements thereof.
    * NLTK, for word tokenization
        e.g.
            apt-get install python-nltk

    * Splitta if you want to sentence tokenize

The English stoplist is from:
    http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
However, I added words at the top (above "a").