Skip to content

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

Notifications You must be signed in to change notification settings

Big-Data/pytextpreprocess

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

pytextpreprocess
================

written by Joseph Turian
released under a BSD license

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

REQUIREMENTS:
    * My Python common library:
        http://github.com/turian/common
    and sub-requirements thereof.
    * NLTK, for word tokenization
        e.g.
            apt-get install python-nltk

    * Splitta if you want to sentence tokenize

The English stoplist is from:
    http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
However, I added words at the top (above "a").

About

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published