Skip to content

nltk utility which more accurately lemmatizes text using pre-trained part-of-speech tagger.

License

Notifications You must be signed in to change notification settings

KT12/tag-lemmatize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Summary

tag-lemmatize is a small bolt-on utility function to be used in concert with the nltk package. The function accepts un-tokenized strings. The original intent was to write a small function which would ease the use of the VADER sentiment analysis tool.

The function uses nltk.tokenize.word_tokenizeto tokenize the string. It then tags parts-of-speech (POS) taking into account context using nltk.pos_tag, which assigns a Penn Treebank POS tag. The function then converts the Penn Treebank tag into the appropriate WordNet POS tag. Finally, it lemmatizes each word using nltk.stem.WordNetLemmatizer.

Installation

Clone and add to path.

import to the Python interpreter.

tag_and_lem is the primary function.

Motivation

The nltk pre-trained part-of-speech tagger uses Penn Treebank tags which must be converted to Wordnet tags in order to use nltk's lemmatizer. This small utility should make it easier to test of Natural Language Processing techniques without training a tagger which uses Wordnet tags.

Requirements

Python 2.6+ nltk

Contributors

@KT12

If this small function was useful, please star/follow me!

About

nltk utility which more accurately lemmatizes text using pre-trained part-of-speech tagger.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages