<a href="https://colab.research.google.com/github/idleyui/python-notebook/blob/master/nltk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import nltk

## tokenizer

requires punkt sentence tokenizations models

In [0]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
word_tokenize("Dr. Lee, what's your problem? $1.9b")

['Dr.', 'Lee', ',', 'what', "'s", 'your', 'problem', '?', '$', '1.9b']

another simple regular-expression based tokenizer

In [0]:
from nltk.tokenize import wordpunct_tokenize

In [0]:
wordpunct_tokenize("Dr. Lee, what's your? $1.9")

['Dr', '.', 'Lee', ',', 'what', "'", 's', 'your', '?', '$', '1', '.', '9']

## pos-tag

In [0]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
pos_tag(['hello', 'world'])

[('hello', 'NN'), ('world', 'NN')]

## stemmer vs lemmatizer

> The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
>
>However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
>
>Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.
>
https://stackoverflow.com/questions/1787110/what-is-the-true-difference-between-lemmatization-vs-stemming

### stemmer
[what is the best temming method in python](https://stackoverflow.com/questions/24647400/what-is-the-best-stemming-method-in-python)
> Stemmers vary in their aggressiveness. Porter is one of the monst aggressive stemmer for English. I find it usually hurts more than it helps. On the lighter side you can either use a lemmatizer instead as already suggested, or a lighter algorithmic stemmer. The limitation of lemmatizers is that they cannot handle unknown words.

In [0]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [0]:
words = ["program", "programs", "programer", "programing", "programers"] 
for w in words:
    print(w,ps.stem(w))

# stem only do matches
ps.stem('languages')

program program
programs program
programer program
programing program
programers program


'languag'

### lemmatizer
based on wordnet

In [0]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemma = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
# default pos is n
lemma.lemmatize('is'), lemma.lemmatize('is', pos='v')

('is', 'be')

In [0]:
lemma.lemmatize('better'), lemma.lemmatize('better', pos='a')

('better', 'good')

## working flow

In [0]:
# translate treeback pos to wordnet pos
# treeback pos: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
# https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python

# steps: Document->Sentences->Tokens->POS->Lemmas
# https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [0]:
# lemma from sentence
seq = "What is your problems?"
tokens = word_tokenize(seq)
print('tokens:',tokens)
wordwithtags = pos_tag(tokens)
print('tags:',wordwithtags)
for word, tag in wordwithtags:
    print(lemma.lemmatize(word, pos=get_wordnet_pos(tag)))

tokens: ['What', 'is', 'your', 'problems', '?']
tags: [('What', 'WP'), ('is', 'VBZ'), ('your', 'PRP$'), ('problems', 'NNS'), ('?', '.')]
What
be
your
problem
?
