# Categorizing and Tagging Words with NLTK

## Tagger

We need to tag our tokens (words) in order to analyze the text morphologically, it is, to make sense of the context in which the word is appearing.

In [1]:
import nltk

In [2]:
text = text = 'We are learning Natural Languages Processing'

In [3]:
tokens = nltk.word_tokenize(text)

After we have tokenized our text, we use <b>POS tagger</b> in order to add a tag to each word:

In [4]:
print(nltk.pos_tag(tokens))

[('We', 'PRP'), ('are', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Languages', 'NNP'), ('Processing', 'NNP')]


If we don´t know what the Tags mean, we can use nltk help:

In [5]:
nltk.help.upenn_tagset('PRP')

PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
None


---

Now we can search trough a corpus (we will use brown) to find words in the same context, using <b>.similar</b> method:

*From the documentation:

+  .similar(): find other words which appear in the
same contexts as the specified word; list most similar words first.

If we apply it to the word "woman":

which type of word is "woman"?:

In [10]:
nltk.pos_tag(['woman'])

[('woman', 'NN')]

In [11]:
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


So we use .similar():

In [9]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

man time day year car moment world house family child country boy
state job place way war girl work word


In [8]:
text.similar('bought')

made said done put had seen found given left heard was been brought
set got that took in told felt


### Create a Tagged Corpus

If we have a text in the following format:

In [14]:
sentence = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
'''

We can create a new object of tuples (word, tag):

In [19]:
tagged_Sentence = [nltk.tag.str2tuple(token) for token in sentence.split()]
print(tagged_Sentence)

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ('of', 'IN'), ('...', None), ('other', 'AP'), ('topics', 'NNS'), (',', ','), ('AMONG', 'IN'), ('them', 'PPO'), ('the', 'AT'), ('Atlanta', 'NP'), ('and', 'CC'), ('...', None), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('purchasing', 'VBG'), ('departments', 'NNS'), ('which', 'WDT'), ('it', 'PPS'), ('...', None), ('said', 'VBD'), ('``', '``'), ('ARE', 'BER'), ('well', 'QL'), ('operated', 'VBN'), ('and', 'CC'), ('follow', 'VB'), ('generally', 'RB'), ('...', None), ('accepted', 'VBN'), ('practices', 'NNS'), ('which', 'WDT'), ('inure', 'VB'), ('to', 'IN'), ('the', 'AT'), ('best', 'JJT'), ('...', None), ('interest', 'NN'), ('of', 'IN'), ('both', 'ABX'), ('governments', 'NNS'), ("''", "''"), ('.', '.')]


NLTK corpus has his own tagged corpus, using .tagged_words():

In [13]:
nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

### The Default Tagger

It is going to tag <b>each word</b> (all of them) with the tag we give it as an input:

We want to tag the following text:

In [20]:
raw_text = 'We are learning taggers right now and aim to build applications which uses language to perform analysis'

First, we have to tokenize our text:

In [21]:
raw_text_tokens = nltk.word_tokenize(raw_text)

We could give it as an input whatever string to tag our text, for example:

In [22]:
default_tagger = nltk.DefaultTagger('OurTag')
print(default_tagger.tag(tokens))

[('We', 'OurTag'), ('are', 'OurTag'), ('learning', 'OurTag'), ('Natural', 'OurTag'), ('Languages', 'OurTag'), ('Processing', 'OurTag')]


But, let use the most common tag in the <b>news category</b> from the <b>brown corpus</b>:

In [24]:
tags = [tag for (word, tag) in nltk.corpus.brown.tagged_words(categories='news')] # create a list of tags from the corpus
print(nltk.FreqDist(tags).max()) # print the most common tag in the corpus

NN


In [25]:
default_tagger = nltk.DefaultTagger(nltk.FreqDist(tags).max()) # creating our tagger
print(default_tagger.tag(tokens))

[('We', 'NN'), ('are', 'NN'), ('learning', 'NN'), ('Natural', 'NN'), ('Languages', 'NN'), ('Processing', 'NN')]


Doing this, we are assigning a wrong tag to the majority of the words.

#### Evaluating our tagger performance:

We use .evaluate():

from the documentation:
* default_tagger.evaluate(gold)
    * :type gold: list(list(tuple(str, str)))
    * :param gold: The list of tagged sentences to score the tagger on.

Explaination: we apply .evaluate() over a test set of tagged sentences.

In [26]:
print(default_tagger.evaluate(nltk.corpus.brown.tagged_sents(categories='news')))

0.13089484257215028


As we can see it's a very poor performance.

### RegExp Tagger

The RegexpTagger assigns tags to tokens by comparing their
word strings to a series of regular expressions.

From the documentation:

- nltk.RegexpTagger(regexps, backoff=None)
    - :type regexps: list(tuple(str, str))
    - :param regexps: A list of ``(regexp, tag)`` pairs, each of which indicates that a word matching ``regexp`` should be tagged with ``tag``.  The pairs will be evalutated in order.
    - :param backoff: If none of the regexps match a word, then the optional backoff tagger is invoked, else it is assigned the tag None.


We can create a list of RegEx patterns:

In [32]:
patterns = [(r'.*ing$', 'VBG'), (r'.*ed$', 'VBD'), (r'.*es$', 'VBZ'), (r'.*ould$', 'MD'), (r'.*\'s$', 'NN$'),
             (r'.*s$', 'NNS'),  (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), (r'.*', 'NN')]

Now, we can create and use our tagger:

In [36]:
regexp_tagger = nltk.RegexpTagger(patterns)
print(regexp_tagger.tag(raw_text_tokens))


[('We', 'NN'), ('are', 'NN'), ('learning', 'VBG'), ('taggers', 'NNS'), ('right', 'NN'), ('now', 'NN'), ('and', 'NN'), ('aim', 'NN'), ('to', 'NN'), ('build', 'NN'), ('applications', 'NNS'), ('which', 'NN'), ('uses', 'VBZ'), ('language', 'NN'), ('to', 'NN'), ('perform', 'NN'), ('analysis', 'NNS')]


### Unigram Tagger

The UnigramTagger finds the most likely tag for each word in a training corpus, and then uses that information to assign tags to new tokens.

In order to use it, we will have to give it a set of tagged sentences that it will use for training.

In [38]:
brown_tagged_sents = nltk.corpus.brown.tagged_sents(categories = 'news')

And now we create and train our tagger:

In [39]:
unigram_tagger = nltk.UnigramTagger(train=brown_tagged_sents)

In [40]:
print(unigram_tagger.tag(nltk.word_tokenize('We are using the programming language Python')))

[('We', 'PPSS'), ('are', 'BER'), ('using', 'VBG'), ('the', 'AT'), ('programming', None), ('language', 'NN'), ('Python', None)]


Evaluating our Unigram Tagger:

In [41]:
print(unigram_tagger.evaluate(brown_tagged_sents))

0.9349006503968017


That's not bad! but <u>we were testing it over our training set!!!</u>

We should split our set into **Train** and **Test** sets:

In [42]:
size = int(len(brown_tagged_sents) * 0.9)
train_sentences = brown_tagged_sents[:size]
test_sentences = brown_tagged_sents[size:]

We have to **train again** our tagger, using our **train set**:

In [43]:
unigram_tagger = nltk.UnigramTagger(train_sentences)

And we **test it again**, over our **Test set**:

In [44]:
print(unigram_tagger.evaluate(test_sentences))

0.8121200039868434


That is more realistic now.

### N-gram Tagger

**n-gram**: a sequential list of n words, used to encode the likelihood that the phrase will appear in the future. So it takes in account **the context** where the word is in.

From documentation:
- N-gram tagger is a tagger that chooses a token's tag based on its word string and on the preceding n word's tags.  In , a tuple (tags[i-n:i-1], words[i]) is looked up in a table, and the corresponding tag is returned.  N-gram taggers are typically trained on a tagged corpus.

It simply involves splitting the sentence into chunks of consecutive words of lenght "n"

- For example: "I don't know what to say"

    - 1-gram (unigram): I, don't, know, what, to, say
    - 2-gram (bigram): I don't, don't know, know what, what to, to say
    - 3-gram (trigram): I don't know, don't know what, know what to, what to say
    - ...
    - n-gram

In [45]:
ngram_tagger = nltk.NgramTagger(len(brown_tagged_sents),train=brown_tagged_sents)
print(ngram_tagger.tag(nltk.word_tokenize('We are using the programming language Python')))

[('We', 'PPSS'), ('are', 'BER'), ('using', None), ('the', None), ('programming', None), ('language', None), ('Python', None)]


Evaluating our N-gram Tagger:

In [46]:
print(ngram_tagger.evaluate(test_sentences))

0.9441841921658527
