**Text categorizing and tagging words**

The process of classifying words into their parts of speech and labelling them accordingly is known as part-of-speech taggin, POS-tagging

In [1]:
import nltk

text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [39]:
# POS tagging

text = '''Natural language processing (NLP) is a field of computer science, artificial intelligence  
       and computational linguistics concerned with the interactions between computers and human 
       (natural) languages, and, in particular, concerned with programming computers to 
       fruitfully process large natural language corpora. Challenges in natural language 
       processing frequently involve natural language understanding, natural language 
       generation frequently from formal, machinereadable logical forms), connecting language  
       and machine perception, managing human-computer dialog systems, or some combinationthereof.'''

tk = nltk.word_tokenize(paragraph)
answer = nltk.pos_tag(tk, tagset='universal')

words = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

## expect return (in of to on for at as that with is or but was from when by it if the all)
# words.similar('and')

tag_2 = nltk.FreqDist(tag for (word, tag) in answer )
tag_2.most_common()

[('NOUN', 27),
 ('ADJ', 16),
 ('.', 16),
 ('VERB', 7),
 ('ADP', 7),
 ('CONJ', 5),
 ('DET', 3),
 ('ADV', 3),
 ('PRT', 1)]

**Tagged Corpora** 
create taked using the function str2tuple()

In [25]:
sent = '''
The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
interes/NN of/IN both/ABX governments/NNS ''/'' ./.
'''
[nltk.tag.str2tuple(t) for t in sent.split()]

#reading from the tagged corpora
nltk.corpus.brown.tagged_words()
nltk.corpus.brown.tagged_words(tagset='universal')

## Expect return of [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
nltk.corpus.treebank.tagged_words()

## Expect returning data of  [('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]
nltk.corpus.treebank.tagged_words(tagset = 'universal')

## Also support different languages across libraries

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]

**Universal part-of-speech tagset**

Tagged corrpora use many different conventions for taggin words.

In [42]:
from nltk.corpus import brown

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()


[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

**Noun and verbs**


In [50]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a,b) in word_tag_pairs if b[1] =='NOUN']

## Should return FreqDist({'NOUN': 7959, 'DET': 7373, 'ADJ': 4761, 'ADP': 3781, '.': 2796, 'VERB': 1842, 'CONJ': 938, 'NUM': 894, 'ADV': 186, 'PRT': 94, ...})
fdist = nltk.FreqDist(noun_preceders)

[tag for (tag, _) in fdist.most_common()]

['NOUN',
 'DET',
 'ADJ',
 'ADP',
 '.',
 'VERB',
 'CONJ',
 'NUM',
 'ADV',
 'PRT',
 'PRON',
 'X']

In [66]:
wsj = nltk.corpus.treebank.tagged_words(tagset='universal')
word_tag_fd = nltk.FreqDist(wsj)

# Return the list of common words in verbs : ['is', 'said', 'are', 'was', 'be', 'has', 'have', 'will', 'says', 'would',
[wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'VERB']

cfd1 = nltk.ConditionalFreqDist(wsj)

## Should return [('VERB', 28), ('NOUN', 20)]
cfd1['yield'].most_common()

## Should return [('VERB', 25), ('NOUN', 3)]
cfd1['cut'].most_common()

[('VERB', 25), ('NOUN', 3)]

In [79]:
# Looking through the list of words
wsj = nltk.corpus.treebank.tagged_words()
cfd1= nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)

## Should return ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold','named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]
list(cfd1['VBN'])

## To clarify the distinction between VBD (past tense) and VBN (past principle)
[w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in cfd1[w]]

# Should return [('While', 'IN'), ('program', 'NN'), ('trades', 'NNS'), ('swiftly', 'RB'), ('kicked', 'VBD')]
idx1 = wsj.index(('kicked', 'VBD'))
wsj[idx1-4:idx1+1]

# Should return [('head', 'NN'), ('of', 'IN'), ('state', 'NN'), ('has', 'VBZ'), ('kicked', 'VBN')]
idx2 = wsj.index(('kicked', 'VBN'))
wsj[idx2-4:idx2+1]

[('head', 'NN'),
 ('of', 'IN'),
 ('state', 'NN'),
 ('has', 'VBZ'),
 ('kicked', 'VBN')]

Adjectives and Adverbs


UnSimplified tags

In [86]:

def findTags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text 
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

tagdict = findTags('NN', nltk.corpus.brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])


NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("city's", 6)]
NN$-HL [("Golf's", 1), ("Navy's", 1)]
NN$-TL [("President's", 11), ("Administration's", 3), ("Army's", 3), ("League's", 3), ("University's", 3)]
NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('cut', 2), ('party', 2)]
NN-NC [('ova', 1), ('eva', 1), ('aya', 1)]
NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
NN-TL-HL [('Fort', 2), ('Mayor', 1), ('Commissioner', 1), ('City', 1), ('Oak', 1)]
NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
NNS$ [("children's", 7), ("women's", 5), ("men's", 3), ("janitors'", 3), ("taxpayers'", 2)]
NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Princes'", 1), ("Bombers'", 1)]
NNS-HL [('Wards', 1), ('deputies', 1), ('bonds', 1), ('aspects', 1), ('Decisions', 1)]
NNS-TL [

**Exploring Tagged Corpora**


In [97]:
brown_learned_text = brown.words(categories='learned')

## should return [',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming','became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...]
sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often'))

brown_lrned_text = brown.tagged_words(categories='learned', tagset='universal')
tags = [b[1]for (a,b) in nltk.bigrams(brown_lrned_text) if a[0] == 'often']

fd = nltk.FreqDist(tags)
fd.tabulate()

VERB  ADV  ADP  ADJ    .  PRT 
  37    8    7    6    4    2 


In [101]:
##trigram

from nltk.corpus import brown

def process(sentence):
    for (w1, t1), (w2,t2),(w3, t3) in nltk.trigrams(sentence):
        if(t1.startswith('V') and 't2' =='TO' and t3.startswith('V')):
            print(w1, w2, w3)

for tagged_sent in brown.tagged_sents():
    process(tagged_sent)
    


brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
data = nltk.ConditionalFreqDist((word.lower(), tag)
                               for (word, tag) in brwon_news_tagged)

for word in sorted(data.conditions()):
    if len(data[word]) >3:
        tags = [tag for (tag, _) in data[word].most_common()]
        print(word, ' '.join(tags))

NameError: name 'brwon_news_tagged' is not defined