In [1]:
import nltk

with open("../data/signpost_corpus.txt", "r", encoding="utf8") as f:
    sentences = nltk.sent_tokenize(f.read())

In [10]:
import numpy as np

# This is, incidentally, a pretty good example of a case when piping (via something akin to pandas.pipe) would be useful.
# This expression is effectively unreadable, and no easy alternatives exist.
sentences_pos = np.concatenate(
    [nltk.pos_tag(
            [word for word in nltk.word_tokenize(sentence) if len(word) <= 34]  # "Supercalifragilisticexpialidocious"
        ) for sentence in sentences
    ])

It worked.

In [11]:
sentences_pos.dtype

dtype('<U34')

Memory usage is barely anything.

In [32]:
freq_d = nltk.FreqDist(sentences_pos[...,0])

The most common elements are the usual suspects (mostly stopwords). Note also that I haven't applied any stemming yet, so `running` and `run` will be reported seperately for example.

In [31]:
freq_d.most_common()[:10]

[(',', 211342),
 ('the', 204079),
 ('.', 161247),
 ('of', 122910),
 ('to', 98783),
 ('and', 97656),
 ('a', 79379),
 ('in', 65785),
 (')', 49361),
 ('(', 49048)]

Since right now we're playing with parts of speech...

In [34]:
nltk.FreqDist(sentences_pos[...,1]).most_common()[:10]

[('NN', 608156),
 ('IN', 509969),
 ('NNP', 487714),
 ('DT', 414182),
 ('JJ', 307300),
 ('NNS', 265547),
 (',', 211342),
 ('.', 172903),
 ('RB', 150107),
 ('VBN', 131700)]

You can look up what these are using the help system, but...eh...

In [46]:
freq_d['Wikipedia']

33342

In [49]:
words_pos = sentences_pos

In [56]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "NN"],0]
).most_common()[:25]

[('article', 13894),
 ('week', 9839),
 ('nom', 8025),
 ('project', 7706),
 ('page', 6301),
 ('case', 5727),
 ('time', 5640),
 ('community', 4483),
 ('year', 4042),
 ('number', 3657),
 ('list', 3596),
 ('information', 3511),
 ('content', 3410),
 ('work', 3302),
 ('%', 3080),
 ('discussion', 2843),
 ('process', 2747),
 ('status', 2738),
 ('coverage', 2627),
 ('editor', 2364),
 ('news', 2314),
 ('way', 2298),
 ('part', 2233),
 (']', 2190),
 ('talk', 2179)]

In [58]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "IN"],0]
).most_common()[:5]

[('of', 122910), ('in', 65785), ('on', 41511), ('for', 38708), ('by', 30754)]

In [59]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "NNP"],0]
).most_common()[:25]

[('Wikipedia', 33192),
 ('Wikimedia', 9857),
 ('Foundation', 5629),
 ('Signpost', 3931),
 ('WikiProject', 3733),
 ('New', 2837),
 ('WMF', 2740),
 ('English', 2472),
 ('Committee', 2365),
 ('Wales', 2342),
 ('Board', 1942),
 ('Featured', 1910),
 ('United', 1702),
 ('User', 1647),
 ('January', 1634),
 ('April', 1600),
 ('May', 1563),
 ('List', 1545),
 ('US', 1525),
 ('March', 1521),
 ('August', 1508),
 (']', 1505),
 ('October', 1475),
 ('December', 1473),
 ('February', 1472)]

In [61]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "DT"],0]
).most_common()[:5]

[('the', 204079), ('a', 79379), ('The', 35638), ('this', 18258), ('an', 17413)]

In [62]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "JJ"],0]
).most_common()[:25]

[('new', 8196),
 ('other', 8059),
 ('last', 4818),
 ('many', 4361),
 ('such', 4137),
 ('first', 4067),
 ('featured', 3922),
 ('good', 2762),
 ('several', 2553),
 ('few', 2221),
 ('same', 1868),
 ('active', 1820),
 ('recent', 1818),
 ('nom', 1801),
 ('own', 1771),
 ('different', 1765),
 ('related', 1745),
 ('current', 1716),
 ('open', 1714),
 ('available', 1634),
 ('English', 1612),
 ('important', 1597),
 ('much', 1544),
 ('German', 1493),
 ('public', 1487)]

In [64]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "NNS"],0]
).most_common()[:25]

[('articles', 16781),
 ('editors', 6684),
 ('people', 4487),
 ('users', 4154),
 ('projects', 3662),
 ('pages', 3429),
 ('years', 2735),
 ('members', 2525),
 ('edits', 2480),
 ('changes', 2181),
 ('images', 2146),
 ('media', 2109),
 ('cases', 1949),
 ('topics', 1938),
 ('issues', 1921),
 ('lists', 1758),
 ('sources', 1749),
 ('others', 1644),
 ('pictures', 1450),
 ('candidates', 1421),
 ('results', 1375),
 ('things', 1364),
 ('months', 1353),
 ('links', 1232),
 ('students', 1228)]

In [66]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "RB"],0]
).most_common()[:5]

[('not', 13761), ('also', 7118), ("n't", 5008), ('now', 4020), ('only', 3652)]

In [67]:
nltk.FreqDist(
    words_pos[[i for i, word_pos in enumerate(words_pos[...,1]) if word_pos == "VBN"],0]
).most_common()[:5]

[('been', 11312),
 ('nominated', 3784),
 ('created', 3580),
 ('promoted', 2543),
 ('used', 2215)]

Interesting. You can already see how much value we would get out of further orthonormalization by stemming the data.

A conditional frequency distribution can also be calculated, which reverses this: instead of getting a total from `freq['WORD']` we get a list of kinds of grammatical places where that word appears.

In [70]:
nltk.ConditionalFreqDist(words_pos.tolist())['cut'].most_common()

[('VB', 48), ('VBN', 43), ('NN', 24), ('VBD', 21)]

In [71]:
nltk.ConditionalFreqDist(words_pos.tolist())['Wales'].most_common()

[('NNP', 2342), ('NNS', 157), ('VBZ', 5)]

In [68]:
# nltk.ConditionalFreqDist(wsj) 

Here's an example of using part-of-speech tagging to do something interesting via trigrams:

In [76]:
nltk.trigrams(words_pos).__next__()  # it's a generator

(array(['The', 'DT'], 
       dtype='<U34'), array(['Association', 'NNP'], 
       dtype='<U34'), array(['of', 'IN'], 
       dtype='<U34'))

In [77]:
verb_to_verb_trigrams = []
for ((w1, pos1), (w2, pos2), (w3, pos3)) in nltk.trigrams(words_pos):
    if pos1[0] == "V" and w2 == "to" and pos3[0] == "V":
        verb_to_verb_trigrams.append(((w1, pos1), (w2, pos2), (w3, pos3)))

In [80]:
verb_to_verb_trigrams[:20]

[(('have', 'VB'), ('to', 'TO'), ('wait', 'VB')),
 (('continued', 'VBD'), ('to', 'TO'), ('make', 'VB')),
 (('take', 'VB'), ('to', 'TO'), ('issue', 'VB')),
 (('managed', 'VBN'), ('to', 'TO'), ('begin', 'VB')),
 (('seemed', 'VBD'), ('to', 'TO'), ('be', 'VB')),
 (('proceeded', 'VBD'), ('to', 'TO'), ('add', 'VB')),
 (('had', 'VBD'), ('to', 'TO'), ('be', 'VB')),
 (('remains', 'VBZ'), ('to', 'TO'), ('be', 'VB')),
 (('voted', 'VBN'), ('to', 'TO'), ('accept', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('reject', 'VB')),
 (('decide', 'VB'), ('to', 'TO'), ('accept', 'VB')),
 (('begins', 'VBZ'), ('to', 'TO'), ('shift', 'VB')),
 (('fail', 'VBP'), ('to', 'TO'), ('recognize', 'VB')),
 (('appears', 'VBZ'), ('to', 'TO'), ('be', 'VB')),
 (('have', 'VB'), ('to', 'TO'), ('wait', 'VB')),
 (('doomed', 'VBN'), ('to', 'TO'), ('fail', 'VB')),
 (('failed', 'VBD'), ('to', 'TO'), ('reach', 'VB')),
 (('need', 'VBP'), ('to', 'TO'), ('be', 'VB')),
 (('chosen', 'VBN'), ('to', 'TO'), ('be', 'VB')),
 (('happen', 'VB'), 

In [82]:
[tri for tri in verb_to_verb_trigrams if tri[0][0] == 'voted'][:10]

[(('voted', 'VBN'), ('to', 'TO'), ('accept', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('reject', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('hear', 'VB')),
 (('voted', 'VBN'), ('to', 'TO'), ('do', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('stay', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('accept', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('keep', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('reject', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('accept', 'VB')),
 (('voted', 'VBD'), ('to', 'TO'), ('close', 'VB'))]

In [87]:
nltk.FreqDist([tri[2][0] for tri in verb_to_verb_trigrams if tri[0][0] == 'voted']).most_common()

[('accept', 12),
 ('reject', 6),
 ('approve', 4),
 ('advise', 3),
 ('keep', 3),
 ('admonish', 2),
 ('close', 2),
 ('stay', 2),
 ('affirm', 2),
 ('appoint', 1),
 ('purchase', 1),
 ('disband', 1),
 ('award', 1),
 ('hear', 1),
 ('change', 1),
 ('strip', 1),
 ('leave', 1),
 ('require', 1),
 ('adopt', 1),
 ('select', 1),
 ('modify', 1),
 ('open', 1),
 ('desysop', 1),
 ('reprimand', 1),
 ('do', 1),
 ('restore', 1),
 ('request', 1),
 ('have', 1),
 (']', 1),
 ('delete', 1),
 ('ban', 1),
 ('abolish', 1),
 ('suspend', 1),
 ('remove', 1)]

In [88]:
del verb_to_verb_trigrams

In [92]:
jimbos = [trigram for trigram in nltk.trigrams(words_pos) if trigram[2][0] == "Jimbo"][:25]

In [102]:
[np.array(jimbo).tolist() for jimbo in jimbos]

[[['.', '.'], ["''", "''"], ['Jimbo', 'NNP']],
 [['debate', 'NN'], [',', ','], ['Jimbo', 'NNP']],
 [['.', '.'], ["''", "''"], ['Jimbo', 'NNP']],
 [['rumor', 'NN'], ['that', 'IN'], ['Jimbo', 'NNP']],
 [['approval', 'NN'], ['of', 'IN'], ['Jimbo', 'NNP']],
 [['passed', 'VBD'], ['to', 'TO'], ['Jimbo', 'NNP']],
 [['email', 'NN'], ['to', 'TO'], ['Jimbo', 'NNP']],
 [['based', 'VBN'], ['on', 'IN'], ['Jimbo', 'NNP']],
 [['banned', 'VBN'], ['by', 'IN'], ['Jimbo', 'NNP']],
 [['Still', 'RB'], [',', ','], ['Jimbo', 'NNP']],
 [['statements', 'NNS'], ['by', 'IN'], ['Jimbo', 'NNP']],
 [['involved', 'VBN'], ['.', '.'], ['Jimbo', 'NNP']],
 [['God-King', 'JJ'], ["''", "''"], ['Jimbo', 'NN']],
 [['lost', 'VBN'], [',', ','], ['Jimbo', 'NNP']],
 [['treatment', 'NN'], ['of', 'IN'], ['Jimbo', 'NNP']],
 [['Grunt', 'NNP'], [',', ','], ['Jimbo', 'NNP']],
 [['relieved', 'JJ'], ['that', 'IN'], ['Jimbo', 'NNP']],
 [['message', 'NN'], ['for', 'IN'], ['Jimbo', 'NNP']],
 [['meetup', 'NN'], ['without', 'IN'], ['Jimbo',

Would need to get rid of the punctuation characters for this to be truly effective!

...at this point the book goes into a long disccussion on the subject of corpus tagging. All very interesting, but not useful here...I did read it.

Of particular interest is the **sparse data problem**. Simple **n-gram tagging** is done by assigning the likeliest tags to n-length combinations of words determined by training data (which is catalouged by hand). In case that an n-gram contains a new word that has never been seen before, however, an n-gram tagger will fail on *every* n-gram containing that word: two n-grams in the case of bigrams (`new word`, `old word`; `old word`, `new word`), three in the case of trigrams, and so on.

This is very intrinsically related to the precision-recall tradeoff.

Nevertheless, the accuracy of NLP classifiers using n-gram identification with fallbacks was an encouragement to research groups back when it was state-of-the-art (in the 90s), as it requires little lexical knowledge of the text to classify right much of the time.

A **Brill tagger** is a guess-and-reguess tagger and a form of supervised learning which identifies thousands of rules, using random probing tag-and-compare against the gold version of the text, and combines the best of these to implement a "tag-by-modification" type of tagger.

That was chapter 6. Chapter 7 has lots more text classification materials, including some information on hidden Markov models:

> One shortcoming of this approach is that we commit to every decision that we make. For example, if we decide to label a word as a noun, but later find evidence that it should have been a verb, there's no way to go back and fix our mistake. One solution to this problem is to adopt a transformational strategy instead. Transformational joint classifiers work by creating an initial assignment of labels for the inputs, and then iteratively refining that assignment in an attempt to repair inconsistencies between related inputs. The Brill tagger, described in (1), is a good example of this strategy.
>
> Another solution is to assign scores to all of the possible sequences of part-of-speech tags, and to choose the sequence whose overall score is highest. This is the approach taken by Hidden Markov Models. Hidden Markov Models are similar to consecutive classifiers in that they look at both the inputs and the history of predicted tags. However, rather than simply finding the single best tag for a given word, they generate a probability distribution over tags. These probabilities are then combined to calculate probability scores for tag sequences, and the tag sequence with the highest probability is chosen. Unfortunately, the number of possible tag sequences is quite large. Given a tag set with 30 tags, there are about 600 trillion (3010) ways to label a 10-word sentence. In order to avoid considering all these possible sequences separately, Hidden Markov Models require that the feature extractor only look at the most recent tag (or the most recent n tags, where n is fairly small). Given that restriction, it is possible to use dynamic programming (4.7) to efficiently find the most likely tag sequence. In particular, for each consecutive word index i, a score is computed for each possible current and previous tag. This same basic approach is taken by two more advanced models, called Maximum Entropy Markov Models and Linear-Chain Conditional Random Field Models; but different algorithms are used to find scores for tag sequences.

RTE is fascinating! IBM Watson, as a quasi-institution, is built atop this basic building task.

The rest of the chapter is tons of somewhat tedious explanations of machine learning tasks, with examples drawn from NLP problems.

...well, it appears that NLTK doesn't come with a built-in chunker, so we'll have to build one ourselves!