In [1]:
import nltk
import spacy

# Prescriptive grammar

Here we will illustrate some basic tools for Natural Language Processing (NLP), using the example of searching a text for violations of two contentious rules of good grammar:

* A preposition (such as *for*, *to*, etc.) is not a good thing to end a sentence with.
* And you shouldn't begin a sentence with a conjunction (such as *and*, *but*, etc.) 

In my semi-humble opinion, neither of these rules is really worth observing. Descriptively, they both occur regularly in everyday writing, and practically, neither of them obscures or impoverishes the meaning of a sentence. This seems also to be the majority opinion among grammarians. Nonetheless they crop up in antique style guides or conversation with provincial English teachers from time to time. Here they will just serve us as simple starting examples of linguistic phenomena to search for in a text.

If this topic has already raised your hackles, you can read one of the recent rounds of the debate in an article in the [New Yorker](https://www.newyorker.com/culture/cultural-comment/steven-pinkers-bad-grammar) rebutting Steven Pinker's ideas on the subject, and Pinker's rebuttal of the rebuttal in the [Guardian](https://www.theguardian.com/books/booksblog/2015/oct/06/steven-pinker-alleged-rules-of-writing-superstitions).

## nltk

In order to find instances of the two violations described above, we need to search for specific patterns in text. But the patterns we need here go somewhat beyond what is easily achievable with regular expressions. For example for the first one we need to define a set of conjunctions and search for any one of these either right at the beginning of the text or following a sentence end, which itself can occur in several ways (period, question mark, exclamation mark, etc. plus any of these followed by a quote mark or parenthesis). This is doable and could be an instructive exercise if you want to practice regular expressions, but as usual people with more expertise and more time on their hands have gone before and laid some of the groundwork for us already.

The *natural language toolkit* (nltk) is a Python package that provides tools for parsing text into sentences, words, etc. as well as models that can recognize the grammatical role of a word (noun, verb, etc.) with a reasonable degree of accuracy. The package is accompanied by a very comprehensive guidebook that is available both [online](https://www.nltk.org/book/) and [in print](http://shop.oreilly.com/product/9780596516499.do).

### Corpora

Let's begin by downloading and reviewing an example text. One useful feature of nltk is that it provides a downloader for various corpora of text. These include news stories, novels, and various other sources that are already suitable material for many basic research questions. Here we will use the 'gutenberg' corpus, a collection of texts that are freely available via [Project Gutenberg](https://www.gutenberg.org/). Downloading them via nltk rather than going to the Project Gutenberg website ensures that our work is easily reproducible for others who work with nltk.

(And in Germany, possibly the only country in the world that takes intellectual property laws seriously, this brings the added bonus of avoiding the blockade of Project Gutenberg for German IP addresses. At the time of writing, the German courts are involved in a legal dispute with the Project Gutenberg Foundation concerning different schedules for copyright expiry in Germany and the US, and a difference in interpretation of jurisdictions over web content. The [legal details](https://cand.pglaf.org/germany/) are actually quite interesting).

The `download()` function accepts the name of the corpus (or other downloadable resource) as its first argument. Called without argument, the function opens a graphical interface for selecting specific resources for download. (Note that for me the print output of this command merely confirms that the package has already been downloaded.)

In [2]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /home/lt/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

Now that we have downloaded it, we can import our chosen corpus from the `corpus` submodule. The `fileids()` function lists the files in the corpus.

In [3]:
from nltk.corpus import gutenberg

gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

As well as example corpora, nltk provides several tools and models for processing text. To determine the structure of a text, an algorithm needs to apply some background knowledge about the structure of a language. This sometimes requires making use of a fairly complex model. The quantity of background knowledge required is not always trivial, so in order to conserve disk space in installation, nltk does not install all this information by default. If we want to use it, we will need to download it first.

The easiest way to download all those tools that are likely to be useful for basic tasks is to download the 'popular' bundle. This includes various models for assigning structure to text. (Note that it also includes the Gutenberg corpus, so is a shortcut to also getting hold of the text we are using.)

In [4]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /home/lt/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/lt/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /home/lt/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/lt/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/lt/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/lt/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /home/lt/nltk

True

We will use the text of Herman Melville's novel *Moby Dick*.

The nltk corpora have already been subjected to a fair bit of preprocessing, and many of them are already annotated with additional linguistic information. However, to illustrate working with unpreprocessed text, we will load just the raw text of the novel. This can be done with the `raw()` function of the corpus object.

In [5]:
md = gutenberg.raw('melville-moby_dick.txt')

print(md[:1000])

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true." --HACKLUYT

"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness
or rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER'S
DICTIONARY

"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;
A.S. WALW-IAN, to roll, to wallow." --RICHARDSON'S DICTIONARY


### Tokenizing

A first common task in nltk is to 'tokenize' the text, or split it into 'tokens': instances of a basic unit of language, such as syllable, word, sentence, etc. Since our task concerns identifying uses of words at the beginning and end of sentences, we need to tokenize the text into sentences first.

nltk provides a few tokenizing functions for splitting a text into tokens. The `sent_tokenize()`function tokenizes a text into sentences, returning a list of sentences.

(Note that the tokenizing functions depend on our having downloaded the additional nltk resources for processing text as described above.)

In [6]:
sentences = nltk.sent_tokenize(md)

print(sentences[0])

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.


We saw above that the text begins not immediately with the well-known opening line but with a title and some preamble concerning the etymology of the word *whale*. This etymological preamble too is part of the full text as Melville wrote it (a little known fact that you may once in a lifetime have a chance to impress your friends with), but for illustration purposes we will cut the text so that it begins with the more well-known opener of the novel proper.

We can use the text tokenized as sentences to more easily search for a specific sentence, and cut out those sentences preceding it.

In [7]:
def find_first_sentence_starting_with(sents, substr):
    for i, s in enumerate(sents):
        if s.startswith(substr):
            return i
    return -1

start = find_first_sentence_starting_with(sentences, 'Call me Ishmael')
sentences = sentences[start:]

for i in range(8):
    print(sentences[i])

Call me Ishmael.
Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.
It is a way I have of
driving off the spleen and regulating the circulation.
Whenever I
find myself growing grim about the mouth; whenever it is a damp,
drizzly November in my soul; whenever I find myself involuntarily
pausing before coffin warehouses, and bringing up the rear of every
funeral I meet; and especially whenever my hypos get such an upper
hand of me, that it requires a strong moral principle to prevent me
from deliberately stepping into the street, and methodically knocking
people's hats off--then, I account it high time to get to sea as soon
as I can.
This is my substitute for pistol and ball.
With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship.
There is nothing surprising in this.
If they but


We also want each sentence to be tokenized into words. For this we can apply nltk's word tokenizer to each of our sentences. Since it will be convenient later to retain the raw sentences, we assign the word-tokenized sentences into a new list.

In [8]:
tokens = [nltk.word_tokenize(s) for s in sentences]

### Tagging

Our task requires also knowing something about the grammatical roles of the words in the sentences. For this, we need to 'tag' the tokens with extra information. The information we require concerns the *parts of speech* (POS) of the words in the text. A part of speech is a grammatical role such as verb, noun, modifier, preposition, etc.

nltk's POS tagging functions attach a 'tag' to each word token. A tag is a shortened label that identifies the probable POS of the word. The basic function for assigning POS tags to text is `pos_tag()`, but note that its documentation recommends `pos_tag_sents()` instead if we have text that is first tokenized as sentences.

In [9]:
help(nltk.pos_tag)

Help on function pos_tag in module nltk.tag:

pos_tag(tokens, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
    
        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
        [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
        ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
    
    NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.
    
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :param tagset: the tagset to be u

We will also need to specify what tags we wish to assign to our text. The `tagset` argument to `pos_tag()` allows us to choose a standardized set of POS tags. The simplest of these is the so-called 'universal' tagset [proposed by researchers at Google](https://arxiv.org/pdf/1104.2086.pdf). This is a very simplified set of twelve broad types of words (or parts of words, morphemes etc.) that occur across various languages. You can read a list of the tags and their meanings at the [GitHub page](https://github.com/slavpetrov/universal-pos-tags/blob/master/README) for the universal tagset project.

(The functions for assigning parts of speech to words also require the additional nltk data we downloaded above.)

Applying a POS model is a fairly computationally intensive operation, so for a long text it may take a moment to execute.

In [10]:
tokens = nltk.pos_tag_sents(tokens, tagset='universal')

print(tokens[0])

[('Call', 'VERB'), ('me', 'PRON'), ('Ishmael', 'NOUN'), ('.', '.')]


Each word in the sentence is now a tuple of the form *(word, POS)*.

### Working with Parts of Speech

We can now proceed to our first task: finding prepositions at the end of sentences. For this, we need to process each sentence in two steps:

* If the sentence ends with a token tagged as punctuation, ignore this and treat the previous token as the final token in the sentence.
* Check the POS tag for the final token to see whether it is a preposition.

So we can define a function for applying these two steps. Before charging in on the whole text, it is worth checking a couple of test sentences.

In [11]:
def endswith_preposition(sentence):
    if sentence[-1][1] == '.':
        endPos = -2
    else:
        endPos = -1
    return sentence[endPos][1] == 'ADP'

testNegative = 'This is the type of arrant pedantry up with which I will not put.'
testPositive = 'This is the type of arrant pedantry that I will not put up with.'

for test in [testNegative, testPositive]:
    test = nltk.pos_tag(nltk.word_tokenize(test), tagset='universal')
    print(endswith_preposition(test))

False
True


Now for the whole tokenized and tagged text. We use our function this time to get the indices of the sentences that violate the prescription. We can then use these indices with the original list of sentences to get them in their original form without all the tokens and tags.

In [12]:
violations = [i for i, s in enumerate(tokens) if endswith_preposition(s)]

for i in range(10):
    print(sentences[violations[i]])

They must get just as nigh
the water as they possibly can without falling in.
Tell me that.
Again, I always go to sea as a sailor, because they make a point of
paying me for my trouble, whereas they never pay passengers a single
penny that I ever heard of.
Not ignoring what is good, I am
quick to perceive a horror, and could still be social with it--would
they let me--since it is but well to be on friendly terms with all
the inmates of the place one lodges in.
Too expensive and
jolly, again thought I, pausing one moment to watch the broad glare
in the street, and hear the sounds of the tinkling glasses within.
Supper over, the company went back to the bar-room, when, knowing not
what else to do with myself, I resolved to spend the rest of the
evening as a looker on.
Presently a rioting noise was heard without.
I began to twitch all
over.
I then placed the
first bench lengthwise along the only clear space against the wall,
leaving a little interval between, for my back to s

We can see that the search has worked out more or less as desired, but the results are not perfect. Among the first few hits we already notice that the word *that* has been erroneously tagged as a preposition where it is really being used as an article or pronoun instead. There are also some prepositions that are not really bare prepositions but part of a phrasal verb such as *falling in*.

Identifying the grammatical structure of natural language is a hard task of the sort that humans can still do much more accurately than computers.

Let's see how we fare with the second task: identifying conjunctions at the beginning of sentences.

In [13]:
def startswith_conjunction(sentence):
    return sentence[0][1] == 'CONJ'

testNegative = 'This is an example of a sentence that does not begin with a conjunction.'
testNegative = 'And this is an example of one that does.'

for test in [testNegative, testPositive]:
    test = nltk.pos_tag(nltk.word_tokenize(test), tagset='universal')
    print(startswith_conjunction(test))

True
False


In [14]:
violations = [i for i, s in enumerate(tokens) if startswith_conjunction(s)]

for i in range(10):
    print(sentences[violations[i]])

But these are all landsmen; of week days pent up in
lath and plaster--tied to counters, nailed to benches, clinched to
desks.
But look!
And there they
stand--miles of them--leagues.
But here is an artist.
But though the picture
lies thus tranced, and though this pine-tree shakes down its sighs
like leaves upon this shepherd's head, yet all were vain, unless the
shepherd's eye were fixed upon the magic stream before him.
And still deeper the meaning of
that story of Narcissus, who because he could not grasp the
tormenting, mild image he saw in the fountain, plunged into it and
was drowned.
But that same image, we ourselves see in all rivers and
oceans.
And as for going as cook,--though I confess
there is considerable glory in that, a cook being a sort of officer
on ship-board--yet, somehow, I never fancied broiling fowls;--though
once broiled, judiciously buttered, and judgmatically salted and
peppered, there is no one who will speak more respectfully, not to
say reverent

We fare rather better on this task. Probably because there are a lot fewer words that can serve as conjunctions and because these words only very rarely play other grammatical roles.

## spacy

nltk began life as a teaching tool in linguistics rather than as state-of-the-art NLP software, and this remains its primary application. The tokenizing and tagging tools provided in nltk are in principle extensible with custom-made tokenizing and tagging models, but for even moderately complex research tasks basic nltk is inadequate. If you find that nltk fails to identify the linguistic phenomenon you are interested in, there are other tools available that lie further out along the performance dimension of the simplicity-performance trade-off.

For Python, foremost among these tools is currently [spacy](https://spacy.io/). spacy provides more complex language models that can assign POS tags in a way that depends a lot more on the context in which words are used, and can be more easily extended with machine learning techniques. spacy has been fairly extensively optimized so as to make efficient use of memory when processing large texts. And it supports many more languages than just English.

There is also a spacy [conference and workshop](https://irl.spacy.io) that this year (2019) takes place here in Berlin.

Like nltk, spacy makes use of additional knowledge in the form of fairly large data files that need to be downloaded in addition to the base package. spacy organizes this additional data by language. To download the data for English, you will need to first run spacy as a script using Python from the command line with the following parameters:

`python3 -m spacy download en`

* `python3` ensures that Python 3 is used. If you prefer to use your system's default Python version or Python 2, replace this with just `python` or `python2`, respectively.
* The `-m` option to Python runs a Python module as a script.
* The remaining arguments `download` and `en` are passed on to the main spacy file when run as a script, instructing it to download the English language models for tokenizing, tagging, etc.

If you have run the above command then you can load spacy's English language model into a spacy NLP object using the `load()`function. Since spacy does some fairly memory-intensive processing, it avoids running out of memory by setting a safe limit on the number of characters that should be processed (by default one million).

If we know that the processing that we want to do will be fairly simple and is unlikely to exceed memory requirements, we can increase the maximum allowed length to that of our document (which in this case is not very far over a million characters).

In [15]:
print(len(md))

1242990


In [16]:
nlp = spacy.load('en', max_length=len(md))

The resulting NLP object is directly callable (i.e. can be used as a function), with a text as its input argument.

The way that spacy NLP objects interact with text is organized as a 'pipeline': a sequence of processing steps, each of which takes as its input the output of the previous step. By default the pipeline includes the following steps:

* *tokenize*: split the text into tokens, most commonly words
* *tag*: tag each token according to its role
* *parse*: arrange tokens in a structure of 'dependencies' indicating which other word each word refers to
* *name entities*: recognize 'named entities' in the text (for example proper names of countries, people, companies, etc.)

The first two of these we have already encountered in nltk. The third is new and provides us with a lot more information about the structure of a text, since in addition to just labelling words with their roles it links words that refer to each other. The fourth is also new, and particularly important for political or commercial applications. You can read more about the default spacy pipeline [here](https://spacy.io/usage/processing-pipelines).

We can view the current pipeline of an NLP object via an attribute.

In [17]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f903635c5f8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f90305da588>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f90305da5e8>)]

We won't always need every step in the pipeline, and we can save processing time and memory usage by excluding the steps we don't need with the `remove_pipe()` method. We do not need dependency parsing or tagging of named entities.

We can also add components. Since our task concerns sentences, we will need a component that parses the text into sentences. spacy already provides a `'sentencizer'` component for this. To add components, we must first initialize them with `create_pipe()`, then add them to our NLP object.

In [18]:
for component in ['parser', 'ner']:
    nlp.remove_pipe(component)

sentencizer = nlp.create_pipe('sentencizer')
nlp.add_pipe(sentencizer, before='tagger')

nlp.pipeline

[('sentencizer', <spacy.pipeline.pipes.Sentencizer at 0x7f90306d0dd8>),
 ('tagger', <spacy.pipeline.pipes.Tagger at 0x7f903635c5f8>)]

Now we can apply the pipeline to process the text of Moby Dick that we loaded above. (This is again computationally intensive and may take a moment.)

In [19]:
doc = nlp(md)

The resulting document object is an iterable of tokens.

In [20]:
firstWord = doc[5100]

print(firstWord)

Call


Each token is a spacy Token object.

In [21]:
type(firstWord)

spacy.tokens.token.Token

These Token objects have various attributes depending on the processing pipeline that we applied. If the pipeline included POS tagging, then the tokens have `.pos` attributes.

One of the ways in which spacy conserves memory when repeatedly processing tokens is by referring to attributes via integer codes (since integers are cheaper to store and quicker to look up).

In [22]:
print(firstWord.pos)

100


But we will more often want to know the meaning of the POS tags. The human-readable labels for spacy attributes are stored in the same-named attributes followed by an underscore `_`.

In [23]:
print(firstWord.pos_)

VERB


We can see that the first word of the novel (*call*) has been tagged as a verb.

spacy got this one right. Note that the word *call* can be either a noun or a verb depending on context. It is a noun in such well-known phrases as "Final call for passenger Tudge on Ryanair flight FR1144", but a verb in the imperative mood in such phrases as "[call me](https://www.youtube.com/watch?v=StKVS0eI85I)" (or if you are younger and a little more timid "[call me *maybe*](https://www.youtube.com/watch?v=fWNaR-rxAic)"). Though this distinction may seem trivial, drawing it is an example of a task that is [easy for humans but hard for computers](https://en.wikipedia.org/wiki/Moravec%27s_paradox).

Many of spacy's POS tags have intuitive names or names in common with other standard tagsets. For a full listing, see the [documentation here](https://spacy.io/api/annotation#pos-tagging).

Because we included a sentencizer in the pipeline, the document has a `.sents` attribute that can be iterated. We can now use this to get the sentences that match our pattern. It is necessary also to update our search functions to take into account spacy's POS tags.

In [24]:
def endswith_preposition(sentence):
    if sentence[-1].pos_ == 'PUNCT':
        endPos = -2
    else:
        endPos = -1
    return sentence[endPos].pos_ == 'ADP'

def startswith_conjunction(sentence):
    return sentence[0].pos_ == 'CONJ'

violations = [s.text for s in doc.sents if endswith_preposition(s)]

spacy seems to have done somewhat better here at recognizing prepositions only when they are used as prepositions proper.

In [25]:
for violation in violations[:10]:
    print(violation)



"No, Sir, 'tis a Right Whale," answered Tom; "I saw his sprout; he
threw up a pair of as pretty rainbows as a Christian would wish to
look at.


Again, I always go to sea as a sailor, because they make a point of
paying me for my trouble, whereas they never pay passengers a single
penny that I ever heard of.


Presently a rioting noise was heard without.
 But I lay perfectly still, and resolved not to say a
word till spoken to.
 And the man that has anything bountifully
laughable about him, be sure there is more in that man than you
perhaps think for.


The bar-room was now full of the boarders who had been dropping in
the night previous, and whom I had not as yet had a good look at.
 He charges him thrice the
usual sum; and it's assented to.
 Hearing him foolishly fumbling
there, the Captain laughs lowly to himself, and mutters something
about the doors of convicts' cells being never allowed to be locked
within.

Upon this, I told him that whaling was my own des

END