The NLTK's corpora: An introduction
=======

_Practical Python for Linguistics and the Humanities -- Alexis
Dimitriadis_

## Contents


**[1. Preliminaries: Corpora and other "data"](#1.-Preliminaries:-Corpora-and-other-"data")**  

**[2. Corpora in the NLTK](#2.-Corpora-in-the-NLTK)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.1 Recap: How we open a regular file](#2.1-Recap:-How-we-open-a-regular-file)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.2 How we access NLTK corpora](#2.2-How-we-access-NLTK-corpora)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.3 Reading words from a corpus](#2.3-Reading-words-from-a-corpus)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.4 Reading sentences from a corpus](#2.4-Reading-sentences-from-a-corpus)  

**[3. Some common tasks](#3.-Some-common-tasks)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.1 Examining each sentence in a corpus](#3.1-Examining-each-sentence-in-a-corpus)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.2 Making a list of words into a string again](#3.2-Making-a-list-of-words-into-a-string-again)  

**[4. The common NLTK corpus model](#4.-The-common-NLTK-corpus-model)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.1 Categorized corpora](#4.1-Categorized-corpora)  

**[5. Activities: Working with NLTK corpora](#5.-Activities:-Working-with-NLTK-corpora)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.1 Exercise A1](#5.1-Exercise-A1)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.2 Exercise A2](#5.2-Exercise-A2)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.3 Exercise A3](#5.3-Exercise-A3)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.4 Exercise A4](#5.4-Exercise-A4)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.5 Exercise A5](#5.5-Exercise-A5)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.6 Exercise A6](#5.6-Exercise-A6)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.7 Exercise A7](#5.7-Exercise-A7)  

**[6. What we learned](#6.-What-we-learned)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [6.1 What you should know by heart](#6.1-What-you-should-know-by-heart)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [6.2 What you should remember you saw](#6.2-What-you-should-remember-you-saw)  


The Natural Language Toolkit (NLTK) is a collection of programs and
resources for teaching and carrying out tasks in natural language
processing. Today we focus on corpora.

###           Background reading

[Chapter 2, section 1][ch2.1] of the NLTK book, for the NLTK corpus
environment

[ch2.1]: http://www.nltk.org/book/ch02.html#accessing-text-corpora

## 1. Preliminaries: Corpora and other "data"

The NLTK **software** is included as part of Anaconda, but its collection of **corpora** and other "NLTK data" must be downloaded separately.
In the PC classrooms, we have taken care of that for you; at home or
in the Mac rooms, download a starter kit yourself by running the
following commands. They will download the collection of resources
discussed in the NLTK book and make them available for later use. A
resource only needs to be downloaded **once** (per computer), hence this
code tests the waters and only launches the download if the NLTK
resources are absent.

In [3]:
import nltk
try:
    nltk.pos_tag("ok".split())
    print("The nltk is ready to use")
except LookupError:
    print("Downloading the nltk's \"book\" bundle. This will take a few minutes.")
    nltk.download("book")

AttributeError: partially initialized module 'nltk' has no attribute 'data' (most likely due to a circular import)

 If you call `nltk.download()` without an argument, it will pop up an
interactive graphical window that allows you to browse the NLTK's
collection of resources. You can use it to download individual corpora
and resources, or bundled "collections" like `"book"`. 

On some configurations, this window cannot be opened from inside a Notebook
without special arrangements.  If necessary, run it from the IPython console in Spyder, from a script run with Spyder, or from a command-line python
prompt. 
But first check if the downloader window is simply hidden under other windows!

If a resource is successfully installed, you can access it through the
NLTK modules that are designed to use it. After installing the book
bundle, you will (inter alia) be able to execute the following test
code successfully:

In [None]:
from nltk.corpus import brown
print(brown.readme())

### Your turn:

1. If you have not done so yet, run the code cell above that installs
the NLTK's "book" bundle. Check that it succeeded by running the test
snippet.<p/>

* Pop up an interactive `nltk.download()` window. Browse the available
resources, and find and install the Alpino parsed corpus of Dutch.
<p/>

* To confirm that the Alpino corpus is now available, import the
object `alpino` from `nltk.corpus` and view its README text.

In [None]:
# YOUR CODE:



## 2. Corpora in the NLTK

The NLTK gives us easy access to its collection of corpora (and later,
to our own).

Corpora usually consist of a lot of files. We've learned how to open
one or more files directly with `os.listdir()` and `open()`, and how
to read their contents in various ways. But the NLTK provides ways of
accessing an entire corpus with one command, and different formats for
its text. In hopes of preventing confusion, we'll compare the two
methods now.

### 2.1 Recap: How we open a regular file

In [None]:
with open("RedCircle.txt") as conn: # or the path to this file, e.g. "../practicum-4/RedCircle.txt"
    text = conn.read()

We open files with `open()` and read their contents with `read()`, as
a single long **string**. We then split them into suitably-sized
pieces (words or lines).

The string method `split()`, gives us a **list of words** (with
punctuation still attached):

By using the string method `splitlines()`, we split our text into a
**list of lines:**

### Your turn:

Split the string `text` into a list of lines and into a list of words.

In [None]:
# YOUR CODE:



Although it's easy to work with lines, 
actually we're almost always interested in **sentences**, not lines. 
And while we have been splitting strings on whitespace, it's better to
separate punctuation from words so that the word tokens are easier to
work with.
The NLTK can do these things for us; but it has a _different_ way of
representing text.

### 2.2 How we access NLTK corpora

As we have already seen, standard NLTK corpora are simply imported
from the `nltk.corpus` module. Here is another one, a selection of
texts from the Gutenberg project. It is part of the NLTK's "book"
collection, so it should now be present and ready to use.

In [None]:
import nltk
from nltk.corpus import gutenberg

The variable `gutenberg` is not a list of words or sentences: It is a
"corpus reader" object that knows where to find the Gutenberg corpus
data (which the NLTK downloader installed in a known location). The
corpus reader's methods allow us to access its contents in various
ways without specifying a path.

Among other things we can list the files that make up the corpus, or
we can get a list of the words or _sentences_ of selected files, or of
the entire corpus; all **without explicitly opening a single file.**
Here's how to list the files that make up the corpus:

In [None]:
print(gutenberg.fileids())

The `fileids()` method returns an ordinary list of strings. We can save the list in a variable, loop over it to print the file names one per line, etc. We won't need to open the files ourselves, though. The `nltk` does that for us.

### 2.3 Reading words from a corpus

Getting a list of all word tokens in a corpus is very simple:

In [None]:
allwords = gutenberg.words()

To get the words from just one file from the corpus, specify it as an
argument:

In [None]:
sense = gutenberg.words("austen-sense.txt")
print(sense[0:60])

Or more readably:

In [None]:
print(" ".join(sense[0:60]))

It is also possible to specify **a list of file ids**, instead of just
one.
```python
sometexts = gutenberg.words(["austen-emma.txt", "blake-poems.txt"])
```
If you look carefully, you'll see that commas and other punctuation
are returned as separate tokens: The NLTK has separated them from the
word they were attached to.

Often there's no need to assign the words to a variable, since we can
process the value returned by `words()` directly:

In [None]:
for w in gutenberg.words("austen-sense.txt"):
    if w.endswith("inging"):
        print(w, end=" ")
        # or do something else with the word

### Your turn:

Print out all words in Jane Austen's _Sense and Sensibility_ that end
with `-nesses`. [There are just two.]

In [None]:
# YOUR CODE:



### 2.4 Reading sentences from a corpus

We often want to examine our corpus one sentence at a time. The NLTK
can give us a whole novel (or an entire corpus) as a **list of
sentences**, with each sentence broken up into words (or more
precisely, into tokens):

In [None]:
emma = gutenberg.sents(fileids="austen-emma.txt")
for sent in emma[:6]:
    print(sent)

**Note carefully:** Each sentence is now a **list of words**, not a
single string! A list of sentences, such as `emma`, is no longer a
list of strings but **a list of lists of strings** (where each string
is a word, not a sentence.)


This is different from how we worked until now. Lists of lists are
more complex to work with, but generally more useful.

In fact, `sents()` and `words()` do not return a list, but a "view". The view object will read the corpus in small pieces, and only when necessary. This allows very large corpora to be searched without running out of memory.

In [None]:
type(gutenberg.sents(fileids="austen-emma.txt"))

## 3. Some common tasks

### 3.1 Examining each sentence in a corpus

Searching in a list of words is fine if we're just looking for
individual words. But to see the sentences that contain some word, we
must loop over sentences and examine each word of the current the
sentence. If we find a match, we can print or save the entire
sentence.

Some care is required:  If what we're searching for occurs in a
sentence twice, we don't want to count or save the sentence twice. As
soon as we find something in a sentence, we should print or process it
and immediately move on to the next sentence.

For this we can use the `break` statement, which immediately ends
execution of the nearest *loop.* Execution continues with the next
statement, which might be inside another, higher loop-- as in our example:

In [None]:
emma = gutenberg.sents('austen-emma.txt')
examples = list()
for sentence in emma:
    for word in sentence:
        if word.lower() == 'these':
            examples.append(sentence)
            break   # "break" out of the nearest loop: "for word in sentence"
    
print(len(examples), 'sentences contain the word "these"')

Remember that `emma` is a **list of sentences,** and each sentence is
a **list of words.** If it's still not clear, print out some elements
(e.g., `emma[6]`) and look at them carefully.

1. The variable `emma` is a list of sentences. We can loop over it.
2. Each sentence is a list of words. We can loop over it.
3. Each word is a string, and we can use string methods or regexp
searches on it.

### 3.2 Making a list of words into a string again

It can be useful to join the list of words back into a single string,
e.g. to easily print a sentence in a readable way, or to save it to a
file with `write()`. Recall also that regular expressions can only
search in a string, not in a list of words. To transform *each
sentence* into a separate string, we loop over the list of sentences
and transform each one into a string using `join()`. The punctuation
looks a little odd, but it's readable enough and easy to search with a regexp.

In [None]:
for sent in emma[55:60]:
    print(" ".join(sent))

### Your turn:

Print (readably, as shown above) only the sentences from Jane Austen's
_Emma_ that contain the word `contemptible`. Repeat with the word
"abominable". (Note how they differ in frequency).

In [None]:
# YOUR CODE:



## 4. The common NLTK corpus model

NLTK corpora have a consistent interface. E.g., they all have a
`fileids()` method. If the corpus has a `README` file with information, we can view it with the `readme()` method. 
All corpora have methods that make the text available in
a choice of formats:

In [None]:
gutenberg.raw()    # The corpus as one huge string
gutenberg.words()  # All words (tokens) as a flat list of strings
gutenberg.sents()  # A sent is a list of words
gutenberg.paras()  # A paragraph is a list of sents

All these can be restricted to one or
more files by specifying the file ids. Multiple file ids must be given
as a list:

In [None]:
emma_words = gutenberg.words('austen-emma.txt')
all_austen = gutenberg.paras(['austen-emma.txt', 
                              'austen-persuasion.txt', 'austen-sense.txt'])

Many corpora are subdivided into categories; others include annotations to the text, such as part of speech
tags (POS tags). These features are available through additional methods, which are also consistently named. We'll see some of them later.

### 4.1 Categorized corpora

The Brown corpus consists of hundreds of files, which are "categorized" into a number of different genres (types of text). We can list them with the method `categories()`. 

In [None]:
print(brown.categories())

Categories, like file ids, are returned in an ordinary list which we
can use with other python functions. When listing file IDs or
extracting text, we can use the argument `categories` to restrict
results to files belonging to one or more categories:

In [None]:
print(brown.fileids(categories='government'))

newstexts = brown.sents(categories='news')
print(len(newstexts), 'sentences in the "news" category')

Note that many corpora are not subdivided into categories, and have no `categories()` method. 

## 5. Activities: Working with NLTK corpora

**Most of the activities rely on the Gutenberg corpus,** which
includes several novels by Jane Austen.

Starred exercises are more challenging than unstarred ones.

### 5.1 Exercise A1

Import the nltk's Brown corpus, as follows:

    
```python
from nltk.corpus import brown
```
a. Find out *how many* files it contains. (Do **not** count them by
hand!)

b. Report the total number of files, words and sentences in the Brown
   corpus. It's easier to just count punctuation as "words" too (i.e. if you just count tokens), but you
   can choose to filter it out and only count words that contain
   letters or numbers.

   Hint: The value returned by `words()` etc. is a kind of list
(really a "view"),
   so you can use `len()` on it.

In [None]:
# YOUR CODE:



### 5.2 Exercise A2

Find and print all words in Jane Austen's "Emma" that
start with "ung" or "Ung".

In [None]:
# YOUR CODE:



### 5.3 Exercise A3

Instead of printing the words from the previous
problem,
   collect them into a list.
   Print the number of words you found (with an explanatory message).

In [None]:
# YOUR CODE:



### 5.4 Exercise A4

Construct a *set* of all words ending in -ings from
_all_ of Jane Austen's novels. How many different words were found?
Make your code report the number.

In [None]:
# YOUR CODE:



### 5.5 Exercise A5
**Count** how many words in the gutenberg corpus end
in -ings, and how many sentences contain one or more such words. There
are 4717 words in 4133 sentences, so don’t print them all out!

In [None]:
# YOUR CODE:



### 5.6 Exercise A6

Count and report how many words in the gutenberg
corpus contain two k’s, not necessarily adjacent. Don’t forget about
capitalization. [Correct answer: 355]

In [None]:
# YOUR CODE:



### 5.7 Exercise A7

Collect into a list all __sentences__ from Jane
Austen's novels that contain a word with two k's

In [None]:
# YOUR CODE:



## 6. What we learned

### 6.1 What you should know by heart

* How to gain access to the nltk's `brown` and `gutenberg` corpora.
* How to iterate over the words and sentences in an nltk corpus.
* How to print out a sentence (or several) in a readable way.
* How to search over all sentences, or over all words, in a corpus.

### 6.2 What you should remember you saw

* The use of the methods `fileids()` and `categories()`
* Searches can be restricted to one, or several, categories of a categorized corpus.

THE END