# NLTK - Corpora

A **text corpus** is a dataset containing **natural language** text from common sources, compiled to represent a specific domain or aspect of language. These often include books, transcripts, websites, correspondence, etc. They can also be annotated with linguistic features, and contain text in many languages.

A corpus should be large, principled, and authentic.

Unlike a simple database dump of text data (such as those offered by Wikipedia), corpora are curated to be useful in linguistic analysis, often offering balanced representation of different genres or languages, and human annotation.

- How can they be useful? What kind of projects would you use a corpus for?
- Types of corpora? diachronic/synchronic/monitor, monolingual/multilingual/parallel, raw/tagged, learner corpus, error-tagged corpus?

## 1. Exploring Corpora

NLTK enables us to access pre-compiled corpora. We will open and explore four of them today.

In [1]:
import nltk
from nltk.corpus import gutenberg, inaugural, brown, dependency_treebank
import matplotlib.pyplot as plt
import random

We download the four corpora we will study:

In [2]:
# The Gutenberg Corpus contains several literary texts from Project Gutenberg.
nltk.download('gutenberg')

# The Inaugural Address Corpus contains the texts of the inaugural addresses of U.S. presidents.
nltk.download('inaugural')

# The Brown Corpus is a standard corpus of American English texts.
nltk.download('brown')

# The Dependency Treebank Corpus contains sentences annotated with their syntactic structure.
nltk.download('dependency_treebank')

# The Punkt tokenizer is used for sentence splitting.
nltk.download('punkt')  # For tokenization

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\KatElmas/nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.
[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\KatElmas/nltk_data...
[nltk_data]   Unzipping corpora\inaugural.zip.
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\KatElmas/nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.
[nltk_data] Downloading package dependency_treebank to
[nltk_data]     C:\Users\KatElmas/nltk_data...
[nltk_data]   Unzipping corpora\dependency_treebank.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KatElmas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

And print the files that make them up:

In [7]:
# print files in each corpus
for fileid in brown.fileids():
    print(fileid, brown)

ca01 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca02 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca03 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca04 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca05 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca06 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca07 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca08 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca09 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca10 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca11 <CategorizedTaggedCorpusReader in 'C:\\Users\\KatElmas\\nltk_data\\corpora\\brown'>
ca12 <CategorizedTagg

### 1.1. Analysing the Gutenberg Corpus

Let's dive into one of the texts and perform some simple explorative analysis.

We can load the dataset in several forms, including the raw characters, list of words, or list of sentences, like so:

```
raw_text = gutenberg.raw()
words = gutenberg.words()
sentences = gutenberg.sents()
```

In [9]:
# read raw text of Alice in Wonderland using .raw()
alice_raw = gutenberg.raw('carroll-alice.txt')

# print first 500 characters
print(alice_raw[0:500])


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an


In [10]:
# .words() method to access the text as a list of words
alice_words = gutenberg.words('carroll-alice.txt')

# print first 50 words
print(alice_words[0:50])


['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit', '-', 'Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into']


In [11]:
# .sents() method to access the sentences
alice_sentences = gutenberg.sents('carroll-alice.txt')

# print two sentences
print(alice_sentences[6:8])


[['Oh', 'dear', '!'], ['I', 'shall', 'be', 'late', "!'"]]


Let's print some statistics about the text:

In [12]:
# print statistics
print(f"Total Characters: {len(alice_raw)}")
print(f"Total Words: {len(alice_raw)}")
print(f"Total Sentences: {len(alice_raw)}")


Total Characters: 144395
Total Words: 144395
Total Sentences: 144395


In [14]:
# find out number of unique words
unique_words = set(alice_words)
print(f"Total unique words: {len(unique_words)}")


Total unique words: 3016


Let's perform some frequency analysis on Alice in Wonderland. You will use `FreqDist` later on.

In [16]:
# print frequency distribution words in Alice in Wonderland
fdist = nltk.FreqDist(alice_words)
fdist


FreqDist({',': 1993, "'": 1731, 'the': 1527, 'and': 802, '.': 764, 'to': 725, 'a': 615, 'I': 543, 'it': 527, 'she': 509, ...})

In [None]:
# show 30 most common words



Most of these aren't very insightful. Punctuation and function words are very common in all texts.

In [21]:
# filter punctuation and stopwords for better visualization
stopwords = set(nltk.corpus.stopwords.words("english"))
alice_words_filtered = [word for word in alice_words if word.lower() not in stopwords and word.isalpha()]
fdist_filtered = nltk.FreqDist(alice_words_filtered)
fdist_filtered


FreqDist({'said': 456, 'Alice': 396, 'little': 125, 'one': 94, 'know': 87, 'like': 84, 'went': 83, 'thought': 74, 'Queen': 74, 'could': 73, ...})

Why and when to filter stopwords and punctuation? what about capitalisation?

### 1.2. ✏️ Your turn!

Try performing a similar exploration of the `inaugural` or `brown` corpus. Any surprises? What about if you restrict it to one of their files or categories?

For either the entire corpus or one of the files/categories, print:

- The number of words
- The 30 most common words
- The 30 most common words, filtering out stopwords and punctuation

Extra:
- Find "Hapax legomenons": words that occur only once in a corpus. Hint: `fdist['word']` returns the count for that word. So you can do `for word in fdist.keys():`...
- Normalise capitalisation, so that "The" and "the" are counted as the same word. Hint: in Python, you can do `word.lower()` to get the lowercase version of a word.

In [None]:
# ✏️ STUDENTS: Your code here!



What can we learn by exploring a corpus in this way? 

### 1.3. Comparative Analysis

What can comparing different corpora or categories within a corpus tell us?

Finding differences between texts of different genres, authors, or periods is called **stylistics**.

In [None]:
# use FreqDist to compute proportions of "said" per Gutenberg file
# Fill code here

# plot the proportion of "said" in each Gutenberg file as a barplot
plt.figure(figsize=(14,4))
plt.bar(gutenberg.fileids(), fd_gutenberg)
plt.xticks(rotation=45)
plt.show()

In [None]:
# use FreqDist to compute counts of "states" and "america" per inaugural address

# IDs look like '1789-Washington.txt', so we extract the year part
address_ids = inaugural.fileids()
years = [fileid[:4] for fileid in address_ids]

# Fill code here

# plot the occurrence of "states" and "america" in each inaugural address as a line plot
plt.figure(figsize=(14,4))
plt.plot(years, fd_inaugural_states, label='united')
plt.plot(years, fd_inaugural_america, label='america')
plt.xticks(rotation=60)
plt.legend()
plt.show()

Noticed how we used proportions with the Gutenberg corpus, but total counts with the inaugural corpus? What are the pros and cons?

In [None]:
# use FreqDist to compute proportions of "will" per Brown file
# Fill code here

# plot the proportion of "will" in each Brown file as a barplot
plt.figure(figsize=(14,4))
plt.bar(brown.categories(), fd_brown)
plt.xticks(rotation=45)
plt.show()

What happens if we switch to "could"?

### 1.4. ✏️ Test your own hypothesis

Use the examples above to come up with your own comparative hypothesis and test it. Some examples:
- Are words in a certain category of Brown longer on average than in another one?
- Does an antiquated term fade out of use in Inauguration as the years go by?
- Do poems use a certain word more frequently than novels? what about plays?

In [None]:
# ✏️ STUDENTS: Your code here!

## 2. Annotated Corpora

Corpora are often annotated with linguistic features. Which ones do you think would be useful, and for what tasks?

In [None]:
# Explore Brown corpus annotations: print tagged words in the "news category"


Do you recognise these tags? https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Let's look at another corpus, this time it's tagged with a _Dependency Parse_. Such corpora are called "treebanks".

In [None]:
# Load Dependency Treebank corpus, format as conll, and print


✏️ Now try using the Brown Corpus' annotations to do some analysis of your own:
- What are the most common verbs per category? Any insightful differences?
- How does pronoun use change in each category? What can pronouns tell us about a text?

In [None]:
# ✏️ STUDENTS: Your code here!

## 3. ✏️ Creating your own Corpus

NLTK allows us to build our own corpus from text files. Find a book you like at https://www.gutenberg.org/ebooks/categories, and open it as "Plain Text UTF-8". Hit `Ctrl+S` or `Cmd+S` to save it in the same directory as this notebook.



In [None]:
# Read a txt file using PlaintextCorpusReader, and print fileids
my_file = ''

In [None]:
# Print excerpt from my corpus


Now, use the code from earlier to print some statistics about your file:
- The 30 most common words, filtering out stopwords and punctuation, and ignoring capitalisation
- Find the most common word in the Brown corpus that does not appear in your corpus. _Hint_: You can create a `FreqDist` of both corpora, and use `for word, count in fdist.most_common(100)` to find it.


In [None]:
# ✏️ STUDENTS: Your code here!

### ✏️ Extra activity: Zipf's Law

Verify Zipf's law using different corpora. Zipf's law states that the frequency of the Nth most common word is proportional to 1/N, so the second most common word appears 1/2 as many times as the first, and the third appears 1/3 as many times as the first.

In [None]:
# Create a FreqDist of the corpus, filtering out stopwords and punctuation, and ignoring capitalisation


plt.figure(figsize=(10,6))
plt.plot(ranks, frequencies, marker='o')
plt.plot(ranks, [frequencies[0]/rank for rank in ranks], linestyle='dashed', color='red', marker='D')
plt.xlabel("Rank")
plt.ylabel("Frequency")
#plt.yscale("log")
#plt.xscale("log")
plt.show()

Now try yours:

In [None]:
# ✏️ STUDENTS: Your code here!

## 4. Generating Random Text with Bigrams

Bigrams are word pairs that appear consequtively in a text. So the text "I had some coffee" has 3 bigrams, (I had), (had some), and (some coffee).

- Have you encountered bigrams before? What about N-grams?
- How can they be useful?

In [None]:
# Create a function that given a word will output a likely next word


In [None]:
# Test our function with the Brown corpus


Now try it with your own corpus! Anything interesting? 

In [None]:
# ✏️ STUDENTS: Your code here!

Some questions:
- What are the limitations of bigrams? 
- How big should N be so that N-grams are useful? 
- How do modern machine learning models differ from N-grams?