# 1 Accessing Text Corpora

**Corpora:** large bodies of linguistic data.

**Text corpus:** large body of text.

## 1.1 Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive.

The `fileids()` function of the `corpus.gutenberg` package returns the file identifiers of the Gutenberg corpus.

In [12]:
from nltk.corpus import gutenberg # import the gutenberg package from the corpus package of the nltk library
gutenberg.fileids()               # return the gutenberg file identifiers

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

The `words("`*`file.txt`*`")` function of the `corpus.gutenberg` package returns all the tokens from a Gutenberg *text file*.

The `raw("`*`file.txt`*`")` function of the `corpus.gutenberg` package returns all the characters (a string of unprocessed text, or raw text) from a Gutenberg *text file*.

In [27]:
emma = gutenberg.words("austen-emma.txt")   # store a list of tokens of the austen-emma.txt file
print(len(emma))                            # print the total amount of tokens obtained from the austen-emma.txt file
emma_raw = gutenberg.raw("austen-emma.txt") # store all the characters (raw text) of the austen-emma.txt file
print(len(emma_raw))                        # print the total amount of characters of the austen-emma.txt file

192427
887071


In the following example, the program loops through all the Gutenberg field identifiers and exctrats all the tokens that are words in each Gutenberg text. Statistics are shown for each text and are displayed in subsequent order:

1. The average word length

2. The average sentence length

3. The average number of times each vocabulary item appears in the text

4. The file identifier of the text

In [36]:
for fileid in gutenberg.fileids():
    words = [token for token in gutenberg.words(fileid) if token.isalpha()]
    word_count = len(words)
    word_lengths_sum = sum(len(word) for word in words)
    sentence_count = len(gutenberg.sents(fileid))
    vocabulary_count = len(set(word.upper() for word in words))

    average_word_length = round(word_lengths_sum / word_count)
    average_sentence_length = round(word_count / sentence_count)
    average_vocabulary_occurrence_count = round(word_count / vocabulary_count)

    print(f"{average_word_length:1} {average_sentence_length:2} {average_vocabulary_occurrence_count:2} {fileid}")

4 21 23 austen-emma.txt
4 22 15 austen-persuasion.txt
4 24 19 austen-sense.txt
4 26 63 bible-kjv.txt
4 16  5 blake-poems.txt
4 16 12 bryant-stories.txt
4 15 11 burgess-busterbrown.txt
4 16 11 carroll-alice.txt
4 17 10 chesterton-ball.txt
4 19  9 chesterton-brown.txt
4 16  9 chesterton-thursday.txt
4 17 21 edgeworth-parents.txt
4 22 13 melville-moby_dick.txt
4 43  9 milton-paradise.txt
4 10  7 shakespeare-caesar.txt
4 10  6 shakespeare-hamlet.txt
4 10  5 shakespeare-macbeth.txt
4 30 10 whitman-leaves.txt


The `sents("`*`file.txt`*`")` function of the `corpus.gutenberg` package returns a list of sentences from a Gutenberg *text file*. Each sentence is a just list of tokens that end with an ending punctuation mark token. To put it simply: the list of sentences is just a list of lists that contain tokens.

In [29]:
# store the list sentences of the shakespeare-macbeth.txt file
macbeth_sentences = gutenberg.sents("shakespeare-macbeth.txt")

# print the list of sentences of the text
print(macbeth_sentences)

# print the 1116th sentence of the text
print(macbeth_sentences[1116])

# store the length of the longest sentence of the text
longest_sentence_length = max(len(sentence) for sentence in macbeth_sentences)

# print the length of the longest sentence of the text
print(longest_sentence_length)

# print all the sentences that have the same length as the longest sentence in the text
print([sentence for sentence in macbeth_sentences if len(sentence) == longest_sentence_length])

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]
['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';', 'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble']
158
[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Ste

## 1.2 Web and Chat Text

NLTK has a small cllection of web text that includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews.

The `fileids()` function of the `corpus.webtext` package returns the file identifiers of the web text corpus.

In [37]:
# import the webtext package from the corpus package of the nltk library
from nltk.corpus import webtext

# print all the file identifiers of the web text package with a small snipet form the beginning of each text file
for fileid in webtext.fileids():
    print(f"{fileid:13} {webtext.raw(fileid)[:33]:33} ...")

firefox.txt   Cookie Manager: "Don't allow site ...
grail.txt     SCENE 1: [wind] [clop clop clop]  ...
overheard.txt White guy: So, do you have any pl ...
pirates.txt   PIRATES OF THE CARRIBEAN: DEAD MA ...
singles.txt   25 SEXY MALE, seeks attrac older  ...
wine.txt      Lovely delicate, fragrant Rhone w ...


NLTK also has a corpus of instant messaging chat sessions. The corpus contains over 10,000 posts. The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).

The `posts("`*`file.xml`*`")` function of the `corpus.nps_chat` package returns an `XMLCorpusView` from an *xml file*. The returned `XMLCorpusView` contains a list of posts. Each post is a list of tokens.

In [44]:
# import the nps_chat package from the corpus package of the nltk library
from nltk.corpus import nps_chat

# store the posts from the 10-19-20s_706posts.xml file (20s chat room on 10/19/2006. contains 706 posts)
chatroom = nps_chat.posts("10-19-20s_706posts.xml")

# print the 123rd post
print(chatroom[123])

['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']


## 1.3 Brown Corpus

NLTK contains The Brown Corpus which contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.

The `categories()` function of the `corpus.brown` package returns a list of all the available categories of The Brown Corpus.

The `words("`*`fileid`*`")` function of the `corpus.brown` package returns all the tokens from a file with the specified *file id*. The parameter for the file id can also be a list of file ids. The `words()` function also has a `category` parameter which accepts a category or a list of categories.

The `sents("`*`fileid`*`")` function of the `corpus.brown` package returns a list sentences of a file from the Brown Corpus with the specified *file id*.

In [59]:
# import the brown package from the corpus package of the nltk library
from nltk.corpus import brown

# print the categories of the brown corpus
print(brown.categories())

# print the tokens from texts of the brown corpus that are categorized as news
print(brown.words(categories="news"))

# print the tokens from the text of the brown corpus with id "cg22"
print(brown.words(fileids=["cg22"]))

# print the sentences from the texts of the brown corpus that are categorized as news, editorial and reviews
print(brown.sents(categories=["news", "editorial", "reviews"]))

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
