### Accessing some popular nltk text corpora

In [None]:
# You can skip the folowing code if you have alredy installed nltk Python library
pip install nltk # Run this. I am not including the output here for space saving. 

In [None]:
# Once you install nltk, you need to download the data.
import nltk
nltk.download('all') # Run this. I am not including the output here for space saving. 

### Accessing Gutenberg corpus

In [2]:
from nltk.corpus import gutenberg
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


- The above code output represents the collection of fileids (file identifiers) that are available in the Gutenberg text corpus of the nltk. 
- Each fileid represents a text file that contains a literary work included in the Gutenberg text corpora.

In [4]:
# This code reads the text of a particular file
hamlet = gutenberg.words('austen-persuasion.txt')
print(hamlet[:50])

['[', 'Persuasion', 'by', 'Jane', 'Austen', '1818', ']', 'Chapter', '1', 'Sir', 'Walter', 'Elliot', ',', 'of', 'Kellynch', 'Hall', ',', 'in', 'Somersetshire', ',', 'was', 'a', 'man', 'who', ',', 'for', 'his', 'own', 'amusement', ',', 'never', 'took', 'up', 'any', 'book', 'but', 'the', 'Baronetage', ';', 'there', 'he', 'found', 'occupation', 'for', 'an', 'idle', 'hour', ',', 'and', 'consolation']


- Let's access a few more such text corpora

### Accessing Reuters Corpus

In [7]:
from nltk.corpus import reuters
file_ids = reuters.fileids()[:10]
print(file_ids)

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843']


- Categories in the Reuters corpus are not mutually exclusive.
- We can request the topics covered by one or more documents
- We cal also request the documents included in one or more categories. 

In [9]:
# The corpus methods can accept a single file ID or a multiple file IDs at a time.
reuters.categories(['test/14828', 'test/14829'])

['crude', 'grain', 'nat-gas']

In [11]:
file_ids = reuters.fileids(['crude', 'grain'])[:10]
print(file_ids)

['test/14828', 'test/14829', 'test/14832', 'test/14841', 'test/14858', 'test/15033', 'test/15043', 'test/15063', 'test/15097', 'test/15106']


In [13]:
reuters.words('test/14843')[:5]

['SUMITOMO', 'BANK', 'AIMS', 'AT', 'QUICK']

In [14]:
reuters.words(categories=['crude', 'grain', 'nat-gas'])

['CHINA', 'DAILY', 'SAYS', 'VERMIN', 'EAT', '7', '-', ...]

### Accessing Brown Corpus

In [17]:
from nltk.corpus import brown
categories = brown.categories()[:5]
print(categories)

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government']


In [23]:
sentences = brown.sents(categories=['adventure', 'belles_lettres'])[:2]
for sentence in sentences:
    print(' '.join(sentence))

Northern liberals are the chief supporters of civil rights and of integration .
They have also led the nation in the direction of a welfare state .


### Accessing Gutenberg Corpus

- This corpus contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. 
- We can ask to see nltk.corpus.gutenberg.fileids().
- It will fetch the file identifiers in this corpus.

In [26]:
import nltk
file_ids = gutenberg.fileids()[:3]
print(file_ids)

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt']


In [27]:
# Let's pick the first text — Emma by Jane Austen
# We will nakme it "emma_text".
# After that we will find out how many words it contains?

emma_text = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma_text)

192427

### Accessing Web and Chat Text

- This corpora includes content from a Firefox discussion forum, the movie script of Pirates of the Carribean, conversations overheard in New York, personal advertisements, and wine reviews

In [28]:
from nltk.corpus import webtext
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:30], '...')

firefox.txt Cookie Manager: "Don't allow s ...
grail.txt SCENE 1: [wind] [clop clop clo ...
overheard.txt White guy: So, do you have any ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD ...
singles.txt 25 SEXY MALE, seeks attrac old ...
wine.txt Lovely delicate, fragrant Rhon ...


In [31]:
# A corpus of instant messaging chat sessions is also available in nltk
# This corpus is organized into 15 files

from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
first_10_messages = chatroom[123][:10]
print(first_10_messages)

['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',']


### Accessing nltk Lexical Resources 

- nltk comtains multiple lexical resources.
- The most significant one is WordNet. 
- WordNet is a huge lexical database of English
- It groups words into multiple sets of synonyms.

In [33]:
from nltk.corpus import wordnet as wn

In [34]:
synonyms = wn.synsets('book')
print(synonyms)

[Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11'), Synset('book.v.01'), Synset('reserve.v.04'), Synset('book.v.03'), Synset('book.v.04')]


- The above output represents a list of synonym sets (called synsets)for the word “book”.


- `book` represents the word for which the synonym set is defined.
- `n` represents the part of speech ("n" stands for noun in this specific example).
- `01` represents the sense number that differentiates the different meanings of the word.

In [37]:
# Get definitions and examples. Printing only the first five. 
for i, syn in enumerate(synonyms):
    if i >= 5:
        break
    print(syn.definition())
    print(syn.examples())

a written work or composition that has been published (printed on pages bound together)
['I am reading a good book on economics']
physical objects consisting of a number of pages bound together
['he used a large book as a doorstop']
a compilation of the known facts regarding something or someone
["Al Smith used to say, `Let's look at the record'", 'his name is in all the record books']
a written version of a play or other dramatic composition; used in preparing for a performance
[]
a record in which commercial accounts are recorded
['they got a subpoena to examine our books']


- The above output gives the definitions and example sentences for each sense of the word “book”
- It is retrieved from WordNet 
- It also illustrates the various contexts in which we can use the word.