# Proof-of-Concept 1: Pull Text from NLTK (Natural Lang. Toolkit) Corpora

## How to use this PoC:
After you run it, you may have to scroll back up to the top.

To run it: in the drop-down menu, click **Kernel --> Restart & Run All --> Restart and Run All Cells**

    or

To run it: in the icon toolbar, click **the Fast-Forward button --> Restart and Run All Cells**.

## Attribution:
**Author**: Steven Kyle Crawford

Special thanks to the NLTK team and numerous authors.

## Description:
This notebook illustrates pulling text from the NLTK corpora installed locally (on the computer).


This notebook demonstrates:
* printing all sentences from a specified Gutenberg book
* printing a specific number of the first sentences from a specified book
* printing sentences from the first 5 available books
* printing sentences from specific books

## Procedure:

### Step 0) Install the dependencies

In [1]:
# # Run this only once to avoid unnecessary redownloading
# # To enable or disable: highlight all lines and <Ctrl> + /
# !pip install -U nltk
# !python -m nltk.downloader all-corpora # This will install only the corpora (no grammars or trained models)

### Step 1) Get the fileid of the Gutenberg book you want

In [2]:
from nltk.corpus import gutenberg


print("All available Gutenberg book titles:\n")
printed = [print(fileid) for fileid in gutenberg.fileids()]

All available Gutenberg book titles:

austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
bryant-stories.txt
burgess-busterbrown.txt
carroll-alice.txt
chesterton-ball.txt
chesterton-brown.txt
chesterton-thursday.txt
edgeworth-parents.txt
melville-moby_dick.txt
milton-paradise.txt
shakespeare-caesar.txt
shakespeare-hamlet.txt
shakespeare-macbeth.txt
whitman-leaves.txt


### Step 2) Get the book's raw text
Get the raw text of a Gutenberg book from the NLTK corpora installed on Step 0. The raw text is one giant string with line-break characters. It is not tagged for parts-of-speech (POS).

In [3]:
from nltk.corpus import gutenberg


fileid= "carroll-alice.txt"
book = gutenberg.raw(fileid)

### Step 3) Split the raw text into a list of sentences

Split the raw text of a Gutenberg book into a list of sentences. A list of sentences is similar to one big paragraph. The tokenizer used here is a pre-trained NLTK sentence tokenizer.

In [4]:
from nltk.tokenize import sent_tokenize


sentences = sent_tokenize(book)

### Step 4) Pretty print each sentence

In [5]:
def pretty_print_sentences(sentences, number_of_sentences=-1):
    """Pretty print a given number of sentences from a list of sentences.
    If number_of_sentences == -1, print all sentences.

    Given a list of strings and a positive integer, returns None.
    Throws ValueError if number_of_sentences is not a positive integer.
    """

    if number_of_sentences == 0 or number_of_sentences < -1:
        raise ValueError('The number of sentences to print must be greater than 0')

    # Get the first n sentences where n is number_of_sentences
    for sentence in sentences[:number_of_sentences]:
        print(sentence + "\n")

### Step 5) Put it all together

In [6]:
def print_sentences_from_gutenberg_book(fileid, number_of_sentences=-1):
    """Pretty print a given number of sentences from a Gutenberg book using the book's fileid.

    Given a string and a positive integer, returns None.
    Throws ValueError if number_of_sentences is not a positive integer.
    """

    book = gutenberg.raw(fileid)
    sentences = sent_tokenize(book)
    pretty_print_sentences(sentences, number_of_sentences)

### Step 6) Use it

In [7]:
fileid = "carroll-alice.txt"
number_of_sentences = 10

print_sentences_from_gutenberg_book(fileid, number_of_sentences)

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.

Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!

Oh dear!

I shall be late!'

(when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural);

## Interactive Example:

### Try changing these settings
Ctrl + Enter = reload the cell/code block

In [8]:
# Change this: don't forget the "" marks and the .txt
fileid = "melville-moby_dick.txt"

# Change this: -1 = give me all of the sentences
number_of_sentences = 5


# Don't change this
print_sentences_from_gutenberg_book(fileid, number_of_sentences)

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.

He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.

He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true."



#### All available Gutenberg book titles:
* austen-emma.txt
* austen-persuasion.txt
* austen-sense.txt
* bible-kjv.txt
* blake-poems.txt
* bryant-stories.txt
* burgess-busterbrown.txt
* carroll-alice.txt
* chesterton-ball.txt
* chesterton-brown.txt
* chesterton-thursday.txt
* edgeworth-parents.txt
* melville-moby_dick.txt
* milton-paradise.txt
* shakespeare-caesar.txt
* shakespeare-hamlet.txt
* shakespeare-macbeth.txt
* whitman-leaves.txt

## Other Examples:

### Example 1: Print the first 5 sentences from the first 5 books

In [9]:
number_of_books = 5
number_of_sentences = 5

# Get the first n books where n is number_of_books
all_fileids = gutenberg.fileids()
trimmed_fileids = all_fileids[:number_of_books]

for id in trimmed_fileids:
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
    print_sentences_from_gutenberg_book(id, number_of_sentences)
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.

Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.

Between _them_ it was more the intimacy
of sisters.

~~~~~~~~~

### Example 2: Print the first 7 sentences from The Bible (KJV), Paradise Lost (Milton), and Julius Caesar (Shakespeare)

In [10]:
number_of_sentences = 7
fileids = [
    # "austen-emma.txt",
    # "austen-persuasion.txt",
    # "austen-sense.txt",
    "bible-kjv.txt",
    # "blake-poems.txt",
    # "bryant-stories.txt",
    # "burgess-busterbrown.txt",
    # "carroll-alice.txt",
    # "chesterton-ball.txt",
    # "chesterton-brown.txt",
    # "chesterton-thursday.txt",
    # "edgeworth-parents.txt",
    # "melville-moby_dick.txt",
    "milton-paradise.txt",
    "shakespeare-caesar.txt",
    # "shakespeare-hamlet.txt",
    # "shakespeare-macbeth.txt",
    # "whitman-leaves.txt",
]

for fileid in fileids:
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
    print_sentences_from_gutenberg_book(fileid, number_of_sentences)
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon
the face of the deep.

And the Spirit of God moved upon the face of the
waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light
from the darkness.

1:5 And God called the light Day, and the darkness he called Night.

And the evening and the morning were the first day.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[Paradise Lost by John Milton 1667] 
 
 
Book I 
 
 
Of Man's first disobedience, and the fruit 
Of that forbidden tree whose mortal taste 
Brought death into the Wo