# Discover Insights  and generate text from Gutenberg Project



Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks. It was founded in 1971 by American writer Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of books or individual stories in the public domain. All files can be accessed for free under an open format layout, available on almost any computer. As of 3 October 2015, Project Gutenberg had reached 50,000 items in its collection of free eBooks.

## Import and Preprocess Text Data

In [2]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg as gb

files_names= gb.fileids()
from nltk.tokenize import word_tokenize , sent_tokenize
print(files_names)



['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Chrispdl\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## Preview the data
Let's write a short program to display other information about each text, by looping over all the values of fileid corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. For a compact output display, we will round each number to the nearest integer, using round().

In [3]:

from nltk.corpus import gutenberg as gb
for fileid in gb.fileids():
    num_chars = len(gb.raw(fileid))
    num_words = len(gb.words(fileid))
    num_sents = len(gb.sents(fileid))
    num_vocab = len(set(w.lower() for w in gb.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 19 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt


## Import necessary libraries

In [4]:
import nltk
import random 
from collections import Counter, defaultdict
from nltk.corpus import PlaintextCorpusReader
from nltk import bigrams
from nltk import trigrams
from nltk.util import ngrams 
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Chrispdl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Choose 10 books 

In [5]:
# Get a list of texts by Chesterton , Austen , Shakespeare and the Bible
files_used = []                        # empty list to hold filenames
for file in files_names:                   # look at each filename in turn
    if 'chesterton' in file or 'austen' in file or 'shakespeare' in file or 'bible' in file:         # test if it's a Chesterton
        files_used.append(file)         # add file to the list
print(files_used)
print(len(files_used))

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt']
10


In [8]:
for j in files_used:
    #splitting the text to words and sentences
    sents= gb.sents(j) 
    words= gb.words(j)
    #Unigram Model 
    counts_word = Counter(words)
    #print(counts_word.most_common(n=5)) #prints the most common words
    for word in counts_word:
        counts_word[word] /= len(counts_word)

    #Bigram Model
    #Creating a dictionary containing the bigrams and the probability:
    bigram_model = defaultdict(lambda: defaultdict(lambda: 0))

    # Add 1 each time the two combinations of the 2 words exists
    for sentence in sents:
        for word1, word2 in bigrams(sentence, pad_right = True, pad_left = True):
            bigram_model[word1][word2] += 1


    # Iterating through the model and calculating the probability of the bigrams
    for word1 in bigram_model:

        total_count = float(sum(bigram_model[word1].values()))
        # If the total_count == 0  meansthe bigram does not exist, then it should skip it
        if total_count == 0:
            pass
        # Again if it does not exist assign the value 0 
        for word2 in bigram_model[word1]:
            if total_count == 0:
                bigram_model[word1][word2] = 0
            # If the brigram is not 0, then it calculates the probability
            else:
                bigram_model[word1][word2] /= total_count
            
        
    #print(str(dict(bigram_model)))
    print('Generating sentences from book',j,'using bigrams')
    #generating sentences using bigrams
    for i in range(10):
        text = [None]
        sentence_finished = False
        count = 0
        while not sentence_finished:

            r = random.random()
            accumulator = .0

            if len(text) == 1:
                var = bigram_model[None]
            else:
                tup_text = tuple(text)

                var = bigram_model[tup_text[-1]]


            for word in var.keys():

                accumulator += var[word]

                if accumulator >= r:

                    text.append(word)
                    break
                else:

                    pass
            if text[-1:] == ['.']:
                sentence_finished = True

        print ('Generated sentence', i  ,'\n', ' '.join([t for t in text if t]))

    #trigram 
    # Creating a dictionary containing the two words and then the words that comes after as well as the probability: 
    trigram_model = defaultdict(lambda: defaultdict(lambda: 0))

    # Adding  1 each time the trigram exists
    for sentence in sents:
        for word1, word2, word3 in trigrams(sentence, pad_right = True, pad_left = True):
            trigram_model[(word1, word2)][word3] += 1

    # Iterating through the model and calculating the probability of the trigrams
    for word1_word2 in trigram_model:
         # If the total_count == 0 means the trigram does not exist, then it should skip it
        total_count = float(sum(trigram_model[word1_word2].values()))
        if total_count == 0:
            pass
        # Again if it does not exist assign the value 0 
        for word3 in trigram_model[word1_word2]:
            if total_count == 0:
                trigram_model[word1_word2][word3] = 0
            # If the brigram is not 0, then it calculates the probability
            else:
                trigram_model[word1_word2][word3] /= total_count



    print('Generating sentences from book',j,'using trigrams')
    for i in range(10):
        text = [None, None]
        sentence_finished = False

        while not sentence_finished:
            r = random.random()
            accumulator = .0

            for word in trigram_model[tuple(text[-2:])].keys():
                accumulator += trigram_model[tuple(text[-2:])][word]

                if accumulator >= r:

                    text.append(word)
                    break
                else:
                    pass
            if text[-1:] == ['.']:
                sentence_finished = True

        print('Generated sentence',i  ,'\n', ' '.join([t for t in text if t]))

Generating sentences from book austen-emma.txt using bigrams
Generated sentence 0 
 But have always do not like therefore , at seventeen .
Generated sentence 1 
 The evil , for her cheeks in a large a heavy work to Donwell !" " But after dinner , at Enscombe must consider whether in character , and it was most deplorable mistake not prejudiced ; and yet it , to get walnuts , which accounted for her visitors perhaps .-- Yes ; it had at all there was led her .
Generated sentence 2 
 But hush , so odd !-- Can you at somebody had reached Mrs .
Generated sentence 3 
 at a very glad to town , a few turns out again , without great deal I find her it as I am I are always good deal under my intentions were not unpretty ; but the sick of mutton for a chair near so very well as the convenience of his sanction to myself .
Generated sentence 4 
 " Does Mrs .
Generated sentence 5 
 He had ever refuse him to do envy him to Box Hill though she found more moderate -- it be so much more safely entered t

Generating sentences from book bible-kjv.txt using bigrams
Generated sentence 0 
 16 And they will give unto him , which I have already attained not leave in the common sort : 15 : 24 For ye , and thy people murmured against the right unto Sephar a sharp razor , and prosper into the city , and when she said , and she lifteth up the psalms and his work thereof , I will swallow by fire from the woman , Behold , and which he shall not send out .
Generated sentence 1 
 28 And the LORD .
Generated sentence 2 
 Do we see David his neck , and also , behold , the LORD hath the time , because I will be troubled .
Generated sentence 3 
 5 By sword beside the cities , Thy watchmen to lift up , which is within your faith fail .
Generated sentence 4 
 O LORD said Jesus is none to stink ; ( he knoweth them ; and the Amorites .
Generated sentence 5 
 5 : for the three days and Jebusites , for there cometh : 8 And as the flood arose a gluttonous , and he rebuked the inhabitants of Judah ' s wisdom to 

Generating sentences from book chesterton-brown.txt using trigrams
Generated sentence 0 
 he muttered ; " but there is no curse on him -- a muffled figure bending forward , evidently peering out into the steering - seat of a watering - place , who was stepping with considerable coolness into the street ." We can direct our moral wills ; but deliberation could not ` X - ray ' the coin from Hawker , because they must be the chance blunders of a mysterious chieftain with an air of some of your critique of the road along the sea ?" But such things before -- were at once luminous and discoloured by the pursuers ; but when the brute has huge humped shoulders and hog ' s private room of the Pendragons in the time when he spoke , the man turned to look after her ? The convict settlement at Sequah is thirty miles from the room was standing with his mouth to speak and his hair ( which was that the house ; and I thought I had , however high they went , the hues of a banquet over - powering projec

Generating sentences from book shakespeare-hamlet.txt using trigrams
Generated sentence 0 
 Polon .
Generated sentence 1 
 Ham .
Generated sentence 2 
 Come away .
Generated sentence 3 
 Sir , Enquire me first what Danskers are in Paris ; And thy Commandment all alone shall liue Within the Center Guild .
Generated sentence 4 
 Ser .
Generated sentence 5 
 I dare Damnation : to me I know not what we are Pictures , or the Poesie of a sorrow , I could tell you why ; so fare ye well : y ' are a good end ; For by the image of my Watch , bid them make hast .
Generated sentence 6 
 Vnhand me Gentlemen : By Heauen I charge you ; And how , and sent into England ? Is it not that I hope will teach you to imagine - Enter a King and Queene , that it were better my Mother , for the Players .
Generated sentence 7 
 King .
Generated sentence 8 
 Laer .
Generated sentence 9 
 Mar .
Generating sentences from book shakespeare-macbeth.txt using bigrams
Generated sentence 0 
 1 Appar .
Generated sentence 1