# Core language example: Reading in, investigating, and editing a book.

In [2]:
import os  # this package allows us to use terminal window commands from within python

## Getting a book

Many many old books are freely available on [Project Gutenberg](http://www.gutenberg.org/). I have previously chosen a book to download and use. Since we already have the web address of the book (`url` below), we can read it in with a basic Linux command. The `os` package allows us to use Linux terminal window commands from within Python.

In [5]:
# This file is already saved into the github repo, but this is how you can get it:
url = 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt'
loc = '../data/'  # relative path location for the book
# os.system('wget --directory-prefix=' + loc + ' ' + url)  # this downloads the text

Now that we have downloaded the book, we can do something with it. We use string manipulations to get the actual filename from the original web address. Then we can simply open the book into an object, `f`.

In [6]:
filename = url.split('/')[-1]
f = open(loc + filename)

So what book is it? Let's check the first line to see:

In [7]:
print(f.readline())

﻿The Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited



## Investigate

Let's define a function that will do something useful for us, like counting words.

In [8]:
def wordcount(text, word):
    '''Count instances in text of the input word.
    
    Input:
        text (str): a string of text
        word (str): a word to search for and count in the text
    
    Output:
        (int) number of instances of word in text
    '''
    
    if not isinstance(text, str):
        print('text is not a string! Try again!')
    
    return(text.count(word))

Now that the function `wordcount()` has been defined, we can use it in the subsequent cells.

First, though, we need to get the text cleaned up. We read in all the lines of the text as follows:

In [9]:
f.seek(0)  # This sets the pointer back to the beginning of the file. This allows us to run this
           # block of code many times without reopening the file each time.
lines = f.readlines()
print(lines)

['\ufeffThe Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited\n', 'by R. W. (Robert William) Chapman\n', '\n', '\n', 'This eBook is for the use of anyone anywhere at no cost and with\n', 'almost no restrictions whatsoever.  You may copy it, give it away or\n', 're-use it under the terms of the Project Gutenberg License included\n', 'with this eBook or online at www.gutenberg.org\n', '\n', '\n', '\n', '\n', '\n', 'Title: Pride and Prejudice\n', '\n', '\n', 'Author: Jane Austen\n', '\n', 'Editor: R. W. (Robert William) Chapman\n', '\n', 'Release Date: May 9, 2013  [eBook #42671]\n', '\n', 'Language: English\n', '\n', '\n', '***START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE***\n', '\n', '\n', 'E-text prepared by Greg Weeks, Jon Hurst, Mary Meehan, and the Online\n', 'Distributed Proofreading Team (http://www.pgdp.net) from page images\n', 'generously made available by Internet Archive (https://archive.org)\n', '\n', '\n', '\n', 'Note: Project Gutenberg also ha

Now that we have our text, let's count some words.

In [10]:
wordcount(lines, 'the')

text is not a string! Try again!


0

We are told that we haven't input our text into the function in the correct format. Why is that?

In [11]:
type(lines)

list

In [12]:
wordcount?

Since we have a list of strings, but not a string, we need to clean up the text and put it together.

In [13]:
joined = ''.join(lines)  # links together the strings in lines with whatever is in the quotes
type(joined)

str

Now we have a string, but there are still a bunch of new line operators in the text, and apostrophes have been included with a slash too:

In [14]:
joined

'\ufeffThe Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited\nby R. W. (Robert William) Chapman\n\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\n\n\n\nTitle: Pride and Prejudice\n\n\nAuthor: Jane Austen\n\nEditor: R. W. (Robert William) Chapman\n\nRelease Date: May 9, 2013  [eBook #42671]\n\nLanguage: English\n\n\n***START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE***\n\n\nE-text prepared by Greg Weeks, Jon Hurst, Mary Meehan, and the Online\nDistributed Proofreading Team (http://www.pgdp.net) from page images\ngenerously made available by Internet Archive (https://archive.org)\n\n\n\nNote: Project Gutenberg also has an HTML version of this\n      file which includes the original illustrations.\n      See 42671-h.htm or 42671-h.zip:\n      (http://ww

In [15]:
cleanedtext = joined.replace('\n', '').replace("\'", '')
cleanedtext[:1000]

'\ufeffThe Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Editedby R. W. (Robert William) ChapmanThis eBook is for the use of anyone anywhere at no cost and withalmost no restrictions whatsoever.  You may copy it, give it away orre-use it under the terms of the Project Gutenberg License includedwith this eBook or online at www.gutenberg.orgTitle: Pride and PrejudiceAuthor: Jane AustenEditor: R. W. (Robert William) ChapmanRelease Date: May 9, 2013  [eBook #42671]Language: English***START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE***E-text prepared by Greg Weeks, Jon Hurst, Mary Meehan, and the OnlineDistributed Proofreading Team (http://www.pgdp.net) from page imagesgenerously made available by Internet Archive (https://archive.org)Note: Project Gutenberg also has an HTML version of this      file which includes the original illustrations.      See 42671-h.htm or 42671-h.zip:      (http://www.gutenberg.org/files/42671/42671-h/42671-h.htm)      or      (http://www.

Now it looks pretty good! Let's use our counting function:

In [16]:
wordcount(cleanedtext, 'the')

7482

Let us evaluate how often different characters are in the book. We'll store the characters with their number count in a dictionary.

In [17]:
characters = {}  # initialize the dictionary of characters
characters['Mr. Darcy'] = wordcount(cleanedtext, 'Mr. Darcy')

In [18]:
name = 'Elizabeth'  # store the name as a variable and then use it multiple times
characters[name] = wordcount(cleanedtext, name)

In [19]:
# use a list of names and loop through
names = ['Mr. Bingley', 'Mrs. Bennet', 'Jane']

for name in names:
    characters[name] = wordcount(cleanedtext, name)

In [20]:
print(characters)

{'Mr. Bingley': 107, 'Jane': 294, 'Mrs. Bennet': 133, 'Elizabeth': 634, 'Mr. Darcy': 250}


Clearly Elizabeth is stealing the show!

Note that the dictionary entries are not in the order we called them in — that is because a dictionary is unordered.

## Mad Libs

Everywhere that one character's name is, replace it with another word.

In [21]:
newtext = cleanedtext.replace('Mr. Bingley', 'BACON')
newtext

'\ufeffThe Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Editedby R. W. (Robert William) ChapmanThis eBook is for the use of anyone anywhere at no cost and withalmost no restrictions whatsoever.  You may copy it, give it away orre-use it under the terms of the Project Gutenberg License includedwith this eBook or online at www.gutenberg.orgTitle: Pride and PrejudiceAuthor: Jane AustenEditor: R. W. (Robert William) ChapmanRelease Date: May 9, 2013  [eBook #42671]Language: English***START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE***E-text prepared by Greg Weeks, Jon Hurst, Mary Meehan, and the OnlineDistributed Proofreading Team (http://www.pgdp.net) from page imagesgenerously made available by Internet Archive (https://archive.org)Note: Project Gutenberg also has an HTML version of this      file which includes the original illustrations.      See 42671-h.htm or 42671-h.zip:      (http://www.gutenberg.org/files/42671/42671-h/42671-h.htm)      or      (http://www.

## Write out

Let's write our version of Pride and Prejudice out to a file.

In [19]:
# This step has already been done and the file saved in the repo.
# fout = open(loc + 'pp+bacon.txt', 'w')  # open up a file to write into
# fout.write(newtext)
# fout.close()