# Project Gutenberg Analysis

In [1]:
import re
import string
import heapq
import random
from collections import defaultdict

In [2]:
with open('alice.txt', 'rt') as f:
    alice = f.read()

In [12]:
def createWordCounts(text):
    word_counts = defaultdict(int)
    for line in text.split('\n'):
        words = line.split()
        for word in words:
            word = word.strip(string.punctuation).lower()
            if word == '':
                continue
            word_counts[word] += 1
    return word_counts

In [13]:
word_counts = createWordCounts(alice)

## Whole-Text Analysis

Create a method that will take in a .txt file and return the number of words in the file. This method should be named getTotalNumberOfWords()

In [15]:
def getTotalNumberOfWords(word_counts):
    return sum(word_counts.values())

In [16]:
getTotalNumberOfWords(word_counts)

26376

Create a new method that returns the number of UNIQUE words in the novel. This method should be named getTotalUniqueWords()

In [17]:
def getTotalNumberOfWords(word_counts):
    return len(word_counts)

In [18]:
getTotalNumberOfWords(word_counts)

2773

Implement an algorithm that will return the 20 most frequently used words in the novel and the number of times they were used. This method should be named get20MostFrequentWords()

In [19]:
def get20MostFrequentWords(word_counts):
    negative_counts = [(-count, word) 
                       for word, count in word_counts.items()]
    heapq.heapify(negative_counts)
    ret = []
    for _ in range(20):
        count, word = heapq.heappop(negative_counts)
        ret.append((word, -count))
    return ret

In [20]:
get20MostFrequentWords(word_counts)

[('the', 1629),
 ('and', 844),
 ('to', 721),
 ('a', 627),
 ('she', 537),
 ('it', 526),
 ('of', 508),
 ('said', 462),
 ('i', 400),
 ('alice', 385),
 ('in', 365),
 ('you', 360),
 ('was', 357),
 ('that', 276),
 ('as', 262),
 ('her', 248),
 ('at', 209),
 ('on', 193),
 ('with', 181),
 ('all', 179)]

Implement a new algorithm that filters the most common 100 English words and then returns the 20 most frequently used words and the number of times they were used. This method should be named get20MostInterestingFrequentWords()

In [21]:
def get20MostInterestingFrequentWords(word_counts, words_to_exclude):
    negative_counts = [(-count, word) 
                       for word, count in word_counts.items()
                      if not word in words_to_exclude]
    heapq.heapify(negative_counts)
    ret = []
    for _ in range(20):
        count, word = heapq.heappop(negative_counts)
        ret.append((word, -count))
    return ret

In [22]:
with open('common-english-words.txt') as f:
    text = f.read()
words_to_exclude = set([word.lower() for word in text.split('\n')])
get20MostInterestingFrequentWords(word_counts, words_to_exclude)

[('alice', 385),
 ('herself', 83),
 ('queen', 68),
 ('into', 67),
 ("i'm", 57),
 ('its', 57),
 ('mock', 56),
 ('turtle', 56),
 ('gryphon', 55),
 ('hatter', 55),
 ("it's", 54),
 ('looked', 45),
 ('rabbit', 43),
 ('dormouse', 39),
 ('duchess', 39),
 ('mouse', 38),
 ("i've", 34),
 ('march', 33),
 ("that's", 33),
 ('hare', 31)]

Implement a new algorithm that returns the 20 LEAST frequently used words and the number of times they were used. This method should be named get20LeastFrequentWords()

In [23]:
def get20LeastFrequentWords(word_counts):
    positive_counts = [(count, word) 
                       for word, count in word_counts.items()]
    heapq.heapify(positive_counts)
    ret = []
    for _ in range(20):
        count, word = heapq.heappop(positive_counts)
        ret.append((word, count))
    return ret

In [24]:
get20LeastFrequentWords(word_counts)

[("a--i'm", 1),
 ('a-piece', 1),
 ('abide', 1),
 ('able', 1),
 ('absence', 1),
 ('acceptance', 1),
 ('accidentally', 1),
 ('account', 1),
 ('accounting', 1),
 ('accounts', 1),
 ('accusation', 1),
 ('accustomed', 1),
 ('ache', 1),
 ('act', 1),
 ('actually', 1),
 ('ada', 1),
 ('adding', 1),
 ('addressing', 1),
 ('adjourn', 1),
 ('adoption', 1)]

## Chapter-by-Chapter Analysis

In [42]:
chapters = re.split(r'CHAPTER \w+\.', alice)[1:]
chapters = [re.sub(r'\n', ' ', chapter) for chapter in chapters]
chapter_freqs = {}
for i, chapter in enumerate(chapters):
    chapter_freqs[i] = createWordCounts(chapter)

Implement a method that can take in a word and return an array of the number of the times the word was used in each chapter. This method should be named getFrequencyOfWord()

In [43]:
def getFrequencyOfWord(word, chapter_freqs):
    ret = []
    for chapter in chapter_freqs:
        ret.append(chapter_freqs[chapter][word])
    return ret

In [44]:
getFrequencyOfWord('alice', chapter_freqs)

[27, 24, 23, 30, 35, 43, 50, 39, 47, 29, 16, 22]

In [45]:
assert sum(getFrequencyOfWord('alice', chapter_freqs)) == word_counts['alice']

Implement a way for us to find out what chapter a certain quote from the book can be found in. Your method take in a string (the quote) and return a number (the chapter number) and be named getChapterQuoteAppears() . If the quote cannot be found in the book, your method should return -1.

In [46]:
def getChapterQuoteAppears(quote, chapters):
    pat = re.compile(quote)
    for i, chapter in enumerate(chapters):
        if re.findall(pat, chapter):
            return i + 1
    return -1

In [48]:
getChapterQuoteAppears('The Rabbit started violently, dropped the white kid gloves and the fan, and skurried away into the darkness as hard as he could go.', chapters)

2

Write a sentence in the author’s voice by implementing a method named generateSentence()


In [120]:
def getRandomNextWord(word, text):
    next_words = re.findall(re.compile(word + ' (\S+)', re.IGNORECASE), text)
    next_words = [word.lower().strip(string.punctuation)
                  for word in next_words]
    if not next_words:
        return 'the'
    return random.choice(next_words)

In [121]:
def generateSentence(text):
    word = 'The'
    sentence = [word]
    for _ in range(19):
        next_word = getRandomNextWord(word, text)
        sentence.append(next_word)
        word = next_word
    return ' '.join(sentence) + '.'

In [133]:
generateSentence(alice)

'The royal children digging her head down their names were the reason of her than ever eat hurry that she.'

###  Sentence Completion / Prediction with Tries (OPTIONAL)
We want to create a simple autocomplete-like feature where we can input one or more words, and be returned a list of all sentences that can start with our input. Implement a method List String 
getAutocompleteSentence(String startOfSentence) that takes in a string, and returns a list of strings that start with the input s.

### Finding closest match (OPTIONAL)
For this challenge, your task is to implement a method defined as String findClosestMatchingQuote(String s). This method will take in a quote, and be able to return the chapter this quote is found in. The catch is that the method could take in a misquoted quote , but still be able to find it.