# Chapter 9 Reading and Writing Natural Languages

## Summarizing Data

Before we start processing natural languages, let's recall the 2-grams program we wrote in last chapter:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
from collections import Counter

def cleanSentence(sentence):
    sentence = sentence.split(' ')
    sentence = [word.strip(string.punctuation+string.whitespace) for word in sentence]
    sentence = [word for word in sentence if len(word) > 1 or (word.lower() == 'a' or word.lower() == 'i')]
    return sentence

def cleanInput(content):
    content = content.upper()
    content = re.sub('\n', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    sentences = content.split('. ')
    return [cleanSentence(sentence) for sentence in sentences]

def getNgramsFromSentence(content, n):
    output = []
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = Counter()
    ngrams_list = []
    for sentence in content:
        newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]
        ngrams_list.extend(newNgrams)
        ngrams.update(newNgrams)
    return(ngrams)

In [2]:
content = str(
      urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(),
              'utf-8')
ngrams = getNgrams(content, 2)
ngrams.most_common(10)

[('OF THE', 213),
 ('IN THE', 65),
 ('TO THE', 61),
 ('BY THE', 41),
 ('THE CONSTITUTION', 34),
 ('OF OUR', 29),
 ('TO BE', 26),
 ('THE PEOPLE', 24),
 ('FROM THE', 24),
 ('THAT THE', 23)]

In the most common 10 2-grams, "the constitution" seems more noteworthy than "of the", "in the", and "to the". Then the question is how we can get rid of these common words? Here's a naively simple method `isCommon()`.

In [3]:
def isCommon(ngram):
    commonWords = ['THE', 'BE', 'AND', 'OF', 'A', 'IN', 'TO', 'HAVE', 'IT', 'I', 'THAT', 'FOR', 'YOU', 'HE', 'WITH', 'ON', 'DO', 'SAY', 'THIS', 'THEY', 'IS', 'AN', 'AT', 'BUT', 'WE', 'HIS', 'FROM', 'THAT', 'NOT', 'BY', 'SHE', 'OR', 'AS', 'WHAT', 'GO', 'THEIR', 'CAN', 'WHO', 'GET', 'IF', 'WOULD', 'HER', 'ALL', 'MY', 'MAKE', 'ABOUT', 'KNOW', 'WILL', 'AS', 'UP', 'ONE', 'TIME', 'HAS', 'BEEN', 'THERE', 'YEAR', 'SO', 'THINK', 'WHEN', 'WHICH', 'THEM', 'SOME', 'ME', 'PEOPLE', 'TAKE', 'OUT', 'INTO', 'JUST', 'SEE', 'HIM', 'YOUR', 'COME', 'COULD', 'NOW', 'THAN', 'LIKE', 'OTHER', 'HOW', 'THEN', 'ITS', 'OUR', 'TWO', 'MORE', 'THESE', 'WANT', 'WAY', 'LOOK', 'FIRST', 'ALSO', 'NEW', 'BECAUSE', 'DAY', 'MORE', 'USE', 'NO', 'MAN', 'FIND', 'HERE', 'THING', 'GIVE', 'MANY', 'WELL']
    for word in ngram:
        if word in commonWords:
            return True
    return False

def getNgramsFromSentence(content, n):
    output = []
    for i in range(len(content)-n+1):
        if not isCommon(content[i:i+n]):
            output.append(content[i:i+n])
    return output

In [4]:
ngrams = getNgrams(content, 2)
ngrams.most_common(10)

[('UNITED STATES', 10),
 ('EXECUTIVE DEPARTMENT', 4),
 ('GENERAL GOVERNMENT', 4),
 ('CALLED UPON', 3),
 ('CHIEF MAGISTRATE', 3),
 ('LEGISLATIVE BODY', 3),
 ('SAME CAUSES', 3),
 ('GOVERNMENT SHOULD', 3),
 ('WHOLE COUNTRY', 3),
 ('WAS OBSERVABLE', 2)]

After filtering common words, we find that "United States" and "executive department" are most two popular 2-grams, which help to guess it might be a presidential inauguration speech.

## Markov Models

After reading the "Markov Models" in the textbook, we're able to use the above texts to generate arbitrarily long Markov chains (with the chain length set to 100).

In [5]:
from urllib.request import urlopen
from random import randint

def wordListSum(wordList):
    sum = 0
    for word, value in wordList.items():
        sum += value
    return sum

def retrieveRandomWord(wordList):
    randIndex = randint(1, wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word

def buildWordDict(text):
    # Remove newlines and quotes
    text = text.replace('\n', ' ');
    text = text.replace('"', '');

    # Make sure punctuation marks are treated as their own "words,"
    # so that they will be included in the Markov chain
    punctuation = [',','.',';',':']
    for symbol in punctuation:
        text = text.replace(symbol, ' {} '.format(symbol));

    words = text.split(' ')
    # Filter out empty words
    words = [word for word in words if word != '']

    wordDict = {}
    for i in range(1, len(words)):
        if words[i-1] not in wordDict:
                # Create a new dictionary for this word
            wordDict[words[i-1]] = {}
        if words[i] not in wordDict[words[i-1]]:
            wordDict[words[i-1]][words[i]] = 0
        wordDict[words[i-1]][words[i]] += 1
    return wordDict

In [6]:
text = str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt')
          .read(), 'utf-8')
wordDict = buildWordDict(text)

length = 100
chain = ['I']
for i in range(0, length):
    newWord = retrieveRandomWord(wordDict[chain[-1]])
    chain.append(newWord)

print(' '.join(chain))

I give some indications to be prescribed by my fellow-citizens , and to encourage them , the more of the temple of the executive branch , those from which their country and the officers who would compare our institutions of its accomplishment . If such constituents in the exercise of the concurrence of the public money , quadrupled in it makes him the master . The people . It is suffered to the great alarm . Never has necessarily resulted from the one of the very reverse is danger is the genuine spirit of the genuine spirit which it brings to
