## Summarizing Data

We write a code that will the text data and returns a Counter object with all 2-grams in the text :

In [1]:
import re
import string
from collections import Counter

from urllib.request import urlopen
from bs4 import BeautifulSoup


def cleanInput(content):
    content = content.lower()
    content = re.sub('\n', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    sentences = content.split('. ')
    return [cleanSentence(sentence) for sentence in sentences]

def cleanSentence(sentence):
    sentence = sentence.split(' ')
    sentence = [word.strip(string.punctuation+string.whitespace) for word in sentence]
    sentence = [word for word in sentence if len(word) > 1 or (word.lower() == 'a' or word.lower() == 'i')]
    return sentence



def getNgramsFromSentence(content, n):
    output = []
    for i in range(len(content)-n+1):
        output.append(content[i:i+n])
    return output

def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = Counter()
    ngrams_list = []
    for sentence in content:
        newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]
        ngrams_list.extend(newNgrams)
        ngrams.update(newNgrams)
    return(ngrams)


content = str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(), 'utf-8')
ngrams = getNgrams(content, 2)
print(ngrams.most_common(10))

[('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34), ('of our', 29), ('to be', 26), ('the people', 24), ('from the', 24), ('that the', 23)]


Of these 2-grams, “the constitution” seems like a reasonably popular subject in the speech, but “of the,” “in the,” and “to the” don’t seem especially noteworthy. How can you automatically get rid of unwanted words in an accurate way? Here we intoroduce the concept of stop words.

In [5]:
def isCommon(ngram):
    StopWords = ['THE', 'BE', 'AND', 'OF', 'A', 'IN', 'TO', 'HAVE', 'IT', 'I', 'THAT', 'FOR', 'YOU', 'HE', 'WITH', 'ON', 'DO', 
                 'SAY', 'THIS', 'THEY', 'IS', 'AN', 'AT', 'BUT', 'WE', 'HIS', 'FROM', 'THAT', 'NOT', 'BY', 'SHE', 'OR', 'AS', 
                 'WHAT', 'GO', 'THEIR', 'CAN', 'WHO', 'GET', 'IF', 'WOULD', 'HER', 'ALL', 'MY', 'MAKE', 'ABOUT', 'KNOW', 
                 'WILL', 'AS', 'UP', 'ONE', 'TIME', 'HAS', 'BEEN', 'THERE', 'YEAR', 'SO', 'THINK', 'WHEN', 'WHICH', 'THEM', 
                 'SOME', 'ME', 'PEOPLE', 'TAKE', 'OUT', 'INTO', 'JUST', 'SEE', 'HIM', 'YOUR', 'COME', 'COULD', 'NOW', 'THAN', 
                 'LIKE', 'OTHER', 'HOW', 'THEN', 'ITS', 'OUR', 'TWO', 'MORE', 'THESE', 'WANT', 'WAY', 'LOOK', 'FIRST', 'ALSO', 
                 'NEW', 'BECAUSE', 'DAY', 'MORE', 'USE', 'NO', 'MAN', 'FIND', 'HERE', 'THING', 'GIVE', 'MANY', 'WELL']
    StopWords = [word.lower() for word in StopWords]
    for word in ngram:
        if word in StopWords:
            return True
    return False

def getNgramsFromSentence(content, n):
    output = []
    for i in range(len(content)-n+1):
        if not isCommon(content[i:i+n]):
            output.append(content[i:i+n])
    return output

ngrams = getNgrams(content, 2)
print(ngrams.most_common(10))

[('united states', 10), ('executive department', 4), ('general government', 4), ('called upon', 3), ('chief magistrate', 3), ('legislative body', 3), ('same causes', 3), ('government should', 3), ('whole country', 3), ('was observable', 2)]


## Markov Models
Again using the inauguration speech of William Henry Harrison analyzed in the previous example, you can write the following code that generates arbitrarily long Markov chains (with the chain length set to 100) based on the structure of its text The output of this code changes every time it is run, but here’s an example of the uncannily nonsensical text it will generate:


The function `buildWordDict` takes in the string of text then does some cleaning and formatting, removing quotes and putting
spaces around punctuation so it's treated as a separate word. After this, it builds a two-dimensional dictionary—a dictionary of dictionaries—that has the following form :

    '''
    {word_a : {word_b : 2, word_c : 1, word_d : 1}, 
    word_e : {word_b : 5, word_d : 2},...}
    '''
    
In this example dictionary, “word_a” was found four times, two instances of which were followed by “word_b,” one instance followed by “word_c,” and one instance followed by “word_d.” “Word_e” was followed seven times, five times by “word_b” and
twice by “word_d.”

If we were to draw a node model of this result, the node representing word_a would have a 50% arrow pointing toward “word_b” (which followed it two out of four times), a 25% arrow pointing toward “word_c,” and a 25% arrow pointing toward “word_d.”

In [6]:
from urllib.request import urlopen
from random import randint

def wordListSum(wordList):
    sum = 0
    for word, value in wordList.items():
        sum += value
    return sum

def retrieveRandomWord(wordList):
    randIndex = randint(1, wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word

def buildWordDict(text):
    # Remove newlines and quotes
    text = text.replace('\n', ' ');
    text = text.replace('"', '');

    # Add space between punctuation marks so that they will be included in the Markov chain
    punctuation = [',','.',';',':']
    for symbol in punctuation:
        text = text.replace(symbol, ' {} '.format(symbol));

    words = text.split(' ')
    # Filter out empty words
    words = [word for word in words if word != '']

    wordDict = {}
    for i in range(1, len(words)):
        if words[i-1] not in wordDict:
                # Create a new dictionary for this word
            wordDict[words[i-1]] = {}
        if words[i] not in wordDict[words[i-1]]:
            wordDict[words[i-1]][words[i]] = 0
        wordDict[words[i-1]][words[i]] += 1
    return wordDict

text = str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt')
          .read(), 'utf-8')
wordDict = buildWordDict(text)

#Generate a Markov chain of length 100
length = 100
chain = ['I']
for i in range(0, length):
    newWord = retrieveRandomWord(wordDict[chain[-1]])
    chain.append(newWord)

print(' '.join(chain))

I refer to the execution of power in attributing the States accept a profound reverence for the dictates of want of the errors there is suffered to constitute a spirit which our free people of a thorough examination of the servant , not yet , the Convention that of no member of each were alarmed at least to communicate information and safety from his growth and sense of disunion , is the departments , virtually subject the greatest of encroachments of interest , discord , that which belong to recommend measures so long exist . In other person . There is


## Statistical Analysis with NLTK
NLTK is great for generating statistical information about word counts, word frequency, and word diversity in sections of text. Analysis with NLTK always starts with the `Text` object. `Text` objects can be created from simple Python strings in the following way:

In [2]:
from nltk import Text
from nltk import word_tokenize

tokens = word_tokenize('Here is some not very interesting text')
text = Text(tokens)
print(text)
print(tokens)

<Text: Here is some not very interesting text...>
['Here', 'is', 'some', 'not', 'very', 'interesting', 'text']


The input for the word_tokenize function can be any Python text string. If you don’t hav long strings handy but still want to play around with the features, NLTK has quite a few books already built into the library, which can be accessed :

In [3]:
from nltk.book import text6
print(text6)

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### Count unique words
Text objects can be manipulated much like normal Python arrays. Using this property, you can count the number of unique words in a text and compare it against the total number of words. It shows that each word in the script was used about eight times on average.

In [7]:
len(text6)/len(set(text6))

7.833333333333333

### Words frequency
We can also put the text into a frequency distribution object to determine some of the most common words and the frequencies for various words:

In [10]:
from nltk import FreqDist

fdist = FreqDist(text6)
fdist.most_common(10)

[(':', 1197),
 ('.', 816),
 ('!', 801),
 (',', 731),
 ("'", 421),
 ('[', 319),
 (']', 312),
 ('the', 299),
 ('I', 255),
 ('ARTHUR', 225)]

### 2-gram and 3-gram
We can create, search, and list N-grams extremely easily in NLTK. Here, the ngrams function is called to break a text object into n-grams of any size, governed by the second parameter. 

In [15]:
from nltk import bigrams
bigrams = bigrams(text6)

bigramsDist = FreqDist(bigrams)
bigramsDist.most_common(7)

[(('ARTHUR', ':'), 217),
 (("'", 's'), 140),
 ((']', '['), 94),
 (('!', '['), 82),
 ((':', 'Oh'), 82),
 (('Oh', ','), 79),
 (("'", 't'), 77)]

In [16]:
from nltk import ngrams

fourgrams = ngrams(text6, 4)
fourgramsDist = FreqDist(fourgrams)
fourgramsDist.most_common(7)

[((':', '[', 'singing', ']'), 25),
 (('GUARD', '#', '1', ':'), 23),
 (('VILLAGER', '#', '1', ':'), 22),
 (('Hello', '.', 'Hello', '.'), 22),
 (('.', 'Hello', '.', 'Hello'), 21),
 (('SOLDIER', '#', '1', ':'), 18),
 (('witch', '!', 'A', 'witch'), 17)]

### Lexicographical Analysis with NLTK