## Parsing Text with Python
This notebook will work us through some of the basic examples of parsing text with Python. There are places for you to fill in some code of your own. This file will come along with another that has my solutions.

In [1]:
%matplotlib inline

import nltk
from nltk.book import *

import numpy as np
import matplotlib 
import matplotlib.pyplot as plt

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
texts()

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [None]:
sents()

## Concordance
In NLP _concordance_ are words that co-occur with a word of interest. Let's look at some examples. Concordances let us see words in context.

In [None]:
text1.concordance("monstrous")

In [None]:
text2.concordance("monstrous")

Question: How do Austen and Melville seem to be using "monstrous" differently?

Play around with some other words and other corpora. (You knew that was the plural, right? Yeah, me too.) Here's one to get you started.

In [None]:
text5.concordance(':P')

---
## Similarity
While concordance let's us see context. _Similarity_ lets us see what words share that kind of context. 

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

Again, play around with some different words and different corpora. Another one to get you started...

In [None]:
text5.similar("lol") # The rise of 'haha' began as early as the mid-aughts!

---
## Common Contexts
Common contexts take this to the next level, as you can now define a list of words and see the contexts they tend to have in common.

In [None]:
text1.common_contexts(["monstrous","curious"])

In [None]:
text2.common_contexts(["monstrous","very"])

The underscore is giving you the spot where either of these words are likely to appear. Pretty cool. 

---

## Visualizing Word Usage
As you know, I love me some data viz. NLTK gives you a way to visualize the distribution of words in a corpus called a dispersion plot. 

In [None]:
words = ["citizens","democracy","freedom","America","duties","fear"]

text4.dispersion_plot(words)

Each vertical line represents a word usage. The `offset` is how deep into the corpus it is. Usually that's not so helpful, but for the inaugural corpus (or any corpus that's ordered chronologically), it gives us a time-series view. This corpus runs through Obama's 2009 address. Feel free to play around with some of the other corpora.

In [None]:
# in the chat there's much less of that times-series idea.
text5.dispersion_plot([":-)","lol","haha",";-)","hey","hi"])

Pick another corpus and some words that might have an interesting dispersion and make another plot.

In [None]:
# your example here.

---

## Summary Statistics on Text
Now we'll dive into the basics of summary statistics on text. I'll add comments in the code so you can tell what's going on.

In [None]:
print(" : ".join([text1.name,str(len(text1))]))
print(" : ".join([text2.name,str(len(text2))])) 
# the length in "tokens". We'll talk more about those, 
# but think of them as characters we want to group together like "Ishmael", "Dr. Snodhead" and ":-)"

print(sorted(set(text3))[1:20]) # the first 20 tokens in Genesis
print(len(set(text1))) # the number of tokens in Moby Dick

# If we wanted to see the average number of times a token is used, across all tokens, we can do this: 
print(len(text1)/len(set(text1))) # this is called lexical diversity

# If we wanted to see how often a word was used, we just do
print(text1.count("whale"))

# and, as a percentage
print(100 * text1.count("the")/len(text1)) # 5% of the words in Moby Dick are "the". Writing is easy! ;-)

Explore some of these summary stats on your own. Find me an interesting usage in the chat data or something. Keep it clean...

Let's write some functions that calculate these quantities.

In [None]:
# Summary Statistic Functions
def lexical_diversity(text) :
    # Write a function that returns the lexical diversity of a text as calculated above.
    pass # your code replaces this line.
    
def token_percentage(text, word) :
    # write a function that takes a text and a word and returns the percentage
    # of words in `text` that are `word`
    pass # your code replaces this line.


In [None]:
print(lexical_diversity(text4))
assert(14.9 < lexical_diversity(text4) < 15) # testing our functions

In [None]:
print(token_percentage(text5,"lol"))
assert(1.43 < token_percentage(text5,"lol") < 1.44) 

---

In NLTK, texts are basically just lists of words. You can see that behavior by writing something like this:

In [None]:
x = 143000 # just picking something that lands us in Obama's address
print(text4[x:(x+20)])
print(" ".join(text4[x:(x+200)]))

So all of our list tricks like slices and whatnot will still work.

---

## Frequency Distributions
One of the most common ways places to start understanding a corpus is to do a frequency analysis. NLTK makes that pretty easy.

In [None]:
moby_fd = FreqDist(text1) # This returns a Counter object, which we learned about a bit in the fall. 
    # basically a dictionary that's optimized for counting. Let's look at the top 50
    
moby_fd.most_common(50)

What do you notice about these keys? How many of them are informative about the text as you skim that list?

To compare, let's look at S&S:

In [None]:
ss_fd = FreqDist(text2)
ss_fd.most_common(50)

Many of these words are what we call "stopwords". These are words that commonly occur in most writing. More on these below.

Let's plot the cumulative distribution of words in _Moby Dick_.

In [None]:
moby_fd.plot(50, cumulative=True)

In [None]:
len(text1)

Recall that this text has 260K words. So the top 50 account for almost 50%!

---

## Finding Specific Words
Thus far we've worked with a corpus in a general way. Now let's find some words that are interesting to us specifically. We'll take advantage of this "list of words" idea. First, let's look at the long words that are used by presidents and compare them to long words used in chat.

In [None]:
# Presidential addresses
print(sorted({w.lower() for w in set(text4) if len(w) > 15}))

In [None]:
# Versus chat rooms. Sigh. Is this the beginning of the end of written discourse? 
print(sorted({w.lower() for w in set(text5) if len(w) > 15}))

A lot of these are probably only used once. Let's get longer words that are used a few times in chat.

In [None]:
# Print all words in the chat corpus that are longer than 
# 7 characters and are used more than 5 times. 

# Your code here

Still not great, but now we're starting to get some signal in here about what people are chatting about.

---

## Bigrams
Bigrams are pairs of words in a text. For instance, here are the bigrams in that previous sentence.

In [None]:
from nltk.util import bigrams
list(bigrams(["bigrams","are","pairs","of","words","in","a","text"])) # wrap in list to get printing

Let's look at the popular bigrams. NLTK gives us the function `collocations` that make this easy.

In [None]:
text4.collocations()

In [None]:
text8.collocations()

We can extend the idea of counting words to count other things. In the cell below we get the frequency distribution of lengths of words.

In [None]:
word_len_count = FreqDist([len(w) for w in text1])

word_len_count

In [None]:
word_len_count.freq(1) # `.freq` gives us the percentages. 

---

## Other Corpora
There are a lot of corpora available in NLTK.

In [None]:
nltk.corpus.gutenberg.fileids()

In [None]:
nltk.corpus.shakespeare.fileids()

---

## Lexical Resources
There are corpora that we may use to generate further analyses. There's one that's just English words:

In [None]:
# Here's a corpus with 236,736 english words
len(nltk.corpus.words.words())

In [None]:
x = nltk.corpus.words.words()

print(x[100200:100210])
print("mark" in x) # you can use this as a rudimentary spell checker
print("makr" in x) # but more on this later.

Michael Lewis recently published a new book, _The Undoing Project_. It's great and you should totally read it. It's about two researchers who uncovered and tested many of the fundamental biases and heuristics. One of these is the [availability heuristic](https://en.wikipedia.org/wiki/Availability_heuristic). One example they give is the following: do you think there are more words that start with "k" or have "k" in the third letter? It's much easier for people to think of words that start with a letter than have letters in other spots. 

We can test that here. Let's count the words with letters in each spot.

In [None]:
# how many words in nltk.corpus.words.words() that have length greater than three and start with k?

In [None]:
# how many words in text4, the Inaugural Addresses,have length greater than three and start with k?

In [3]:
texts()

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [None]:
# Same question but for text1, Moby Dick.

In [None]:
# how many words in nltk.corpus.words.words() that have length greater than three and have k in the third spot?

In [None]:
# Same question, but for text4.

Does the assertion seem to be supported?

Look at some of the words in the list of words from nltk.corpus.words.words() and give me a hypothesis as to what's going on. Test your hypothesis using one of these corpora.

In [None]:
# Let's find words in S&S that are unusual. We'll start with a function
def unusual_words(text):
    '''
        Write a docstring for this function. Google docstrings to see what should go in them.
    '''
    
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return(sorted(unusual))

In [None]:
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')) # You'll see a lot of verb tenses and the like

In [None]:
# There are also stopwords. These are very common words.
list(nltk.corpus.stopwords.words('english'))

In [None]:
# this might be useful to get a freqdist that's not dominated by stopwords and punctuation with the words in lowercase.
ss_fd = FreqDist([w.lower() for w in nltk.corpus.gutenberg.words('austen-sense.txt') 
                  if (w.lower() not in nltk.corpus.stopwords.words("english") 
                      and w.isalpha())])

# This takes a bit of time to run. There are faster ways to do this using sets.

In [None]:
ss_fd.most_common(30)

In [None]:
in_fd = FreqDist([w.lower() for w in nltk.corpus.inaugural.words() 
                  if (w.lower() not in nltk.corpus.stopwords.words("english") 
                      and w.isalpha())])
in_fd.most_common(30)

There's a giant corpus of names!

In [None]:
names = nltk.corpus.names
names.fileids()
male_names = names.words('male.txt')
female_names = names.words('female.txt')

In [None]:
x = [w for w in male_names if w in female_names]
x[1:30]

In [None]:
x = [w for w in male_names if w not in female_names]
x[1:30]

It's semi-well known that that names ending in "a" are almost always female. Let's visualize names by last letter.

In [None]:
fd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))

# This uses some trickery from a function we haven't talked about. 
# Try to figure out ConditionalFreqDist! 

In [None]:
fd.plot()

In [None]:
# B is not a popular last letter. Let's look at those
[w for w in female_names if w[-1]=="b"] # change to female to see those

---

# WordNet
WordNet gives us meanings and synonyms for many words. We're just going to scratch the surface.

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets('bottle') # there are 5 "synonym sets" for bottle.

In [None]:
wn.synset('bottle.v.01').definition() # change the stuff in quotes to see others.

In [None]:
# Check out set!
wn.synsets('set')

In [None]:
wn.synset('plant.v.01').lemma_names() # Linguists call synonyms "lemmas"

In [None]:
# Let's look at them all...
for synset in wn.synsets('set'):
    print(synset.lemma_names())

We'll talk more about lexical relations further down the road...

Before we leave this, let's take a look at some other corpora.

In [None]:
# type a period after "corpus" below, hit tab, and check out all the lowercase options--these 
# are corpora. Google one to see what's in it.
nltk.corpus.

---

## Load Your Own
It's fun to play around with corpora from other people, and we'll use these multiple times. But it's also nice to load your own corpus. Let's load in the Twitter descriptions you pulled from one of your files. First, we'll just read in the data normally.

In [None]:
# Change these next two lines based on your data
file_location = "C:\\Users\\jchan\\Dropbox\\Teaching\\2017_Spring\\UnstructuredData\\PreWork\\"
file_name = "20170305_GeneralMills_followers.txt"
#file_name = "20170305_michaelpollan_followers.txt"

descs = []
with open(file_location + file_name,'r') as ifile :
    next(ifile)
    for idx, line in enumerate(ifile.readlines()) :
        line = line.strip().split("\t")
        
        # spot 6 has the description
        if len(line) >= 7 : # sometimes we don't have descriptions
            descs.extend(line[6].split())
        
        # for now we'll just add on to a big list

In [None]:
with open(file_location + file_name,'r') as ifile :
    print(ifile.readline())
    print(ifile.readline())

In [None]:
# how many descriptions do we have?
len(descs)

Now let's use some of our techniques on this corpus.

In [None]:
fd = FreqDist(descs)
fd.most_common(10)

Let's get serious about cleaning a list like we did above by writing a function.

In [None]:
stopwords = set(nltk.corpus.stopwords.words("english"))
stopwords_sp = set(nltk.corpus.stopwords.words("spanish"))

def clean_list(text) :
    ''' takes a list of text and returns a new list with 
        * words cast to lowercase
        * stopwords removed
        * only alphanumeric words
    '''
    text_clean = [w.lower() for w in text if w.isalpha()]
    text_clean = [w for w in text_clean if w not in stopwords]
    text_clean = [w for w in text_clean if w not in stopwords_sp]
    return(text_clean)
    

In [None]:
descs_clean_gm = clean_list(descs)

In [None]:
fd_mp = FreqDist(descs_clean_mp)
fd_mp.most_common(20)

In [None]:
fd_gm = FreqDist(descs_clean_gm)
fd_gm.most_common(25)

Now, go do the same for your other file. I'll ask you to report back on the similarities and differences in a bit.