
# Python Clinic Day 2: Corpus Processing

Na-Rae Han (naraehan@pitt.edu), 07/13/2017, [Pittsburgh NEH Institute "Make Your Edition"](https://github.com/Pittsburgh-NEH-Institute/Institute-Materials-2017) 

# Preparation
- This tutorial is found on https://github.com/Pittsburgh-NEH-Institute/Institute-Materials-2017/tree/master/schedule/week_1
- Download and unzip the "C-Span Inaugural Address Corpus", available on NLTK's corpora page: http://www.nltk.org/nltk_data/
- Place the unzipped "inaugural" folder **on your DESKTOP** 

Jupyter tips:
- Shift+ENTER to run cell, go to next cell
- Alt+ENTER to run cell, create a new cell below

More on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/

# Review 
- Let's review [what we learned yesterday](Python Clinic Day 1.ipynb). 

# Processing a single text file, continued
### Reading in a text file
* Start with opening up the 1789 Washington address, using `open(filename).read()`. 

In [None]:
myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt'  # Mac users should leave out C:
wtxt = open(myfile).read()
print(wtxt[:500])

### Tokenize text, compile frequency count

In [None]:
import nltk    # Don't forget to import nltk
%pprint    # Turn off/on pretty printing (prints too many lines)

In [None]:
wtokens = nltk.word_tokenize(wtxt)
len(wtokens)     # Number of words in text

In [None]:
# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']

In [None]:
len(wfreq)      # Number of unique words in text

In [None]:
wfreq.most_common(40)     # 40 most common words

### Average sentence length, frequency of long words

In [None]:
sentcount = wfreq['.'] + wfreq['?'] + wfreq['!']  # Assuming every sentence ends with ., ! or 
print(sentcount)

In [None]:
# Tokens include symbols and punctuation. First 50 tokens:
wtokens[:50]

In [None]:
wtokens_nosym = [t for t in wtokens if t.isalnum()]    # alpha-numeric tokens only
len(wtokens_nosym)

In [None]:
# Try "n't", "20th", "."
"n't".isalnum()

In [None]:
# First 50 tokens, alpha-numeric tokens only: 
wtokens_nosym[:50]

In [None]:
len(wtokens_nosym)/sentcount     # Average sentence length in number of words

In [None]:
[w for w in wfreq if len(w) >= 13]       # all 13+ character words

In [None]:
long = [w for w in wfreq if len(w) >= 13] 
for w in long :
    print(w, len(w), wfreq[w])               # long words tend to be less frequent

# Processing a  corpus

- NLTK can read in an entire corpus from a directory (the 'root' directory).
- As it reads in a corpus, it applies word tokenization (shown below) and sentence tokenization (not shown here). 

In [None]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/narae/Desktop/inaugural'  # Mac users should leave out C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt' 

In [None]:
# .txt file names as file IDs
inaug.fileids()

In [None]:
# NLTK automatically tokenizes the corpus. First 50 words: 
print(inaug.words()[:50])

In [None]:
# You can also specify individual file ID. First 50 words from Obama 2009:
print(inaug.words('2009-Obama.txt')[-50:])

In [None]:
# NLTK automatically segments sentences too, which are accessed through .sents()
print(inaug.sents('2009-Obama.txt')[0])   # first sentence
print(inaug.sents('2009-Obama.txt')[1])   # 2nd sentence

In [None]:
# How long are these speeches in terms of word and sentence count?
print('Washington 1789:', len(inaug.words('1789-Washington.txt')), len(inaug.sents('1789-Washington.txt')))
print('Obama 2009:', len(inaug.words('2009-Obama.txt')), len(inaug.sents('2009-Obama.txt')))

In [None]:
# for-loop through file IDs and print out word count. 
for f in inaug.fileids():
    print(len(inaug.words(f)), f)


### Trouble shooting 
- Unfortunately, 2005 Bush file produces a Unicode encoding error. 
- Let's make a new text file from [http://www.presidency.ucsb.edu/inaugurals.php](http://www.presidency.ucsb.edu/inaugurals.php)
- Copy text and paste in Notepad (Windows) or T. Make sure to choose UTF-8 encoding and not ANSI. 
- The text files are locked; We will need to save, halt and then re-start the Python notebook. 

In [None]:
# Corpus size in number of words
print(len(inaug.words()))

In [None]:
# Building word frequency distribution for the entire corpus
inaug_freq = nltk.FreqDist(inaug.words())
inaug_freq.most_common(100)

# What next?
Take a Python course. There are many online courses available on Coursera, Udemy, EdX, and more. 

 [Coursera](http://www.coursera.org), [EdX](https://www.edx.org/), [udemy](https://www.udemy.com/courses/), [DataCamp](https://www.datacamp.com/courses)