
# Python clinic day 1: Text processing

Na-Rae Han ([naraehan@pitt.edu](mailto:naraehan@pitt.edu)), 2017-07-12, [Pittsburgh NEH Institute “Make Your Edition”](https://github.com/Pittsburgh-NEH-Institute/Institute-Materials-2017) 

## Preparation

### Data

- This tutorial is found in https://github.com/Pittsburgh-NEH-Institute/Institute-Materials-2017/tree/master/schedule/week_1
- Download and unzip the “C-Span Inaugural Address Corpus”, available on NLTK’s corpora page: http://www.nltk.org/nltk_data/
- Place the unzipped `inaugural` folder **on your desktop** 

### Jupyter tips

- `Shift+ENTER` to run cell, go to next cell
- `Alt+ENTER` to run cell, create a new cell below

More on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/

## The very basics

### First code

* Printing a string, using `print()`. 

In [None]:
print("hello, world!")

### Strings

* String type objects are enclosed in quotation marks (" or ').
* \+ is a concatenation operator.
* Below, `greet` is a variable name assigned to a string value; note `OUT[]:` and the absence of quotation marks.

In [None]:
greet = "Hello, world!"
greet = greet + " I come in peace."
greet

* String methods such as `.upper()`, `.lower()` transform a string. 
* Unlike printing, the command **returns** a new string value. 

In [None]:
greet.upper()

* `len()` returns the length of a string in the # of characters. 

In [None]:
len(greet)

### Numbers

* Integers and floats are written without quotes. 
* You can use algebraic operations such as `+`, `-`, `*` and `/` with numbers. 

In [None]:
num1 = 5678
num2 = 3.141592
result = num1 / num2
print(num1, "divided by", num2, "is", result)

### Lists
* Lists are enclosed in `[ ]`, with elements separated with commas. Lists can contain strings, numbers, and more. 
* As with string, you can use `len()` to get the size of a list. 
* As with string, you can use `in` to see whether an element is in a list. 

In [None]:
li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)

In [None]:
# Try logical operators not, and, or
'mauve' in li

* A list can be indexed through `li[i]`. Python indexes starts with 0. 
* A list can be sliced: `li[3:5]` returns a sub-list beginning with index 3 up to and not including index 5. 

In [None]:
# Try [0], [2], [-1], [3:5], [3:], [:5]
li[4]

### `for` loop
* Using a `for` loop, you can loop through a list of items, applying the same set of operations to each element. 
* The embedded code block is marked with indentation. 

In [None]:
for x in li :
    print(x, "is", len(x), "characters long.")
print("Done!")

### List comprehension
* List comprehension builds a new list from an existing list. 
* You can filter to include only certain elements, and you can apply transformationa in the process.
* Try: `.upper()`, `len()`, `+'ish'`

In [None]:
# filter
[x for x in li if x.endswith('e')]

In [None]:
# transform
[x+'ish' for x in li]

In [None]:
# filter and transform
[x.upper() for x in li if len(x)>=5]

### Dictionaries
- Dictionaries hold **key:value** mappings. 
- `len()` on dictionary returns the number of keys. 
- Looping over a dictionary means looping over its keys. 

In [None]:
di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Lisa']

In [None]:
# 20 years-old or younger. x is bound to keys. 
[x for x in di if di[x] <= 20]

In [None]:
len(di)

## Using NLTK

* NLTK ([Natural Language Toolkit](http://www.nltk.org/)) is an external module; you can start using it after importing it. 
* `nltk.word_tokenize()` is a handy tokenizing function, one of hundreds available in NLTK.
* `nltk.word_tokenize()` turns a text (a single string) into a list of word tokens (that is, a list of words, punctuation, etc.). 

In [None]:
import nltk

In [None]:
nltk.word_tokenize(greet)

In [None]:
help(nltk.word_tokenize)

In [None]:
sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)

* `nltk.FreqDist()` is is another useful NLTK function. 
* It builds a frequency count dictionary from a list. 

In [None]:
# First "Rose" is capitalized. How to lowercase? 
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)

In [None]:
freq = nltk.FreqDist(toks)
freq

In [None]:
freq.most_common(3)

In [None]:
freq['rose']

In [None]:
len(freq)

## Processing a single text file

### Reading in a text file
* `open(filename).read()` reads in the content of a text file as a single continuous string. 

In [None]:
myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt'  # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)

In [None]:
len(wtxt)     # Number of characters in text

In [None]:
'fellow citizens' in wtxt  # phrase as a substring. try "Americans"

In [None]:
'th' in wtxt

### Tokenize text, compile frequency count

In [None]:
# Turn off/on pretty printing (prints too many lines)
%pprint    

In [None]:
# Tokenize text
nltk.word_tokenize(wtxt)

In [None]:
wtokens = nltk.word_tokenize(wtxt)
len(wtokens)     # Number of words in text

In [None]:
# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']

In [None]:
wfreq['we']

In [None]:
len(wfreq)      # Number of unique words in text

In [None]:
wfreq.most_common(40)     # 40 most common words

## More tomorrow

- How long are Washington’s sentences on average? 
- Which long words did he use, and how frequent were they? 
- Processing the entire Inaugural Address corpus
    - Which inaugural speech was the longest? The shortest?
    - Which presidents favored long sentences?

All answered in [Python Clinic Day 2: Corpus Processing](Python Clinic Day 2.ipynb)