This notebook accompanies the lectures on tokenization, normalization, and stemming. 

Before we get started, you're going to need the NLTK book corpus. Here are the steps to install it: 

1. Open a console or command window.
1. Type `python` to start using python. 
1. Type `import nltk` and hit enter.
1. Type `nltk.download()` and hit enter.
1. This will open a little window. 
1. Click "All Packages" at the top of the list. 
1. Click "Download"

Let me know if you run into any issues!

---

Now we can get started in earnest.

In [1]:
import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## zzzTokenization

Tokenization is the process by which we split text up into tokens. The simplest tokens are those split by whitespace. Let's begin by counting the words in a file that I've included the in repo: the text of _Beowulf_. 

1. Read the file into a variable that holds a (large) string. 
1. Look at the first 1000 characters of that string.
1. Split that string on whitespace (spaces, returns, tabs, etc.)
1. Count the number of tokens. 
1. Determine the most common 10 tokens. 
1. Find the tokens that include punctuation within the token. 

In [2]:
# Read in beowulf here
beowulf = open("beowulf.txt").read()

In [3]:
# Display first 1000 characters
beowulf[:1000]

"BEOWULF\n\nBy Anonymous\n\nTranslated by Gummere\n\n\n\n\nBEOWULF\n\n\n\n\nPRELUDE OF THE FOUNDER OF THE DANISH HOUSE\n\n\n\nLO, praise of the prowess of people-kings\nof spear-armed Danes, in days long sped,\nwe have heard, and what honor the athelings won!\nOft Scyld the Scefing from squadroned foes,\nfrom many a tribe, the mead-bench tore,\nawing the earls. Since erst he lay\nfriendless, a foundling, fate repaid him:\nfor he waxed under welkin, in wealth he throve,\ntill before him the folk, both far and near,\nwho house by the whale-path, heard his mandate,\ngave him gifts:  a good king he!\nTo him an heir was afterward born,\na son in his halls, whom heaven sent\nto favor the folk, feeling their woe\nthat erst they had lacked an earl for leader\nso long a while; the Lord endowed him,\nthe Wielder of Wonder, with world's renown.\nFamed was this Beowulf:  {0a} far flew the boast of him,\nson of Scyld, in the Scandian lands.\nSo becomes it a youth to quit him well\nwith his father's

In [4]:
# Split it on whitespace
beo_tokens = beowulf.split() # leaving nothing in parentheses refers to "whitespace"

In [5]:
# Display first 1000 characters
beo_tokens[:1000]

['BEOWULF',
 'By',
 'Anonymous',
 'Translated',
 'by',
 'Gummere',
 'BEOWULF',
 'PRELUDE',
 'OF',
 'THE',
 'FOUNDER',
 'OF',
 'THE',
 'DANISH',
 'HOUSE',
 'LO,',
 'praise',
 'of',
 'the',
 'prowess',
 'of',
 'people-kings',
 'of',
 'spear-armed',
 'Danes,',
 'in',
 'days',
 'long',
 'sped,',
 'we',
 'have',
 'heard,',
 'and',
 'what',
 'honor',
 'the',
 'athelings',
 'won!',
 'Oft',
 'Scyld',
 'the',
 'Scefing',
 'from',
 'squadroned',
 'foes,',
 'from',
 'many',
 'a',
 'tribe,',
 'the',
 'mead-bench',
 'tore,',
 'awing',
 'the',
 'earls.',
 'Since',
 'erst',
 'he',
 'lay',
 'friendless,',
 'a',
 'foundling,',
 'fate',
 'repaid',
 'him:',
 'for',
 'he',
 'waxed',
 'under',
 'welkin,',
 'in',
 'wealth',
 'he',
 'throve,',
 'till',
 'before',
 'him',
 'the',
 'folk,',
 'both',
 'far',
 'and',
 'near,',
 'who',
 'house',
 'by',
 'the',
 'whale-path,',
 'heard',
 'his',
 'mandate,',
 'gave',
 'him',
 'gifts:',
 'a',
 'good',
 'king',
 'he!',
 'To',
 'him',
 'an',
 'heir',
 'was',
 'afterwa

In [6]:
# Calculate the number of tokens
len(beo_tokens)

26116

## zzzCounting elements

###      There are many ways to count things. The handiest is the Counter data type. I'll illustrate it's use and show you how you could do the same thing with a dictionary.

***First, the Counter version***

In [7]:
from collections import Counter

In [8]:
# Determine the 10 most common tokens
# Hint: check out the Counter object in the collections library
beo_counter = Counter(beo_tokens) # creates the "counter object" called beo_counter
beo_counter.most_common(10) #uses function "most_common()" on counter object "beo_counter"

[('the', 1701),
 ('of', 1032),
 ('and', 689),
 ('to', 531),
 ('in', 452),
 ('his', 428),
 ('that', 322),
 ('he', 312),
 ('with', 286),
 ('was', 240)]

## zzzCounter : really just a dictionary, where the keys are the elements of the list that you fed in, and the values are the integer counts. Here's how you'd do the same thing with a dictionary. Notice how much more difficult it is, and the weird construction to sort the dictionary by values for reading out.

***Second, the 'dictionary' version***

In [23]:
beo_dict = dict()

for t in beo_tokens : # 't' representing the tokens in 'beo_tokens'
    
    # Have to create the spot in the dictionary if it's NOT in there.
    if t not in beo_dict :
        beo_dict[t] = 0  # establishes the token 't' as a key, with value '0' (dict pair)
    
    # And now increment the count
    beo_dict[t] += 1  # within same 't', we're now incrementing the count.. initially to '1'
    
# after going through all of the tokens in beo_tokens, will have a full dictionary
# where each unique token is a key, and the value:key = number of count increments

In [27]:
beo_dict

{'BEOWULF': 6,
 'By': 7,
 'Anonymous': 1,
 'Translated': 1,
 'by': 147,
 'Gummere': 1,
 'PRELUDE': 1,
 'OF': 2,
 'THE': 3,
 'FOUNDER': 1,
 'DANISH': 1,
 'HOUSE': 1,
 'LO,': 1,
 'praise': 2,
 'of': 1032,
 'the': 1701,
 'prowess': 4,
 'people-kings': 1,
 'spear-armed': 1,
 'Danes,': 11,
 'in': 452,
 'days': 17,
 'long': 26,
 'sped,': 1,
 'we': 27,
 'have': 39,
 'heard,': 7,
 'and': 689,
 'what': 19,
 'honor': 8,
 'athelings': 6,
 'won!': 1,
 'Oft': 4,
 'Scyld': 2,
 'Scefing': 1,
 'from': 148,
 'squadroned': 1,
 'foes,': 6,
 'many': 39,
 'a': 216,
 'tribe,': 1,
 'mead-bench': 3,
 'tore,': 1,
 'awing': 1,
 'earls.': 3,
 'Since': 2,
 'erst': 12,
 'he': 312,
 'lay': 16,
 'friendless,': 3,
 'foundling,': 1,
 'fate': 7,
 'repaid': 5,
 'him:': 1,
 'for': 215,
 'waxed': 2,
 'under': 25,
 'welkin,': 3,
 'wealth': 9,
 'throve,': 1,
 'till': 34,
 'before': 5,
 'him': 120,
 'folk,': 12,
 'both': 13,
 'far': 35,
 'near,': 8,
 'who': 78,
 'house': 13,
 'whale-path,': 1,
 'heard': 21,
 'his': 428,
 'ma

In [11]:
# Printing out the top 10 is pretty tricky. Do the work to understand what's happening below:

num_printed = 0  # creating a variable 'num_printed' and initially assigning it value '0'
# this will be used eventually to see how many key:values we've reported, to stop after 10th

for token, count in sorted(beo_dict.items(), key=lambda item: -1*item[1]) :
# 'token, count' reflect the naming convention for the key:values of each record in the dict
#    ex: from 1st above, token = 'BEOWULF' (string), count = 6 (integer)
# 'items' (in beo_dict.items()) is a dict attribute...= the 'key' and '___' (user defined)
# 'in' is referring to the dictionary 'beo_dict.items()'
# 'sorted' means ascending based on the 'item' count
# 'lambda' is single-line defined, unnamed function (vs 'def ____ as...')
# -1*item[1] says for each value in the '1' position ('count', not 'token').. take negative
    print(token + " had " + str(count) + " instances.") # while 'item' negative, count still +
    num_printed += 1 # after sorting by largest(-)and iterating.. incrementing # returned
    
    if num_printed == 10 : # once 10th item returned, break the loop... returned top 10!
        break

the had 1701 instances.
of had 1032 instances.
and had 689 instances.
to had 531 instances.
in had 452 instances.
his had 428 instances.
that had 322 instances.
he had 312 instances.
with had 286 instances.
was had 240 instances.


# zzzPunctuation

### Now let's count the number of tokens that have punctuation in them. 

A ***set*** is an ***unordered*** and ***mutable*** collection of unique elements. Sets are written with curly brackets ({}), being the elements separated by commas.
A set:    numbers = {1, 2, 3, 4}
NOT a set:    numbers = {[1, 2], 3, 4}

***Sets*** can also be defined with the built-in function set([iterable]). This function takes as argument an iterable (i.e. any type of sequence, collection, or iterator), returning a set that contains ***unique*** items from the input (duplicated values are removed).
- a ***string***:  set('Amanda') ***=>***  {'A', 'a', 'd', 'm', 'n'}
- a ***tuple***: set(('Madrid', 'Valencia', 'Munich')) ***=>*** {'Madrid', 'Munich', 'Valencia'}
- a ***dictionary***: set({'hydrogen': 1, 'helium': 2, 'carbon': 6, 'oxygen': 8}) ***=>*** {'carbon', 'helium', 'hydrogen', 'oxygen'}
- a ***list***: set(['Madrid', 'Valencia', 'Munich', 'Munich']) ***=>*** {'Madrid', 'Munich', 'Valencia'}


cities = {'Madrid', 'Valencia', 'Barcelona'}
### add an element to a set
cities.add('Munich') ***=>*** print(cities) ***=>***  {'Valencia', 'Barcelona', 'Munich', 'Madrid'}

### Resource:  https://towardsdatascience.com/10-things-you-should-know-about-sets-in-python-9902828c0e80

In [12]:
# Hint: the string library has an object with all the punctuation marks in it
from string import punctuation
for x in punctuation:
    print("[" + x + "]")
# https://www.codespeedy.com/python-string-punctuation-get-all-sets-of-punctuation/

[!]
["]
[#]
[$]
[%]
[&]
[']
[(]
[)]
[*]
[+]
[,]
[-]
[.]
[/]
[:]
[;]
[<]
[=]
[>]
[?]
[@]
[[]
[\]
[]]
[^]
[_]
[`]
[{]
[|]
[}]
[~]


### testing "set.intersection()"

In [13]:
x = {"apple", "banana", "cherry"}
y = {"google", "microsoft", "apple"}

z = x.intersection(y)

print(z)

{'apple'}


In [28]:
# TEST w_set in for loop...
num_printed = 0
punct_set = set(punctuation) #creating variable 'punct_set' as a SET using library punctuation

#beo_tokens_punct = []  # creating a blank list

for w in beo_tokens :  #using 'w' to represent each 
    w_set = set(w)
    #overlap = w_set.intersection(punct_set)
    print(w_set)
    #print(len(overlap)) 
    num_printed += 1 
    
    if num_printed == 10 : 
        break

{'W', 'E', 'F', 'O', 'U', 'L', 'B'}
{'B', 'y'}
{'u', 'm', 'n', 'y', 's', 'A', 'o'}
{'r', 't', 'd', 'l', 'T', 'e', 'n', 's', 'a'}
{'b', 'y'}
{'u', 'r', 'm', 'G', 'e'}
{'W', 'E', 'F', 'O', 'U', 'L', 'B'}
{'R', 'E', 'P', 'U', 'D', 'L'}
{'O', 'F'}
{'H', 'E', 'T'}


In [15]:
# TEST overlap in for loop...
num_printed = 0
punct_set = set(punctuation) #creating variable 'punct_set' as a SET using library punctuation

#beo_tokens_punct = []  # creating a blank list

for w in beo_tokens :  #using 'w' to represent each 
    w_set = set(w)
    overlap = w_set.intersection(punct_set)
    #print(w_set)
    print(len(overlap)) 
    num_printed += 1 
    
    if num_printed == 10 : 
        break

0
0
0
0
0
0
0
0
0
0


In [16]:
# We'll use a set trick, so need punctuation in a set
punct_set = set(punctuation) #creating variable 'punct_set' as a SET using library punctuation

beo_tokens_punct = []  # creating a blank list

for w in beo_tokens :  #using 'w' to represent the key in each record/member of dict
    w_set = set(w)   # make a set out of the key elements
    overlap = w_set.intersection(punct_set) # identify (via "intersection") punctuation in key
    
    if len(overlap) > 0 :  # if ANY punctuation in a key, returns > 0
        beo_tokens_punct.append(w)  # add this key into the list "beo_tokens_punct"

        
#print(beo_tokens_punct[:100])
print(len(beo_tokens_punct))

5921


As you look at those tokens with punctuation, what do you notice? 
- Lots of commas just stuck onto words.
- Some tokens that are just punctuation (e.g., "--").
- Lots of capitals that might not be what we want in our tokenization.
- A lot of tokens with punctuation.

What fraction of tokens contain punctuation?

In [17]:
len(beo_tokens_punct)/len(beo_tokens)

0.2267192525654771

### Tokenization Second Exercise

Now let's try working with some NLTK data. Count the words (or, more precisely, tokens) in one of the first three books included in the book corpus (_Moby Dick_, _Sense and Sensibility_, and The Book of Genesis from the _Bible_).

1. Pick one of the texts and assign it to a new variable. It'll have a name like `text1` before you assign it. That variable was created when we imported everything from `nltk.book`.
1. Look at the structure of variable.
1. Count the tokens as above. Use the `Counter` object.
1. Display the 10 most common tokens.

In [18]:
# Assign the text to the new variable.
shrubbery = text6

In [19]:
shrubbery[:1000]

['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa',
 'there',
 '!',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'SOLDIER',
 '#',
 '1',
 ':',
 'Halt',
 '!',
 'Who',
 'goes',
 'there',
 '?',
 'ARTHUR',
 ':',
 'It',
 'is',
 'I',
 ',',
 'Arthur',
 ',',
 'son',
 'of',
 'Uther',
 'Pendragon',
 ',',
 'from',
 'the',
 'castle',
 'of',
 'Camelot',
 '.',
 'King',
 'of',
 'the',
 'Britons',
 ',',
 'defeator',
 'of',
 'the',
 'Saxons',
 ',',
 'sovereign',
 'of',
 'all',
 'England',
 '!',
 'SOLDIER',
 '#',
 '1',
 ':',
 'Pull',
 'the',
 'other',
 'one',
 '!',
 'ARTHUR',
 ':',
 'I',
 'am',
 ',',
 '...',
 'and',
 'this',
 'is',
 'my',
 'trusty',
 'servant',
 'Patsy',
 '.',
 'We',
 'have',
 'ridden',
 'the',
 'length',
 'and',
 'breadth',
 'of',
 'the',
 'land',
 'in',
 'search',
 'of',
 'knights',
 'who',
 'will',
 'join',
 'me',
 'in',
 'my',
 'court',
 'at',
 'Camelot',
 '.',
 'I',
 'must',
 'speak',
 'with',
 'your',
 'lord',
 'and',
 'ma

In [20]:
# Count the tokens
len(shrubbery)

16967

In [21]:
# Display the 10 most common tokens.
shrub_counter = Counter(shrubbery) # creates the "counter object" called beo_counter
shrub_counter.most_common(10)


[(':', 1197),
 ('.', 816),
 ('!', 801),
 (',', 731),
 ("'", 421),
 ('[', 319),
 (']', 312),
 ('the', 299),
 ('I', 255),
 ('ARTHUR', 225)]

We pull in 10 copora when we load the books, but there are a ton of books we get with NLTK. Here are the ones that come with Project Gutenberg.

In [22]:
for book in nltk.corpus.gutenberg.fileids() :
    # fileids() Return a list of file identifiers for the files that make up
    # this corpus, or that make up the given category(s) if specified.
    print(book)

austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
bryant-stories.txt
burgess-busterbrown.txt
carroll-alice.txt
chesterton-ball.txt
chesterton-brown.txt
chesterton-thursday.txt
edgeworth-parents.txt
melville-moby_dick.txt
milton-paradise.txt
shakespeare-caesar.txt
shakespeare-hamlet.txt
shakespeare-macbeth.txt
whitman-leaves.txt
