## Text Normalization

Normalizing text is a critical component of text mining and a step we'll take on every single analysis. Eventually it'll get to the point that it's basically second nature. This notebook accompanies the lecture, where we mention six common types of text normalization: 

1. Case folding
1. Removing punctuation
1. Handling numbers, dates, and times
1. Extracting special information
1. Removing stopwords
1. Correcting spelling

We'll work through a few examples of most of these, although we'll save spelling correction for another day.

In [2]:
import nltk
from nltk.book import *
from collections import Counter
from nltk.corpus import stopwords

from string import punctuation

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### 1. Case Folding

We'll often discover that having a mixture of upper and lower case doesn't serve us very well. Case folding helps us handle this. Let's start by finding all the words that appear in the top 1000 most frequent words in the chat corpus with multiple capitalizations.

In [3]:
chat = text5
chat_count = Counter(chat)

I'll use a dictionary to hold all the words in the top 1000. The key will be the lowercase word and the value will be a list of every word that maps onto that lowercase word. 

In [4]:
case_collisions = dict() # make a set to hold lowercase versions

for word, count in chat_count.most_common(1000) :
    lc_word = word.lower()
    
    if lc_word not in case_collisions : 
        case_collisions[lc_word] = [word]
    else :
        case_collisions[lc_word].append(word)

In [5]:
for word, wlist in case_collisions.items() :
    if len(wlist) > 1 :
        print(f'The words {",".join(wlist)} map onto {word}') 
        # using the new-ish f strings

The words PART,part map onto part
The words lol,LOL,LoL,Lol map onto lol
The words i,I map onto i
The words the,The map onto the
The words you,You,YOU map onto you
The words a,A map onto a
The words hi,Hi map onto hi
The words me,ME map onto me
The words is,Is map onto is
The words and,And map onto and
The words it,It map onto it
The words that,That map onto that
The words hey,Hey map onto hey
The words my,My map onto my
The words what,What map onto what
The words not,NOT map onto not
The words do,Do map onto do
The words no,No map onto no
The words im,Im map onto im
The words how,How map onto how
The words pm,PM map onto pm
The words lmao,LMAO map onto lmao
The words who,Who map onto who
The words if,If map onto if
The words ok,OK map onto ok
The words am,AM map onto am
The words but,But map onto but
The words this,This map onto this
The words he,He map onto he
The words well,Well map onto well
The words m,M map onto m
The words now,Now map onto now
The words oh,Oh map onto oh
The wor

Now for a slightly easier one, how many times are "the" and "The" used in _Moby Dick_? 

In [6]:
Counter(text1)['the']

13721

In [7]:
Counter(text1)['The']

612

### 2. Punctuation

Punctuation can be tricky to handle. The easiest thing is to remove it, but that's not always the best thing to do. To practice playing around with it, count the number of **unique** words that have punctuation in them _Beowulf_. Print out a few to look at (although there are a lot, so maybe don't print them all).

In [8]:
beowulf = open("beowulf.txt").read()

In [9]:
# Let's grab every word with punctuation. 
# One straightforward way to do this is to make punctuation a 
# set and intersect it with the set of characters in the word. 

punct_set = set(punctuation)
punct_words = set() # since we want uniques

for word in beowulf.split() :
    wset = set(word)
    if punct_set.intersection(wset) :
        punct_words.add(word)
    
print(len(punct_words))

# Let's print 20 or so
print(list(punct_words)[:20])


3478
['flashes,', 'ether-robed', 'bearer.', 'murder,', 'anew.', 'good,', 'spoken,', 'ravens;', 'misery.', "prince's", 'he,', 'bold-hearted', 'asunder,', 'breath,', 'cavern-hall,', 'weary-hearted,', 'ash:', 'added,', 'saw,', '{28e}']


In [10]:
# While we're here, we can use the `isalnum` function to test if a string is alphanumeric. 
# This makes the code much simpler. There are also functions like isalpha and isnumeric
# https://docs.python.org/3/library/stdtypes.html#str.isalpha
punct_set_2 = set() 

for word in beowulf.split() :
    if not word.isalnum() :
        punct_set_2.add(word)

print(len(punct_set_2))

3478


Lots of that punctuation is at the end of words (e.g., "gallows." and "vain;"). Let's count the number of words that have punctuation in the _middle_ of the word. Let's also throw them in a `Counter` object and look at the most common. 

In [None]:
punct_mid_words = [] # Use a list so we can use a counter. 

for word in beowulf.split() :
    if not word.isalnum() and len(word) > 1:
        # now we're in the case of punctuation somewhere
        # need to test if it's start or end. 
        if (not word[1] in punctuation and
            not word[-1] in punctuation) :
            punct_mid_words.append(word)


In [None]:
Counter(punct_mid_words).most_common(20)

### Stopwords

There are many common words that don't help analysis that much (and can take up a lot of space). These are called stopwords. Let's play around with the English stopwords.
1. Load in the English stopwords and assign them to a variable called `sw`. Print them out. Any surprises?
1. Look at the top words in _Moby Dick_ and _Sense and Sensibility_.
1. Look at the top words in both of those that _aren't_ stopwords. 

In [11]:
sw = stopwords.words("english")
sw

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [12]:
Counter(text1).most_common(10)

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982)]

In [13]:
Counter(text2).most_common(10)

[(',', 9397),
 ('to', 4063),
 ('.', 3975),
 ('the', 3861),
 ('of', 3565),
 ('and', 3350),
 ('her', 2436),
 ('a', 2043),
 ('I', 2004),
 ('in', 1904)]

To look at the same stats but without stopwords and non-alpha strings, I'm going to use a list comprehension. If you haven't seen these before, here's a nice [tutorial](https://www.youtube.com/watch?v=AhSvKGTh28Q). 

In [14]:
Counter([w for w in text1 if w.lower() not in sw and w.isalpha()]).most_common(10)

[('whale', 906),
 ('one', 889),
 ('like', 624),
 ('upon', 538),
 ('man', 508),
 ('ship', 507),
 ('Ahab', 501),
 ('ye', 460),
 ('old', 436),
 ('sea', 433)]

In [15]:
Counter([w for w in text2 if w.lower() not in sw and w.isalpha()]).most_common(10)

[('Elinor', 684),
 ('could', 568),
 ('Marianne', 566),
 ('Mrs', 530),
 ('would', 507),
 ('said', 397),
 ('every', 361),
 ('one', 304),
 ('much', 287),
 ('sister', 282)]

## Stemming

Stemming is the process by which we move from a token to some "root" of that word. Let's explore one of the stemmers available through NLTK.

First, let's find all the words in the NLTK words corpus that end in "ing", then let's find those that have no vowels before an instance of "ing". You can access the words corpus with the confusing call of `nltk.corpus.words.words()`. To make it easier to deal with "y", let's just consider it a vowel.

In [16]:
words = nltk.corpus.words.words()
vowels = set('aeiouy')

In [17]:
ing_words = [w for w in words if len(w) > 3 and w[-3:]=="ing"]

In [18]:
len(ing_words)

5557

In [19]:
# Now let's find the subset that don't have a vowel before the 'ing'
ing_no_vowel = []

for word in ing_words :
    remainder = word[:-3]
    if len(set(remainder).intersection(vowels))==0 :
        ing_no_vowel.append(word)
        
ing_no_vowel

['bing',
 'bring',
 'ching',
 'cling',
 'ding',
 'fling',
 'ging',
 'hing',
 'Irving',
 'jing',
 'King',
 'king',
 'Kling',
 'ling',
 'Ming',
 'ming',
 'Ning',
 'Ping',
 'ping',
 'ring',
 'sing',
 'sling',
 'Spring',
 'spring',
 'sting',
 'string',
 'swing',
 'thing',
 'thring',
 'Ting',
 'ting',
 'whing',
 'wing',
 'wring',
 'zing',
 'ring',
 'spring',
 'thing',
 'wing']

Now let's play around with the Porter Stemmer in NLTK. First we'll look at a few hundred characters of inaugural addresses both stemmed and not stemmed.

In [20]:
porter = nltk.PorterStemmer() # give it a short name.
start = 30000
distance = 200

print(" ".join(text4[start:(start + distance)]))
print("\n\n")
print(" ".join([porter.stem(w) for w in text4[start:(start + distance)]]))



aid of that Almighty Power which has hitherto protected me and enabled me to bring to favorable issues other important but still greatly inferior trusts heretofore confided to me by my country . The broad foundation upon which our Constitution rests being the people -- a breath of theirs having made , as a breath can unmake , change , or modify it -- it can be assigned to none of the great divisions of government but to that of democracy . If such is its theory , those who are called upon to administer it must recognize as its leading principle the duty of shaping their measures so as to produce the greatest good to the greatest number . But with these broad admissions , if we would compare the sovereignty acknowledged to exist in the mass of our people with the power claimed by other sovereignties , even by those which have been considered most purely democratic , we shall find a most essential difference . All others lay claim to power limited only by their own will . The majority of

Now for you: how many words are in the inaugural addresses? How many lowercase stems are in them? 

In [21]:
# words in inaugural addresses
print(len(set(text4)))

9913


In [22]:
inaug_stemmed = {porter.stem(w.lower()) for w in text4}

print(len(inaug_stemmed))

print(len(set(text4))/len(inaug_stemmed))

5562
1.7822725638259618


---

Okay, let's have some "fun" and play around with some sets of characters that aren't words. Text 5 is the chat corpus. Find the emojis in there (doesn't have to be perfect) and count up the happy and sad ones.

In [None]:
chat = text5 # give it a nice name. 

# Let's find emojis in chat. 
potential_emojis = {w for w in chat if ":" in w or ";" in w or "=" in w}

In [None]:
potential_emojis

Clearly we're catching some non-emojis, but let's assume we're getting most of the list. 

In [None]:
# Count happy vs sad
happy = [w for w in chat if w in {":-)",":)",":D",";-)","=)"}]
sad = [w for w in chat if w in {":-(",":(",";-(","=("}]

print(len(happy))
print(len(sad))