## Text Normalization

Normalizing text is a critical component of text mining and a step we'll take on every single analysis. Eventually it'll get to the point that it's basically second nature. This notebook accompanies the lecture, where we mention six common types of text normalization: 

1. Case folding
1. Removing punctuation
1. Handling numbers, dates, and times
1. Extracting special information
1. Removing stopwords
1. Correcting spelling

We'll work through a few examples of most of these, although we'll save spelling correction for another day.

In [1]:
import nltk
from nltk.book import *
from collections import Counter
from nltk.corpus import stopwords

from string import punctuation

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### 1. Case Folding

We'll often discover that having a mixture of upper and lower case doesn't serve us very well. Case folding helps us handle this. Let's start by finding all the words that appear in the top 1000 most frequent words in the chat corpus with multiple capitalizations.

In [2]:
chat = text6
chat_count = Counter(chat)

In [3]:
chat_count

Counter({'SCENE': 24,
         '1': 76,
         ':': 1197,
         '[': 319,
         'wind': 3,
         ']': 312,
         'clop': 39,
         'KING': 1,
         'ARTHUR': 225,
         'Whoa': 1,
         'there': 25,
         '!': 801,
         'SOLDIER': 24,
         '#': 127,
         'Halt': 3,
         'Who': 25,
         'goes': 1,
         '?': 207,
         'It': 35,
         'is': 106,
         'I': 255,
         ',': 731,
         'Arthur': 36,
         'son': 5,
         'of': 158,
         'Uther': 1,
         'Pendragon': 1,
         'from': 20,
         'the': 299,
         'castle': 18,
         'Camelot': 26,
         '.': 816,
         'King': 27,
         'Britons': 11,
         'defeator': 1,
         'Saxons': 1,
         'sovereign': 1,
         'all': 30,
         'England': 2,
         'Pull': 2,
         'other': 5,
         'one': 32,
         'am': 22,
         '...': 118,
         'and': 135,
         'this': 59,
         'my': 38,
         'trusty': 1

In [4]:
chat_count.most_common(1000)

[(':', 1197),
 ('.', 816),
 ('!', 801),
 (',', 731),
 ("'", 421),
 ('[', 319),
 (']', 312),
 ('the', 299),
 ('I', 255),
 ('ARTHUR', 225),
 ('?', 207),
 ('you', 204),
 ('a', 188),
 ('of', 158),
 ('--', 148),
 ('to', 144),
 ('s', 141),
 ('and', 135),
 ('#', 127),
 ('...', 118),
 ('Oh', 110),
 ('it', 107),
 ('is', 106),
 ('-', 88),
 ('in', 86),
 ('that', 84),
 ('t', 77),
 ('1', 76),
 ('No', 76),
 ('LAUNCELOT', 76),
 ('your', 75),
 ('not', 70),
 ('GALAHAD', 69),
 ('KNIGHT', 68),
 ('What', 65),
 ('FATHER', 63),
 ('we', 62),
 ('You', 61),
 ('BEDEVERE', 61),
 ('We', 60),
 ('this', 59),
 ('no', 55),
 ('Well', 54),
 ('HEAD', 54),
 ('have', 53),
 ('GUARD', 53),
 ('are', 52),
 ('Sir', 52),
 ('A', 50),
 ('And', 50),
 ('on', 47),
 ('VILLAGER', 47),
 ('Ni', 47),
 ('me', 46),
 ('He', 46),
 ('boom', 45),
 ('be', 43),
 ('he', 43),
 ('Yes', 42),
 ('2', 42),
 ('ha', 42),
 ('re', 41),
 ('her', 40),
 ('clop', 39),
 ('ROBIN', 39),
 ('my', 38),
 ('with', 38),
 ('away', 38),
 ('witch', 37),
 ('KNIGHTS', 37),


Now for a slightly easier one, how many times are "the" and "The" used in _Moby Dick_? 

In [5]:
Counter(text1)['the']

13721

In [6]:
Counter(text1)['The']

612

I'll use a ***dictionary*** to hold all the words in the top 1000. The ***key*** will be the lowercase word and the ***value*** will be a ***list*** of every word that maps onto that lowercase word.

In [7]:
case_collisions = dict() # make a set to hold lowercase versions (key:value pair)

for word, count in chat_count.most_common(1000) : 
    # ie, taking the 1000 most common tokens in the data set selected above..
    # 'word' reflects the KEY and 'count' the VALUE for each ordered pair in chat_count
    lc_word = word.lower() # puts each key in lower-case and saves in variable 'lc_word'
    # assume word = ARTHUR, to lc_word = arthur
    
    if lc_word not in case_collisions : # starting with NOTHING in case_collissions...
        case_collisions[lc_word] = [word] # populate 1st pair:  'arthur' : ['ARTHUR']
    else : # assume next word is Arthur, so we already have 'arthur' as a key...
        case_collisions[lc_word].append(word) # appends the VALUE to ['ARTHUR','Arthur']

In [8]:
case_collisions

{':': [':'],
 '.': ['.'],
 '!': ['!'],
 ',': [','],
 "'": ["'"],
 '[': ['['],
 ']': [']'],
 'the': ['the', 'The', 'THE'],
 'i': ['I', 'i'],
 'arthur': ['ARTHUR', 'Arthur'],
 '?': ['?'],
 'you': ['you', 'You'],
 'a': ['a', 'A'],
 'of': ['of', 'OF', 'Of'],
 '--': ['--'],
 'to': ['to', 'To'],
 's': ['s', 'S'],
 'and': ['and', 'And'],
 '#': ['#'],
 '...': ['...'],
 'oh': ['Oh', 'oh'],
 'it': ['it', 'It'],
 'is': ['is', 'Is'],
 '-': ['-'],
 'in': ['in', 'In'],
 'that': ['that', 'That'],
 't': ['t'],
 '1': ['1'],
 'no': ['No', 'no'],
 'launcelot': ['LAUNCELOT', 'Launcelot'],
 'your': ['your'],
 'not': ['not', 'Not'],
 'galahad': ['GALAHAD', 'Galahad'],
 'knight': ['KNIGHT', 'Knight', 'knight'],
 'what': ['What', 'what'],
 'father': ['FATHER', 'father', 'Father'],
 'we': ['we', 'We'],
 'bedevere': ['BEDEVERE', 'Bedevere'],
 'this': ['this', 'This'],
 'well': ['Well', 'well'],
 'head': ['HEAD', 'head'],
 'have': ['have', 'Have'],
 'guard': ['GUARD', 'guard'],
 'are': ['are', 'Are'],
 'sir': ['

***F-strings*** provide a way to embed expressions inside string literals, using a minimal syntax. It should be noted that an f-string is really an expression evaluated at run time, not a constant value. In Python source code, an f-string is a literal string, prefixed with f , which contains expressions inside braces.
- ACTION(f'some_opening_text {"separator".join(VALUE_name)} map onto {KEY_name}')
- ex: print(f'The words {",".join(wlist)} map onto {word}')

In [9]:
for word, wlist in case_collisions.items() :
    # 'word' reflects the KEY and 'wlist' the VALUE (a LIST) 
    # for each ordered pair in case_collisions pairs 
    if len(wlist) > 1 : # ie, if the VALUE(list) contents for any record > 1 element
        print(f'The words {",".join(wlist)} map onto {word}')
        #print(f'The words {"?".join(wlist)} map onto {word}')
        #print('The words {"?".join(wlist)} map onto {word}')
        # using the new-ish f strings

The words the,The,THE map onto the
The words I,i map onto i
The words ARTHUR,Arthur map onto arthur
The words you,You map onto you
The words a,A map onto a
The words of,OF,Of map onto of
The words to,To map onto to
The words s,S map onto s
The words and,And map onto and
The words Oh,oh map onto oh
The words it,It map onto it
The words is,Is map onto is
The words in,In map onto in
The words that,That map onto that
The words No,no map onto no
The words LAUNCELOT,Launcelot map onto launcelot
The words not,Not map onto not
The words GALAHAD,Galahad map onto galahad
The words KNIGHT,Knight,knight map onto knight
The words What,what map onto what
The words FATHER,father,Father map onto father
The words we,We map onto we
The words BEDEVERE,Bedevere map onto bedevere
The words this,This map onto this
The words Well,well map onto well
The words HEAD,head map onto head
The words have,Have map onto have
The words GUARD,guard map onto guard
The words are,Are map onto are
The words Sir,sir,SIR map 

In [10]:
# tweak the sentence ordering (swapping positions of KEY and VALUE references)
for word, wlist in case_collisions.items() :
    # 'word' reflects the KEY and 'wlist' the VALUE (a LIST) 
    # for each ordered pair in case_collisions pairs 
    if len(wlist) > 1 : # ie, if the VALUE(list) contents for any record > 1 element
        print(f'The word {word} maps with {"=>".join(wlist)}')

The word the maps with the=>The=>THE
The word i maps with I=>i
The word arthur maps with ARTHUR=>Arthur
The word you maps with you=>You
The word a maps with a=>A
The word of maps with of=>OF=>Of
The word to maps with to=>To
The word s maps with s=>S
The word and maps with and=>And
The word oh maps with Oh=>oh
The word it maps with it=>It
The word is maps with is=>Is
The word in maps with in=>In
The word that maps with that=>That
The word no maps with No=>no
The word launcelot maps with LAUNCELOT=>Launcelot
The word not maps with not=>Not
The word galahad maps with GALAHAD=>Galahad
The word knight maps with KNIGHT=>Knight=>knight
The word what maps with What=>what
The word father maps with FATHER=>father=>Father
The word we maps with we=>We
The word bedevere maps with BEDEVERE=>Bedevere
The word this maps with this=>This
The word well maps with Well=>well
The word head maps with HEAD=>head
The word have maps with have=>Have
The word guard maps with GUARD=>guard
The word are maps with ar

### 2. Punctuation

Punctuation can be tricky to handle. The easiest thing is to remove it, but that's not always the best thing to do. To practice playing around with it, count the number of **unique** words that have punctuation in them _Beowulf_. Print out a few to look at (although there are a lot, so maybe don't print them all).

In [11]:
beowulf = open("beowulf.txt").read()

In [12]:
# APPROACH BASED ON TOKENIZATION LECTURE... HAS REDUNDANCIES...
beo_tokens = beowulf.split() # leaving nothing in parentheses refers to "whitespace"
# we have already imported the String:Punctuation and collections:Counter...
punct_set = set(punctuation) #creating variable 'punct_set' as a SET using library punctuation

beo_tokens_punct = []  # creating a blank LIST... so COULD HAVE DUPLICATES (unlike a SET)

for w in beo_tokens :  #using 'w' to represent the key in each record/member of dict
    # NOTE:  rather than creating beo_tokens, could have just put in beowulf.split() here..
    w_set = set(w)   # make a set out of the key elements
    overlap = w_set.intersection(punct_set) # identify (via "intersection") punctuation in key
    
    if len(overlap) > 0 :  # if ANY punctuation in a key, returns > 0
        beo_tokens_punct.append(w)  # ADD this KEY into the LIST "beo_tokens_punct"

print(len(beo_tokens_punct))

print(beo_tokens_punct[:20])

5921
['LO,', 'people-kings', 'spear-armed', 'Danes,', 'sped,', 'heard,', 'won!', 'foes,', 'tribe,', 'mead-bench', 'tore,', 'earls.', 'friendless,', 'foundling,', 'him:', 'welkin,', 'throve,', 'folk,', 'near,', 'whale-path,']


In [13]:
# EXTRA CODE NEEDED TO MAKE THIS UNIQUE
beo_tokens_punct_uniq = []
for b in beo_tokens_punct :
    if b not in beo_tokens_punct_uniq :
        beo_tokens_punct_uniq.append(b)
        
print(len(beo_tokens_punct_uniq))
print(beo_tokens_punct_uniq[:20])

3478
['LO,', 'people-kings', 'spear-armed', 'Danes,', 'sped,', 'heard,', 'won!', 'foes,', 'tribe,', 'mead-bench', 'tore,', 'earls.', 'friendless,', 'foundling,', 'him:', 'welkin,', 'throve,', 'folk,', 'near,', 'whale-path,']


In [None]:
punct_set = set(punctuation)
punct_set # Made it a set, but unlike dictionary.. does NOT need pairs??  CORRECT!

A ***set*** is an unordered and mutable collection of ***unique elements***. Sets are written with curly brackets ({}), being the elements separated by commas.

In [None]:
# APPROACH PER JOHN'S SOLUTIONS... CLEANER/LESS CODE
punct_set = set(punctuation)
punct_words = set() # creates blank SET (since we want uniques)

for word in beowulf.split() :
    wset = set(word)  # make a set out of the key elements
    if punct_set.intersection(wset) : # "TRUE" means there is a punctuation in this KEY
        punct_words.add(word) # "TRUE" above means add this 
    
print(len(punct_words))

# Let's print 20 or so
print(list(punct_words)[:20]) # prints out the first 20, but in a LIST format

In [None]:
# While we're here, we can use the `isalnum` function to test if a string is alphanumeric. 
# This makes the code much simpler. There are also functions like isalpha and isnumeric
# https://docs.python.org/3/library/stdtypes.html#str.isalpha
punct_set_2 = set() 

for word in beowulf.split() :
    if not word.isalnum() :
        punct_set_2.add(word)

print(len(punct_set_2))
print(list(punct_set_2)[:20])

In [None]:
punct_set_3 = set() 

for word in beowulf.split() :
    if not word.isalpha() :  # CHANGED FUNCTION TO ISALPHA
        punct_set_3.add(word)

print(len(punct_set_3))
print(list(punct_set_3)[:20])

In [None]:
punct_set_4 = set() 

for word in beowulf.split() :
    if not word.isnumeric() :  # CHANGED FUNCTION TO ISNUMERIC
        punct_set_4.add(word)

print(len(punct_set_4))
print(list(punct_set_4)[:20])

Now let's count the number of words that have punctuation in the _middle_ of the word. Let's also throw them in a `Counter` object and look at the most common. 

***From John***:
Lots of that punctuation is at the end of words (e.g., "gallows." and "vain;"). Let's count the number of words that have punctuation in the middle of the word. Let's also throw them in a Counter object and look at the most common.

In [14]:
punct_mid_words = [] # Use a LIST (vs a SET) so we can use a counter. 

for word in beowulf.split() :
    if not word.isalnum() and not len(word) > 1:
        print(word)

"


In [None]:
punct_mid_words = [] # Use a LIST (vs a SET) so we can use a counter. 

for word in beowulf.split() :
    if not word.isalnum() and len(word) > 1: # 'len(word) is counting characters'
        # 2 conditions.. can't be alpha-numeric, must have more than 1 character
        # now we're in the case of punctuation somewhere
        # need to test if it's start or end. 
        if (not word[1] in punctuation and
            not word[-1] in punctuation) : # punctuation NOT in first or last position..
            punct_mid_words.append(word) # then add this word to our "mid-word-punctuation"

In [None]:
Counter(punct_mid_words).most_common(20)

### Stopwords

There are many common words that don't help analysis that much (and can take up a lot of space). These are called stopwords. Let's play around with the English stopwords.
1. Load in the English stopwords and assign them to a variable called `sw`. Print them out. Any surprises?
1. Look at the top words in _Moby Dick_ and _Sense and Sensibility_.
1. Look at the top words in both of those that _aren't_ stopwords. 

**When might you need to keep what is considered a 'Stopword'**?
- when you DO care about the use of one
- when a stopword has a different meaning (ex: woman named "May")
- titles of material (ex: song "Me, Myself and I")

In [None]:
sw = stopwords.words("english")
sw

In [None]:
Counter(text1).most_common(10)

In [None]:
Counter(text2).most_common(10)

To look at the same stats but without stopwords and non-alpha strings, I'm going to use a list comprehension. If you haven't seen these before, here's a nice tutorial: https://www.youtube.com/watch?v=AhSvKGTh28Q

### see "List Comprehension" notebook...

In [None]:
Counter([w for w in text1 if w.lower() not in sw and w.isalpha()]).most_common(10)

In [None]:
Counter([w for w in text2 if w.lower() not in sw and w.isalpha()]).most_common(10)

### Lemmas vs Stemming (from John's lecture on normalization, 9/8/21 class review)

-	19:52-m – review text normalization /lemma /stemming
-	20:29m  lemma defined:  generally used in different parts of speech (noun vs verb, etc)
-	20:55m	 stemming defined
-	21:36m  example:  learn, learns, learning… **diff lemmas, same stem**
-	21:30m whiteboard
-	22:44m  whiteboard “wordform vs lemma” 
-	23:30m  type vs tokens (each instance of the type):  => complexity vs size of the corpus
o	Want a set of tokens to work with
-	24:30m  steps to normalize a corpus (keep what needs being kept, delete what doesn’t)
o	Ex: sarcasm/satire?  “I’m REALLY excited…” vs. “i’m really excited…”
-	26:15m whiteboard:  stop words, punctuation, case-folding
o	case-folding (make all lower or all upper);  ? re: formatting (bold, italics, etc)
-	30:40m – punctuation:  tend to strip out.. make simpler, don’t care about it
-	35:55m:  dealing with “it’s” => “it is” before other cleansing…
-	36:25m – stop words;  generally don’t contain much value, so remove
o	MacBeth:  an exception
o	Can add words… ex: Twitter “rt” (re-tweet) or NOT delete (“#”, “@”)
-	41:05m  stemming (vs tokenization)
o	Always need tokenization
o	Almost always need normalization
o	Stemming is more of “maybe” to do it.. 



## Stemming

Stemming is the process by which we move from a token to some "root" of that word. Let's explore one of the stemmers available through NLTK.

First, let's find all the words in the NLTK words corpus that end in "ing", then let's find those that have no vowels before an instance of "ing". You can access the words corpus with the confusing call of `nltk.corpus.words.words()`. To make it easier to deal with "y", let's just consider it a vowel.

In [None]:
words = nltk.corpus.words.words()
vowels = set('aeiouy')
vowels

In [None]:
# 2 conditions.. word length > 3 characters, and last 3 MUST be 'ing'
# (Not YET concerned about ANY preceeding characters being a vowel...)
ing_words = [w for w in words if len(w) > 3 and w[-3:]=="ing"]

In [None]:
len(ing_words)

In [None]:
# Now let's find the subset that don't have a vowel before the 'ing'
ing_no_vowel = []  # creating a blank LIST

for word in ing_words :  # iterate over each "ing_word"...
    remainder = word[:-3] # identifies all characters preceeding the 'ing'
    # we have a set "vowels", so convert "remainder" to a set, for intersection testing
    # determine if any of the characters in our remainder intersect with vowels;
    #   count the number of intersections
    if len(set(remainder).intersection(vowels))==0 : # looking for 0 intersections!
        ing_no_vowel.append(word) # TRUE... make this an 'ing_no_vowel' word (ex: sing)
        
ing_no_vowel

Now let's play around with the Porter Stemmer in NLTK. First we'll look at a few hundred characters of inaugural addresses both stemmed and not stemmed.

In [None]:
porter = nltk.PorterStemmer() # give it a short name.
start = 30000
distance = 200 # (will actually return 201)
# join the text starting at 30,000 and ending with 30,200... 
# but then separate with a space
print(" ".join(text4[start:(start + distance)])) 

In [None]:
print("\n\n") # this is just to create a separator between these two outputs...

In [None]:
# to get the stem of each, take the ORIGINAL..
#    text4[start:(start + distance)]
# and make a list comprehension out of it
#    [porter.stem(w) for w in ORIGINAL]
print(" ".join([porter.stem(w) for w in text4[start:(start + distance)]]))

Now for you: how many words are in the inaugural addresses? How many lowercase stems are in them? 

In [None]:
# words in inaugural addresses
print(len(set(text4)))

In [None]:
# Identify the lower-cased stems by adding in the qualifier for w
#    ie, "porter.stem(w.lower())..."  INSTEAD OF "porter.stem(w)...""
inaug_stemmed = {porter.stem(w.lower()) for w in text4}
# what is the number of lower-cased stems?
print(len(inaug_stemmed))
# what is the multiple of total words to these lower-case stemmed words?
print(len(set(text4))/len(inaug_stemmed))

---

Okay, let's have some "fun" and play around with some sets of characters that aren't words. Text 5 is the chat corpus. Find the emojis in there (doesn't have to be perfect) and count up the happy and sad ones.

In [None]:
chat = text5 # give it a nice name. 

# Let's find emojis in chat. Assume an emoji if includes ":", ";", "="
potential_emojis = {w for w in chat if ":" in w or ";" in w or "=" in w}

In [None]:
potential_emojis

Some non-emoji's present... but we've likely captured most/all true emoji's

In [None]:
# Count happy vs sad
# we're looking at the options and make our own call on which are happy vs sad
happy = [w for w in chat if w in {":-)",":)",":D",";-)","=)"}]
sad = [w for w in chat if w in {":-(",":(",";-(","=("}]

print(len(happy))
print(len(sad))