# My First Natural Language Processor

CSCI 1360: Foundations for Informatics and Analytics

## Overview and Objectives

Last week, we introduced the concept of natural language processing, and in particular the "bag of words" model for representing and quantifying text for later analysis. In this lecture, we'll expand on those topics, including some additional preprocessing and text representation methods. By the end of this lecture, you should be able to
 - Implement several preprocessing techniques like stemming, stopwords, and minimum counts
 - Understand the concept of *feature vectors* in natural language processing
 - Compute inverse document frequencies to up or down-weight term frequencies

### Text Preprocessing

Some preprocessing techniques with text we've covered in the last lecture!
 - Lower case (or upper case) everything
 - Split into single words
 - Remove trailing whitespace (spaces, tabs, newlines)

To start, let's download a few more books in addition to *Alice in Wonderland* example from the previous lecture. Links are below:

 - [*Alice in Wonderland*](https://www.gutenberg.org/ebooks/11), by Lewis Carroll
 - [*Pride and Prejudice*](https://www.gutenberg.org/ebooks/1342), by Jane Austen
 - [*Frankenstein*](https://www.gutenberg.org/ebooks/84), by Mary Shelley
 - [*Beowulf*](https://www.gutenberg.org/ebooks/16328), by Lesslie Hall
 - [*The Adventures of Sherlock Holmes*](https://www.gutenberg.org/ebooks/1661), by Sir Arthur Conan Doyle
 - [*The Adventures of Tom Sawyer*](https://www.gutenberg.org/ebooks/74), by Mark Twain
 - [*The Adventures of Huckleberry Finn*](https://www.gutenberg.org/ebooks/76), by Mark Twain

First, we'll read all the books' raw contents into a dictionary. We use a dictionary to store all the text from the text books above.

In [3]:
books = {}  # We'll use a dictionary to store all the text from the books.
files = ['NLP-ICA/alice.txt',
         'NLP-ICA/pride.txt',
         'NLP-ICA/frank.txt',
         'NLP-ICA/bwulf.txt',
         'NLP-ICA/holmes.txt',
         'NLP-ICA/tom.txt',
         'NLP-ICA/finn.txt']

for f in files:
    # This weird line just takes the part of the filename between the "/" and "." as the dict key.
    prefix = f.split("/")[-1].split(".")[0]
    try:
        with open(f, "r", encoding = "ISO-8859-1") as descriptor:
            books[prefix] = descriptor.read()
    except:
        print("File '{}' had an error!".format(f))
        books[prefix] = None

In [4]:
# Here you can see the dict keys (i.e. the results of the weird line of code in the last cell)
print(books.keys())

dict_keys(['alice', 'pride', 'frank', 'bwulf', 'holmes', 'tom', 'finn'])


Let's go ahead and lower case everything, strip out whitespace.

In [5]:
def preprocess(book):
    # First, lowercase everything.
    lower=book.lower()
    
    # Second, split into lines.
    lines=lower.split("\n")
    
    # Third, split each line into words.
    words = []
    for line in lines:
        words.extend(line.strip().split(" "))
   
    

    # That's it!
    return count(words)

Then we will count all the words. This function takes a list of words as input, and counts them all up.

In [6]:
from collections import defaultdict # Our good friend from the last lecture, defaultdict!


def count(words):
    counts=defaultdict(int)
    
    for word in words:
        counts[word]+=1
        
    
   




    return counts

Finally, let's loop through our books and count all the words that show up!

In [7]:
counts = {}
for k, v in books.items():
    counts[k] = preprocess(v)

Let's see how our basic preprocessing techniques worked out.

In [8]:
from collections import Counter

def print_results(counts):
    for key, bag_of_words in counts.items():
        word_counts = Counter(bag_of_words) # Remember "Counter"?
        mc_word, mc_count = word_counts.most_common(1)[0]
        print("'{}': {} unique words; most common word '{}' appeared {} times."
              .format(key, len(bag_of_words.keys()), mc_word, mc_count))
print_results(counts)

'alice': 5584 unique words; most common word 'the' appeared 1777 times.
'pride': 13127 unique words; most common word 'the' appeared 4479 times.
'frank': 11725 unique words; most common word 'the' appeared 4327 times.
'bwulf': 11024 unique words; most common word '' appeared 3497 times.
'holmes': 14544 unique words; most common word 'the' appeared 5704 times.
'tom': 13449 unique words; most common word 'the' appeared 3907 times.
'finn': 13840 unique words; most common word 'and' appeared 6107 times.


Let's take a quick step back and think about the code we just saw.

 - The **`preprocess`** function takes a single book string as input and does some preprocessing: it lowercases everything so it's all the same case, it splits up the string into single words, and it adds all these words to one big list.

 - We also have a **`count`** function, which takes a list of words (output from `preprocess`) and counts everything up into a dictionary (the keys are unique words, the values how many times those words appear in the book). 

 - Finally, we have a block of code that loops over all our books and runs these two functions on each of them, building dictionaries of word counts. These are fed into **`print_results`** so that we can see 1) the number of unique words in each book, and 2) the most common word in each book.

We're going to repeat this process for the rest of this activity, slowly upgrading the **`preprocess`** function so that the final result (top words for each book) become more meaningful and indicative of the books' contents.

### Stop words

A great first step is to implement stop words. (You can use [this list of 319 stop words](http://xpo6.com/list-of-english-stop-words/)). This code just reads in the words from a stoplist file and adds them to a list we can use later.

In [9]:
with open("NLP-ICA/stopwords.txt", "r") as f:
    lines = f.read().split("\n")
    stopwords = [w.strip() for w in lines]
print(stopwords)

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', '

We'll now use the words in the `stopwords` list to eliminate words from our books (remember: we consider "stop words" to be meaningless to the overall semantics of the text; go back to the previous lecture if you need a refresher on stop words).

Now we'll augment our `preprocess` function to include stop word processing.

In [10]:
def preprocess_v2(book, stopwords):  # Note the "_v2"--this is a new function!
    # First, lowercase everything.
    
    lower=book.lower()
    
    # Second, split into lines.
    lines=lower.split("\n")
    
    # Third, split each line into words.
    words = []
    for line in lines:
        words.extend(line.strip().split(" "))
           ### NEW CODE HERE! Check for stopwords.
    new_words=[]
    for word in words:
        if word in stopwords:
            pass
        else:
            new_words.append(word)


        
        
    # That's it!
    return count(new_words)

Now let's see what we have! Same code as before--count the words in the books, but this time the stopwords will be filtered out per the new code in the preprocess_v2() function.

In [11]:
counts = {}
for k, v in books.items():
    counts[k] = preprocess_v2(v, stopwords)
    
print_results(counts)

'alice': 5350 unique words; most common word '' appeared 1178 times.
'pride': 12857 unique words; most common word '' appeared 2474 times.
'frank': 11463 unique words; most common word '' appeared 2841 times.
'bwulf': 10781 unique words; most common word '' appeared 3497 times.
'holmes': 14278 unique words; most common word '' appeared 2750 times.
'tom': 13180 unique words; most common word '' appeared 2283 times.
'finn': 13590 unique words; most common word '' appeared 2719 times.


Errr... this seems even worse! What's with the `''` word?! What could we try next?

### Minimum length

Our problem with "words" that are just blank strings `''` is that 
 1. Text is weird, and 
 2. We haven't told Python that we don't care about super-short "words" (that aren't really words)

We can get around this by adding another filtering step in our preprocess function: drop any word that has fewer than a certain number of characters.

Put another way: **ignore words under a certain length.**

Here's another new preprocessing function, this time enforcing a minimum word length of 3 (anything less than that is ignored the same way we ignore stop words).

In [12]:
def preprocess_v3(book, stopwords):  # We've reached "_v3"!
    # First, lowercase everything.
    
    low=book.lower()
    
    # Second, split into lines.
    lines=low.split("\n")
    
    # Third, split each line into words.
    words = []
    for line in lines:
        words.extend(line.strip().split(" "))
           # Skip stopwords OR words with length under 2
    new_words=[]
    tokens=[]
    for word in words:
        if word in stopwords:
            pass
        else:
            new_words.append(word)
    for word in new_words:
        if len(word) < 2:
            pass
        else:
            tokens.append(word)
        
        
        
        
        
        

    # That's it!
    return count(tokens)

Maybe this will be better? Same word-counting again, this time using preprocess_v3 to enforce both stopwords and minimum word length.

In [13]:
counts = {}
for k, v in books.items():
    counts[k] = preprocess_v3(v, stopwords)
    
print_results(counts)

'alice': 5344 unique words; most common word 'said' appeared 421 times.
'pride': 12845 unique words; most common word 'mr.' appeared 766 times.
'frank': 11451 unique words; most common word 'me,' appeared 148 times.
'bwulf': 10761 unique words; most common word 'beowulf' appeared 112 times.
'holmes': 14267 unique words; most common word 'said' appeared 448 times.
'tom': 13170 unique words; most common word 'tom' appeared 461 times.
'finn': 13581 unique words; most common word 'got' appeared 603 times.


Ooh! Definite improvement! Though clearly, punctuation is getting in the way; you can see it in at least two of the top words in the list above.

We spoke last time about how removing punctuation could be a little dangerous; what if the punctuation is inherent to the meaning of the word (i.e., a contraction)? 

In our fourth version of the preprocess function, we'll compromise a little: we'll get the "easy" punctuation, like exclamation marks, periods, and commas, and leave the rest.

In [14]:
def preprocess_v4(book, stopwords):

    # First, lowercase everything.
    
    low=book.lower()
    
    # Second, split into lines.
    lines=low.split("\n")
    
    # Third, split each line into words.
    words = []
    for line in lines:
        words.extend(line.strip().split(" "))
           # Skip stopwords OR words with length under 2
    new_words=[]
    tokens=[]
    for word in words:
        if word in stopwords:
            pass
        else:
            new_words.append(word)
    for word in new_words:
        if len(word) < 2:
            pass
        else:
            tokens.append(word)

            ### NEW CODE HERE! Cut off any end-of-sentence punctuation. Use endswith() function
    new_tokens=[]
    for word in tokens:

        if word.endswith("!") or word.endswith(".") or word.endswith(",") or word.endswith(":") or word.endswith(";") or word.endswith("?"):
    
            new_tokens.append(word[:-1])
        else:
            new_tokens.append(word)
        
        
        

    # That's it!
    return count(new_tokens)

Alright, let's check it out again. Same thing AGAIN, this time with preprocess_v4 to know out commas, exclamation marks, periods, colons, and semicolons.

In [15]:
counts = {}
for k, v in books.items():
    counts[k] = preprocess_v4(v, stopwords)
    
print_results(counts)

'alice': 4181 unique words; most common word 'said' appeared 456 times.
'pride': 8544 unique words; most common word 'mr' appeared 766 times.
'frank': 7960 unique words; most common word 'me' appeared 324 times.
'bwulf': 8736 unique words; most common word 'beowulf' appeared 156 times.
'holmes': 10781 unique words; most common word 'said' appeared 484 times.
'tom': 10038 unique words; most common word 'tom' appeared 611 times.
'finn': 9976 unique words; most common word 'says' appeared 628 times.


Now we're getting somewhere! But this introduces a new concept--in looking at this list, wouldn't you say that "says" and "said" are probably, semantically, more or less the same word?

### Stemming

Stemming is the process by which we convert words with similar meaning into the same word, so their similarity is reflected in our analysis. Words like "imaging" and "images", or "says" and "said" should probably be considered the same thing.

To do this, we'll need an external Python package: the Natural Language Toolkit, or NLTK. (it's installed on JupyterHub, so go ahead and play with it!)

[NLTK](https://www.nltk.org/) has a ton of really interesting stuff that goes way beyond the scope of this lecture. For our purposes, though, we're just going to use its "stemming" capabilities.

In [16]:
import nltk # This package!

def preprocess_v5(book, stopwords):
    stemmer=nltk.stem.SnowballStemmer('english')
    
    # First, lowercase everything.
    
    low=book.lower()
    
    # Second, split into lines.
    lines=low.split("\n")
    
    # Third, split each line into words.
    words = []
    for line in lines:
        words.extend(line.strip().split(" "))
           # Skip stopwords OR words with length under 2
    new_words=[]
    tokens=[]
    for word in words:
        if word in stopwords:
            pass
        else:
            new_words.append(word)
    for word in new_words:
        if len(word) < 2:
            pass
        else:
            tokens.append(word)

            ### NEW CODE HERE! Cut off any end-of-sentence punctuation. Use endswith() function
    new_tokens=[]
    stem_words=[]

    for word in tokens:

        if word.endswith("!") or word.endswith(".") or word.endswith(",") or word.endswith(":") or word.endswith(";") or word.endswith("?"):
    
            new_tokens.append(word[:-1])
        else:
            new_tokens.append(word)
        
    
    
            ### NEW CODE HERE! Just this next line (though
            ### it's initialized at the start of the function)
    
    for word in new_tokens:
        stemmed = stemmer.stem(word)  # This is all that is required--nltk does the rest!
        stem_words.append(stemmed)
            
            
            
            
            

    # That's it!
    return count(stem_words)

How did this go?

In [17]:
# Up to version 5 to enforce word stemming!
counts = {}
for k, v in books.items():
    counts[k] = preprocess_v5(v, stopwords)
    
print_results(counts)

'alice': 3489 unique words; most common word 'said' appeared 456 times.
'pride': 5975 unique words; most common word 'mr' appeared 767 times.
'frank': 5516 unique words; most common word 'me' appeared 324 times.
'bwulf': 7072 unique words; most common word 'beowulf' appeared 192 times.
'holmes': 8197 unique words; most common word 'said' appeared 484 times.
'tom': 7628 unique words; most common word 'tom' appeared 705 times.
'finn': 8017 unique words; most common word 'say' appeared 847 times.


Well, this *kinda* helped--"says" was reduced to "say", and its count clearly increased from the 628 it was before, meaning stemmed versions that were previously viewed as different words were merged. But "said" is still there; clearly, there are limitations to this stemmer.

As one final step--it's convenient sometimes to simply drop words that occur only once or twice. This can dramatically help with processing time, as quite a few words (usually proper nouns) will only be seen a few times.

We'll just remove any word that appears only once, so only keep words that were observed more than once.

In [24]:
def preprocess_v6(book, stopwords):
    stemmer=nltk.stem.SnowballStemmer('english') #changed order of some steps to improve processing
    
    low=book.lower()
    lines=low.split("\n")
    words = []
    for line in lines:
        words.extend(line.strip().split(" "))
           # Skip stopwords OR words with length under 2
    new_words=[]
    tokens=[]
    new_tokens=[]
    stem_words=[]

    for word in words:
###Use endswith() function

        if word.endswith("!") or word.endswith(".") or word.endswith(",") or word.endswith(":") or word.endswith(";") or word.endswith("?"):
    
            new_tokens.append(word[:-1])
        else:
            new_tokens.append(word)
                
    for word in new_tokens:
        if word in stopwords:
            pass
        else:
            tokens.append(word)
            
    for word in tokens:
        if len(word) <= 2:   #lots of MR so i did less than or equal to rather than less than 
            pass
        else:
            new_words.append(word)
        
    for word in new_words:
        stemmed = stemmer.stem(word)  # This is all that is required--nltk does the rest!
        stem_words.append(stemmed)

    
    word_counts = count(stem_words)
    
    ### NEW CODE HERE! It drops any keys whose values 
    ### (word counts) are only 1.
    trimmed_counts = {}
    for k, v in word_counts.items():
        if v > 1: # Perform the check that the value (count) is larger than 1
            trimmed_counts[k] = v
    return trimmed_counts

One final check:

In [25]:
counts = {}
for k, v in books.items():
    counts[k] = preprocess_v6(v, stopwords)
    
print_results(counts)

'alice': 1502 unique words; most common word 'said' appeared 456 times.
'pride': 3002 unique words; most common word 'elizabeth' appeared 625 times.
'frank': 3073 unique words; most common word 'feel' appeared 152 times.
'bwulf': 2543 unique words; most common word 'beowulf' appeared 192 times.
'holmes': 3822 unique words; most common word 'said' appeared 484 times.
'tom': 3419 unique words; most common word 'tom' appeared 705 times.
'finn': 3428 unique words; most common word 'say' appeared 847 times.


The most common words and their counts haven't changed, but hopefully you can see there's a big difference in the number of unique words after dropping any word that only appeared once!

 - *Frankenstein*: 5256 to 3053
 - *Beowulf*: 6871 to 2543
 - *Alice in Wonderland*: 3174 to 1448
 - *Tom Sawyer*: 7596 to 3459
 - *Pride and Prejudice*: 5848 to 3059
 - *Sherlock Holmes*: 8004 to 3822
 - *Huckleberry Finn*: 8070 to 3521

By removing and ignoring words that only appeared once, we eliminated a *ton* of words from the word count dictionaries of each of these books. Since the words only appeared once they were unlikely to contribute anything meaningful to any analysis we did; at the same time, we have significantly fewer words to worry about. Win-win!

Now that we have the document vectors in bag of words format, fully preprocessed, we can do some analysis, right?

**What is it we're really hoping these word counts tell us?**

What words are important in the book. The top 20 words could tell us a summary of a story and help match search inquiries to the book. it can also match a book by how often the word youre looking for is said with a high number of counts being more imortant or more relevant to your search. 