# The Basics of Natural Language Processing

### Nicholas Zufelt, CSC630 Data Visualization

In this notebook, we use the `nltk` library to vectorize the text "Siddhartha" by Herman Hesse.

The main goal of vectorization is to take [natural language](https://en.wikipedia.org/wiki/Natural_language) and "turn it a data set".  What that means depends on how you plan to use it, but for us, we'll take a common, basic approach:

**Given a novel, break it into parts (sections, chapters, etc.) and then into sentences.  Turn each sentence into a vector of length $N$ consisting of 1's and 0's, given by whether the top $N$ non-stop words are contained in each sentence.**

There's a lot in there:
1. Break the text into pieces appropriately,
2. Break each part into sentences,
3. Remove _stop words_, _i.e._ words that are commonly present in most natural language and which don't add much meaning to the text (e.g. `the`, `is`, _etc._).  What is known as a stop words is often based upon the text itself, and requires some fine-tuning.
4. Determine the top $N$ non-stop words, deciding upon $N$ by some kind of process.  This typically involves _stemming_ the words, _i.e._ reducing words to their stem: `charge`, `charging`, `charged` all become `charg`.
5. Replace each sentence with the appropriate vector.  This is often called the _bag-of-words_ approach, because it doesn't take into account the interaction between words: having the word "peanut" in a sentence increases the chancese of having the word "butter" in the sentence, for example, and we'll disregard such concerns here.

Now, in our line of work, doing all that by hand would take a lot of lines of Python. Fortunately, the Natural Language Toolkit (`nltk`) is a Python library that has many built-in, helpful tools for these tasks.

In [123]:
import json
import string

import numpy as np
import pandas as pd
import requests

import nltk

You'll likely need to install nltk the first time you use it.  Recall that you can do that inside of this notebook!  You use 

```
!pip install nltk
```

to have [pip](https://en.wikipedia.org/wiki/Pip_%28package_manager%29) install it for you.

## Removing non-content

Okay, let's get a book from [Project Gutenberg](http://www.gutenberg.org/), as an example to work with.

In [2]:
url = "http://www.gutenberg.org/cache/epub/2500/pg2500.txt"

res = requests.get(url)

Let's take a look at what we're working with here.

In [34]:
print(res.text[:1000])

﻿The Project Gutenberg EBook of Siddhartha, by Herman Hesse

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Siddhartha

Author: Herman Hesse

Translator: Gunther Olesch, Anke Dreher, Amy Coulter, Stefan Langer and Semyon Chaichenets

Release Date: April 6, 2008 [EBook #2500]
Last updated: July 2, 2011
Last updated: January 23, 2013

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK SIDDHARTHA ***




Produced by Michael Pullen,  Chandra Yenco, Isaac Jones





SIDDHARTHA

An Indian Tale

by Hermann Hesse





FIRST PART

To Romain Rolland, my dear friend




THE SON OF THE BRAHMAN

In the shade of the house, in the sunshine of the riverbank near the
boats, in the shade of the Sal-wood forest, in the shade of the fig

Okay, there's some text in the beginning that we don't want to include in our analysis of the text, ending with 

```
By Hermann Hesse
```

Let's remove up to that, and then see what's at the end, as well.

In [3]:
i = res.text.index("FIRST PART")

text = res.text[i:]

print(text[:100])

FIRST PART

To Romain Rolland, my dear friend




THE SON OF THE BRAHMAN

In the shade of t


In [4]:
# Tinkering with the numbers below brought me to:
print(text[-20000:-19000])

s innermost self, Govinda
still stood for a little while bent over Siddhartha's quiet face, which
he had just kissed, which had just been the scene of all manifestations,
all transformations, all existence.  The face was unchanged, after under
its surface the depth of the thousandfoldness had closed up again, he
smiled silently, smiled quietly and softly, perhaps very benevolently,
perhaps very mockingly, precisely as he used to smile, the exalted one.

Deeply, Govinda bowed; tears he knew nothing of, ran down his old face;
like a fire burnt the feeling of the most intimate love, the humblest
veneration in his heart.  Deeply, he bowed, touching the ground, before
him who was sitting motionlessly, whose smile reminded him of everything
he had ever loved in his life, what had ever been valuable and holy to
him in his life.





End of the Project Gutenberg EBook of Siddhartha, by Herman Hesse

*** END OF THIS PROJECT GUTENBERG EBOOK SIDDHARTHA ***

***** This file 

So the end of the book is signaled by 

```
End of the Project Gutenberg EBook of Siddhartha, by Herman Hesse
```
So we'll strip that off.

In [5]:
i = text.index("End of the Project Gutenberg EBook")

text = text[:i]
print(text[-100:])

he had ever loved in his life, what had ever been valuable and holy to
him in his life.








Great, now our text is ready.  

## Breaking into parts/chapters

I also saw that the big sections/chapters, etc are written in all caps.  So to split it up by chapters, I'm going to use our first part of `nltk`, a _tokenizer_.  A tokenizer is an object which can break up natural language in a way you specify.  For example: 

In [6]:
from nltk import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
tokens = tokenizer.tokenize("The cat in the hat sat on the bat, then spat.")

tokens

['The', 'cat', 'in', 'the', 'hat', 'sat', 'on', 'the', 'bat,', 'then', 'spat.']

The tokenizer we'll use first is `RegexpTokenizer`, which takes in its constructor a regular expression pattern to match.  Here, we'll use the pattern 
```
^(?:[A-Z]+ ?)+
```
I tweaked this until I got what I wanted.  Use [this website](https://regex101.com/) to help fine-tune your regular expression to be what you want.  Since regular expressions are an [entire problem of themselves](http://regex.info/blog/2006-09-15/247), I won't spend time explaining them.  Just ask for help if you need them.

In [7]:
from nltk import RegexpTokenizer

# grab all lines that are all caps
re_tokenizer = RegexpTokenizer("^(?:[A-Z][A-Z]+ ?)+")
sections = re_tokenizer.tokenize(text)

In [8]:
sections

['FIRST PART',
 'THE SON OF THE BRAHMAN',
 'WITH THE SAMANAS',
 'GOTAMA',
 'AWAKENING',
 'SECOND PART',
 'KAMALA',
 'WITH THE CHILDLIKE PEOPLE',
 'SANSARA',
 'BY THE RIVER',
 'THE FERRYMAN',
 'THE SON',
 'OM',
 'GOVINDA']

That `"OM"` looks suspect to me, so I want to check it out.

In [9]:
i = text.index("OM")

print(text[i-30:i+30])

dy found him asleep.




OM

For a long time, the wou


No, it actually seems fine.  What a funny name for a chapter header!  At any rate, we can break the text into chapters using this tokenizer.  We can use `span_tokenize` to get the starting and ending indices of the titles, and then `zip` to "zip together" the two different pieces of information.  Finally, `zip` returns a generator, so we can "exhaust" the  generator by making it become a list.

In [10]:
print(*zip(re_tokenizer.tokenize(text), re_tokenizer.span_tokenize(text)), sep="\n")

('FIRST PART', (0, 10))
('THE SON OF THE BRAHMAN', (57, 79))
('WITH THE SAMANAS', (15469, 15485))
('GOTAMA', (34498, 34504))
('AWAKENING', (52193, 52202))
('SECOND PART', (60454, 60465))
('KAMALA', (60527, 60533))
('WITH THE CHILDLIKE PEOPLE', (86856, 86881))
('SANSARA', (103516, 103523))
('BY THE RIVER', (121625, 121637))
('THE FERRYMAN', (144091, 144103))
('THE SON', (168032, 168039))
('OM', (186036, 186038))
('GOVINDA', (199596, 199603))


That's _almost_ what I want.  What I have here is the beginning and ending indices of the _chapter titles_, and I want the beginning and ending indices of the _chapters_.  That's an easy fix:

In [11]:
chapters_by_title_indices = list(zip(re_tokenizer.tokenize(text), re_tokenizer.span_tokenize(text)))

chapters = []

for i, chapter in enumerate(chapters_by_title_indices):
    # Carry over the chapter title
    c = [chapter[0], [chapter[1][1], 0]]
    try: 
        # grab the starting index of the next chapter, if it exists
        c[1][1] = chapters_by_title_indices[i+1][1][0]
    except:
        # it didn't work, meaning we're at the end.
        pass
    chapters.append(c)
    
print(*chapters, sep="\n")

['FIRST PART', [10, 57]]
['THE SON OF THE BRAHMAN', [79, 15469]]
['WITH THE SAMANAS', [15485, 34498]]
['GOTAMA', [34504, 52193]]
['AWAKENING', [52202, 60454]]
['SECOND PART', [60465, 60527]]
['KAMALA', [60533, 86856]]
['WITH THE CHILDLIKE PEOPLE', [86881, 103516]]
['SANSARA', [103523, 121625]]
['BY THE RIVER', [121637, 144091]]
['THE FERRYMAN', [144103, 168032]]
['THE SON', [168039, 186036]]
['OM', [186038, 199596]]
['GOVINDA', [199603, 0]]


Okay, now we're ready to roll!  After we do this, all the new lines are not useful, so we'll get rid of them. 

In [12]:
# split into the two parts of the text
sectionized_text = [[], []]

part = -1
for chapter in chapters:
    if "PART" in chapter[0]:
        # this is not a chapter
        part += 1
        continue
    else:
        chapter_text = text[chapter[1][0]: chapter[1][1]-1]
        sectionized_text[part].append(
            (chapter[0],
             chapter_text.replace("\n", " ").replace("\r", "")
            ))
        
sectionized_text[0][0]

('THE SON OF THE BRAHMAN',
 '  In the shade of the house, in the sunshine of the riverbank near the boats, in the shade of the Sal-wood forest, in the shade of the fig tree is where Siddhartha grew up, the handsome son of the Brahman, the young falcon, together with his friend Govinda, son of a Brahman.  The sun tanned his light shoulders by the banks of the river when bathing, performing the sacred ablutions, the sacred offerings.  In the mango grove, shade poured into his black eyes, when playing as a boy, when his mother sang, when the sacred offerings were made, when his father, the scholar, taught him, when the wise men talked.  For a long time, Siddhartha had been partaking in the discussions of the wise men, practising debate with Govinda, practising with Govinda the art of reflection, the service of meditation.  He already knew how to speak the Om silently, the word of words, to speak it silently into himself while inhaling, to speak it silently out of himself while exhaling, w

We now have our text broken into chapters, and now let's break each chapter into sentences.

## Sentences as lists of words

In order to have the following sections work, you may need to install three things from nltk: 
1. the [`punkt`](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf) sentence-finder, 
2. the nltk stopwords corpus, and 
3. the `wordnet` 

Why do I need a sentence-finder tool (the other parts will be clear later)?  Well, to quote [Wikipedia](https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation):

> Sentence boundary disambiguation (SBD), also known as sentence breaking or sentence boundary detection, is the problem in natural language processing of deciding where sentences begin and end. Often, natural language processing tools require their input to be divided into sentences for a number of reasons; however, sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.

In [37]:
# After running this cell, a dialog window should open:
# Under "Models", download "punkt",
# Under "Corpora", download "stopwords" and "wordnet"

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [14]:
# just to make sure it's working:
nltk.sent_tokenize(sectionized_text[0][0][1])[0:5]

['  In the shade of the house, in the sunshine of the riverbank near the boats, in the shade of the Sal-wood forest, in the shade of the fig tree is where Siddhartha grew up, the handsome son of the Brahman, the young falcon, together with his friend Govinda, son of a Brahman.',
 'The sun tanned his light shoulders by the banks of the river when bathing, performing the sacred ablutions, the sacred offerings.',
 'In the mango grove, shade poured into his black eyes, when playing as a boy, when his mother sang, when the sacred offerings were made, when his father, the scholar, taught him, when the wise men talked.',
 'For a long time, Siddhartha had been partaking in the discussions of the wise men, practising debate with Govinda, practising with Govinda the art of reflection, the service of meditation.',
 'He already knew how to speak the Om silently, the word of words, to speak it silently into himself while inhaling, to speak it silently out of himself while exhaling, with all the con

Now, since our text has been split into sentences, we now want to split into words.  You might want to say "okay, just do `my_sentence.split(' ')`", but then we'll get strange punctuation in some of our words.  Okay, then how about removing all the punctuation?  Well:

In [65]:
string.punctuation      # this seems useful!  Let's use this

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [66]:
sentence = "Immediately, he couldn't believe his eyes; what a feast!"
for punc in string.punctuation:
    sentence = sentence.replace(punc, "")
    
sentence

'Immediately he couldnt believe his eyes what a feast'

That's not bad, but `could` and `couldn't` are very different words, and I fear that we need to separate those if possible.  Fortunately, `nltk` has a tool for us, again!

In [67]:
sentence = "Immediately, he couldn't believe his eyes; what a feast!"
nltk.word_tokenize(sentence)

['Immediately',
 ',',
 'he',
 'could',
 "n't",
 'believe',
 'his',
 'eyes',
 ';',
 'what',
 'a',
 'feast',
 '!']

And now I can remove those "words" that are just punctuation:

In [71]:
word_tokens = nltk.word_tokenize(sentence)
for i, word in enumerate(word_tokens):
    if len(word) == 1 and word[0] in string.punctuation:
        word_tokens.pop(i)
        
word_tokens

['Immediately',
 'he',
 'could',
 "n't",
 'believe',
 'his',
 'eyes',
 'what',
 'a',
 'feast']

## Removing stop words and vectorizing

The last step before creating a vector for each sentence is to remove all the useless words.  These words include things like `"the"` and `"a"`, but there are a lot more than that.  They're called stopwords.

In [103]:
from nltk.corpus import stopwords
print(stopwords.words('english')[:15], len(stopwords.words('english')[:15]))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him'] 15


This seems perfect, but this is often a stage where you need to look through and see if you want to keep some of them, or add some of your own.  It might not be what you want, but it's a useful thing to start from!  To remove them, you do the most obvious thing:

In [85]:
useful_word_tokens = [word for word in word_tokens if word not in stopwords.words('english')]
useful_word_tokens

['Immediately', 'could', "n't", 'believe', 'eyes', 'feast']

Okay, now to create a vector.  The idea here is to create a _dictionary_ of the most common useful (non-stop) words, and then for each sentence, record whether that word appears in that sentence.  The useful tool here is `collections.Counter`, which is a (Python) dictionary which records each time you add an item to it.

In [92]:
from collections import Counter

dictionary = Counter()
dictionary.update(useful_word_tokens + ['eyes', 'eyes', 'feast'])
    
print(dictionary)
print(dictionary.most_common(2))

Counter({'eyes': 3, 'feast': 2, 'Immediately': 1, 'could': 1, "n't": 1, 'believe': 1})
[('eyes', 3), ('feast', 2)]


This seems perfect.  Let's make a dictionary!

In [104]:
dictionary = Counter()

def tokenize_sentence(sentence):
    """
    This is just the code from above that tokenizes the sentence into 
    words, then deletes things like `.` and removes stop words, wrapped 
    in a function.
    """
    word_tokens = nltk.word_tokenize(sentence)
    for i, word in enumerate(word_tokens):
        if len(word) == 1 and word[0] in string.punctuation:
            word_tokens.pop(i)

    return [word.lower() for word in word_tokens if word not in stopwords.words('english')]
    

for part in sectionized_text:
    for chapter in part:
        # recall that `chapter` is a tuple, `(chapter_name, chapter_text)`
        sentences = nltk.sent_tokenize(chapter[1])
        sentences_in_word_tokens = [tokenize_sentence(sentence) for sentence in sentences]
        for tokenized_sentence in sentences_in_word_tokens:
            dictionary.update(tokenized_sentence)

In [105]:
dictionary.most_common(100)

[('i', 476),
 ('siddhartha', 403),
 ('``', 373),
 ("''", 370),
 ('one', 229),
 ("'s", 193),
 ('but', 148),
 ('govinda', 145),
 ('time', 139),
 ('like', 138),
 ('would', 132),
 ('he', 128),
 ('also', 113),
 ('river', 109),
 ("n't", 104),
 ('saw', 99),
 ('long', 96),
 ('many', 96),
 ('and', 96),
 ('said', 95),
 ('man', 90),
 ('life', 87),
 ('love', 82),
 ('thought', 82),
 ('kamala', 82),
 ('you', 81),
 ('learned', 80),
 ('teachings', 78),
 ('felt', 74),
 ('people', 73),
 ('nothing', 72),
 ('heart', 71),
 ('without', 71),
 ('vasudeva', 70),
 ('become', 68),
 ('world', 68),
 ('face', 68),
 ('the', 67),
 ('eyes', 67),
 ('friend', 66),
 ('it', 66),
 ('every', 66),
 ('oh', 66),
 ('still', 63),
 ('samana', 63),
 ('know', 62),
 ('spoke', 61),
 ('much', 60),
 ('path', 60),
 ('everything', 59),
 ('old', 57),
 ('good', 57),
 ('looked', 57),
 ('could', 57),
 ("'ve", 57),
 ('words', 56),
 ('buddha', 55),
 ('thus', 54),
 ('father', 53),
 ('well', 53),
 ('--', 52),
 ('already', 51),
 ('never', 51),
 (

Here, you might notice that it isn't perfect, but there's some good stuff there!  

Some things to add to the picture, that I'll leave as an: 
##### Exercise
1. Make a small "lookup table" to turn things like `"n't"` into "not",
2. Remove random things like `"''"`,
3. Tweak your list of stopwords from the default to include or don't include some words.
4. [Stem your words](#Stemming) 

For now, I'll just use the top 200 words as my dictionary (with the above errors not fixed, as this is already taking a while), and vectorize the sentence like that.  To "vectorize", just replace the sentence with a length-200 array of `1`'s and `0`'s corresponding to whether the top `i`th word appears in the sentence.

For example, if my dictionary were only 6 words (instead of 200): 
```
["food", "glorious", "upon", "wonderment", "lark", "cunning"]
```
and my sentence was `"Hark, a lark dost descend upon me."`, then the sentences vector becomes `[0, 0, 1, 0, 1, 0]` (because `upon` and `lark` are in the dictionary and the sentence, but the other words of the dictionary are not in the sentence).

In [112]:
dictionary_200 = [word[0] for word in dictionary.most_common(200)]

def vectorize(word_tokens):
    vector = []
    for word in dictionary_200:
        if word in word_tokens:
            vector.append(1)
        else:
            vector.append(0)
    
    return vector

print(vectorize(["i", "siddhartha", "am", "brave"]))

[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Perfect!  Let's finish off this example.

In [114]:
len(sectionized_text[1])

8

In [117]:
final_text_format = [[[], [], [], []],   # Part 1
                     [[], [], [], [], [], [], [], []]      # Part 2
                    ]

for part_ind, part in enumerate(sectionized_text):
    for chap_ind, chapter in enumerate(part):
        # recall that `chapter` is a tuple, `(chapter_name, chapter_text)`
        sentences = nltk.sent_tokenize(chapter[1])
        sentences_in_word_tokens = [tokenize_sentence(sentence) for sentence in sentences]
        for tokenized_sentence in sentences_in_word_tokens:
            final_text_format[part_ind][chap_ind].append(vectorize(tokenized_sentence))

In [124]:
# The first three sentences, vectorized
print(*final_text_format[0][0][:3], sep="\n\n")

with open('siddhartha.json', 'w') as f:
    json.dump(final_text_format, f)

[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

This is now ready for whatever analysis you want to perform ... with the [above caveats](#Exercise), along with choosing an appropriate length for your dictionary, taken into consideration.

## Other Options
### Stemming
You'll probably want to **stem** the words: remove the differences that are from the same part of speech so that things like "running", "run", "ran", etc. all (hopefully!) become the same "word stem".

In [107]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

for word in tokenize_sentence("Quickly running cats start jumping on birds"):
    print(word, ":", stemmer.stem(word))

quickly : quick
running : run
cats : cat
start : start
jumping : jump
birds : bird


That worked really well!  It does indeed have issues, though:

In [109]:
stemmer.stem("flying")

'fli'

Language is hard.

### part of speech tagging

Check out [this link](http://textminingonline.com/dive-into-nltk-part-iii-part-of-speech-tagging-and-pos-tagger) for an explanation of how to do part of speech tagging, if that's something you'd like to try!

### Lemmatizing

Instead of stemming, you might try your hand at **lemmatizing**: turning all the words of a certain stem into a single, usable, form.

In [49]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

In [39]:
words = ["fairly", "running", "quicker", "dogs", "fishes"]

for word in words:
    print(word, ":", lemmatizer.lemmatize(word))

fairly : fairly
running : running
quicker : quicker
dogs : dog
fishes : fish


As you can see, it's not always successful.  It sometimes helps if you add the part of speech tags:

In [64]:
# pos options are 'n' for noun, 'v' for verb, 'r' for adverb, 'a' for adjective
print("running", ":", lemmatizer.lemmatize("running", pos='v'))

# but it doesn't always work:
for part in ['n', 'v', 'r', 'a', 's']:
    print("fairly", ":", lemmatizer.lemmatize("fairly", pos=part))

running : run
fairly : fairly
fairly : fairly
fairly : fairly
fairly : fairly
fairly : fairly
