# Reducing words to their  Stemming and Lemmatisation

So far we have tidied up our text and split it into smaller chunks and removed any frequent joining words. Now we have to start digging into the nature of language and words a little more. One task is to reduce words down to a more basic form, e.g., waiting, waited and waits can be reduced to wait.

This is where things get interesting - and tricky!


### What is the difference? 
Lets take the example of **Stemming** - the process of reducing inflected (or sometimes derived) words to their word *stem*, base or root form. For example argue, argued, argues, arguing reduce to the stem argu. Usually stemming is a crude heuristic process that chops off the ends of words in the hope of achieving the root correctly most of the time. 

**Lemmatisation** aims to do this using vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the *lemma*. 

If confronted with the token 'saw', stemming might return just 's', whereas lemmatization would attempt to return either 'see' or 'saw' depending on whether the use of the token was as a verb or a noun. 


### Implementing 
Now we could start to build our own functions here, using rules such as

* if the word ends in 'ed', remove the 'ed'

* if the word ends in 'ing', remove the 'ing'

* if the word ends in 'ly', remove the 'ly'

This might work for stemming but lemmatising is a far more complex challenge. 

But there is good news - someone has already done all the hard work for us!

Using existing libraries like [**nltk**](https://www.nltk.org) we can perform stemming and lemmatising quickly and easily.

Guess what - they can also be used to tokenise your text too! Lets see how we might use NLTK for this:

## Stemming with nltk

Nltk has a number of stemming algorithms but Porter (PorterStemer) is the most popular.

In [2]:
from nltk.stem import PorterStemmer

words = ["game","gaming","gamed","games"]
stemmer = PorterStemmer()
 
for word in words:
    print(stemmer.stem(word))

game
game
game
game


## Lemmatizing with nltk

In [7]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatiser = WordNetLemmatizer()

for word in words:
    print(lemmatiser.lemmatize(word))

[nltk_data] Downloading package wordnet to /Users/pughd/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
game
gaming
gamed
game


## Using a larger body of text - using nltk to tokenise

Lets use nltk to tokenise before we apply the lemtatisation and stemming.


In [21]:
text_o = """
Locked in a sorry dream
You know we're drowning in designer ice creams
I scream, this is the present but it's no surprise
Then I realize
What I see I spies
(Nice to see you, to see you, to see you)
The past was eagle-eyed
(To see you, nice to see you, to see you, to see you)
The future's pixelised
(To see you, nice to see you, to see you, to see you, to see you)
I had my Frisbee sharpened and honed
I had it galvanized and chromed
Decapitate and bury your toys
My Frisbee brings the noise
What I see I spies
(Nice to see you, to see you, to see you)
The past was eagle-eyed
(To see you, nice to see you, to see you, to see you)
The future's pixelised
(To see you, nice to see you, to see you, to see you, to see you)
To see you, to see you, to see you, nice to see you
(Wipe your windscreen with a chocolate cake)
To see you, to see you, to see you, nice to see you
(Count your pizzas before they bake)
To see you, to see you, to see you, nice to see you
(Wipe your windscreen with a chocolate cake)

"""

In [22]:
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#tokensise the text
tokens = word_tokenize(text_o)

for token in tokens:
    print("{} \t -> Lemma: {} \t -> stem: {}".format(token, lemmatiser.lemmatize(token), stemmer.stem(token)))


[nltk_data] Downloading package punkt to /Users/pughd/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Locked 	 -> Lemma: Locked 	 -> stem: lock
in 	 -> Lemma: in 	 -> stem: in
a 	 -> Lemma: a 	 -> stem: a
sorry 	 -> Lemma: sorry 	 -> stem: sorri
dream 	 -> Lemma: dream 	 -> stem: dream
You 	 -> Lemma: You 	 -> stem: you
know 	 -> Lemma: know 	 -> stem: know
we 	 -> Lemma: we 	 -> stem: we
're 	 -> Lemma: 're 	 -> stem: 're
drowning 	 -> Lemma: drowning 	 -> stem: drown
in 	 -> Lemma: in 	 -> stem: in
designer 	 -> Lemma: designer 	 -> stem: design
ice 	 -> Lemma: ice 	 -> stem: ice
creams 	 -> Lemma: cream 	 -> stem: cream
I 	 -> Lemma: I 	 -> stem: I
scream 	 -> Lemma: scream 	 -> stem: scream
, 	 -> Lemma: , 	 -> stem: ,
this 	 -> Lemma: this 	 -> stem: thi
is 	 -> Lemma: is 	 -> stem: is
the 	 -> Lemma: the 	 -> stem: the
present 	 -> Lemma: present 	 -> stem: present
but 	 -> Lemma: but 	 -> stem: but
it 	 -> Lemma: it 	 -> stem: it
's 	 -> Lemma: 's 	 -> stem: 's
n

### What other preprocessing would be useful here? 

## Bonus - POS with nltk
The **part of speech (POS)** explains how a word is used in a sentence. There are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. Want to know more - [read here](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb)

In [24]:
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

tokens_pos = pos_tag(tokens)

print(tokens_pos)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pughd/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('Locked', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('sorry', 'JJ'), ('dream', 'NN'), ('You', 'PRP'), ('know', 'VBP'), ('we', 'PRP'), ("'re", 'VBP'), ('drowning', 'VBG'), ('in', 'IN'), ('designer', 'NN'), ('ice', 'NN'), ('creams', 'NNS'), ('I', 'PRP'), ('scream', 'VBP'), (',', ','), ('this', 'DT'), ('is', 'VBZ'), ('the', 'DT'), ('present', 'JJ'), ('but', 'CC'), ('it', 'PRP'), ("'s", 'VBZ'), ('no', 'DT'), ('surprise', 'NN'), ('Then', 'RB'), ('I', 'PRP'), ('realize', 'VBP'), ('What', 'WP'), ('I', 'PRP'), ('see', 'VBP'), ('I', 'PRP'), ('spies', 'NNS'), ('(', '('), ('Nice', 'NNP'), ('to', 'TO'), ('see', 'VB'), ('you', 'PRP'), (',', ','), ('to', 'TO'), ('see', 'VB'), ('you', 'PRP'), (',', ','), ('to', 'TO'), ('see', 'VB'), ('you', 'PRP'), (')', ')'), ('The', 'DT'), ('past', 'NN'), ('was', 'VBD'), ('eagle-eyed', 'JJ'), ('(', '('), 