# Reducing words to their meaning
## Stemming and Lemmatisation

So far we have tidied up our text and split it into smaller chunks and removed any frequent joining words. Now we have to start digging into the nature of language and words a little more. One task is to reduce words down to a more basic form, e.g., waiting, waited and waits can be reduced to wait.


Question: Should they be? What benefits may we gain from not doing this?


This is where things get interesting - and tricky!


### What is the difference? 
Lets take the example of **stemming** - the process of reducing inflected (or sometimes derived) words to their word *stem*; that is, their base or root form. For example, the words; argue, argued, argues, arguing reduce to the stem argu. Usually stemming is a crude heuristic process that chops off the ends of words in the hope of achieving the root correctly most of the time. 

**Lemmatisation** aims to do this using vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the *lemma*. Most lemmatisers achieve this using a lookup table and so this process, when you have large volumes of text may be slower than stemming. However, if it is a suitable application for your data then it lemmatising is generally the recommended approach to take. This is why the spaCy python module only includes a lemmatiser and no stemming functions.

If confronted with the token 'saw', stemming might return just 's', whereas lemmatisation would attempt to return either 'see' or 'saw' depending on whether the use of the token was as a verb or a noun. However, most stemming algorithms impose a 'shortest length' on the tokens they are trying to stem so we should get saw in this instance unless we write out own and its very crude.


### Implementing 
Now we could start to build our own functions here, using rules such as

* if the word ends in 'ed', remove the 'ed'

* if the word ends in 'ing', remove the 'ing'

* if the word ends in 'ly', remove the 'ly'

This might work for stemming but lemmatising is a far more complex challenge as you have to generate a whole database of the english language which understands word morphology.


But there is good news - someone has already done all the hard work for us! Using existing libraries like [**nltk**](https://www.nltk.org) we can perform stemming and lemmatising quickly and easily.


Guess what - they can also be used to tokenise your text too! Lets see how we might use NLTK for this:


## Stemming with nltk

Nltk has a number of stemming algorithms but Porter (PorterStemmer) is the most popular.

In [None]:
from nltk.stem import PorterStemmer

words = ["game","gaming","gamed","games"]
stemmer = PorterStemmer()
 
for word in words:
    print(stemmer.stem(word))

## Lemmatising with NLTK

In [None]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

for word in words:
    print(lemmatizer.lemmatize(word))

## Using a larger body of text 
### Using nltk to tokenise

Lets use nltk to tokenise before we apply the lemtatisation and stemming.


In [None]:
text_o = """
Locked in a sorry dream
You know we're drowning in designer ice creams
I scream, this is the present but it's no surprise
Then I realize
What I see I spies
(Nice to see you, to see you, to see you)
The past was eagle-eyed
(To see you, nice to see you, to see you, to see you)
The future's pixelised
(To see you, nice to see you, to see you, to see you, to see you)
I had my Frisbee sharpened and honed
I had it galvanized and chromed
Decapitate and bury your toys
My Frisbee brings the noise
What I see I spies
(Nice to see you, to see you, to see you)
The past was eagle-eyed
(To see you, nice to see you, to see you, to see you)
The future's pixelised
(To see you, nice to see you, to see you, to see you, to see you)
To see you, to see you, to see you, nice to see you
(Wipe your windscreen with a chocolate cake)
To see you, to see you, to see you, nice to see you
(Count your pizzas before they bake)
To see you, to see you, to see you, nice to see you
(Wipe your windscreen with a chocolate cake)

"""

In [None]:
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#tokensise the text
tokens = word_tokenize(text_o)

# a little bit of additional text to explain the outputs
print("\n\nword    \t    lemma     \t    stem")
print("-"*50)

for token in tokens:
    print(f"{token:10} \t -> {lemmatiser.lemmatize(token):10} \t -> {stemmer.stem(token):10}")


### What other preprocessing would be useful here? 

## Bonus - POS with nltk

The **part of speech (POS)** tag explains how a word is used in a sentence. There are eight main parts of speech - nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. Want to know more - [read here](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb)

In [None]:
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

tokens_pos = pos_tag(tokens)

print(tokens_pos)