# Lemmatisation with nltk

"Lemmatisation" is the process of reducing a word down to its "lemma" or dictionary form. The words "walked", "walking", and "walks" all have the same lemma - "walk", and this is what you would find in a dictionary if you were searching for any of those words. Using lemmatisation, the word "horses" is converted to "horse", while "is", "are", "be", and "was" are all reduced to the root verb "be".

Lemmatisation is useful for text analysis because it means that words with essentially the same meaning can be considered together, rather than being viewed as unrelated by the computer: "dog" and "dogs" do not refer to completely separate concepts, and it is important for accurate text analytics that the link is acknowledged.

This notebook contains commented examples of how to apply the `nltk`'s `WordNetLemmatizer` to strings. The majority of examples demonstrate how to lemmatise a single string, with the final section demonstrating how to apply the technique to an entire column in a dataframe.

### Imports

In [1]:
import pandas as pd  # Data manipulation
import nltk  # Text analysis methods
from nltk.tokenize import word_tokenize  # Split strings into tokens/words
from nltk.stem import WordNetLemmatizer  # Reduce a single word to its lemma (root form)
from nltk import pos_tag  # Tag a word with its part of speech (e.g. verb)
from nltk.corpus import wordnet  # Group words together
from collections import defaultdict  # Dictionaries with a fallback value

### Install nltk wordlists

You will need to run `nltk.download("all")` once to download the necessary wordlists that `nltk` depends on. If you'd rather download less, it is also possible to download each list separately.

### Example data

In [2]:
example_a = "She runs faster through the shallows to the sandiest banks"

example_b = "He thanks the women for the happier years"

### Text preparation

The lemmatisation function works on one word at a time - it can't handle strings of multiple words, and just returns them unaltered. In order to lemmatise a string effectively, it must first be split into word tokens.

In [3]:
# Tokenize each string, converting them to lists of words

example_a_tokens = word_tokenize(example_a)
example_b_tokens = word_tokenize(example_b)

# Display the tokenized strings

print(example_a_tokens)
print(example_b_tokens)

['She', 'runs', 'faster', 'through', 'the', 'shallows', 'to', 'the', 'sandiest', 'banks']
['He', 'thanks', 'the', 'women', 'for', 'the', 'happier', 'years']


### Basic lemmatisation

In [4]:
# Create an object that can be used to lemmatise

lemma = WordNetLemmatizer()

In [5]:
# Apply lemmatisation to each word/token in example_a

example_a_lemmas = [lemma.lemmatize(token) for token in example_a_tokens]

print(example_a_lemmas)

['She', 'run', 'faster', 'through', 'the', 'shallow', 'to', 'the', 'sandiest', 'bank']


In [6]:
# Apply lemmatisation to each word/token in example_b

example_b_lemmas = [lemma.lemmatize(token) for token in example_b_tokens]

print(example_b_lemmas)

['He', 'thanks', 'the', 'woman', 'for', 'the', 'happier', 'year']


You can see that the lemmatiser has had some success - "year" has been reduced to "year", and "women" to "woman", for example. However, "thanks" is still plural, and "faster" has not been converted to "fast". 

This is because, by default, the lemmatiser only works on nouns. Any word passed into the `lemmatize` function is reduced only if the lemmatiser recognises the word as a noun. This means that many words are not reduced even though they do have a base form - unfamiliar nouns and all other parts of speech remain unchanged. 

### 'Part of speech' tagging


In order to lemmatise more effectively, we first need a way of identifying what part of speech (PoS) a particular word is. While this could be done manually, that would be impractical for large amounts of text and tedious even for small ones. The solution is to use another function in `nltk` - `pos_tag`. This function attempts to classify a given word as a particular PoS. It's not a perfect method, but it gets us closer to a more versatile lemmatisation process.

The function `pos_tag` takes a set of tokens; it needs the set because one of the ways that the tagger identifies a word is through the words around it: any word after "very" is likely to be an adjective, for example.

In [7]:
# pos_tag applied to a set of tokens.

pos_tag(example_a_tokens)

[('She', 'PRP'),
 ('runs', 'VBZ'),
 ('faster', 'RBR'),
 ('through', 'IN'),
 ('the', 'DT'),
 ('shallows', 'NNS'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('sandiest', 'JJS'),
 ('banks', 'NNS')]

Each word token is now a `tuple` containing the original word and the `pos_tag` tag.

The default tagset - the "Penn Treebank" tagset - has [36 different tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), classifying words into highly-specific categories. In the example above, the word "She" was tagged as "PRP" for "personal pronoun", and the word "sandiest" was tagged as an "adjective, superlative".

The WordNetLemmatizer also understands tags, but unfortunately, not the same ones. It's necessary to convert between the two tagging systems. WordNet groups words into four categories: nouns, adjectives, verbs, and adverbs. 

By converting between the two systems, moving from 36 tags to 4, we will lose some information. Certain words will not be correctly classified, and so will not be converted - the lemmatiser can only convert words it understands. However, this will still result in an improvement upon our previous system of only reducing nouns.

In [8]:
# Create a dictionary to map tags to ones that the lemmatiser will understand.

tag_map = defaultdict(lambda : "n")  # by default, assume nouns
tag_map['J'] = "a"  # adjectives
tag_map['V'] = "v"  # verbs
tag_map['R'] = "r"  # adverbs

To actually convert the tags from one type to another, the following steps need to happen:

1. Tag each token using `pos_tag`
2. For each token, look in `tag_dict` for the tag that matches the *first letter* of the tag (all the different types of adverb start with "R" for the PoS tags, so this forces all those tags into one; this holds true for other types) 
3. If the first letter doesn't match, treat the word as a noun (this is a fallback to past behaviour - no matter what happens, our lemmatiser is as good as the default)
4. Return the new tokens

In [9]:
# Create a function to get the pos tags for a set of tokens, and return the tokens in a way the
# lemmatizer can interpret
def get_wordnet_tags(tokens):
    """Returns WordNet pos_tags for a set of tokens"""
    
    # Tag tokens with pos_tagger
    tagged_tokens = pos_tag(tokens)
    
    # Convert each tag to a version wordnet can understand
    tagged_tokens = [(token[0], tag_map[token[1][0]]) for token in tagged_tokens]
    
    return tagged_tokens
    
    return tokens

In [10]:
get_wordnet_tags(example_a_tokens)

[('She', 'n'),
 ('runs', 'v'),
 ('faster', 'r'),
 ('through', 'n'),
 ('the', 'n'),
 ('shallows', 'n'),
 ('to', 'n'),
 ('the', 'n'),
 ('sandiest', 'a'),
 ('banks', 'n')]

Each word is now represented as a `tuple` of the word itself and a tag that the lemmatiser can understand.

### More complex lemmatisation

Once you have tagged tokens, lemmatising the words more effectively involves only a slight alteration to our code.

In [11]:
# Apply lemmatisation to each word/token in example_a, using pos tags

# Tag each token
example_a_tagged = get_wordnet_tags(example_a_tokens)

example_a_lemmas = [lemma.lemmatize(word=token[0], pos=token[1]) for token in example_a_tagged]

print(example_a_lemmas)

['She', 'run', 'faster', 'through', 'the', 'shallow', 'to', 'the', 'sandy', 'bank']


In [12]:
# Apply lemmatisation to each word/token in example_a, using pos tags

# Tag each token
example_b_tagged = get_wordnet_tags(example_b_tokens)

example_b_lemmas = [lemma.lemmatize(word=token[0], pos=token[1]) for token in example_b_tagged]

print(example_b_lemmas)

['He', 'thank', 'the', 'woman', 'for', 'the', 'happy', 'year']


This is still not perfect ("faster" has not been simplified), but text analysis deals with such complex and messy data that very little ever is.

There is a noticeable improvement though - "thank" and "happier" have both been correctly reduced to their lemma forms, making our data more consistent. 

### Lemmatisation within a column/dataframe

So far, we have lemmatised only with individual strings. The below code demonstrates how to apply these concepts to a column in a dataframe.

For this particular example, we're using Shakespeare's Sonnet 116, because who says data scientists can't appreciate culture?

In [13]:
# Create an example dataframe

df = pd.DataFrame(columns=["line_number", "text"], data=[[1, "Let me not to the marriage of true minds"],
                                                         [2, "Admit impediments. Love is not love"],
                                                         [3, "Which alters when its alteration finds,"],
                                                         [4, "Or bends with the remover to remove."],
                                                         [5, "O no! it is an ever-fixed mark"],
                                                         [6, "That looks on tempests and is never shaken;"],
                                                         [7, "It is the star to every wand'ring bark,"],
                                                         [8, "Whose worth's unknown, although his height be taken."],
                                                         [9, "Love's not Time's fool, though rosy lips and cheeks"],
                                                         [10, "Within his bending sickle's compass come;"],
                                                         [11, "Love alters not with his brief hours and weeks,"],
                                                         [12, "But bears it out even to the edge of doom."],
                                                         [13, "If this be error and upon me prov'd,"],
                                                         [14, "I never writ, nor no man ever lov'd."]])

In [14]:
# Clean the text a little

# Manage poetic conventions
df["text"].replace("'r", "er", regex=True, inplace=True)
df["text"].replace("'d", "ed", regex=True, inplace=True)

# Remove punctuation
df["text"].replace("-", " ", regex=True, inplace=True)
df["text"].replace("[^\w\s]", "", regex=True, inplace=True)

# Case-fold for consistency
df["text"] = df["text"].str.lower()

In [15]:
# View the cleaned text

df["text"]

0              let me not to the marriage of true minds
1                    admit impediments love is not love
2                which alters when its alteration finds
3                   or bends with the remover to remove
4                         o no it is an ever fixed mark
5            that looks on tempests and is never shaken
6                it is the star to every wandering bark
7     whose worths unknown although his height be taken
8      loves not times fool though rosy lips and cheeks
9               within his bending sickles compass come
10       love alters not with his brief hours and weeks
11            but bears it out even to the edge of doom
12                  if this be error and upon me proved
13                   i never writ nor no man ever loved
Name: text, dtype: object

In [16]:
# Tokenise the text

df["tokens"] = df["text"].apply(word_tokenize)

In [17]:
# Tag the tokens

df["tagged"] = df["tokens"].apply(get_wordnet_tags)

In [18]:
# Lemmatise the tagged tokens

df["lemmas"] = df["tagged"].apply(lambda tokens: [lemma.lemmatize(word=token[0], pos=token[1])
                                                   for token in tokens]) 

In [19]:
# View the lemmas

df["lemmas"]

0     [let, me, not, to, the, marriage, of, true, mind]
1              [admit, impediment, love, be, not, love]
2            [which, alter, when, it, alteration, find]
3            [or, bend, with, the, remover, to, remove]
4                  [o, no, it, be, an, ever, fix, mark]
5      [that, look, on, tempest, and, be, never, shake]
6       [it, be, the, star, to, every, wandering, bark]
7     [whose, worth, unknown, although, his, height,...
8     [love, not, time, fool, though, rosy, lip, and...
9         [within, his, bending, sickle, compass, come]
10    [love, alters, not, with, his, brief, hour, an...
11    [but, bear, it, out, even, to, the, edge, of, ...
12          [if, this, be, error, and, upon, me, prove]
13           [i, never, writ, nor, no, man, ever, love]
Name: lemmas, dtype: object

In [20]:
# Join the lemmatised text back together

df["processed_text"] = df["lemmas"].apply(lambda lemmas: " ".join(lemmas))

In [21]:
# View the results

df["processed_text"]

0             let me not to the marriage of true mind
1                   admit impediment love be not love
2                 which alter when it alteration find
3                  or bend with the remover to remove
4                         o no it be an ever fix mark
5             that look on tempest and be never shake
6              it be the star to every wandering bark
7     whose worth unknown although his height be take
8        love not time fool though rosy lip and cheek
9              within his bending sickle compass come
10       love alters not with his brief hour and week
11           but bear it out even to the edge of doom
12                 if this be error and upon me prove
13                  i never writ nor no man ever love
Name: processed_text, dtype: object

The above code breaks the lemmatisation process down into several steps; this process could be shortened significantly - even into a single line, if you really wanted, but we took the slow route to show each transformation.