# Text Preprocessing

This notebook demonstrates a simple text preprocessing pipeline using the [Natural Language Toolkit (NLTK)](https://www.nltk.org/index.html). 

Make sure you first follow the [instructions on Wattle](https://wattlecourses.anu.edu.au/mod/page/view.php?id=2943340) to set up your environment for this lab.

In [1]:
import nltk
import string
from collections import Counter

Raw text from [this Wikipedia page](https://en.wikipedia.org/wiki/Australia).

In [2]:
raw_text = "Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country. Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils. It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.\nIndigenous Australians have inhabited the continent for approximately 65,000 years. The European maritime exploration of Australia commenced in the early 17th century with the arrival of Dutch explorers. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day."

In [3]:
print(raw_text)

Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country. Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils. It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.
Indigenous Australians have inhabited the continent for approximately 65,000 years. The European maritime exploration of Australia commenced in the early 17th century with the arrival of Dutch explorers. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 Januar

## Sentence splitting

Splitting text into sentences.

In [4]:
from nltk.tokenize import sent_tokenize

In [5]:
# sent_tokenize?  # uncomment this line to see the documentation of `sent_tokenize'

SyntaxError: invalid syntax (3952749885.py, line 1)

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/juneehome/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
sentences = sent_tokenize(raw_text)

In [7]:
print(f'There are {len(sentences)} sentences')

There are 7 sentences


## Tokenisation

Dividing a string into a list of tokens.

In [8]:
from nltk.tokenize import word_tokenize

In [None]:
# word_tokenize?

In [9]:
tokens_list = [word_tokenize(s) for s in sentences]

In [10]:
tokens_list

[['Australia',
  ',',
  'officially',
  'the',
  'Commonwealth',
  'of',
  'Australia',
  ',',
  'is',
  'a',
  'sovereign',
  'country',
  'comprising',
  'the',
  'mainland',
  'of',
  'the',
  'Australian',
  'continent',
  ',',
  'the',
  'island',
  'of',
  'Tasmania',
  ',',
  'and',
  'numerous',
  'smaller',
  'islands',
  '.'],
 ['With',
  'an',
  'area',
  'of',
  '7,617,930',
  'square',
  'kilometres',
  '(',
  '2,941,300',
  'sq',
  'mi',
  ')',
  ',',
  'Australia',
  'is',
  'the',
  'largest',
  'country',
  'by',
  'area',
  'in',
  'Oceania',
  'and',
  'the',
  'world',
  "'s",
  'sixth-largest',
  'country',
  '.'],
 ['Australia',
  'is',
  'the',
  'oldest',
  ',',
  'flattest',
  ',',
  'and',
  'driest',
  'inhabited',
  'continent',
  ',',
  'with',
  'the',
  'least',
  'fertile',
  'soils',
  '.'],
 ['It',
  'is',
  'a',
  'megadiverse',
  'country',
  ',',
  'and',
  'its',
  'size',
  'gives',
  'it',
  'a',
  'wide',
  'variety',
  'of',
  'landscapes',
  '

The top-10 most common tokens.

In [11]:
Counter([w for x in tokens_list for w in x]).most_common(10)

[('the', 15),
 (',', 14),
 ('of', 8),
 ('Australia', 7),
 ('and', 7),
 ('.', 7),
 ('in', 5),
 ('is', 4),
 ('a', 4),
 ('country', 4)]

### Exercise

Try [other tokenisers provided by NLTK](https://www.nltk.org/api/nltk.tokenize.html) (e.g. RegexpTokenizer, WhitespaceTokenizer, WordPunctTokenizer etc.) and compare their outputs.

In [None]:
#from nltk.tokenize import WhitespaceTokenizer

In [15]:
from nltk.tokenize import *

In [18]:
wordpunct_tokens_list = [wordpunct_tokenize(s) for s in sentences]
wordpunct_tokens_list

[['Australia',
  ',',
  'officially',
  'the',
  'Commonwealth',
  'of',
  'Australia',
  ',',
  'is',
  'a',
  'sovereign',
  'country',
  'comprising',
  'the',
  'mainland',
  'of',
  'the',
  'Australian',
  'continent',
  ',',
  'the',
  'island',
  'of',
  'Tasmania',
  ',',
  'and',
  'numerous',
  'smaller',
  'islands',
  '.'],
 ['With',
  'an',
  'area',
  'of',
  '7',
  ',',
  '617',
  ',',
  '930',
  'square',
  'kilometres',
  '(',
  '2',
  ',',
  '941',
  ',',
  '300',
  'sq',
  'mi',
  '),',
  'Australia',
  'is',
  'the',
  'largest',
  'country',
  'by',
  'area',
  'in',
  'Oceania',
  'and',
  'the',
  'world',
  "'",
  's',
  'sixth',
  '-',
  'largest',
  'country',
  '.'],
 ['Australia',
  'is',
  'the',
  'oldest',
  ',',
  'flattest',
  ',',
  'and',
  'driest',
  'inhabited',
  'continent',
  ',',
  'with',
  'the',
  'least',
  'fertile',
  'soils',
  '.'],
 ['It',
  'is',
  'a',
  'megadiverse',
  'country',
  ',',
  'and',
  'its',
  'size',
  'gives',
  'it

In [17]:
Counter([w for x in wordpunct_tokens_list for w in x]).most_common(10)

[(',', 18),
 ('the', 15),
 ('of', 8),
 ('Australia', 7),
 ('and', 7),
 ('.', 7),
 ('in', 5),
 ('is', 4),
 ('a', 4),
 ('country', 4)]

In [21]:
Regexp_tokens_list = [RegexpTokenizer(s) for s in sentences]
Regexp_tokens_list

[RegexpTokenizer(pattern='Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL),
 RegexpTokenizer(pattern="With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country.", gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL),
 RegexpTokenizer(pattern='Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils.', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL),
 RegexpTokenizer(pattern='It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.', gaps=False, discard_empty=True,

In [22]:
whitespace_tokens_list = [WhitespaceTokenizer().span_tokenize(s) for s in sentences]
whitespace_tokens_list

[<generator object RegexpTokenizer.span_tokenize at 0x1401f0f40>,
 <generator object RegexpTokenizer.span_tokenize at 0x1401f24d0>,
 <generator object RegexpTokenizer.span_tokenize at 0x1401f0220>,
 <generator object RegexpTokenizer.span_tokenize at 0x1401f2020>,
 <generator object RegexpTokenizer.span_tokenize at 0x1401f0310>,
 <generator object RegexpTokenizer.span_tokenize at 0x1401f1300>,
 <generator object RegexpTokenizer.span_tokenize at 0x1401f13f0>]

### Question 

What are the differences and how can we choose the best tokeniser for a task?

The best tokeniser depends on tasks and languages. For some languages like Chinese and Japanese, there are no spaces between word, surely we cannot use whitespace tokenizer then. 

## Removing punctuation and stop words

Stopwords and punctuation are usually not helpful for many IR tasks, and removing them can reduce the number of tokens we need to process. 

In [23]:
from nltk.corpus import stopwords

In [24]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juneehome/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [25]:
stopwords_en = set(stopwords.words('english'))

In [26]:
stopwords_en

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [27]:
tokens_list[:] = [[w for w in x if w not in string.punctuation and w not in stopwords_en] for x in tokens_list]

In [28]:
tokens_list

[['Australia',
  'officially',
  'Commonwealth',
  'Australia',
  'sovereign',
  'country',
  'comprising',
  'mainland',
  'Australian',
  'continent',
  'island',
  'Tasmania',
  'numerous',
  'smaller',
  'islands'],
 ['With',
  'area',
  '7,617,930',
  'square',
  'kilometres',
  '2,941,300',
  'sq',
  'mi',
  'Australia',
  'largest',
  'country',
  'area',
  'Oceania',
  'world',
  "'s",
  'sixth-largest',
  'country'],
 ['Australia',
  'oldest',
  'flattest',
  'driest',
  'inhabited',
  'continent',
  'least',
  'fertile',
  'soils'],
 ['It',
  'megadiverse',
  'country',
  'size',
  'gives',
  'wide',
  'variety',
  'landscapes',
  'climates',
  'deserts',
  'centre',
  'tropical',
  'rainforests',
  'north-east',
  'mountain',
  'ranges',
  'south-east'],
 ['Indigenous',
  'Australians',
  'inhabited',
  'continent',
  'approximately',
  '65,000',
  'years'],
 ['The',
  'European',
  'maritime',
  'exploration',
  'Australia',
  'commenced',
  'early',
  '17th',
  'century',


The top-10 most common tokens.

In [29]:
Counter([w for x in tokens_list for w in x]).most_common(10)

[('Australia', 7),
 ('country', 4),
 ('continent', 3),
 ("'s", 3),
 ('area', 2),
 ('inhabited', 2),
 ('officially', 1),
 ('Commonwealth', 1),
 ('sovereign', 1),
 ('comprising', 1)]

### Question

Will we get a different set of tokens if we lower casing all words before removing stopwords? What are the potential problems by doing that?

Yes. This may cause problems when there are specific words of names, terms, etc. Such as Apple (the company), FOCUS (the organization), etc.

## Stemming

Turning words into stems.

In [30]:
from nltk.stem import PorterStemmer

In [31]:
stemmer = PorterStemmer()

In [32]:
tokens_stem = [stemmer.stem(w) for x in tokens_list for w in x]

In [33]:
tokens_stem

['australia',
 'offici',
 'commonwealth',
 'australia',
 'sovereign',
 'countri',
 'compris',
 'mainland',
 'australian',
 'contin',
 'island',
 'tasmania',
 'numer',
 'smaller',
 'island',
 'with',
 'area',
 '7,617,930',
 'squar',
 'kilometr',
 '2,941,300',
 'sq',
 'mi',
 'australia',
 'largest',
 'countri',
 'area',
 'oceania',
 'world',
 "'s",
 'sixth-largest',
 'countri',
 'australia',
 'oldest',
 'flattest',
 'driest',
 'inhabit',
 'contin',
 'least',
 'fertil',
 'soil',
 'it',
 'megadivers',
 'countri',
 'size',
 'give',
 'wide',
 'varieti',
 'landscap',
 'climat',
 'desert',
 'centr',
 'tropic',
 'rainforest',
 'north-east',
 'mountain',
 'rang',
 'south-east',
 'indigen',
 'australian',
 'inhabit',
 'contin',
 'approxim',
 '65,000',
 'year',
 'the',
 'european',
 'maritim',
 'explor',
 'australia',
 'commenc',
 'earli',
 '17th',
 'centuri',
 'arriv',
 'dutch',
 'explor',
 'in',
 '1770',
 'australia',
 "'s",
 'eastern',
 'half',
 'claim',
 'great',
 'britain',
 'initi',
 'settl'

In [34]:
Counter(tokens_stem).most_common(10)

[('australia', 7),
 ('countri', 4),
 ('contin', 3),
 ("'s", 3),
 ('australian', 2),
 ('island', 2),
 ('area', 2),
 ('inhabit', 2),
 ('explor', 2),
 ('offici', 1)]

### Exercise

Try other NLTK stemmers (e.g. SnowballStemmer, RegexpStemmer), you may need to download additional data packages, see https://www.nltk.org/data.html

In [35]:
from nltk.stem import SnowballStemmer, RegexpStemmer

In [38]:
snow_stemmer = SnowballStemmer("english")

In [39]:
snow_tokens_stem = [snow_stemmer.stem(w) for x in tokens_list for w in x]
snow_tokens_stem

['australia',
 'offici',
 'commonwealth',
 'australia',
 'sovereign',
 'countri',
 'compris',
 'mainland',
 'australian',
 'contin',
 'island',
 'tasmania',
 'numer',
 'smaller',
 'island',
 'with',
 'area',
 '7,617,930',
 'squar',
 'kilometr',
 '2,941,300',
 'sq',
 'mi',
 'australia',
 'largest',
 'countri',
 'area',
 'oceania',
 'world',
 "'s",
 'sixth-largest',
 'countri',
 'australia',
 'oldest',
 'flattest',
 'driest',
 'inhabit',
 'contin',
 'least',
 'fertil',
 'soil',
 'it',
 'megadivers',
 'countri',
 'size',
 'give',
 'wide',
 'varieti',
 'landscap',
 'climat',
 'desert',
 'centr',
 'tropic',
 'rainforest',
 'north-east',
 'mountain',
 'rang',
 'south-east',
 'indigen',
 'australian',
 'inhabit',
 'contin',
 'approxim',
 '65,000',
 'year',
 'the',
 'european',
 'maritim',
 'explor',
 'australia',
 'commenc',
 'earli',
 '17th',
 'centuri',
 'arriv',
 'dutch',
 'explor',
 'in',
 '1770',
 'australia',
 "'s",
 'eastern',
 'half',
 'claim',
 'great',
 'britain',
 'initi',
 'settl'

In [40]:
Counter(snow_tokens_stem).most_common(10)

[('australia', 7),
 ('countri', 4),
 ('contin', 3),
 ("'s", 3),
 ('australian', 2),
 ('island', 2),
 ('area', 2),
 ('inhabit', 2),
 ('explor', 2),
 ('offici', 1)]

In [41]:
reg_stemmer = RegexpStemmer("english")
reg_tokens_stem = [reg_stemmer.stem(w) for x in tokens_list for w in x]
reg_tokens_stem

['Australia',
 'officially',
 'Commonwealth',
 'Australia',
 'sovereign',
 'country',
 'comprising',
 'mainland',
 'Australian',
 'continent',
 'island',
 'Tasmania',
 'numerous',
 'smaller',
 'islands',
 'With',
 'area',
 '7,617,930',
 'square',
 'kilometres',
 '2,941,300',
 'sq',
 'mi',
 'Australia',
 'largest',
 'country',
 'area',
 'Oceania',
 'world',
 "'s",
 'sixth-largest',
 'country',
 'Australia',
 'oldest',
 'flattest',
 'driest',
 'inhabited',
 'continent',
 'least',
 'fertile',
 'soils',
 'It',
 'megadiverse',
 'country',
 'size',
 'gives',
 'wide',
 'variety',
 'landscapes',
 'climates',
 'deserts',
 'centre',
 'tropical',
 'rainforests',
 'north-east',
 'mountain',
 'ranges',
 'south-east',
 'Indigenous',
 'Australians',
 'inhabited',
 'continent',
 'approximately',
 '65,000',
 'years',
 'The',
 'European',
 'maritime',
 'exploration',
 'Australia',
 'commenced',
 'early',
 '17th',
 'century',
 'arrival',
 'Dutch',
 'explorers',
 'In',
 '1770',
 'Australia',
 "'s",
 'east

In [42]:
Counter(reg_tokens_stem).most_common(10)

[('Australia', 7),
 ('country', 4),
 ('continent', 3),
 ("'s", 3),
 ('area', 2),
 ('inhabited', 2),
 ('officially', 1),
 ('Commonwealth', 1),
 ('sovereign', 1),
 ('comprising', 1)]

## Lemmatisation

Turning words into lemmas (entries in a dictionary). It requires knowledge of the context (typically the intended
Part-of-Speech of a word in the context).

In [43]:
from nltk.stem import WordNetLemmatizer

In [44]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/juneehome/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/juneehome/nltk_data...


True

POS tagging for lemmatisation.

In [45]:
nltk.download('averaged_perceptron_tagger')
tags_list = nltk.pos_tag_sents(tokens_list)
tags_list

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/juneehome/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[[('Australia', 'NNP'),
  ('officially', 'RB'),
  ('Commonwealth', 'NNP'),
  ('Australia', 'NNP'),
  ('sovereign', 'JJ'),
  ('country', 'NN'),
  ('comprising', 'VBG'),
  ('mainland', 'NN'),
  ('Australian', 'JJ'),
  ('continent', 'NN'),
  ('island', 'NN'),
  ('Tasmania', 'NNP'),
  ('numerous', 'JJ'),
  ('smaller', 'JJR'),
  ('islands', 'NNS')],
 [('With', 'IN'),
  ('area', 'NN'),
  ('7,617,930', 'CD'),
  ('square', 'JJ'),
  ('kilometres', 'NNS'),
  ('2,941,300', 'CD'),
  ('sq', 'JJ'),
  ('mi', 'NN'),
  ('Australia', 'NNP'),
  ('largest', 'JJS'),
  ('country', 'NN'),
  ('area', 'NN'),
  ('Oceania', 'NNP'),
  ('world', 'NN'),
  ("'s", 'POS'),
  ('sixth-largest', 'JJ'),
  ('country', 'NN')],
 [('Australia', 'NNP'),
  ('oldest', 'JJS'),
  ('flattest', 'JJS'),
  ('driest', 'NN'),
  ('inhabited', 'VBN'),
  ('continent', 'NN'),
  ('least', 'JJS'),
  ('fertile', 'JJ'),
  ('soils', 'NNS')],
 [('It', 'PRP'),
  ('megadiverse', 'VBZ'),
  ('country', 'NN'),
  ('size', 'NN'),
  ('gives', 'VBZ'),
  (

A heuristic to convert POS tags to the [four syntactic categories that wordnet recognizes (i.e. **noun**, **verb**, **adj** and **adv**)](https://wordnet.princeton.edu/):
- `n` for nouns
- `v` for verbs
- `a` for adjectives
- `r` for adverbs

In [None]:
# tags_list

In [46]:
wordnet_tag = lambda t: 'a' if t == 'j' else (t if t in ['n', 'v', 'r'] else 'n')

Lemmatising

In [48]:
lemmatizer = WordNetLemmatizer()

In [49]:
tokens_lemma = [lemmatizer.lemmatize(w.lower(), pos=wordnet_tag(t[0].lower())) for x in tags_list for (w, t) in x]



In [60]:
tokens_list

[['Australia',
  'officially',
  'Commonwealth',
  'Australia',
  'sovereign',
  'country',
  'comprising',
  'mainland',
  'Australian',
  'continent',
  'island',
  'Tasmania',
  'numerous',
  'smaller',
  'islands'],
 ['With',
  'area',
  '7,617,930',
  'square',
  'kilometres',
  '2,941,300',
  'sq',
  'mi',
  'Australia',
  'largest',
  'country',
  'area',
  'Oceania',
  'world',
  "'s",
  'sixth-largest',
  'country'],
 ['Australia',
  'oldest',
  'flattest',
  'driest',
  'inhabited',
  'continent',
  'least',
  'fertile',
  'soils'],
 ['It',
  'megadiverse',
  'country',
  'size',
  'gives',
  'wide',
  'variety',
  'landscapes',
  'climates',
  'deserts',
  'centre',
  'tropical',
  'rainforests',
  'north-east',
  'mountain',
  'ranges',
  'south-east'],
 ['Indigenous',
  'Australians',
  'inhabited',
  'continent',
  'approximately',
  '65,000',
  'years'],
 ['The',
  'European',
  'maritime',
  'exploration',
  'Australia',
  'commenced',
  'early',
  '17th',
  'century',


In [50]:
tokens_lemma

['australia',
 'officially',
 'commonwealth',
 'australia',
 'sovereign',
 'country',
 'comprise',
 'mainland',
 'australian',
 'continent',
 'island',
 'tasmania',
 'numerous',
 'small',
 'island',
 'with',
 'area',
 '7,617,930',
 'square',
 'kilometre',
 '2,941,300',
 'sq',
 'mi',
 'australia',
 'large',
 'country',
 'area',
 'oceania',
 'world',
 "'s",
 'sixth-largest',
 'country',
 'australia',
 'old',
 'flat',
 'driest',
 'inhabit',
 'continent',
 'least',
 'fertile',
 'soil',
 'it',
 'megadiverse',
 'country',
 'size',
 'give',
 'wide',
 'variety',
 'landscape',
 'climates',
 'desert',
 'centre',
 'tropical',
 'rainforest',
 'north-east',
 'mountain',
 'range',
 'south-east',
 'indigenous',
 'australian',
 'inhabited',
 'continent',
 'approximately',
 '65,000',
 'year',
 'the',
 'european',
 'maritime',
 'exploration',
 'australia',
 'commence',
 'early',
 '17th',
 'century',
 'arrival',
 'dutch',
 'explorer',
 'in',
 '1770',
 'australia',
 "'s",
 'eastern',
 'half',
 'claim',
 '

In [51]:
Counter(tokens_lemma).most_common(10)

[('australia', 7),
 ('country', 4),
 ('continent', 3),
 ("'s", 3),
 ('australian', 2),
 ('island', 2),
 ('area', 2),
 ('officially', 1),
 ('commonwealth', 1),
 ('sovereign', 1)]

In [61]:
tokens_list_flat = [item for sublist in tokens_list for item in sublist]

for original, lemma in zip(tokens_list_flat, tokens_lemma):
    if original != lemma:
        print(f"Original: {original}, Lemma: {lemma}")

Original: Australia, Lemma: australia
Original: Commonwealth, Lemma: commonwealth
Original: Australia, Lemma: australia
Original: comprising, Lemma: comprise
Original: Australian, Lemma: australian
Original: Tasmania, Lemma: tasmania
Original: smaller, Lemma: small
Original: islands, Lemma: island
Original: With, Lemma: with
Original: kilometres, Lemma: kilometre
Original: Australia, Lemma: australia
Original: largest, Lemma: large
Original: Oceania, Lemma: oceania
Original: Australia, Lemma: australia
Original: oldest, Lemma: old
Original: flattest, Lemma: flat
Original: inhabited, Lemma: inhabit
Original: soils, Lemma: soil
Original: It, Lemma: it
Original: gives, Lemma: give
Original: landscapes, Lemma: landscape
Original: deserts, Lemma: desert
Original: rainforests, Lemma: rainforest
Original: ranges, Lemma: range
Original: Indigenous, Lemma: indigenous
Original: Australians, Lemma: australian
Original: years, Lemma: year
Original: The, Lemma: the
Original: European, Lemma: europe

### Question

Compare the results of stemming and lemmatisation. Can you see the differences and the potential problems with stemming and lemmatisation?