# Text Preprocessing

This notebook demonstrates a simple text preprocessing pipeline using the [Natural Language Toolkit (NLTK)](https://www.nltk.org/index.html). 

Make sure you first follow the [instructions on Wattle](https://wattlecourses.anu.edu.au/mod/page/view.php?id=2683737) to set up your environment for this lab.

In [None]:
import nltk
import string
from collections import Counter

Raw text from [this Wikipedia page](https://en.wikipedia.org/wiki/Australia).

In [None]:
raw_text = "Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country. Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils. It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.\nIndigenous Australians have inhabited the continent for approximately 65,000 years. The European maritime exploration of Australia commenced in the early 17th century with the arrival of Dutch explorers. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day. The European population grew steadily in subsequent decades, and by the time of an 1850s gold rush, most of the continent had been explored by European settlers and an additional five self-governing crown colonies established. On 1 January 1901, the six colonies federated, forming the Commonwealth of Australia. Australia has since maintained a stable liberal democratic political system and wealthy market economy.\nPolitically, Australia is a federal parliamentary constitutional monarchy, comprising six states and ten territories. Australia's population of nearly 26 million is highly urbanised and heavily concentrated on the eastern seaboard. Canberra is the nation's capital, while the five largest cities are Sydney, Melbourne, Brisbane, Perth, and Adelaide. Australia's demography has been shaped by centuries of immigration: immigrants account for 30% of the country's population, and almost half of Australians have at least one parent born overseas. Australia's abundant natural resources and well-developed international trade relations are crucial to the country's economy, which generates its income from various sources including services, mining exports, banking, manufacturing, agriculture and international education.\nAustralia is a highly developed country with a high-income economy; it has the world's thirteenth-largest economy, tenth-highest per capita income and eighth-highest Human Development Index. Australia is a regional power, and has the world's thirteenth-highest military expenditure. Australia ranks amongst the highest in the world for quality of life, democracy, health, education, economic freedom, civil liberties, safety, and political rights, with all its major cities faring exceptionally in global comparative livability surveys. It is a member of international groupings including the United Nations, the G20, the OECD, the WTO, ANZUS, AUKUS, Five Eyes, the Quad, APEC, the Pacific Islands Forum, the Pacific Community and the Commonwealth of Nations."

In [None]:
# print(raw_text)

## Sentence splitting

Splitting text into sentences

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
# sent_tokenize?  # uncomment this line to see the documentation of `sent_tokenize'

In [None]:
nltk.download('punkt')

In [None]:
sentences = sent_tokenize(raw_text)

In [None]:
print(f'There are {len(sentences)} sentences')

Use the first few sentences to demonstrate text pre-processing.

In [None]:
text = ' '.join(sentences[:7])
print(text)

## Tokenisation

Dividing a string into a list of tokens.

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
# word_tokenize?

In [None]:
tokens = word_tokenize(text)
# tokens

The top-10 most common tokens.

In [None]:
Counter(tokens).most_common(10)

### Question

Try [other tokenisers provided by NLTK](https://www.nltk.org/api/nltk.tokenize.html) (e.g. RegexpTokenizer, WhitespaceTokenizer, WordPunctTokenizer etc.) and compare their outputs. 

What are the differences and how can we choose the best tokeniser for a task?

In [None]:
# from nltk.tokenize import WhitespaceTokenizer

# tokeniser = WhitespaceTokenizer()
# tokeniser.tokenize(text)

## Removing punctuation and stop words

Stopwords and punctuation are usually not helpful for many IR tasks, and removing them can reduce the number of tokens we need to process. 

In [None]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

In [None]:
stopwords_en = set(stopwords.words('english'))
# stopwords_en

In [None]:
tokens[:] = [w for w in tokens if w not in string.punctuation and w not in stopwords_en]
# tokens

The top-10 most common tokens.

In [None]:
Counter(tokens).most_common(10)

### Question

Will we get a different set of tokens if we lower casing all words before removing stopwords? What are the potential problems by doing that?

## Stemming or Lemmatisation

In [None]:
from nltk.stem import PorterStemmer
# from nltk.stem import SnowballStemmer, RegexpStemmer

In [None]:
stemmer = PorterStemmer()

In [None]:
tokens_stem = [stemmer.stem(w) for w in tokens]
# tokens_stem

In [None]:
Counter(tokens_stem).most_common(10)

Lemmatisation

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

POS tagging for lemmatisation.

In [None]:
nltk.download('averaged_perceptron_tagger')
tags = nltk.pos_tag(tokens)
# tags

Convert the pos tags to the [four syntactic categories that wordnet recognizes](https://wordnet.princeton.edu/documentation/wndb5wn).

In [None]:
wordnet_tag = lambda t: 'a' if t == 'j' else (t if t in ['n', 'v', 'r'] else 'n')

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
tokens_lemma = [lemmatizer.lemmatize(tokens[i].lower(), wordnet_tag(tags[i][1][0].lower())) for i in range(len(tokens))]
# tokens_lemma

In [None]:
Counter(tokens_lemma).most_common(10)

### Question

Compare the results of stemming and lemmatisation. Can you see the differences and the potential problems with stemming and lemmatisation?