# Week 2: Preprocessing Text (Part 2)


In [1]:
#necessary library imports and setup introduced previously

import sys
import re
import pandas as pd
#import matplotlib.pyplot as plt
#%matplotlib inline
from itertools import zip_longest
import nltk
from nltk.corpus import reuters
from nltk.tokenize import word_tokenize
#nltk.download('punkt')

In [2]:
##uncomment these lines below if working on colab
#from google.colab import drive
##mount google drive
#drive.mount('/content/drive/')


## Overview 
Remember, a raw text document is just a sequence of characters. There are a number of basic steps that are often performed when processing natural language text. In lab sessions this week we are covering some of the basic text pre-processing methods. In the previous notebook, you looked at
- <b> segmentation</b> - breaking down large units of text into smaller units such as documents and sentences. 
- <b> tokenisation</b> - roughly speaking, this involves grouping characters into words;

In this notebook, you will be looking at:
- <b>case normalisation</b> - this involves converting all of the text into lower case; 
- <b>stemming</b> - this involves removing a word's inflections to find the stem; and 
- <b>punctuation and stop-word removal</b> - stop-words are common functions words that in some situations can be ignored.

Note that we do not always apply all of the above preprocessing methods; it depends on the application. One of the things that you will be learning about in this module, is when the application of each of these methods is, and is not, appropriate.

## Normalising text and removing unimportant tokens
In this next section we will consider several methods that pre-process (tokenised) text in ways that are sometimes helpful to 'downstream' processing.

### Number and case normalisation
Without any kind of normalisation, the tokens `"help"` and `"Help"` are two distinct types. In some contexts you may not want to distinguish them.

Another example, is that `"1998"` and `"1999"` count as distinct types. There are situations where there is no need to distinction between different numbers.

The following code performs case normalisation and replaces tokens that consist of digits by "NUM". 
- Python provides a [number of functions](http://docs.python.org/library/stdtypes.html#string-methods), which you can call in order to analyse their content, or produce new strings from them.
- The code uses a [list comprehension](http://docs.python.org/tutorial/datastructures.html#list-comprehensions) to build a new list by looping through and filtering items.

In [3]:
tokens = ["The","cake","is","a","LIE"]      #a list of tokens, some of which contain uppercase letters
print([token.lower() for token in tokens])   #print newly created list of all lowercase tokens

numbers = ['in', 'the', 'year', '120', 'of', 'the', 'fourth', 'age', ',', 'after', '120', 'years', 'as', 'king', ',' , 'aragorn', 'died', 'at', 'the', 'age', 'of', '210']
print(["NUM" if token.isdigit() else token for token in numbers])  #replace all number tokens with "NUM" in a new list of tokens

['the', 'cake', 'is', 'a', 'lie']
['in', 'the', 'year', 'NUM', 'of', 'the', 'fourth', 'age', ',', 'after', 'NUM', 'years', 'as', 'king', ',', 'aragorn', 'died', 'at', 'the', 'age', 'of', 'NUM']


### Exercise 1.1
- Write a function <code>number_normalise</code> which 
    * replaces numbers with NUM; 
    * and replaces tokens such as `"4th"`, `"1st"` and `"22nd"` with `"Nth"`.
- Test your code on the list `["Within","5","minutes",",","the", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]`. 
- Check that the token `"and"` isn't changed to `"Nth"`.
- You will find [this page](http://docs.python.org/library/stdtypes.html#string-methods) useful.


In [4]:
tokens = ["Within","5","minutes",",","the", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]
def number_normalise(tokens):
    normalised=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokens]
    normalised=["NUM" if token.isdigit() else token for token in normalised]
    return normalised

print(number_normalise(tokens))


['Within', 'NUM', 'minutes', ',', 'the', 'Nth', 'and', 'Nth', 'placed', 'runners', 'lapped', 'the', 'Nth', '.']


### Exercise 1.2
- Complete the code in the cell below. You have just two lines to complete. The goal is to use a large sample of the Reuters corpus to establish the extent to which vocabulary size is reduced when number and case normalisation is applied.
- For each of the two incomplete lines you should ideally use nested list comprehensions. This is described in Section 5.1.4 in [this document](http://docs.python.org/tutorial/datastructures.html#list-comprehensions).  Alternatively, you could define functions which iterate over the sentences in each sample and the tokens within each sentence.


In [5]:
#the sample_sentences() function from the last lab will be useful here
#next week we will look at including useful functions like this in a utils python file which we can import from
import random

def sample_sentences(corpus,sample_size):
    
    size=len(corpus)
    ids=random.sample(range(size),sample_size) 
    sample=[corpus[i] for i in ids]  
    return sample

In [6]:
def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())

  

sample_size = 10000

raw_sentences = sample_sentences(reuters.sents(),sample_size)

############################################
lowered_sentences = # complete this line
normalised_sentences = # complete this line

############################################

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


SyntaxError: invalid syntax (<ipython-input-6-8efe6d4b2586>, line 15)

In [7]:
def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())
    

sample_size = 10000

raw_sentences = sample_sentences(reuters.sents(),sample_size)

############################################
lowered_sentences = [[word.lower() for word in sentence] for sentence in raw_sentences]
normalised_sentences = [number_normalise(sentence) for sentence in lowered_sentences]
print(lowered_sentences[0])
print(normalised_sentences[0])
############################################

raw_vocab_size = vocabulary_size(raw_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


['bass', 'group', 'cuts', 'national', 'distillers', '&', 'lt', ';', 'dr', '>', 'stake', 'an', 'investor', 'group', 'led', 'by', 'members', 'of', 'the', 'bass', 'family', 'of', 'fort', 'worth', ',', 'texas', ',', 'said', 'it', 'lowered', 'its', 'stake', 'in', 'national', 'distillers', 'and', 'chemical', 'corp', 'to', '1', ',', '159', ',', '400', 'shares', ',', 'or', '3', '.', '6', 'pct', 'of', 'the', 'total', 'common', ',', 'from', '1', ',', '727', ',', '200', ',', 'or', '5', '.', '3', 'pct', '.']
['bass', 'group', 'cuts', 'national', 'distillers', '&', 'lt', ';', 'dr', '>', 'stake', 'an', 'investor', 'group', 'led', 'by', 'members', 'of', 'the', 'bass', 'family', 'of', 'fort', 'worth', ',', 'texas', ',', 'said', 'it', 'lowered', 'its', 'stake', 'in', 'national', 'distillers', 'and', 'chemical', 'corp', 'to', 'NUM', ',', 'NUM', ',', 'NUM', 'shares', ',', 'or', 'NUM', '.', 'NUM', 'pct', 'of', 'the', 'total', 'common', ',', 'from', 'NUM', ',', 'NUM', ',', 'NUM', ',', 'or', 'NUM', '.', 'NU

## Stemming
A considerable amount of the lexical variation found in documents results from the use of morphological variants which we might not wish to distinguish - e.g. when determining the topic of a document. An easy way to remove these varied forms is to use a stemmer. NLTK includes a number of stemmers in the `nltk.stem` package.
- [NLTK stem module API](http://nltk.org/api/nltk.stem.html)

- [NLTK Porter stemmer](http://nltk.org/api/nltk.stem.html?highlight=stemmer#nltk.stem.porter.PorterStemmer)

- Look at the code below to show how the NLTK implementation of the Porter stemmer in `nltk.stem.porter.PorterStemmer` stems a sample of sentences in the Reuters corpus.
- Have a close look at the differences between the columns. This will give you a good indication of what the stemmer does.

In [8]:
from nltk.stem.porter import PorterStemmer
 
st = PorterStemmer()

sample_size = 10

tokenised_sentences = sample_sentences(reuters.sents(),sample_size)

for sentence in tokenised_sentences:
  df = pd.DataFrame(list(zip_longest(sentence,[st.stem(token) for token in sentence])),columns=["BEFORE","AFTER"])
  display(df)

Unnamed: 0,BEFORE,AFTER
0,It,it
1,said,said
2,Rice,rice
3,Hall,hall
4,manages,manag
5,investments,invest
6,for,for
7,institutions,institut
8,and,and
9,individuals,individu


Unnamed: 0,BEFORE,AFTER
0,Iran,iran
1,against,against
2,United,unit
3,States,state
4,vessels,vessel
5,in,in
6,the,the
7,Gulf,gulf
8,",""",","""
9,Walters,walter


Unnamed: 0,BEFORE,AFTER
0,That,that
1,budget,budget
2,deficit,deficit
3,has,ha
4,meant,meant
5,that,that
6,the,the
7,U,u
8,.,.
9,S,s


Unnamed: 0,BEFORE,AFTER
0,Widdrington,widdrington
1,said,said
2,Labatt,labatt
3,','
4,s,s
5,three,three
6,-,-
7,year,year
8,business,busi
9,plan,plan


Unnamed: 0,BEFORE,AFTER
0,The,the
1,new,new
2,facility,facil
3,will,will
4,lower,lower
5,labor,labor
6,and,and
7,mill,mill
8,costs,cost
9,and,and


Unnamed: 0,BEFORE,AFTER
0,Abe,abe
1,','
2,s,s
3,visit,visit
4,is,is
5,to,to
6,prepare,prepar
7,for,for
8,Prime,prime
9,Minister,minist


Unnamed: 0,BEFORE,AFTER
0,Taiwan,taiwan
1,','
2,s,s
3,surplus,surplu
4,with,with
5,the,the
6,U,u
7,.,.
8,S,s
9,.,.


Unnamed: 0,BEFORE,AFTER
0,Bretz,bretz
1,said,said
2,the,the
3,sharp,sharp
4,rise,rise
5,in,in
6,the,the
7,growth,growth
8,of,of
9,new,new


Unnamed: 0,BEFORE,AFTER
0,The,the
1,firms,firm
2,will,will
3,apply,appli
4,to,to
5,form,form
6,the,the
7,cartel,cartel
8,to,to
9,the,the


Unnamed: 0,BEFORE,AFTER
0,Staley,staley
1,and,and
2,Archer,archer
3,Daniels,daniel
4,Midland,midland
5,soon,soon
6,.,.


### Exercise 2.1
- By looking at the impact on a large sample of the Reuters corpus, establish the extent to which vocabulary size is reduced by stemming.
- Write code to do this in the empty cell below. You should be able to re-use a lot of the code from the code you used when measuring the impact of lower case and number normalisation.

In [9]:
sample_size=10000
tokenised_sentences = sample_sentences(reuters.sents(),sample_size)
normalised_sentences=[number_normalise(sentence) for sentence in tokenised_sentences]
stemmed_sentences = [[st.stem(token) for token in sentence] for sentence in normalised_sentences]
raw_vocab_size = vocabulary_size(tokenised_sentences)
stemmed_vocab_size = vocabulary_size(stemmed_sentences)
print("Stemming produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - stemmed_vocab_size)/raw_vocab_size,raw_vocab_size,stemmed_vocab_size))

Stemming produced a 46.94% reduction in vocabulary size from 20224 to 10731


### Exercise 2.2
* Try using the WordNetLemmatizer <code>nltk.stem.wordnet.WordNetLemmatizer</code> instead of the Porter Stemmer.
* Using a large sample of the Reuters corpus, establish the extent to which the vocabulary size reduced by lemmatization?
* As an extension, you could look at different sample sizes and/or different corpora and display the results in a table or graph (e.g., using <code>pandas</code> and/or <code>matplotlib</code>)

In [10]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/juliewe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:

sample_size=10000
lemm = WordNetLemmatizer()
tokenised_sentences = sample_sentences(reuters.sents(),sample_size)
normalised_sentences=[number_normalise(sentence) for sentence in tokenised_sentences]
stemmed_sentences = [[lemm.lemmatize(token) for token in sentence] for sentence in normalised_sentences]
raw_vocab_size = vocabulary_size(tokenised_sentences)
stemmed_vocab_size = vocabulary_size(stemmed_sentences)
print("Lemmatizing produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - stemmed_vocab_size)/raw_vocab_size,raw_vocab_size,stemmed_vocab_size))

Lemmatizing produced a 11.44% reduction in vocabulary size from 20097 to 17797


### Punctuation and stop-word removal
A stopword is a word that occurs so often that it loses its usefulness in some tasks. We may get more meaningful information from our corpus analysis if we remove stopwords and punctuation.

The code below takes a list of tokens and creates a new list, which contains only those strings which are alphabetic and non-stop-words.

In [12]:
from nltk.corpus import stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juliewe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:

stop = stopwords.words('english')
tokens="The cat , which is really fat , sat on the mat".lower().split()
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop]
print(tokens)
print(filtered_tokens)

['the', 'cat', ',', 'which', 'is', 'really', 'fat', ',', 'sat', 'on', 'the', 'mat']
['cat', 'really', 'fat', 'sat', 'mat']


**Note**: `isalpha` only returns `True` if the string is entirely composed of alphabet characters. If you want a function to return `True` even when a word contains digits, then you should use `isalnum`.`

### Exercise 3.1
- In the empty cell below, write code that looks at a large sample of the Reuters corpus, establishing what proportion of tokens are stop-words.
- As extension, you could establish the mean (and or the distribution of the) number of stop-words per sentence; or compare the numbers of stop-words in different corpora.

In [14]:
num_stopwords = 0
num_tokens = 0
for sentence in tokenised_sentences:
    for token in sentence:
        num_tokens += 1
        if token in stop:
            num_stopwords += 1
############################################

print("Stopword removal produced a {0:.2f}% reduction in number of tokens from {1} to {2}".format(
    100*(num_tokens - num_stopwords)/num_tokens,num_tokens,num_stopwords))

Stopword removal produced a 76.07% reduction in number of tokens from 314009 to 75153


### Exercise 3.2
Explain the difference between the number of tokens in a corpus and the size of the vocabulary of a corpus.  Would you expect stopword removal to have a greater effect on the size of the vocabulary or the number of tokens in the corpus?

#### My answer
The number of tokens counts every occurrence of a token whereas the size of a vocabulary is looking at distinct types - if the same token occurs $n$ times in the corpus, it adds $n$ to the number of tokens in the corpus but only 1 to the size of the vocabulary.  There are a relatively small number of stopwords which occur many times throughout a corpus - therefore, removing stopwords will have a much larger impact on the number of tokens in the corpus than on the vocabulary size.