# Week 2: Preprocessing Text (Part 2)


In [1]:
#necessary library imports and setup introduced previously

import sys
import re
import pandas as pd
#import matplotlib.pyplot as plt
#%matplotlib inline
from itertools import zip_longest
import nltk
from nltk.corpus import reuters
from nltk.tokenize import word_tokenize
#nltk.download('punkt')

In [3]:
##uncomment these lines below if working on colab
#from google.colab import drive
#drive.mount('/content/drive/')

## Overview
Remember, a raw text document is just a sequence of characters. There are a number of basic steps that are often performed when processing natural language text. In lab sessions this week we are covering some of the basic text pre-processing methods. In the previous notebook, you looked at
- <b> segmentation</b> - breaking down large units of text into smaller units such as documents and sentences.
- <b> tokenisation</b> - roughly speaking, this involves grouping characters into words;

In this notebook, you will be looking at:
- <b>case normalisation</b> - this involves converting all of the text into lower case;
- <b>stemming</b> - this involves removing a word's inflections to find the stem; and
- <b>punctuation and stop-word removal</b> - stop-words are common functions words that in some situations can be ignored.

Note that we do not always apply all of the above preprocessing methods; it depends on the application. One of the things that you will be learning about in this module, is when the application of each of these methods is, and is not, appropriate.

## Normalising text and removing unimportant tokens
In this next section we will consider several methods that pre-process (tokenised) text in ways that are sometimes helpful to 'downstream' processing.

### Number and case normalisation
Without any kind of normalisation, the tokens `"help"` and `"Help"` are two distinct types. In some contexts you may not want to distinguish them.

Another example, is that `"1998"` and `"1999"` count as distinct types. There are situations where there is no need to distinction between different numbers.

The following code performs case normalisation and replaces tokens that consist of digits by "NUM".
- Python provides a [number of functions](http://docs.python.org/library/stdtypes.html#string-methods), which you can call in order to analyse their content, or produce new strings from them.
- The code uses a [list comprehension](http://docs.python.org/tutorial/datastructures.html#list-comprehensions) to build a new list by looping through and filtering items.

In [5]:
tokens = ["The","cake","is","a","LIE"]      #a list of tokens, some of which contain uppercase letters
print([token.lower() for token in tokens])   #print newly created list of all lowercase tokens

numbers = ['in', 'the', 'year', '120', 'of', 'the', 'fourth', 'age', ',', 'after', '120', 'years', 'as', 'king', ',' , 'aragorn', 'died', 'at', 'the', 'age', 'of', '210']
print(["NUM" if token.isdigit() else token for token in numbers])  #replace all number tokens with "NUM" in a new list of tokens

['the', 'cake', 'is', 'a', 'lie']
['in', 'the', 'year', 'NUM', 'of', 'the', 'fourth', 'age', ',', 'after', 'NUM', 'years', 'as', 'king', ',', 'aragorn', 'died', 'at', 'the', 'age', 'of', 'NUM']


### Exercise 1.1
- Write a function <code>number_normalise</code> which
    * replaces numbers with NUM;
    * and replaces tokens such as `"4th"`, `"1st"` and `"22nd"` with `"Nth"`.
- Test your code on the list `["Within","5","minutes",",","the", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]`.
- Check that the token `"and"` isn't changed to `"Nth"`.
- You will find [this page](http://docs.python.org/library/stdtypes.html#string-methods) useful.


In [9]:
tokens = ["Within","5","minutes",",","the", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]
def number_normalise(tokens):
    normalised1=["Nth" if (token.endswith(("nd","st","th","rd")) and token[:-2].isdigit()) else token for token in tokens]
    normalised2=["NUM" if token.isdigit() else token for token in normalised1]
    return normalised2

print(number_normalise(tokens))

['Within', 'NUM', 'minutes', ',', 'the', 'Nth', 'and', 'Nth', 'placed', 'runners', 'lapped', 'the', 'Nth', '.']


### Exercise 1.2
- Complete the code in the cell below. You have just two lines to complete. The goal is to use a large sample of the Reuters corpus to establish the extent to which vocabulary size is reduced when number and case normalisation is applied.
- For each of the two incomplete lines you should ideally use nested list comprehensions. This is described in Section 5.1.4 in [this document](http://docs.python.org/tutorial/datastructures.html#list-comprehensions).  Alternatively, you could define functions which iterate over the sentences in each sample and the tokens within each sentence.


In [13]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\efemi\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

In [15]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\efemi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [17]:
#the sample_sentences() function from the last lab will be useful here
#next week we will look at including useful functions like this in a utils python file which we can import from
import random

def sample_sentences(corpus,sample_size):

    size=len(corpus)
    ids=random.sample(range(size),sample_size)
    sample=[corpus[i] for i in ids]
    return sample

#in vocabulary_size(), we use a dictionary to store the frequency of each token type in the corpus
#the number of keys in this dictionary is the size of the vocab
def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences:
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())



sample_size = 10000

raw_sentences = sample_sentences(reuters.sents(),sample_size)

############################################
lowered_sentences = [[token.lower() for token in sentence] for sentence in raw_sentences]
normalised_sentences = [["Num" if token.isdigit() or (token.endswith(('st', 'nd', 'rd', 'th')) and token[-2].isdigit()) else token for token in sentence] for sentence in lowered_sentences]

############################################

raw_vocab_size = vocabulary_size(raw_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Normalisation produced a 28.25% reduction in vocabulary size from 20351 to 14601


## Stemming
A considerable amount of the lexical variation found in documents results from the use of morphological variants which we might not wish to distinguish - e.g. when determining the topic of a document. An easy way to remove these varied forms is to use a stemmer. NLTK includes a number of stemmers in the `nltk.stem` package.
- [NLTK stem module API](https://www.nltk.org/api/nltk.stem.html)

- [NLTK Porter stemmer](https://www.nltk.org/api/nltk.stem.porter.html)

- Look at the code below to show how the NLTK implementation of the Porter stemmer in `nltk.stem.porter.PorterStemmer` stems a sample of sentences in the Reuters corpus.
- Have a close look at the differences between the columns. This will give you a good indication of what the stemmer does.

In [19]:
from nltk.stem.porter import PorterStemmer

st = PorterStemmer()

sample_size = 10

tokenised_sentences = sample_sentences(reuters.sents(),sample_size)

for sentence in tokenised_sentences:
    df = pd.DataFrame(list(zip_longest(sentence,[st.stem(token) for token in sentence])),columns=["BEFORE","AFTER"])
    display(df)

Unnamed: 0,BEFORE,AFTER
0,Johnson,johnson
1,Geneva,geneva
2,said,said
3,the,the
4,buy,buy
5,out,out
6,was,wa
7,accomplished,accomplish
8,through,through
9,&,&


Unnamed: 0,BEFORE,AFTER
0,"""",""""
1,The,the
2,sentiment,sentiment
3,in,in
4,the,the
5,market,market
6,is,is
7,bullish,bullish
8,and,and
9,I,i


Unnamed: 0,BEFORE,AFTER
0,PHLCORP,phlcorp
1,&,&
2,lt,lt
3,;,;
4,PHX,phx
...,...,...
61,dlrs,dlr
62,in,in
63,tax,tax
64,credits,credit


Unnamed: 0,BEFORE,AFTER
0,IPCO,ipco
1,CORP,corp
2,&,&
3,lt,lt
4,;,;
5,IHS,ih
6,>,>
7,SETS,set
8,REGULAR,regular
9,PAYOUT,payout


Unnamed: 0,BEFORE,AFTER
0,Low,low
1,yields,yield
2,from,from
3,the,the
4,country,countri
5,','
6,s,s
7,ageing,age
8,coffee,coffe
9,plantations,plantat


Unnamed: 0,BEFORE,AFTER
0,BRITAIN,britain
1,CALLS,call
2,ON,on
3,JAPAN,japan
4,TO,to
5,INCREASE,increas
6,IMPORTS,import
7,Britain,britain
8,today,today
9,called,call


Unnamed: 0,BEFORE,AFTER
0,USAir,usair
1,was,wa
2,halted,halt
3,on,on
4,the,the
5,New,new
6,York,york
7,Stock,stock
8,EXcahnge,excahng
9,for,for


Unnamed: 0,BEFORE,AFTER
0,It,it
1,said,said
2,the,the
3,debt,debt
4,was,wa
5,part,part
6,of,of
7,a,a
8,33,33
9,.,.


Unnamed: 0,BEFORE,AFTER
0,The,the
1,crop,crop
2,reporting,report
3,board,board
4,said,said
5,the,the
6,estimates,estim
7,for,for
8,the,the
9,1986,1986


Unnamed: 0,BEFORE,AFTER
0,1986,1986
1,4th,4th
2,qtr,qtr
3,includes,includ
4,3,3
5,.,.
6,5,5
7,mln,mln
8,dlr,dlr
9,provision,provis


### Exercise 2.1
- By looking at the impact on a large sample of the Reuters corpus, establish the extent to which vocabulary size is reduced by stemming.
- Write code to do this in the empty cell below. You should be able to re-use a lot of the code from the code you used when measuring the impact of lower case and number normalisation.

In [21]:
import random

def sample_sentences(corpus,sample_size):

    size=len(corpus)
    ids=random.sample(range(size),sample_size)
    sample=[corpus[i] for i in ids]
    return sample

#in vocabulary_size(), we use a dictionary to store the frequency of each token type in the corpus
#the number of keys in this dictionary is the size of the vocab
def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences:
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())



sample_size = 10000

raw_sentences = sample_sentences(reuters.sents(),sample_size)

############################################
normalised_sentences=[number_normalise(sentence) for sentence in raw_sentences]
stem_sentences = [[st.stem(token) for token in sentence] for sentence in normalised_sentences]


############################################

raw_vocab_size = vocabulary_size(raw_sentences)
stem_vocab_size = vocabulary_size(stem_sentences)
print("Stemming produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - stem_vocab_size)/raw_vocab_size,raw_vocab_size,stem_vocab_size))

Stemming produced a 47.30% reduction in vocabulary size from 20208 to 10649


### Exercise 2.2
* Try using the WordNetLemmatizer <code>nltk.stem.wordnet.WordNetLemmatizer</code> instead of the Porter Stemmer.
* Using a large sample of the Reuters corpus, establish the extent to which the vocabulary size reduced by lemmatization?
* As an extension, you could look at different sample sizes and/or different corpora and display the results in a table or graph (e.g., using <code>pandas</code> and/or <code>matplotlib</code>)

In [30]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\efemi\AppData\Roaming\nltk_data...


True

In [32]:

from nltk.stem.wordnet import WordNetLemmatizer
import random

wnl = WordNetLemmatizer()

def sample_sentences(corpus,sample_size):

    size=len(corpus)
    ids=random.sample(range(size),sample_size)
    sample=[corpus[i] for i in ids]
    return sample

#in vocabulary_size(), we use a dictionary to store the frequency of each token type in the corpus
#the number of keys in this dictionary is the size of the vocab
def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences:
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())



sample_size = 10000

raw_sentences = sample_sentences(reuters.sents(),sample_size)

############################################
normalised_sentences=[number_normalise(sentence) for sentence in raw_sentences]
lemma_sentences = [[wnl.lemmatize(token) for token in sentence] for sentence in normalised_sentences]


############################################

raw_vocab_size = vocabulary_size(raw_sentences)
lemma_vocab_size = vocabulary_size(lemma_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - lemma_vocab_size)/raw_vocab_size,raw_vocab_size,lemma_vocab_size))


Normalisation produced a 11.45% reduction in vocabulary size from 20153 to 17846


### Punctuation and stop-word removal
A stopword is a word that occurs so often that it loses its usefulness in some tasks. We may get more meaningful information from our corpus analysis if we remove stopwords and punctuation.

The code below takes a list of tokens and creates a new list, which contains only those strings which are alphabetic and non-stop-words.

In [35]:
from nltk.corpus import stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\efemi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
stop = stopwords.words('english')
tokens="The cat , which is really fat , sat on the mat".lower().split()
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop]
print(tokens)
print(filtered_tokens)

['the', 'cat', ',', 'which', 'is', 'really', 'fat', ',', 'sat', 'on', 'the', 'mat']
['cat', 'really', 'fat', 'sat', 'mat']


**Note**: `isalpha` only returns `True` if the string is entirely composed of alphabet characters. If you want a function to return `True` even when a word contains digits, then you should use `isalnum`.`

### Exercise 3.1
- In the empty cell below, write code that looks at a large sample of the Reuters corpus, establishing what proportion of tokens are stop-words.
- As extension, you could establish the mean (and or the distribution of the) number of stop-words per sentence; or compare the numbers of stop-words in different corpora.

In [49]:
sample_size = 10000

raw_sentences = sample_sentences(reuters.sents(),sample_size)

normalised_sentences=[number_normalise(sentence) for sentence in raw_sentences]

wnl = WordNetLemmatizer()

lemma_sentences = [[wnl.lemmatize(token) for token in sentence] for sentence in normalised_sentences]

stop_word_exc = [[w for w in sentence if w.isalpha() and w not in stop ] for sentence in lemma_sentences]
stop_word_exc[:5]

[['It',
  'said',
  'might',
  'take',
  'legal',
  'action',
  'seek',
  'support',
  'shareholder',
  'calling',
  'special',
  'meeting',
  'replace',
  'board',
  'consider',
  'proposal'],
 ['A',
  'high',
  'command',
  'communique',
  'said',
  'warplane',
  'hit',
  'western',
  'jetty',
  'Iran',
  'Kharg',
  'island',
  'oil',
  'terminal',
  'afternoon',
  'struck',
  'supertanker',
  'nearby',
  'time'],
 ['As',
  'result',
  'net',
  'income',
  'first',
  'quarter',
  'reduced',
  'NUM',
  'mln',
  'dlrs'],
 ['CORN',
  'SOYBEANS',
  'TOLEDO',
  'NUM',
  'UND',
  'MAY',
  'UNC',
  'NUM',
  'UND',
  'MAY',
  'UNC',
  'CINCINNATI',
  'NUM',
  'UND',
  'MAY',
  'UNC',
  'NUM',
  'OVR',
  'MAY',
  'UP',
  'NUM',
  'NEW',
  'HAVEN',
  'NUM',
  'UND',
  'MAY',
  'UNC',
  'NUM',
  'UND',
  'MAY',
  'DN',
  'NUM',
  'N',
  'E'],
 ['The',
  'federal',
  'government',
  'drew',
  'NUM',
  'NUM',
  'billion',
  'mark',
  'Bundesbank',
  'cash',
  'deposit',
  'stood',
  'NUM',
  'NUM

In [51]:
# As extension, you could establish the mean (and or the distribution of the) number of stop-words per sentence; or compare the numbers of stop-words in different corpora.
number_token = 0
number_stop = 0

for sentence in lemma_sentences:
  for token in sentence:
    number_token +=1
    if token in stop:
        number_stop += 1

print(number_token)
print(number_stop)

315526
73645


### Exercise 3.2
Explain the difference between the number of tokens in a corpus and the size of the vocabulary of a corpus.
Would you expect stopword removal to have a greater effect on the size of the vocabulary or the number of tokens in the corpus?

In [None]:
#### My answer
