# Topic 1: Preprocessing Text

## Preliminaries 
Run this cell.

In [1]:
import sys
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import collections
from collections import defaultdict,Counter
from itertools import zip_longest
from IPython.display import display
from random import seed
get_ipython().magic('matplotlib inline')
import random
import math
import matplotlib.pylab as pylab
%matplotlib inline
params = {'legend.fontsize': 'large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'large',
         'axes.titlesize':'large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
from pylab import rcParams
from operator import itemgetter, attrgetter, methodcaller
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
import csv

## Overview 
A raw text document is just a sequence of characters. There are a number of basic steps that are often performed when processing natural language text. In this lab session we will cover some of the basic text pre-processing methods. In particular, you will be looking at:
- <b> tokenisation</b> - roughly speaking, this involves grouping characters into words;
- <b>case normalisation</b> - this involves converting all of the text into lower case; 
- <b>stemming</b> - this involves removing a word's inflections to find the stem; and 
- <b>punctuation and stop-word removal</b> - stop-words are common functions words that in some situations can be ignored.

Note that we do not always apply all of the above preprocessing methods; it depends on the application. One of the things that you will be learning about in this module, is when the application of each of these methods is, and is not, appropriate.

### Available corpora
We have provided simple interfaces to each of the following corpora, which interact well with NLTK tools.

- The NLTK texts
- Amazon product reviews (~78k documents, ~640k sentences)
- Wall Street Journal text (~2k documents, ~51k sentences)
- Reuters articles (~61k documents, ~740k sentences)
  - Reuters / Finance (~47k documents, ~550k sentences)
  - Reuters / Sport (~13k documents, ~185k sentences)
- Medline abstracts (~985k documents, ~6100k sentences)
- Twitter posts (~962k documents, ~1720k sentences)

## Getting raw sentences from a corpus
The corpora are too large to easily process with some of the functions you will be using, so we have provided a way for you to work on a randomly selected sample of each corpus.

The Reuters, Twitter and Medline corpora have a function called <code style="background-color: #F5F5F5;">sample_raw_sents</code>, which returns a specified number of random sentences, where each sentence is an un-tokenised string.

The code in the next cell shows you how to iterate over a random sample of 10 sentences. When you are using a tokeniser, you will replace
`# do something with sentence`
with code that tokenises each sentence and prints the results.

In [2]:
from sussex_nltk.corpus_readers import ReutersCorpusReader

rcr = ReutersCorpusReader()    #Create a new reader

sample_size = 10

candidate_number = 123456

for sentence in rcr.sample_raw_sents(sample_size): #get a sample of random sentences, where each sentence is a string
    # do something with sentence

SyntaxError: unexpected EOF while parsing (<ipython-input-2-992bca363b32>, line 10)

### Exercise

- Make a copy the cell above and move the copied cell so that it is positioned below this cell. 
- Adapt the code in the new cell so that it prints a sample of **20** sentences from the **Twitter** corpus.

In [3]:
from sussex_nltk.corpus_readers import TwitterCorpusReader

tcr = TwitterCorpusReader()    #Create a new reader

sample_size = 20

for sentence in tcr.sample_raw_sents(sample_size): #get a sample of random sentences, where each sentence is a string
    print(sentence)


Sussex NLTK root directory is \\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources
New google doodle is fun! They've really upped their game for #olympics2012 #awfulpun
@KimConley amazing job in the 5K Semi-Finals at The Olympics
These brothers are freaking awesome #TeamGB #Triathlon
Imagine if all the Caribbean islands united as a team for the Olympics. The domination would be like China with the Indoor events.
This pretty much sums it up the feelings of a nation. @cathalkelly terrific piece on last night's soccer game.  http://t.co/xxBXWAVJ
Ma watching Olympics women's volleyball. Says she used to play. I is amazed. Pa says they used to play on same team in uni. LAGI AMAZED.
GG lagi ning CHINA WOMEN'S VOLLEYBALL TEAM :X   #olympics
It looks like Water Polo's version of a footy dive is to reenact the opening scene of Jaws. #london2012
These male #gymnasts have me all sorts of hot and bothered #Olympics
#London2012 Penalty corner to Argentina, the

### Exercise

- Point your browser at [Sussex NLTK package documentation](http://www.sussex.ac.uk/Users/davidw/courses/nle/SussexNLTK-API/) and have a look around. This provides information about the above corpora. Take a particularly careful look at the [corpus_readers Module](http://www.sussex.ac.uk/Users/davidw/courses/nle/SussexNLTK-API/sussex_nltk.html#module-sussex_nltk.corpus_readers)

### Exercise

- In the code cell below write code that will establish whether there are systematic differences between the  average sentence length (as measured in terms of the number of characters in the sentence) of the sentences in the Reuters, Twitter and Medline corpora.

In [11]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from sussex_nltk.corpus_readers import TwitterCorpusReader
from sussex_nltk.corpus_readers import MedlineCorpusReader

rcr = ReutersCorpusReader()
tcr = TwitterCorpusReader()
mcr = MedlineCorpusReader()

sample_size = 1000

R = 0
T = 0
M = 0

for sentence in rcr.sample_raw_sents(sample_size):
    R += len(sentence.split())
for sentence in tcr.sample_raw_sents(sample_size):
    T += len(sentence.split())
for sentence in mcr.sample_raw_sents(sample_size):
    M += len(sentence.split())
    
ASL_R = R/sample_size
ASL_T = T/sample_size
ASL_M = M/sample_size

ddict = {"Corpus": ["Reuters", "Twitter", "Medline"], "ASL": [ASL_R, ASL_T, ASL_M]}

display(pd.DataFrame(ddict, columns = ["Corpus", "ASL"]))

Unnamed: 0,Corpus,ASL
0,Reuters,14.579
1,Twitter,14.496
2,Medline,22.024


In [None]:
# uncomment the next line and then run the cell to load a partial solution
#%load solutions/average_sentence_length_part

In [None]:
# %load solutions/average_sentence_length
from sussex_nltk.corpus_readers import MedlineCorpusReader
from sussex_nltk.corpus_readers import TwitterCorpusReader
from sussex_nltk.corpus_readers import ReutersCorpusReader

rcr = ReutersCorpusReader()    #Create a new reader
tcr = TwitterCorpusReader()    #Create a new reader
mcr = MedlineCorpusReader()    #Create a new reader

samplesize = 1000

TSL_R = 0 #initialise reuters total sentence length variable
TSL_T = 0 #initialise twitter total sentence length variable
TSL_M = 0 #initialise medline total sentence length variable
   
for sentence in rcr.sample_raw_sents(samplesize): 
    TSL_R += len(sentence)
for sentence in tcr.sample_raw_sents(samplesize): 
    TSL_T += len(sentence)
for sentence in mcr.sample_raw_sents(samplesize): 
    TSL_M += len(sentence)

ASL_Reuters = TSL_R/samplesize
ASL_Twitter = TSL_T/samplesize
ASL_Medline = TSL_M/samplesize

# A Pandas dataframe is a convenient way to display the average sentence length (ASL) of each corpus in a table. 
 
# Create a dictionary.
# There is a key for each column - in this we have two columns 'Corpus' and 'ASL'
# The values of each key is a list of the values for each row of the corresponding column.
# The lists need to have the same length, corresponding to the number of rows in the table.

datadict = {'Corpus' : ['Reuters','Twitter','Medline'],
            'ASL' : [ASL_Reuters,ASL_Twitter,ASL_Medline]}

# Make a dataframe from the dictionary.
# The columns parameters allows us to specify the order of the columns.
# By default the columns would appear in alphabetical order of their key.

df = pd.DataFrame(datadict,columns=['Corpus','ASL'])

display(df)


## DIY Tokenisation with Regular Expressions
Text doesn't come in neat tokens ready for analysis, it must first undergo sentence segmentation and tokenisation.  
We have already sentence segmented the corpora.  
In this lab you will be focusing on tokenisation, in particular, you will be comparing the merits of the following tokenisers:  
- Your own regular expression based tokeniser
- The (NLTK implemented) PENN treebank style regular expression based tokeniser
- A Twitter-specific CMU tokeniser

### Issues to consider
Your goal when working through this next section should be to investigate the strengths and weaknesses of each of the 3 tokenisers on three rather different kinds of corpora: 
- the Reuters corpus, 
- the Twitter corpus and 
- the Medline corpus.

### Making your own tokeniser
In this section, you will write your own Python function, which takes as input a single string representing a sentence, and returns a <b>list of strings</b> obtained by splitting the sentence into tokens.

Let's start by simply splitting by whitespace. 

In [12]:
print("   What    is the    air-speed   velocity of  an unladen swallow?   ".split()) 

['What', 'is', 'the', 'air-speed', 'velocity', 'of', 'an', 'unladen', 'swallow?']


### Exercise

- In the empty code cell below write a [function](http://docs.python.org/tutorial/controlflow.html#defining-functions), `tokenise` which takes a sentence as input and returns a list of the tokens making up the sentence. Your first version of this function should tokenise only on whitespace, as shown in the cell above. Show that your function works on the sentence shown above.


In [13]:
def tokenise(text):
    return text.split()

In [14]:
# %load solutions/simple_tokenise
def tokenise(sentence):
    return sentence.split()

print(tokenise(' What is the    air-speed . velocity of  an unladen swallow?   '))


['What', 'is', 'the', 'air-speed', '.', 'velocity', 'of', 'an', 'unladen', 'swallow?']


### Exercise

- In the empty code cell below write code that applies your tokenise function to each sentence in a sample of 30 sentences taken from  the Reuters, Twitter and Medline corpora, 10 sentences from each corpus.

In [16]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from sussex_nltk.corpus_readers import TwitterCorpusReader
from sussex_nltk.corpus_readers import MedlineCorpusReader

rcr = ReutersCorpusReader()
tcr = TwitterCorpusReader()
mcr = MedlineCorpusReader()

sent = 10

for sentence in rcr.sample_raw_sents(sent):
    print(tokenise(sentence))
for sentence in tcr.sample_raw_sents(sent):
    print(tokenise(sentence))
for sentence in mcr.sample_raw_sents(sent):
    print(tokenise(sentence))

['DELIVERY:', '45', 'days', 'ORDERS:']
['SUMMARY', 'NOTICE', 'OF', 'SALE']
['But', 'on', 'Monday,', 'confusion', 'surrounded', 'the', 'plan,', 'depositors', 'amd', 'company', 'officials', 'said.']
['Downey', 'Unified', 'School', 'District']
['PSP', '=', 'Postipankki', '(May', '27)']
['PSBR', '(in', 'million', 'stg)', 'APRIL', 'MARCH', 'APRIL', '96']
['--$21.039', 'million', 'Plain', 'LSD,', 'G.O.']
['A', 'booming', 'U.S.', 'economy', 'drew', 'in', 'record', 'imports', 'during', 'February,', 'the', 'Commerce', 'Department', 'said', 'on', 'Thursday,', 'slowing', 'improvement', 'in', 'the', 'monthly', 'deficit', 'and', 'spotlighting', 'trade', 'tensions', 'with', 'China.']
['The', 'tax', 'bill', 'includes', 'a', '$500-per-child', 'credit,', 'a', 'reduction', 'in', 'the', 'capital', 'gains', 'rate', 'and', 'new', 'tax', 'incentives', 'for', 'for', 'higher', 'education.']
['Analysts', 'and', 'politicians', 'agreed', 'the', 'cachet', 'of', 'membership', 'in', 'exclusive', 'Western', 'clubs',

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions/tokenise_samples

In most tokenisation policies (e.g. in the Wall Street Journal corpus), contractions like "I'm" tend to be split into "I" and "'m".  

When it comes to more than just splitting by whitespace, it can be convenient to use [regular expressions](http://docs.python.org/library/re.html) to process the string in some way. The following code cell illustrates this. Trying running it and then read on to discover how it works.

In [14]:
print(re.sub("([.?!'])", " \g<1>", "You're using coconuts!").split())   

['You', "'re", 'using', 'coconuts', '!']


Let's look at how the above code works by breaking it down.  

First, run the following cell.

In [15]:
print(re.sub("'", " '", "You're using coconuts!")   )

You 're using coconuts!


As you can see, this code takes the string "You're using coconuts!" and inserts a space before the apostophe, the `'` character. 

Let's see how it works...

The first argument of `re.sub`, i.e. `"'"`, is a regular expression that in this case is extremely simple, since it only matches the apostophe character, `'`.

The second argument of `re.sub`, where we see `" '"`, indicates that an apostophe should be substituted by a space followed by an apostophe.

Now let's make it slightly more complicated. We also want to insert a space before the `"!"`, so let's look at how to do that. 

Run the following code cell.

In [22]:
print(re.sub("(['!])", " \g<1>", "You're using coconuts!")   )

You 're using coconuts !


The first argument of `re.sub`, has been changed to `"(['!])"`, which is a regular expression that matches either an apostophe character,`'`, or an exclamation mark,`!`.

This is achieved with the regular expression `"['!]"`, where the square brackets enclose the alternative characters. 

Why does the regular expression contain parenthesis? 

It has to do with what we need to put as the second argument of `re.sub` where the substitution is specified. 

To understand this, you need to appreciate that we want to add a space before an apostrophe and also a space before an exclamation mark. How can we specify that in the second argument of `re.sub`? 

The answer is that we need to make use of the the idea of a **group**.

The parenthesis in `"(['!])"` define the start and end of a group. In this case the whole regular expression is a group. In general, however, there can be several sets of parentheis defining several groups. For example, the regular expression `"([Tt]h)e (m*n)"` has two groups. Groups are numbered from left to right, so the group in the regular expression `"(['!])"` is group 1. 

Defining this group allows us to refer to the string that matches the regular expression `"(['!])"`, which will be either an apostrophe or an exclamation mark. This is then used in the second argument of `re.sub`, where we see `" \g<1>"`, which indicates that the material that matches the apostophe or exclamation mark should be substituted by a space followed by the symbol that was matched. The `1` in `\g<1>` tells us that it is group one.

We are now ready to look at the original code, which is reproduced below and should now make sense. 

In [None]:
print(re.sub("([.?!'])", " \g<1>", "You're using /.coconuts!").split())   

First, the spaces are added before any full stop, question mark, exclamation mark or apostrophe.
The resulting string is then split on white space.

### Exercise

- Create an empty code cell below, and write a new version of your `tokenise` function that uses `re.sub` in the way we've just considered. 

In [19]:
def token(string):
    return re.sub("([.?!'])", " \g<1>", string).split()

print(token("You're using coconuts!"))

['You', "'re", 'using', 'coconuts', '!']


In [None]:
# %load solutions/tokenise_with_re.sub
def tokenise(sentence):
    return re.sub("([.?!'])", " \g<1>", sentence).split()

print(tokenise(' What is the    air-speed . velocity of  an unladen swallow?   '))


### Exercise


- Create an empty code cell below, and extend your tokeniser function to cater for the following guidelines. 
- Test out your new tokeniser on the string  
`"After saying \"I won't help, I'm gonna leave!\", on his parents' arrival, the boy's behaviour improved."`  
 notice that the `"` characters in the test sentence have been espaced, appearing as `\"`.

In [20]:
tokenise("I won't help, I'm gonna leave!\", on his parents' arrival, the boy's behaviour improved.")

['I',
 'wo',
 "n't",
 'help',
 ',',
 'I',
 "'m",
 'gon',
 'na',
 'leave',
 '!',
 '"',
 ',',
 'on',
 'his',
 'parents',
 "'",
 'arrival',
 ',',
 'the',
 'boy',
 "'s",
 'behaviour',
 'improved',
 '.']

### Guidelines

- punctuation is split from adjoining words
- opening double quotes are changed to two single forward quotes.
- closing double quotes are changed to two single backward quotes.
- the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately.
  - e.g. `"children's"` produces `"children 's"`
  - e.g. `"parents'"` produces `"parents '"`
- contractions should be split into component morphenes
  - e.g. `"won't"` produces `"wo n't"`
  - e.g. `"gonna"` produces `"gon na"`
  - e.g. `"I'm"` produces `"I 'm"`
  
  
These tokenisation guidelines are a subset of those found [here](http://www.cis.upenn.edu/~treebank/tokenization.html).



### Hints:

- Use multiple calls to `re.sub` to deal with different cases one at a time. As in...

```
    sentence = re.sub(<pattern1>, <replacement1>,sentence)
    sentence = re.sub(<pattern2>, <replacement2>,sentence)
    sentence = re.sub(<pattern3>, <replacement3>,sentence)
```

- Order your calls to `re.sub` so that you deal with the specific cases first and the more general cases later.

- In dealing with the replacement of start and end `"`, you will find the following useful:

>The `'*'`, `'+'`, and `'?'` qualifiers are all *greedy*; they match
>as much text as possible.  Sometimes this behaviour isn't desired; if the RE
>`<.\*>` is matched against `<a> b <c>`, it will match the entire
>string, and not just `<a>`.  Adding `'?'` after the qualifier makes it
>perform the match in *non-greedy* or *minimal* fashion; as *few*
>characters as possible will be matched.  Using the RE `<.\*?>` will match
>only `<a>`.  
(taken from https://docs.python.org/2/library/re.html).


In [21]:
import re    #import regex module

def tokenise(sentence):
    sentence = re.sub("'(s|m|(re)|(ve)|(ll)|(d))", " '\g<1> ", sentence + " ")
    print(sentence)
    sentence = re.sub("s'", "s '", sentence)
    print(sentence)
    sentence = re.sub("n't", " n't", sentence)
    print(sentence)
    sentence = re.sub("gonna", "gon na", sentence)
    print(sentence)
    sentence = re.sub("\"(.+?)\"", "`` \g<1> ''", sentence)   
    print(sentence)
    sentence = re.sub("([.,?!])", " \g<1> ", sentence)
    print(sentence)
    return sentence.split()



testsentence = "After saying \"I won't help, I'm gonna leave!\", on his parents' arrival, the boy's behaviour improved."

print(tokenise(testsentence))

After saying "I won't help, I 'm  gonna leave!", on his parents' arrival, the boy 's  behaviour improved. 
After saying "I won't help, I 'm  gonna leave!", on his parents ' arrival, the boy 's  behaviour improved. 
After saying "I wo n't help, I 'm  gonna leave!", on his parents ' arrival, the boy 's  behaviour improved. 
After saying "I wo n't help, I 'm  gon na leave!", on his parents ' arrival, the boy 's  behaviour improved. 
After saying `` I wo n't help, I 'm  gon na leave! '', on his parents ' arrival, the boy 's  behaviour improved. 
After saying `` I wo n't help ,  I 'm  gon na leave !  '' ,  on his parents ' arrival ,  the boy 's  behaviour improved .  
['After', 'saying', '``', 'I', 'wo', "n't", 'help', ',', 'I', "'m", 'gon', 'na', 'leave', '!', "''", ',', 'on', 'his', 'parents', "'", 'arrival', ',', 'the', 'boy', "'s", 'behaviour', 'improved', '.']


In [24]:
# %load solutions/my_tokeniser


In [32]:
# %load solutions/my_tokeniser

## The NLTK regular expression tokeniser
The NLTK implements a regular expression tokeniser `word_tokenize` that is based on the above tokenisation guidelines. 

**Function**: `word_tokenize`

- Arguments
 - a single string, representing a sentence
- Returns
 - a list of strings, where each string is a token within the sentence</dd>

### Exercise

- Make sure you understand the code in the cell below and then run it so that you can compare the way that the test sentence has been tokensed by the two tokenisers.

In [36]:
from nltk.tokenize import word_tokenize
    
testsentence = "After saying \"I won't help, I'm gonna leave!\", on his parents' arrival, the boy's behaviour improved."

# run the nltk tokeniser and your tokeniser on the test sentence
nltk_toks = word_tokenize(testsentence) # run the nltk tokeniser
my_toks = tokenise(testsentence) # run your tokeniser

pd.DataFrame(list(zip_longest(nltk_toks,my_toks)),columns=["NLTK", "MINE"])

Unnamed: 0,NLTK,MINE
0,After,After
1,saying,saying
2,``,``
3,I,I
4,wo,wo
5,n't,n't
6,help,help
7,",",","
8,I,I
9,'m,'m


### Exercise

- In the code cell below write code to run both the `NLTK_Tokenise` and your own `Tokenise` function on a sample of 10 sentences from the Reuters corpus.
- Look for differences in the output of the two tokenisers.


In [61]:
from nltk.tokenize import word_tokenize
from sussex_nltk.corpus_readers import ReutersCorpusReader

rcr = ReutersCorpusReader()

sample_size = 10

for sentence in rcr.sample_raw_sents(sample_size):
    nltk_tok = word_tokenize(sentence)
    my_tok = tokenise(sentence)
    
pd.DataFrame(list(zip_longest(nltk_tok, my_tok)), columns = ["NLTK", "Mine"])

Unnamed: 0,NLTK,Mine
0,Swiss,Swiss
1,fund,fund
2,helps,helps
3,E.,E
4,Europe,.
5,Holocaust,Europe
6,victims,Holocaust
7,first,victims
8,.,first
9,,.


In [43]:
# %load solutions/nltk_vs_mine
%load ../Solutions/3/nltk_vs_mine


ValueError: '../Solutions/3/nltk_vs_mine' was not found in history, as a file, url, nor in the user namespace.

## The Twitter-specific Tokeniser
The third tokeniser for you to explore is a Twitter-specific tokeniser that has been developed by [Gimpel et al.](http://ttic.uchicago.edu/~kgimpel/papers/gimpel+etal.acl11.pdf) as part of a Twitter-specific part-of-speech tagger (featured in later lab classes).

---
**Function**: `twitter_tokenize`
- Arguments
 - a single string, representing a sentence
- Returns
 - a list of strings, where each string is a token within the sentence
---

`twitter_tokenize` can be quite slow, so we have provided the following function to tokenise an entire sample of sentences at once.  

---
**Function**: `twitter_tokenize_batch`
- Arguments
 - a list of strings, where each string represents a sentence
- Returns
 - a list of sentences, where each sentence is a list of tokens
---

### Exercise
- In the empty cell below, write code to run both  `twitter_tokenize` and the the NLTK tokeniser, `word_tokenize`, function on each sentence in a sample of 10 sentences from the Twitter corpus.
- Display each sentence tokenised by the two tokenisers using the `print_lists_in_columns` function defined above.
- Once you have done this, look for differences in the output of the two tokenisers.


In [65]:
from sussex_nltk.tokenize  import twitter_tokenize, twitter_tokenize_batch
from sussex_nltk.corpus_readers import TwitterCorpusReader

tcr = TwitterCorpusReader()

sample_size = 10

for sentence in tcr.sample_raw_sents(sample_size):
    twitter_tok = twitter_tokenize(sentence)
    nltk_tok = word_tokenize(sentence)
    
display(pd.DataFrame(list(zip_longest(twitter_tok, nltk_tok)), columns = ["TWITTER", "NLTK"]))

Unnamed: 0,TWITTER,NLTK
0,7,7
1,Cameroon,Cameroon
2,Athletes,Athletes
3,Flee,Flee
4,Olympics,Olympics
5,http://t.co/RJurwbox,http
6,,:
7,,//t.co/RJurwbox


In [67]:
# %load solutions/nltk_vs_twitter
#%load ../Solutions/3/nltk_vs_twitter

from sussex_nltk.tokenize import twitter_tokenize,twitter_tokenize_batch  #import CMU tokenize functions
from sussex_nltk.corpus_readers import TwitterCorpusReader

tcr = TwitterCorpusReader()

samplesize = 10

for sentence in tcr.sample_raw_sents(samplesize): 
    nltk_toks = word_tokenize(sentence)
    twit_toks = twitter_tokenize(sentence)
    print_lists_in_columns(nltk_toks,twit_toks,"NLTK","TWIT")
    display(pd.DataFrame(list(zip_longest(nltk_toks,twit_toks)),columns=["NLTK","TWIT"]))


NameError: name 'print_lists_in_columns' is not defined

### Exercise
- Copy the code cell above and move the copy to below this cell. Then use both the NLTK and Twitter tokenisers on a sample of 10 sentences from the **Medline** corpus.
- Look for situations where the  tokenisers do not tokenise appropriately.
- Try to figure out the differences in tokenisation policies of the tokenisers.
- Think about possible motivations for the differences in tokenisation policy, by considering how the tokens may be used in subsequent (down-stream) language processing steps.


In [72]:
from sussex_nltk.tokenize import twitter_tokenize,twitter_tokenize_batch
from nltk.tokenize import word_tokenize
from sussex_nltk.corpus_readers import MedlineCorpusReader

mcr = MedlineCorpusReader()    #Create a new reader

samplesize = 10   

for sentence in mcr.sample_raw_sents(samplesize):
    nltk_toks = word_tokenize(sentence)
    twitter_toks = twitter_tokenize(sentence)
    
display(pd.DataFrame(list(zip_longest(nltk_toks,twitter_toks)),columns=["NLTK","TWIT"]))

Unnamed: 0,NLTK,TWIT
0,Burulin,Burulin
1,was,was
2,found,found
3,to,to
4,be,be
5,highly,highly
6,specific,specific
7,for,for
8,patients,patients
9,in,in


In [None]:
# %load solutions/nltk_vs_twitter_medline
# %load ../Solutions/3/nltk_vs_twitter_medline

from sussex_nltk.tokenize import twitter_tokenize,twitter_tokenize_batch  #import CMU tokenize functions
from nltk.tokenize import word_tokenize
from sussex_nltk.corpus_readers import MedlineCorpusReader

mcr = MedlineCorpusReader()    #Create a new reader

samplesize = 10   

for sentence in mcr.sample_raw_sents(samplesize): 
    nltk_toks = word_tokenize(sentence)
    twit_toks = twitter_tokenize(sentence)
    display(pd.DataFrame(list(zip_longest(nltk_toks,twit_toks)),columns=["NLTK","TWIT"]))

## Normalising text and removing unimportant tokens
In this next section we will consider several methods that pre-process (tokenised) text in ways that are sometimes helpful to 'downstream' processing.

### Number and case normalisation
Without any kind of normalisation, the tokens `"help"` and `"Help"` are two distinct types. In some contexts you may not want to distinguish them.

Another example, is that `"1998"` and `"1999"` count as distinct types. There are situations where there is no need to distinction between different numbers.

The following code performs case normalisation and replaces tokens that consist of digits by "NUM". 
- Python provides a [number of functions](http://docs.python.org/library/stdtypes.html#string-methods), which you can call in order to analyse their content, or produce new strings from them.
- The code uses [list comprehension](http://docs.python.org/tutorial/datastructures.html#list-comprehensions) to build a new list by looping through and filtering items.

In [16]:
tokens = ["The","cake","is","a","LIE"]      #a list of tokens, some of which contain uppercase letters
print([token.lower() for token in tokens])   #print newly created list of all lowercase tokens

numbers = ['in', 'the', 'year', '120', 'of', 'the', 'fourth', 'age', ',', 'after', '120', 'years', 'as', 'king', ',' , 'aragorn', 'died', 'at', 'the', 'age', 'of', '210']
print(["NUM" if token.isdigit() else token for token in numbers])  #replace all number tokens with "NUM" in a new list of tokens

['the', 'cake', 'is', 'a', 'lie']
['in', 'the', 'year', 'NUM', 'of', 'the', 'fourth', 'age', ',', 'after', 'NUM', 'years', 'as', 'king', ',', 'aragorn', 'died', 'at', 'the', 'age', 'of', 'NUM']


### Exercise
- In the empty cell below, write code that normalises tokens such as `"4th"`, `"1st"` and `"22nd"` to `"Nth"`.
- Try to adapt this code from the cell above: `["NUM" if token.isdigit() else token for token in numbers]`
- Test your code on the list `["The", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]`. 
- Check that the token `"and"` isn't changed to `"Nth"`.
- You will find [this page](http://docs.python.org/library/stdtypes.html#string-methods) useful.


In [78]:
tokens = ["4th", "1st", "22nd", "bruh"]
print(["Nth" if not token.isalpha() else token for token in tokens])

['Nth', 'Nth', 'Nth', 'bruh']


In [82]:
tokens = ["The", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]
punct = "?!.,:;"
print(["Nth" if not token.isalpha() and not token in punct else token for token in tokens])

['The', 'Nth', 'and', 'Nth', 'placed', 'runners', 'lapped', 'the', 'Nth', '.']


In [83]:
# %load solutions/normalise_to_Nth
%load ../Solutions/3/normalise_to_Nth

ValueError: '../Solutions/3/normalise_to_Nth' was not found in history, as a file, url, nor in the user namespace.

### Exercise
- Complete the code in the cell below. You have just two lines to complete. The goal is to use a large sample of the Reuters corpus to establish the extent to which vocabulary size is reduced when number and case normalisation is applied.
- For each of the two incomplete lines you should use nested list comprehensions. This is described in Section 5.1.4 in [this document](http://docs.python.org/tutorial/datastructures.html#list-comprehensions)


In [97]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize

def vocabulary_size(sentences):
    tok_counts = collections.defaultdict(int)
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token] += 1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

############################################
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
normalised_sentences = [["NUM" if token.isdigit() else token for token in sentence] for sentence in lowered_sentences]
############################################

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Normalisation produced a 13.22% reduction in vocabulary size from 19109 to 16583


In [96]:
# %load solutions/impact_of_normalisation
#%load ../Solutions/3/impact_of_normalisation

from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize

def vocabulary_size(sentences):
    tok_counts = collections.defaultdict(int)
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token] += 1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
normalised_sentences = [["NUM" if token.isdigit() else token for token in sentence] for sentence in lowered_sentences]
raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Normalisation produced a 13.48% reduction in vocabulary size from 19183 to 16598


## Stemming
A considerable amount of the lexical variation found in documents results from the use of morphological variants which we might not wish to distinguish - e.g. when determining the topic of a document. An easy way to remove these varied forms is to use a stemmer. NLTK includes a number of stemmers in the `nltk.stem` package.
- [NLTK stem module API](http://nltk.org/api/nltk.stem.html)

- [NLTK Porter stemmer](http://nltk.org/api/nltk.stem.html?highlight=stemmer#nltk.stem.porter.PorterStemmer)

### Exercise
- Complete the code below to show how the NLTK implementation of the Porter stemmer in `nltk.stem.porter.PorterStemmer` stems a sample of sentences in the Reuters corpus. All you need to do is to provide the missing first two arguments to the call to `print_lists_in_columns`.
- Have a close look at the differences between the columns. This will give you a good indication of what the stemmer does.

In [98]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

rcr = ReutersCorpusReader() 
st = PorterStemmer()

sample_size = 10

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

for sentence in tokenised_sentences:
    display(pd.DataFrame(list(zip_longest(sentence,[st.stem(token) for token in sentence])),columns=["BEFORE","AFTER"]))

Unnamed: 0,BEFORE,AFTER
0,The,the
1,same,same
2,priorities,prioriti
3,would,would
4,apply,appli
5,in,in
6,financing,financ
7,law,law
8,enforcement,enforc
9,and,and


Unnamed: 0,BEFORE,AFTER
0,GENERAL,gener
1,ELECTION,elect
2,BONDS,bond
3,SER,ser
4,.,.


Unnamed: 0,BEFORE,AFTER
0,``,``
1,Boris,bori
2,Nikolayevich,nikolayevich
3,said,said
4,he,he
5,would,would
6,look,look
7,at,at
8,the,the
9,status,statu


Unnamed: 0,BEFORE,AFTER
0,08/01/2001,08/01/2001
1,"8,270M","8,270m"
2,5.00,5.00
3,%,%
4,5.10,5.10


Unnamed: 0,BEFORE,AFTER
0,--,--
1,Sydney,sydney
2,Newsroom,newsroom
3,61-2,61-2
4,373-1800,373-1800


Unnamed: 0,BEFORE,AFTER
0,2006.0,2006.0
1,150000.0,150000.0
2,5.3,5.3
3,5.3,5.3


Unnamed: 0,BEFORE,AFTER
0,TRADE,trade
1,DATE,date
2,WILL,will
3,BE,BE
4,TODAY,today
5,",",","
6,APRIL,april
7,3RD,3rd
8,.,.


Unnamed: 0,BEFORE,AFTER
0,--,--
1,Mike,mike
2,Peacock,peacock
3,",",","
4,London,london
5,Newsroom,newsroom
6,+44,+44
7,171,171
8,542,542
9,5109,5109


Unnamed: 0,BEFORE,AFTER
0,ISSUE,issu
1,:,:
2,General,gener
3,Obligation,oblig
4,",",","
5,1997,1997
6,TAX,tax
7,STAT,stat
8,:,:
9,Exempt-ULT,exempt-ult


Unnamed: 0,BEFORE,AFTER
0,--,--
1,U.S.,u.s.
2,Municipal,municip
3,Desk,desk
4,",",","
5,212-859-1650,212-859-1650


In [11]:
# %load solutions/show_stemmer_sample
from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

rcr = ReutersCorpusReader() 
st = PorterStemmer()

sample_size = 10

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

for sentence in tokenised_sentences:
    df = pd.DataFrame(list(zip_longest(sentence,[st.stem(token) for token in sentence])),columns=["BEFORE","AFTER"])
    print(df)


      BEFORE   AFTER
0   Seasonal  season
1      Loans    loan
2        ...     ...
3        ...     ...
4        ...     ...
5        ...     ...
6        ...     ...
7        ...     ...
8        295     295
9         up      up
10       ...     ...
11       ...     ...
12       ...     ...
13       ...     ...
14        74      74
      BEFORE    AFTER
0       This      thi
1       year     year
2          ,        ,
3       hits      hit
4       from     from
5        the      the
6      Spice    spice
7      Girls     girl
8        and      and
9      other    other
10    Virgin   virgin
11      acts      act
12      like     like
13       the      the
14  Smashing    smash
15  Pumpkins  pumpkin
16         ,        ,
17  Scarface  scarfac
18       and      and
19      Blur     blur
20      have     have
21    helped     help
22     boost    boost
23       EMI      emi
24        's       's
25     sales     sale
26         .        .
   BEFORE   AFTER
0       $       $
1   Price   

### Exercise
- By looking at the impact on a large sample of the Reuters corpus, establish the extent to which vocabulary size is reduced by stemming.
- Write code to do this in the empty cell below. You should be able to re-use a lot of the code from the code you used when measuring the impact of lower case and number normalisation.

In [2]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

def vocabulary_size(sentences):
    tok_counts = collections.defaultdict(int)
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token] += 1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader() 
st = PorterStemmer()

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
stemmed_sentences = [[st.stem(token) for token in sentence] for sentence in tokenised_sentences]
raw_vocab_size = vocabulary_size(tokenised_sentences)
stemmed_vocab_size = vocabulary_size(stemmed_sentences)
print("Stemming produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - stemmed_vocab_size)/raw_vocab_size,raw_vocab_size,stemmed_vocab_size))

Sussex NLTK root directory is \\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources
Stemming produced a 26.14% reduction in vocabulary size from 19393 to 14323


In [None]:
# %load solutions/impact_of_stemming
from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

def vocabulary_size(sentences):
    tok_counts = collections.defaultdict(int)
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token] += 1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    
st = PorterStemmer()

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
stemmed_sentences = [[st.stem(token) for token in sentence] for sentence in tokenised_sentences]
raw_vocab_size = vocabulary_size(tokenised_sentences)
stemmed_vocab_size = vocabulary_size(stemmed_sentences)
print("Stemming produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - stemmed_vocab_size)/raw_vocab_size,raw_vocab_size,stemmed_vocab_size))


### Punctuation and stop-word removal
A stopword is a word that occurs so often that it loses its usefulness in some tasks. We may get more meaningful information from our corpus analysis if we remove stopwords and punctuation.

The code below takes a list of tokens and creates a new list, which contains only those strings which are alphabetic and non-stop-words.

In [3]:
from nltk.corpus import stopwords

stopwords = stopwords.words('english')
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stopwords]

NameError: name 'tokens' is not defined

**Note**: `isalpha` only returns `True` if the string is entirely composed of alphabet characters. If you want a function to return `True` even when a word contains digits, then you should use `isalnum`.`

### Exercise
- In the empty cell below, write code that looks at a large sample of the Medline corpus, establishing what proportion of tokens are stop-words.

In [20]:
from sussex_nltk.corpus_readers import MedlineCorpusReader
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

mcr = MedlineCorpusReader()
stopwords = stopwords.words('english')
sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

num_stopwords = 0
num_tokens = 0
for sentence in tokenised_sentences:
    for token in sentence:
        num_tokens += 1
        if token in stopwords:
            num_stopwords += 1
            
print("Stopword removal produced a {0:.2f}% reduction in number of tokens from {1} to {2}".format(
    100*(num_tokens - num_stopwords)/num_tokens,num_tokens,num_stopwords))


Stopword removal produced a 74.48% reduction in number of tokens from 163759 to 41795


In [18]:
# %load solutions/impact_of_stopword_removal
from sussex_nltk.corpus_readers import MedlineCorpusReader
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def vocabulary_size(sentences):
    tok_counts = collections.defaultdict(int)
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token] += 1
    return len(tok_counts.keys())

mcr = MedlineCorpusReader()    
stopwords = stopwords.words('english')

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

############################################
num_stopwords = 0
num_tokens = 0
for sentence in tokenised_sentences:
    for token in sentence:
        num_tokens += 1
        if token in stopwords:
            num_stopwords += 1
############################################

print("Stopword removal produced a {0:.2f}% reduction in number of tokens from {1} to {2}".format(
    100*(num_tokens - num_stopwords)/num_tokens,num_tokens,num_stopwords))


Stopword removal produced a 74.47% reduction in number of tokens from 166979 to 42625
