## DIY Tokenisation with Regular Expressions (EXTENSION)
Text doesn't come in neat tokens ready for analysis, it must first undergo sentence segmentation and tokenisation.  
The nltk package can handle sentence segmentation and word tokenisation of a corpus for you.
The word tokeniser in NLTK is based on regular expressions.  For a deeper understanding, work through this extension lab where you will build your own regular expression based tokeniser and compare it with the NLTK implementation.


### Making your own tokeniser
In this section, you will write your own Python function, which takes as input a single string representing a sentence, and returns a <b>list of strings</b> obtained by splitting the sentence into tokens.

Let's start by simply splitting by whitespace. 

In [None]:
print("   What    is the    air-speed   velocity of  an unladen swallow?   ".split()) 

### Exercise 1.1

- In the empty code cell below write a [function](http://docs.python.org/tutorial/controlflow.html#defining-functions), `tokenise` which takes a sentence as input and returns a list of the tokens making up the sentence. Your first version of this function should tokenise only on whitespace, as shown in the cell above. Show that your function works on the sentence shown above.
- Note: this is intended to be an easy exercise - just a couple of lines of code - don't overcomplicate it!

### Exercise 1.2

- In the empty code cell below write code that applies your tokenise function to each sentence in a sample of 30 sentences taken from  the twitter_samples corpus.

In most tokenisation policies (e.g. in the Wall Street Journal corpus), contractions like "I'm" tend to be split into "I" and "'m".  

When it comes to more than just splitting by whitespace, it can be convenient to use [regular expressions](http://docs.python.org/library/re.html) to process the string in some way. The following code cell illustrates this. Trying running it and then read on to discover how it works.

In [None]:
import re

print(re.sub("([.?!'])", " \g<1>", "You're using coconuts!").split())   

Let's look at how the above code works by breaking it down.  

First, run the following cell.

In [None]:
print(re.sub("'", " '", "You're using coconuts!")   )

As you can see, this code takes the string "You're using coconuts!" and inserts a space before the apostophe, the `'` character. 

Let's see how it works...

The first argument of `re.sub`, i.e. `"'"`, is a regular expression that in this case is extremely simple, since it only matches the apostophe character, `'`.

The second argument of `re.sub`, where we see `" '"`, indicates that an apostophe should be substituted by a space followed by an apostophe.

Now let's make it slightly more complicated. We also want to insert a space before the `"!"`, so let's look at how to do that. 

Run the following code cell.

In [None]:
print(re.sub("(['!])", " \g<1>", "You're using coconuts!")   )

The first argument of `re.sub`, has been changed to `"(['!])"`, which is a regular expression that matches either an apostophe character,`'`, or an exclamation mark,`!`.

This is achieved with the regular expression `"['!]"`, where the square brackets enclose the alternative characters. 

Why does the regular expression contain parenthesis? 

It has to do with what we need to put as the second argument of `re.sub` where the substitution is specified. 

To understand this, you need to appreciate that we want to add a space before an apostrophe and also a space before an exclamation mark. How can we specify that in the second argument of `re.sub`? 

The answer is that we need to make use of the the idea of a **group**.

The parenthesis in `"(['!])"` define the start and end of a group. In this case the whole regular expression is a group. In general, however, there can be several sets of parenthesis defining several groups. For example, the regular expression `"([Tt]h)e (m*n)"` has two groups. Groups are numbered from left to right, so the group in the regular expression `"(['!])"` is group 1. 

Defining this group allows us to refer to the string that matches the regular expression `"(['!])"`, which will be either an apostrophe or an exclamation mark. This is then used in the second argument of `re.sub`, where we see `" \g<1>"`, which indicates that the material that matches the apostophe or exclamation mark should be substituted by a space followed by the symbol that was matched. The `1` in `\g<1>` tells us that it is group one.

We are now ready to look at the original code, which is reproduced below and should now make sense. 

In [None]:
print(re.sub("([.?!'])", " \g<1>", "You're using coconuts!").split())   

First, the spaces are added before any full stop, question mark, exclamation mark or apostrophe.
The resulting string is then split on white space.

### Exercise 2.1

- Write a new version of your `tokenise` function that uses `re.sub` in the way we've just considered. Make sure you test your new function

### Exercise 2.2


- In an empty code cell below, extend your tokeniser function to cater for the following guidelines. 
- Test out your new tokeniser on the string  
`"After saying \"I won't help, I'm gonna leave!\", on his parents' arrival, the boy's behaviour improved."`  
 notice that the `"` characters in the test sentence have been espaced, appearing as `\"`.

### Guidelines

- punctuation is split from adjoining words
- opening double quotes are changed to two single forward quotes.
- closing double quotes are changed to two single backward quotes.
- the Anglo-Saxon genitive of nouns are split into their component parts.
  - e.g. `"children's"` produces `"children 's"`
  - e.g. `"parents'"` produces `"parents '"`
- contractions should be split into component parts
  - e.g. `"won't"` produces `"wo n't"`
  - e.g. `"gonna"` produces `"gon na"`
  - e.g. `"I'm"` produces `"I 'm"`
  
  
These tokenisation guidelines are a subset of those found at
ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html



### Hints:

- Use multiple calls to `re.sub` to deal with different cases one at a time. As in...

```
    sentence = re.sub(<pattern1>, <replacement1>,sentence)
    sentence = re.sub(<pattern2>, <replacement2>,sentence)
    sentence = re.sub(<pattern3>, <replacement3>,sentence)
```

- Order your calls to `re.sub` so that you deal with the specific cases first and the more general cases later.

- In dealing with the replacement of start and end `"`, you will find the following useful:

>The `'*'`, `'+'`, and `'?'` qualifiers are all *greedy*; they match
>as much text as possible.  Sometimes this behaviour isn't desired; if the RE
>`<.\*>` is matched against `<a> b <c>`, it will match the entire
>string, and not just `<a>`.  Adding `'?'` after the qualifier makes it
>perform the match in *non-greedy* or *minimal* fashion; as *few*
>characters as possible will be matched.  Using the RE `<.\*?>` will match
>only `<a>`.  
(taken from https://docs.python.org/2/library/re.html).


### Exercise 3.1

- In the code cell below write code to run both the `NLTK_Tokenise` and your `tokenise` function on a sample of 10 sentences from the twitter_samples corpus.
- Compare the output.  What differences do you notice?
