<div class="alert alert-danger">
**Due date:** 2017-01-20
</div>

# Lab 0: Text Segmentation

**Students:** Maria Johansson (marjo123), Erik Karlsson (erika456)

## Introduction

From a computer's perspective, a text primarily is a sequence of characters, such as letters and digits. Before we can process a text with language technology tools, we need to segment it into linguistically more meaningful units, such as paragraphs, sentences, or words. This basic technique is called **text segmentation**. When the target units are words, it is called **tokenisation**. In this lab you will implement a simple tokeniser for continuous text.

In [None]:
import lab0

## Data

The text you will be working with is an article from Swedish Wikipedia: [Gustav III](https://sv.wikipedia.org/wiki/Gustav_III). Look at the webpage and see how it is built up.

A Wikipedia-page not only consists of text but even of other data, such as pictures and tables. Before you can start tokenising the text, you would usually need to extract it from the page using a tool like [Scrapy](https://scrapy.org). For this laboratory this has been already done for you, which means that your starting point is going to be the extracted text.

### Read in the raw text

In order to read in the extracted text in Python, we define a helper function `read_data()`. The function opens the given file and returns its content as a list with lines of text. The textfile uses newline characters (`\n`) to end each line; this character is removed using Python's [`str.rstrip()`](https://docs.python.org/3.5/library/stdtypes.html#str.rstrip).

In [None]:
def read_data(filename):
    with open(filename) as f:
        return [line.rstrip() for line in f]

You can now read in the raw text:

In [None]:
text1 = read_data("/home/TDP030/labs/lab0/data/text1.txt")

Look at the text in a text editor and try to identify peculiarities that might create problems for further analysis. The text is automatically extracted, using methods that read the data from the website's HTML tree.

You can even look at the text directly from the notebook. The next command prints a list with the first 50 lines of the text:

In [None]:
print(text1[:50])

The following code snippet recreates the content from the text file in lines 51 to 60, glueing the lines together using the newline character:

In [None]:
print("\n".join(text1[50:60]))

### Read in the gold standard

There exists a gold standard tokenisation for the raw text. This tokenisation follows the rules used in the [Stockholm–Umeå Corpus (SUC)](https://spraakbanken.gu.se/swe/resurs/suc3), a standard corpus for Swedish. The file containing the gold standard tokenisation consists of all tokens from the raw text, with one token per line.

In [None]:
gold1 = read_data("/home/TDP030/labs/lab0/data/gold1.txt")

Look at the gold standard and try to understand the principles it is based on. Most tokens are normal words or punctuation marks, but note that abbreviations are handled as one token.

In [None]:
print(gold1[:50])

## Whitespace tokenisation

The next cell contains a very simple tokeniser:

In [None]:
def tokenize_ws(lines):
    tokens = []
    for line in lines:
        for token in line.split():
            tokens.append(token)
    return tokens

This function takes a list with text lines, splits every line at whitespace using the function [`str.split()`](https://docs.python.org/3.5/library/stdtypes.html#str.split), and collects the resulting strings in a list `tokens`.

### Compare the tokenisation with the gold standard

Test the tokeniser on the first 50 lines of the text:

In [None]:
print(tokenize_ws(text1[:50]))

Compare this tokenisation with the gold standard. Which differences do you find?

Most differences can be explained as **undersegmentation**, where the tokeniser has missed to split a token. The opposite is **oversegmentation**, where the tokeniser splits a character sequence that should really be one token.

In order to examine the differences, you can use the function `diff()` from the lab module. This function expects two arguments, a list with gold standard tokens and a list with automatically predicted tokens. It returns a new list that shows the differences between the two tokenisations in a compact way. The following command shows the first ten differences:

In [None]:
lab0.diff(gold1, tokenize_ws(text1))[:10]

The list contains pairs whose first component is a sequence of tokens that appear in the gold standard but not in the automatic tokenisation, and whose second component is a sequence of tokens that appear in the automatic tokenisation but not in the gold standard. The following code snippet prints the list in a more readable way:

In [None]:
# Helper function that formats a list of tokens
def fmt_tokens(tokens):
    return " ".join(tokens) + " (%d)" % len(tokens)

# Print out information about divergent subsequences
print("Gold tokens".ljust(40), "Predicted tokens".ljust(40))
print()
for gold_tokens, pred_tokens in lab0.diff(gold1, tokenize_ws(text1)):
    print(fmt_tokens(gold_tokens).ljust(40), fmt_tokens(pred_tokens).ljust(40))

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Examine the differences between the gold standard and the whitespace-based tokenisation. Try to classify different types of undersegmentation and think of ways how one could eliminate them. Give at least three examples from different classes. Give at least one example of oversegmentation. In order to solve this problem, you can examine the output from the previous code cell by hand or write code to solve this task for you.
</div>
</div>

In [None]:
# You might want to write some code here.

*Room for your answer*

Examples for different types of under-segmentation:

* Example 1
* Example 2
* Example 3

Example for over-segmentation:

### Compute precision and recall

A way to do a more quantitative evaluation of the tokeniser is to compute its **precision** and its **recall**. Precision is defined as the number of correct tokens among all tokens the system has identified. Recall is defined as the number of correctly identified tokens among all tokens in the gold standard. In order to compute those values you can use the next code cell:

In [None]:
tokens_ws = tokenize_ws(text1)

print("Errors: %d" % lab0.n_errors(gold1, tokens_ws))
print("Precision: %.4f" % lab0.precision(gold1, tokens_ws))
print("Recall: %.4f" % lab0.recall(gold1, tokens_ws))

## Tokenisation based on regular expressions

In the second part of this lab you will exchange the simple whitespace-based tokenisation with a more advanced tokenisation based on **regular expressions**. Before you can use regular expressions in Python you have to first load the relevant module:

In [None]:
import re

A simple tokeniser based on regular expressions looks like this:

In [None]:
def tokenize_re(regex, lines):
    output = []
    for line in lines:
        for match in re.finditer(regex, line):
            output.append(match.group(0))
    return output

This function finds all longest, non-overlapping occurrences of the pattern `regex` in the row `line` and returns them as a list. The line is scanned from left to right and the matching substrings are returned in the same order.

In order to simulate and run the whitespace-based tokeniser using regular expression you can use the following lines of code:

In [None]:
# Regular expression the tokeniser will use
regex = r'\S+'

tokens_re = tokenize_re(regex, text1)

print("Errors: %d" % lab0.n_errors(gold1, tokens_re))
print("Precision: %.4f" % lab0.precision(gold1, tokens_re))
print("Recall: %.4f" % lab0.recall(gold1, tokens_re))

# In order to debug the regex, you might want to comment in the next line.
# lab0.diff(gold1, tokens_re)

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Find a regular expression that eliminates as many differences between the gold standard and the automatic tokenisation as possible. Your finished tokeniser should have at least 99.5% precision and recall.
</div>
</div>

Here are some hints that can help you with the exercise:

* Read [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) and [the documentation for the module  `re`](https://docs.python.org/3.5/library/re.html).

* If you want to use grouping sub-expressions, you might want to use *non-capturing* groups.

* If your expression gets too long and hard to read, have a look at [`re.VERBOSE`](https://docs.python.org/3.5/library/re.html#re.VERBOSE) for writing the expression over multiple lines.

* If you want to practice your regex skills a little more, hop over to [RegexOne](https://regexone.com) or [RegExr](http://regexr.com).

## Evaluate the tokeniser on new text

Your last exercise is to evaluate your tokeniser on another article from Swedish Wikipedia: [Katarina II av Ryssland](https://sv.wikipedia.org/wiki/Katarina_II_av_Ryssland). (She was Gustav&nbsp;III's cousin.)

The raw text and the gold standard tokenisation is loaded like this:

In [None]:
text2 = read_data("/home/TDP030/labs/lab0/data/text2.txt")
gold2 = read_data("/home/TDP030/labs/lab0/data/gold2.txt")

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Redo Problem&nbsp;1 on the new text. Compute precision and recall as well. Report the results and try to explain them. Write a short text (max. 250 words) of discussion.
</div>
</div>

In [None]:
# Room for your evaluation

*Room for your discussion (max. 250 words)*