# Week 1 - Second Year Project

---

**Learning goals**

- be familiar with the concept of regular expressions
- be able to discuss issues that arise in tokenization and segmentation
- be able to use the Unix command-line tools for navigation, search (`grep`), count (`wc`), and basic text processing (`sed` for substitution), as well as the pipe (`|`), e.g., to count word types or extract a simple word frequency list

**Notebook overview**

*Lecture 1*
1. Regular expressions - get basic familiarity with the concept
2. Tokenization - learn about how to approach tokenization and its challenges, then apply the knowledge to a small example text
3. Twitter tokenization - learn how to tokenize domain specific text
4. Sentence segmentation - learn how to segment given text into sentences

*Lecture 2*

5. Linux Command Line for NLP - learn how to use command line to quickly extract useful information from provided text file
6. Advanced Use of Linux Command Line - construct more complex command to extract word frequency from a text


## 1. Regular Expressions (pen and paper)
For this section, it might be handy to use the website https://regex101.com/ to test your solutions.
Note: By word, we mean any alphabetic string separated from other words by whitespace, any relevant punctuation, line breaks, etc., as defined in [J&M](https://web.stanford.edu/~jurafsky/slp3/old_dec21/). If we do not specify word, any substring match might be sufficient.
- a) Write a regular expression (regex or pattern) that matches any of the following words: `cat`, `sat`, `mat`.
<br>
(Bonus: What is a possible long solution? Can you find a shorter solution? hint: match characters instead of words)
- b) Write a regular expression that matches numbers, e.g. 12, 1,000, 39.95
- c) Expand the previous solution to match Danish prices indications, e.g., `1,000 kr` or `39.95 DKK` or `19.95`.

## 2. Tokenization

(Adapted notebook from S. Riedel, UCL & Facebook: https://github.com/uclnlp/stat-nlp-book).

In Python, a simple way to tokenize a text is via the `split` method that divides a text wherever a particular substring is found. In the code below this pattern is simply the whitespace character, and this seems like a reasonable starting point for an English tokenization approach.

This is analogous to the `sed` command we have seen in the lecture.

In [1]:
text = "Mr. Bob Dobolina is thinkin' of a master plan." + \
       "\nWhy doesn't he quit?"
text.split(" ")

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.\nWhy',
 "doesn't",
 'he',
 'quit?']

To make more fine-grained decision, we will focus on using regular expressions for tokenization in this assignment. This can be done by either:
1. Defining the character sequence patterns at which to split.
2. Specifying patters that define what constitutes a token. 

In the code below we use a simple pattern `\s` that matches **any whitespace** to define where to split.

In [2]:
import re
gap = re.compile('\s')
gap.split(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

One **shortcoming** of this tokenization is its treatment of punctuation because it considers `plan.` as a token whereas ideally we would prefer `plan` and `.` to be distinct tokens. It might be easier to address this problem if we define what a token is, instead of what constitutes a gap. Below we have defined tokens as sequences of alphanumeric characters and punctuation.

In [3]:
token = re.compile('\w+|[.?:]')
token.findall(text)

['Mr',
 '.',
 'Bob',
 'Dobolina',
 'is',
 'thinkin',
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 'doesn',
 't',
 'he',
 'quit',
 '?']

This still isn't perfect as `Mr.` is split into two tokens, but it should be a single token. Moreover, we have actually lost an apostrophe. Both are fixed below, although we now fail to break up the contraction `doesn't`.

In [4]:
token = re.compile('Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
tokens

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 "doesn't",
 'he',
 'quit',
 '?']

In the code below, we have an input text and apply the tokenizer (described previously) on the text:

In [5]:
import re
text = """'Curiouser and curiouser!' cried Alice (she was so much surprised, that for the moment she quite
forgot how to speak good English); 'now I'm opening out like the largest telescope that ever was! Good-bye,
feet!' (for when she looked down at her feet, they seemed to be almost out of sight, they were getting so far
off). 'Oh, my poor little feet, I wonder who will put on your shoes and stockings for you now, dears? I'm sure I
shan't be able! I shall be a great deal too far off to trouble myself about you: you must manage the best
way you can; —but I must be kind to them,' thought Alice, 'or perhaps they won't walk the way I want to go!
Let me see: I'll give them a new pair of boots every Christmas...'
"""

token = re.compile('Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
print(tokens[:10])
print(len(tokens))

["'Curiouser", 'and', 'curiouser', "'", 'cried', 'Alice', 'she', 'was', 'so', 'much']
147


Questions:

* a) The tokenizer clearly makes a few mistakes. Where?

* b) Write a tokenizer to correctly tokenize the text.

* c) Should one separate `'m`, `'ll`, `n't`, possessives, and other forms of contractions from the word? Implement a tokenizer that separates these, and attaches the `'` to the latter part of the contraction.

* d) Should elipsis (...) be considered as three `.`s or one `...`? Design a regular expression for both solutions.


## 3. Twitter Tokenization
As you might imagine, tokenizing tweets differs from standard tokenization. There are 'rules' on what specific elements of a tweet might be (mentions, hashtags, links), and how they are tokenized. The goal of this exercise is not to create a bullet-proof Twitter tokenizer but to understand tokenization in a different domain.

In the next exercises, we will focus on the following tweet:

In [6]:
tweet = "@robv New vids coming tomorrow #excited_as_a_child, can't w8!!"

In [7]:
token = re.compile('[\w]+')
tokens = token.findall(tweet)
print(tokens)

['robv', 'New', 'vids', 'coming', 'tomorrow', 'excited_as_a_child', 'can', 't', 'w8']


Questions:
- a) What is the correct tokenization of the tweet above according to you?
- b) Try your tokenizer from the previous exercise (Question 4). Which cases are going wrong? Make sure your tokenizer handles the above tweet correctly.
- c) Will your tokenizer correctly tokenize emojis?
- d) Think of at least one other example where your tokenizer will behave incorrectly.

## 4. Segmentation


Sentence segmentation is not a trivial task either.

First, make sure you understand the following sentence segmentation code used in the lecture:

In [8]:
import re

def sentence_segment(match_regex, tokens):
    """
    Splits a sequence of tokens into sentences, splitting wherever the given matching regular expression
    matches.

    Parameters
    ----------
    match_regex the regular expression that defines at which token to split.
    tokens the input sequence of string tokens.

    Returns
    -------
    a list of token lists, where each inner list represents a sentence.

    >>> tokens = ['the','man','eats','.','She', 'sleeps', '.']
    >>> sentence_segment(re.compile('\.'), tokens)
    [['the', 'man', 'eats', '.'], ['She', 'sleeps', '.']]
    """
    current = []
    sentences = [current]
    for tok in tokens:
        current.append(tok)
        if match_regex.match(tok):
            current = []
            sentences.append(current)
    if not sentences[-1]:
        sentences.pop(-1)
    return sentences


In the following code, there is a variable `text` containing a small text and a regular expression-based segmenter:

In [9]:
text = """
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is the longest official one-word placename in U.K. Isn't that weird? I mean, someone took the effort to really make this name as complicated as possible, huh?! Of course, U.S.A. also has its own record in the longest name, albeit a bit shorter... This record belongs to the place called Chargoggagoggmanchauggagoggchaubunagungamaugg. There's so many wonderful little details one can find out while browsing http://www.wikipedia.org during their Ph.D. or an M.Sc.
"""

token = re.compile('Mr.|[\w\']+|[.?]+')

tokens = token.findall(text)
sentences = sentence_segment(re.compile('\.'), tokens)
for sentence in sentences:
    print(sentence)

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U', '.']
['K', '.']
["Isn't", 'that', 'weird', '?', 'I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?', 'Of', 'course', 'U', '.']
['S', '.']
['A', '.']
['also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '...']
['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '.']
["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http', 'www', '.']
['wikipedia', '.']
['org', 'during', 'their', 'Ph', '.']
['D', '.']
['or', 'an', 'M', '.']
['Sc', '.']


Questions:
- a) Improve the segmenter so that it segments the text in the way you think it is correct.
- b) How would you deal with all URLs effectively?
- c) Can you think of other problematic cases not covered in the example?

# Lecture 2

## 5. Linux Command Line for NLP: Conll Format
In natural language processing, the "conll" format is a highly common standard to represent annotated text. There is a variety of conll formats, so it might be more correct to refer to them as conll-like formats. These formats have one word per line, separate sentences with an empty line, and have separate collumns (separated with tabs) for each annotation layer.

In this assignment, we will use the conll format for named entity recognition (from conll2002: [paper](https://aclanthology.org/W02-2024.pdf)). We will use Danish data from (DaN+)[https://aclanthology.org/2020.coling-main.583.pdf]. This data follows the BIO labels as discussed in the lecture. An example of the data is shown below, this example has one entity-span "goergh bush":

```
-       O
en      O
mand    O
der     O
hedder  O
goergh  B-PER
bush    I-PER
.       O

```


Use Unix command line tools for this assignment (grep, sed, etc.)

* a) Search in the `da_arto.conll` file (in the assingment1 directory) for first names. You can assume that first names always have the label B-PER, and that the string "B-PER" does not occur in the first column. 
* b) How many names occur in the data?
* c) How can we make sure that we do not match the string "B-PER" occuring in the first column?
* d) How can we clean away the labels, so that we have only a list of names left? (hint: pipe the result of the previous command into a `split`)
* e) How many of the names you found start with an uppercased character?




## 6. More Advanced Usage of Unix Tools: Creating a word frequency list, finding function words
Let us now create a simple word frequency list from the book above using Unix tools to answer the following question: Which four (function) words are the most frequent in *The Adventures of Sherlock Holmes* by Arthur Conan Doyle (`pg1661.txt`)?

* The first step is to split the text into separate words. Here, we will use the command sed to replace all spaces with a newline:

```
sed 's/ /\n/g' FILE
```

Note: Remember the flag `g`, which stands for global. It replaces all occurrences of a space on a line.

* Hint: It is handy to forward this command to a tool called `less`, which lets you browse through the result (type `q` to quit).

```
sed 's/ /\n/g' FILE | less
```

* Now we can sort the list of tokens and count unique words:

```
sed 's/ /\n/g' FILE | sort | uniq -c
```

* To create the most frequent words first, sort again in reverse numeric order (find the options of `sort` to do so, e.g. check `man sort`).

Note: Here we used `sed`, our textbook shows an alternative with `tr` instead.