In [2]:
import re

# Regular Expressions

A formal language for specifying a set of text strings.

A pattern that specify text search string to find in a corpus of text.

For more info, refer to [Regular Expressions](../general_computing/regular_expression.ipynb)



## Errors

1. False positive (Type I): Matching strings that we do not want to be matched
2. False negative (Type II): Not matching strings we want to match

Suppose that we wish to find all occurence of `at`, regardless of capitalization, in the following corpus

`All cats like to sleep at home. At noon, they wake up to eat`

If we use the regular expression of `at`, we will get

In [4]:
def show_matches(exp, text):
    for m in list(re.finditer(exp, text))[::-1]:
        a, b = m.start(), m.end()
        text = text[:b] + "\x1b[0m" + text[b:]
        text = text[:a] + "\x1b[31m" + text[a:]

    return text


corpus = "All cats like to sleep at home. At noon, they wake up to eat"

print(show_matches(r"at", corpus))

All c[31mat[0ms like to sleep [31mat[0m home. At noon, they wake up to e[31mat[0m


Which has both false positive when it matches `c(at)s` and `e(at)`, and false negative with `At`.

The more correct expression would be `\w[Aa][Tt]\w`

In [5]:
print(show_matches(r"\w[Aa][Tt]\w", corpus))

All [31mcats[0m like to sleep at home. At noon, they wake up to eat


Reducing errors involves two opposing efforts
* Increase **accuracy/precision** $\rightarrow$ Minimise false positive
* Increase **coverage/recall** $\rightarrow$ Minimise false negatives

One can see that if we increase the strictness of our match, it is likely that we improve precision but reduce recall.

# Corpus Preprocessing
To determine how many words there are in a corpus, we need to define what a unique word is. 
Words can be defined by their:

* Lemma: same stem, part of speech, rough word sense 

    * cat and cats = same lemma

* Wordform: the full inflected surface form

    * cat and cats = different wordforms

Type: Element of the vocabulary

Token: Instance of that type in running text


## Issues with Tokenization
Tokenization in English have many ambiguities:

* Apostrophe used for belonging may split into different word forms
    * London's $\rightarrow$ London, Londons, London's
* Apostrophe used for contraction may cause odd word fragments
    * what're $\rightarrow$ what + re
* Compound words no longer become a single word
    * break-through $\rightarrow$ breakthrough
* Acronymns with perdios may cause issues:
    * L.A.P.D, Mr.
    
    
This permeates through other languages as well

* French: L'ensemble

* German: Lebensversicherungsgesellschaftsangestellter

    * Where it is composed of compound words. We will need a scheme to split it into its components

* Chinese/Japanese have no spaces between words

    * Multiple syllabaries are intermingled in Japanese: Hiragana, Katagana, Kanji


## Word Tokenization in Chinese

Standard baseline segmentation algorithm: 

* Maximum/Greedy Matching

### Maximum Matching
1. Start at the beginning of the string
2. Find the longest word in dictionary that matches the string
3. Move the pointer over the word in string
4. Repeat 2-4

This generally doesn't work in English

`Themeterratedropped` may be tokenized to "Theme terra ted ropped", when "The meter rate dropped" may be the intended meaning.

However, this works rather well for Chinese

## Morphology <a id="morphology"></a>
**The study of the way words are build from morphemes**

Morphemes: The minimal meaning-beraing unit in a language
* Stems: Core meaning-bearing unit (eg. cow)
* Affixes: Bits that adhere to stems. Often carry grammatical function (eg. -s)

### Benefits of Morphology

It is inefficient to listing all the different morphological variants of a word in a dictionary.
For example, storing `ride`, `riding`, `rides`, `rode` ... will take up excessive space.

Thus, affixes are productive, as we can simply generate new words from its stem.

Also, it might be impossible to list all morphological variants of every word, especially for morphologically complex languages like Turkish.


### Forms of Morphology

#### Inflectional
* Combine a stem and an affix to form a word in the same class as stem
    * ie noun $\rightarrow$ noun
* For syntactic function like agreement
* e.g., -s to form plural form of a noun

#### Derivational
* Combine a stem and an affix to form a word in a different class
    * ie verb $\rightarrow$ noun
* Harder to predict the meaning of the derived form
* e.g., -ation in organize and organization

### Out of Vocabulary
New words are constantly being created.

However, we can apply morphology analysis to try to understand them even though we have never seen them before.

For example, even if we are in the year before "selfie" was first coined, we can derive that it have something to do with "self".

### Repurposed Byte Pair Encoding (BPE)

1. Find most frequent pairs in the string
2. Update vocabulary with the pair with a new symbol
3. Replace all pairs in the string with the new symbol
4. Repeat up to k times, where k is a tunable parameter (corpus size dependant)

In [33]:
from collections import Counter


def bpe(text, k):
    text = text.lower()
    vocab = {}
    curr_token = "A"
    for _ in range(k):
        counter = Counter()
        for pair in [text[i : i + 2] for i in range(len(text) - 1)]:
            counter.update([pair])
        most_common_pair, freq = counter.most_common(1)[0]
        vocab[curr_token] = most_common_pair
        text = text.replace(most_common_pair, curr_token)
        curr_token = chr(ord(curr_token) + 1)

    return text, vocab


bpe("wwwxzwwwxwy", 3)

('CzCwy', {'A': 'ww', 'B': 'Aw', 'C': 'Bx'})

## Normalization
Convert text to a convenient standard form.

* `U.S.A` to `USA`

Alternative: asymmetric expansion

* `beat` to `beat`,`beats`

* `beats` to `beat`, `beats`, `Beats`

* `Beats` to `Beats`

Notice that `beats` $\rightarrow$ `Beats` but `Beats` $\not \rightarrow$ `beats`, because "Beats" is likely referring to the brand "Beats".

### Case folding
Also known as using only upper/lower case for analysis.

Folding allows us to group the same tokens that are in different part of the sentence together.

`He jogs in the morning. Jogs in the morning are tiring.`

where both "jogs" are tokenized to the same token.

However, case folding may cause improper grouping of tokens also.

For example, `No one can divide us! US (United States) will stand united!`, where "us" has a different meaning from "US".

### Lemmatization
Reduce words to their base form

* am, are, is, were, was $\rightarrow$ be
* cow, cows, cow's $\rightarrow$ cow

Requires a dictionary of words with all its possible forms

### Penn Treebank Tokenization
* Seperate out clitics (morpheme that has the role of a independant word, but does not occur alone)

`don't` $\rightarrow$ `do n't`

* Keep hyphenated words together

* Separate out all punctuation symbols

### Stemming
Reduce terms to their stems

Crude chopping of affixes

`Activate, activated, activation` $\rightarrow$ `activat`

* Language dependant

#### Porter Stemmer
Efficient stemming algorithm that is build on regular expressions

As series of rules that update the text in each pass.

Does not require a lexicon


### Sentence Segmentation
`!` and `?` are generally unambiguous as the end of the sentence.

However, `.` is ambiguous, in the case of `U.S.A`, `Dr.`,  `4.3`.

To solve this, we can use a decision tree to determine if it is the end of the sentence.


## Spelling Errors

1. Non-word error detection
    * Detecting that `desret` is not a word
2. Isolated-word error correction
    * Seeing that `desret` may be a mispelling of `desert`
3. Context-sensitive error detection and correction
    * Noticing that the context is about sweets and thus `dessert` is more fitting

### Spelling Error Pattterns <a id="single_error"></a>
Single-errors are the most common error in typewritten text

Types of single-errors:
* Insertion
    * `desserrt`
* Deletion
    * `dssert`
* Substitution
    * `dessett`
* Transposition
    * `dessret`

### Candidate Generation
Similar spelling
* Small edit distance

Similar pronounciation
* Small edit distance to the pronounced word

## Noisy Channel Intuition

We can model the corpus generation as through a noisy channel

Our mission is to decode the noisy signal back into its original intentions


### Noisy Channel

Given x as the observation of a misspelled word, we wish to find the correct word w

$$\hat w = \text{argmax} _{w\in V} Pr(w|x)$$

$$\hat w = \text{argmax} _{w\in V} \frac{Pr(x| w)P(w)}{Pr(x)}$$

Since all the $Pr(x)$ would be the same for all possible word w, we simply need to compare:

$$\hat w = \text{argmax} _{w\in V} Pr(x| w)P(w)$$

#### Estimating $Pr(w)$

We can use a Maximum Likelihood Estimate (MLE):

$Pr(w) = \frac{freq(w)}{N}$

This represents the probability of the word $w$ appearing in the text

#### Estimating $Pr(x|w)$  <a id="confusion_matrix"></a>

We can use a pre-computed confusion matrix of spelling errors to determine the probabilty that $w$ will be misspelled to $x$.



## Edit Distance

A method to compare words based on their similarity in spelling.

### Uses of Edit Distance
* Candidate Generation
    * Most of the errors have edit distance 1; almost every error are in edit distance of 2.
    * Allow insertion of space/hyphens

* Evaluating Machine Translation and Speech Recognition
* Named Entity Extraction and Entity Coreference
* Nucleotide Comparison in Computation Biology

The allowed edit operations are the same as those in [single errors](#single_error):
* Insertion
* Deletion
* Substitution
* Transposition

However, we can assign different cost to each of the operations.

For minimum edit distance (**Levenshtein distance**), insertion and deletion have a cost of 1 while substitution have a cost of 2. We do not allow for transposition for Levenshtein distance.

The solution to this is the same as the variant discussed in [Dynamic Programming](../algorithm_analysis/dynamic_programming.ipynb#edit_distance) section.

### Backtrace
By using the remembering which cell we came from, we can track back the path to get the alignment of the 2 string.

### Complexity
Time: O(mn)

Space: O(mn)

Backtrace: O(m+n)

## Weighted Min Edit Distance
We can add weights to certain edit operations, because some operations may be more common than others.

In the case of spelling correction, some letters are more likely to be mistyped than others (due to proximity on the keyboard etc)

In biology, certain deletion/insertion of sequences may be more common than others.

To account for this, we can use the same [confusion matrix](#confusion_matrix) we used to determine the word from a noisy channel.