In [5]:
import re

## Regular Expressions

A formal language for specifying a set of text strings.

A pattern that specify text search string to find in a corpus of text.

For more info, refer to [Regular Expressions](../general-computing/regular_expression.ipynb)



### Errors

1. False positive (Type I): Matching strings that we do not want to be matched
2. False negative (Type II): Not matching strings we want to match

Suppose that we wish to find all occurence of `at`, regardless of capitalization, in the following corpus

`All cats like to sleep at home. At noon, they wake up to eat`

If we use the regular expression of `at`, we will get

In [6]:
def show_matches(exp, text):
    for m in list(re.finditer(exp, text))[::-1]:
        a, b = m.start(), m.end()
        text = text[:b] + ")" + text[b:]
        text = text[:a] + "(" + text[a:]
        
    return text

corpus = "All cats like to sleep at home. At noon, they wake up to eat"

print(show_matches(r"at", corpus))

All c(at)s like to sleep (at) home. At noon, they wake up to e(at)


Which has both false positive when it matches `c(at)s` and `e(at)`, and false negative with `At`.

The more correct expression would be `\w[Aa][Tt]\w`

In [7]:
print(show_matches(r"\w[Aa][Tt]\w", corpus))

All (cats) like to sleep at home. At noon, they wake up to eat


Reducing errors would involve two opposing efforts
* Increase **accuracy/precision** $\rightarrow$ Minimise false positive
* Increase **coverage/recall** $\rightarrow$ Minimise false negatives

## How many word

I do uh main- mainly business data processing

* Lemma: same stem, part of speech, rough word sense 

cat and cats = same lemma

* Wordform: the full inflected surface form

cat and cats = different wordforms

they lay back on the San Francisco grass and looked at the stars
and their

Type: Element of the vocabulary

Token: Instance of that type in running text

`San Francisco`: Should it be one word or 2?

## Issues with Tokenization

Findland's capital $\rightarrow$ Finland, Finlands, Finland's

what're $\rightarrow$ What are

Hewlett-Packet $\rightarrow$ Hewlett Packard?

This permeates through other languages

French: L'ensemble

German: Lebensvericherungsgesellschaftsangestellter

Where it is compose of compound words, which we would want to split it into its components

Chinese/Japanese have no spaces between words

Multiple syllabaries are intermingled in Japanese: Hiragana, Katagana, Kanji


## Word Tokenization in Chinese

Standard baseline segmentation algorithm: 

* Maximum/Greedy Matching

### Maximum Matching



## Why Morphology?
Listing all the different morphological variants of a word in a
dictionary is inefficient.

Affixes are productive; they apply to new words (e.g., fax and faxing).

For morphologically complex languages like Turkish, it is
impossible to list all morphological variants of every word.

## Forms of Morphology

### Inflectional
* Combine a stem and an affix to form a word in the same class as stem
* For syntactic function like agreement
* e.g., -s to form plural form of a noun

### Derivational
* Combine a stem and an affix to form a word in a different class
* Harder to predict the meaning of the derived form
* e.g., -ation in computerize and computerization

### Out of Vocabulary
New words are always being created

However, we can apply morphology analysis to try to understand them even though we have never seen them before

### Repurposed Byte Pair Encoding (BPE)

1. Find most frequent pairs in the string
2. Update vocabulary with the pair with a new symbol
3. Replace all pairs in the string with the new symbol
4. Repeat up to k times, where k is a tunable parameter (corpus size dependant)

## Normalization
Convert text to a convenient standard form.

`U.S.A` to `USA`

Alternative: asymmetric expansion

### Case folding

### Lemmatization
Reduce words to their base form

### Penn Treebank Tokenization
Seperate out clitics

`don't` $\rightarrow$ `do not`

### Stemming
Reduce terms to their stems

Crude chopping of affixes

* Language dependant

#### Porter Stemmer
Efficient stemming algorithm that is build on regular expressions

Does not require a lexicon


## Sentence Segmentation
`!` and `?` are generally unambiguous as the end of the sentence.

However, `.` is ambiguous, in the case of `U.S.A`, `Dr.`

To solve this, we can use a decision tree


## Spelling Errors

1. Non-word error detection
2. Isolated-word error correction

### Spelling Error Pattterns
Single-errors are the most common error in typewritten text

* Insertion
* Deletion
* Substitution
* Transposition

### Candidate Generation
Similar spelling
* Small edit distance

Similar pronounciation
* Small edit distance to the pronounced word

### Noisy Channel Intuition

We can model the corpus generation as through a noisy channel

Our mission is to decode the noisy signal back into its original intentions


### Noisy Channel

Given x as the misspelled word, we wish to find the correct word w

$$\hat w = argmax _{w\in V} P(w|x)$$

$$\hat w = argmax _{w\in V} \frac{P(x| w)}{P(x)}$$

### Edit Distance
80% of errors are in edit distance 1
Almost all errors are in  edit distance 2

Allow insertion of space/hyphens

