# Text preprocessing

- Cleaning
    - Non-text
        - Punctuation
        - JavaScript
    - Encoding

- Segmentation

    - Sentence segmentation (sentence boundary recognition): Document -> Sentence

    - Tokenization
        - Character level
        - Subword level: Byte Pair Encoding or WordPiece
        - Word level

- Normalization

    - Spell checking and correction

    - Lowercase
    
    - Standardization

- Stopword removal

- Morphological processing

    - Stemming (词干提取): Remove derivation morphology: happiness -> happy
    
    - Lemmatization (词形还原): Remove inflection morphology: better -> good

- build vocabulary from training set. 

    - all novel words at test set are replaced by [UNK] token.

- Part-of-speech tagging

- Entity recognition
    - Special characters (emoticons and emojis)
    - Named entity
    - URLs, emails
    - Numbers and dates


### sentence segmentation

Sentence boundary recognition, also known as sentence segmentation, is the task of identifying the boundaries between sentences in a given text. 

It is an essential preprocessing step in many NLP tasks, such as parsing, machine translation, and information extraction. 

Sentence boundary recognition can be treated as a classification problem, where each potential boundary (usually a punctuation mark) is classified as either a sentence boundary or not.

To perform sentence boundary recognition, various features can be used, including:

- Punctuation: Periods `.`, question marks `?`, and exclamation marks `!`.

- Formatting: Line breaks, paragraph breaks.

- Fonts: Changes in font style or size.

- Spacing: spaces before and after punctuation marks. For example, if there is no space after a period, it is less likely to be a sentence boundary.

- Capitalization: The first word of a sentence is typically capitalized in many languages.

- Case: Similar to capitalization, the case of characters in a word can provide clues about sentence boundaries.

- Use of abbreviations: Certain abbreviations, like "Dr." or "a.m.", may be followed by a period but not necessarily indicate the end of a sentence.

 

# subword tokenization

## algorithm

| subword tokenization algorithm                | BPE                        | WordPiece                      |
|------------------------|----------------------------|--------------------------------|
| Vocab Initialization         | all raw characters   | all raw characters and some common subwords      |
| Token selection        | Most frequent byte pairs | maximum likelihood of a token given its surrounding context.  |
| Tokenization process   | Greedy longest match      | Greedy based on likelihood     |
| Pros                   | Simple, language-agnostic | Better token selection in context |
| Cons                   | Suboptimal token selection | language-dependent, Requires predefined vocabulary |


## motivation

- unit of vocabulary in subword tokenization is subword (parts of words, bytes, characters)

    byte: not refer to traditional definition of a storage unit -  byte = 8 bit. it means a string of characters the BPE algorithm operate on.

- finite vocabulary assumption:  a fixed vocab of tens of thousands of words, built from the training set,  all novel words in test set are replaced as [UNK] token.

- issue of finite vocab assumption: 

    - many languages has complex morphology, representing Tense, mood, definiteness, negation, information about the object, etc.
    
    - simply replace novel words of different morphology in test set as unknown tokens and have same embeddings will loss many information.

    - morphology processing (stemming and lemmatization) can only address part of issue:
    
        - loss of information: words at base or root form result in ambiguity without context. morphological features carry important syntactic or semantic information are lost.

        - languge-dependent: Stemming and lemmatization algorithms are typically language-dependent, requiring separate rule sets or resources for each language. This makes them less flexible and scalable than subword encoding techniques like BPE, which can be applied across languages.

