# Utils for Implementation

In [None]:
from transformers import AutoTokenizer

In [None]:
def useTokenizer(tokenizer_name, text):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, token="your_token ^_^")
    tokens = tokenizer.tokenize(text)
    return tokens

In [None]:
import random

colors = [
    '\033[91m',  # Red
    '\033[92m',  # Green
    '\033[93m',  # Yellow
    '\033[94m',  # Blue
    '\033[95m',  # Magenta
    '\033[96m',  # Cyan
    '\033[97m',  # White
]

reset = '\033[0m'

def colorful_print(tokens):
    color_count = len(colors)
    for idx, token in enumerate(tokens):
        color = colors[idx % color_count]  # Cycle
        print(f"{color}{token}{reset}", end=' ')
    print()

#Tokenization in NLP

To deepen your understanding of tokenization in Natural Language Processing (NLP), check out this useful blog post:

🔗 [What is NLP? Natural Language Processing & Tokenization](https://www.ixopay.com/blog/what-is-nlp-natural-language-processing-tokenization)


# Types of Tokenization

## Whitespace Tokenization

**Whitespace tokenization** is one of the simplest methods of breaking text into tokens. It simply splits the text wherever there are spaces.

### ✅ Pros:
-  Very quick to implement and requires minimal processing power.
- Works for any language that uses spaces between words without needing special rules.

### ❌ Cons:
- It keeps punctuation attached to words (e.g., `hello!` stays as `hello!` instead of separating `hello` and `!`).
- Struggles with languages like Chinese or Japanese, where words are not separated by spaces.


In [None]:
def whitespace_tokenize(text):
    tokens = text.split()
    return tokens

text = "This is a simple text.\nLet's tokenize it!"
tokens = whitespace_tokenize(text)
colorful_print(tokens)

[91mThis[0m [92mis[0m [93ma[0m [94msimple[0m [95mtext.[0m [96mLet's[0m [97mtokenize[0m [91mit![0m 


## Word Tokenization

**Word tokenization** involves splitting text into individual words, often by using spaces and punctuation as boundaries.

### ✅ Pros:
- It separates punctuation from words (e.g., `hello!` becomes `hello` and `!`).
- Prepares cleaner input, leading to better model performance.

### ❌ Cons:
- Some languages require specialized tokenizers (e.g., German compound words, Chinese without spaces).
- Struggles with contractions (e.g., "don't" could be treated differently depending on the tokenizer settings).

In [None]:
import re

def word_tokenize(text):
    tokens = re.findall(r"\w+|[^\w\s]", text) # to match words (sequence of alphanumeric characters) or punctuation
    return tokens

text = "Hello world! This is a text: isn't it great?"
tokens = word_tokenize(text)
colorful_print(tokens)

[91mHello[0m [92mworld[0m [93m![0m [94mThis[0m [95mis[0m [96ma[0m [97mtext[0m [91m:[0m [92misn[0m [93m'[0m [94mt[0m [95mit[0m [96mgreat[0m [97m?[0m 


## Sentence Tokenization

**Sentence tokenization** involves splitting a text into individual sentences, usually by identifying punctuation marks like periods (`.`), exclamation points (`!`), and question marks (`?`).

### ✅ Pros:
- Useful for tasks like summarization, translation, and question answering, where sentence structure matters.
- Helps models understand the flow and relationship between different parts of a text.

### ❌ Cons:
- Periods can appear in abbreviations (e.g., "Dr.", "e.g.") and may confuse the tokenizer.
- Different languages and writing styles (like missing punctuation) make accurate sentence splitting harder.


In [None]:
import re

def sentence_tokenize(text):
    sentence_endings = re.compile(r'(?<!\w\.\w)(?<![A-Z][a-z]\.)(?<=\.|\!|\?)(\s+)') # by sentence-ending punctuation
    sentences = sentence_endings.split(text)
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

text = "Hello world! This is a sentence tokenization text. How's it going? Ali is here."
sentences = sentence_tokenize(text)
colorful_print(sentences)

[91mHello world![0m [92mThis is a sentence tokenization text.[0m [93mHow's it going?[0m [94mAli is here.[0m 


## Character Tokenization

**Character tokenization** breaks down text into individual characters. It is useful for languages without clear word boundaries or when modeling at a very granular level.

### ✅ Pros:
- Works well with languages like Chinese, Japanese, and others that don't use spaces to separate words.
- Allows for detailed feature extraction, useful in tasks like character-level language modeling or OCR.

### ❌ Cons:
- It ignores word-level semantics, making it harder for the model to understand high-level meaning.
- Results in longer sequences, which can increase computation time and memory usage in models.


In [None]:
def character_tokenize(text):
    return list(text)

text = "Hello, world! It is Ali"
tokens = character_tokenize(text)
colorful_print(tokens)

[91mH[0m [92me[0m [93ml[0m [94ml[0m [95mo[0m [96m,[0m [97m [0m [91mw[0m [92mo[0m [93mr[0m [94ml[0m [95md[0m [96m![0m [97m [0m [91mI[0m [92mt[0m [93m [0m [94mi[0m [95ms[0m [96m [0m [97mA[0m [91ml[0m [92mi[0m 


## Subword Tokenization

**Subword tokenization** breaks down words into smaller units, such as prefixes, suffixes, or roots. It is often used to handle rare words and improve the model's ability to generalize.

### ✅ Pros:
- Helps by breaking down unknown words into meaningful subword units.
- Captures part of the semantic meaning while reducing sequence length compared to character tokenization.

### ❌ Cons:
- Requires more sophisticated algorithms.
- For texts with frequent common words, subword tokenization may be unnecessary and overcomplicated.


### Byte-Pair Encoding (BPE)

**Byte-Pair Encoding (BPE)** is a subword tokenization method that iteratively merges the most frequent pair of characters or subwords in a corpus to form new subword units. This method is often used in machine translation and other NLP tasks.

### ✅ Pros:
- It helps manage out-of-vocabulary words by breaking them into smaller, known subword units.
- By using subwords, BPE reduces the vocabulary size while still capturing meaningful information from the text.

### ❌ Cons:
- The merging process can be computationally expensive and requires a large corpus to be effective.
- While it handles rare words, it might lose finer semantic details, especially for languages with complex morphology.


In [None]:
text = "Hello world! This is an example of BPE."
# text ="""
# English and CAPITALIZATION
# 🎵⻦
# show_tokens False None elif == >= else: two tabs:" " Three tabs: " "
# 12.0*50=600
# """
tokens = useTokenizer("openai-community/gpt2", text)
colorful_print(tokens)

[91mHello[0m [92mĠworld[0m [93m![0m [94mĠThis[0m [95mĠis[0m [96mĠan[0m [97mĠexample[0m [91mĠof[0m [92mĠB[0m [93mPE[0m [94m.[0m 


### WordPiece

**WordPiece** is a subword tokenization method used in models like BERT. It builds a vocabulary of subword units based on frequency and likelihood, with the goal of efficiently handling rare words and improving model generalization.

### ✅ Pros:
- Widely used in state-of-the-art models like BERT, making it highly optimized for language understanding tasks.
- It helps the model generalize by learning both subword and word-level representations, providing better handling of rare or unknown words.

### ❌ Cons:
- WordPiece requires a large corpus to effectively capture the most frequent subwords, making it less effective for small datasets.
- The algorithm for building subword units can be complex and computationally intensive.


In [None]:
text = "Hello world! This is an example of WordPiece tokenization."
tokens = useTokenizer("bert-base-uncased", text)
#NOTE: bert-base-cased used when capitalization is valued in the task like NER (named entity recognition)
#      NER example: Ali from Hermel ==> entities: ALi (person) and Hermel (location)
colorful_print(tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[91mhello[0m [92mworld[0m [93m![0m [94mthis[0m [95mis[0m [96man[0m [97mexample[0m [91mof[0m [92mword[0m [93m##piece[0m [94mtoken[0m [95m##ization[0m [96m.[0m 


### Unigram Language Model

The **Unigram Language Model** is used by **SentencePiece** for subword tokenization. It selects subword units based on a probabilistic model, aiming to maximize the likelihood of the corpus under the chosen subword vocabulary.

### ✅ Pros:
- Uses a probabilistic model to select the most likely subwords, which helps capture the structure of language more effectively.
- Allows control over the vocabulary size, which can be fine-tuned based on the task and available data.

### ❌ Cons:
- The model can sometimes generate subwords that are not linguistically meaningful, especially when the vocabulary size is too small.
- Like other tokenization methods, its effectiveness depends on the quality and quantity of training data.


In [None]:
text = "Hello world! This is an example of SentencePiece tokenization."
# text ="""
# English and CAPITALIZATION
# 🎵⻦
# show_tokens False None elif == >= else: two tabs:" " Three tabs: " "
# 12.0*50=600
# """
tokens = useTokenizer("t5-small", text)
colorful_print(tokens) # The ▁ character is used to indicate the start of a new word in SentencePiece tokenization.

[91m▁Hello[0m [92m▁world[0m [93m![0m [94m▁This[0m [95m▁is[0m [96m▁an[0m [97m▁example[0m [91m▁of[0m [92m▁Sen[0m [93mt[0m [94mence[0m [95mP[0m [96mi[0m [97me[0m [91mce[0m [92m▁token[0m [93mization[0m [94m.[0m 


## Morpheme-Based Tokenization

**Morpheme-Based Tokenization** segments text into morphemes, the smallest meaningful units in a language. This method is especially useful for morphologically rich languages where words can have many affixes or complex structures.

### ✅ Pros:
- It breaks down words into meaningful units, allowing models to better understand complex word structures.
- Works well for languages like Finnish, Turkish, and **Arabic**, which have extensive word formation rules.

### ❌ Cons:
- Identifying morphemes requires sophisticated algorithms and linguistic resources, making it more challenging than simpler tokenization methods.
- For languages with simpler morphology (like English), this method might not provide a significant advantage over word tokenization.


In [None]:
!pip install spacy



In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m124.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm") #english

text = "unhappiness"
doc = nlp(text)

for token in doc:
    print(f"Text: {token.text}")
    print(f"Lemma: {token.lemma_}")
    print(f"Prefix: {token.prefix_}")
    print(f"Suffix: {token.suffix_}")
    print(f"POS: {token.pos_}")
    print(f"Tag: {token.tag_}")
    print(f"Dep: {token.dep_}")
    print("-" * 20)

Text: unhappiness
Lemma: unhappiness
Prefix: u
Suffix: ess
POS: NOUN
Tag: NN
Dep: ROOT
--------------------
