# What is Tokenization?

Tokenizers are one of the most important tools in nlp, which break down text into smaller units called tokens. These tokens can be words, characters or subwords, making complex sentence understandable to computers. Mainly tokenizers bridge the gap between human language and machine understanding.

# Setup

For this lab, the following libraries are gonna be used:

* [`nltk`](https://www.nltk.org/) or natural language toolkit, will be employed for `data management` tasks. It offers comprehensive tools and resources for processing natural language task, making it a valuable choice for tasks such as text preprocessing, and analysis

* [`spaCy`](https://spacy.io/) is an open-source library for advanced natural preocessing in Python. `spaCy` is renowned for its speed and accuracy in processing large volumes of text data

* [`BertTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#berttokenizer) is part of the Hugging Face Transformers Library, a popular library for working with state-of-the-art pre-trained language models. `BertTokenizer` is specially designed for `tokenizing` text according to the BERT model's specifications.

* [`XLNetTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#xlnettokenizer) is another component of the Hugging Face Transformers library. It is tailored for tokenizing text in alignment with the XLNet model's requirements.

* [`torchtext`](https://pytorch.org/text/stable/index.html) It is part of the PyTorch ecosystem, to handle various natural language processing tasks. It  simplifies the process of working with text data and provides functionalities for data preprocessing, tokenization, vocabulary management, and batching.


# Installing Required Libraries

In [1]:
!pip install nltk
!pip install transformers==4.42.1
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install scikit-learn
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install numpy==1.26.0


Collecting numpy>=1.19.0 (from spacy)
  Obtaining dependency information for numpy>=1.19.0 from https://files.pythonhosted.org/packages/2b/3e/e7247c1d4f15086bb106c8d43c925b0b2ea20270224f5186fa48d4fb5cbd/numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl.metadata
  Using cached numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.2.4-cp311-cp311-macosx_14_0_arm64.whl (5.4 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.0
    Uninstalling numpy-1.26.0:
      Successfully uninstalled numpy-1.26.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tables 3.8.0 requires blosc2~=2.0.0, which is not installed.
tables 3.8.0 requires cython>=0.29.21, which is not installed.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
numba 0.57.1 requires numpy<1.25,>=1.21, b

# Importing Required Libraries

In [2]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from transformers import BertTokenizer
from transformers import XLNetTokenizer


from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def warn(*args,**kwargs):
    pass
import warnings
warnings.warn=warn
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Why tokenization is important?

Tokenization segmenting text into smaller units called tokens. These tokens are subsequently transformed into `numerical representation` called token indices, which are directly employed by deep learning algorithms.

## Types of tokenizer

Tokenization methods are further divided into 3 main sections.

    * Word-Based
    * Character-Based
    * Subword-Based

## Word-based Tokenizer

### nltk

As the name suggests, this is the splitting of text based on words.

In [13]:
text = 'This is a sample sentence for word tokenization'
tokens = word_tokenize(text)
tokens

['This', 'is', 'a', 'sample', 'sentence', 'for', 'word', 'tokenization']

General libraries like nltk and spaCy often split words like 'don't' and 'couldn't,' which are contractions, into different individual words. There's no universal rule, and each library has its own tokenization rules for word-based tokenizers. However, the general guideline is to preserve the input format after tokenization to match how the model was trained.


In [14]:
# This showcases word_tokenize from nltk library

text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are"

tokens = word_tokenize(text)
tokens

['I',
 'could',
 "n't",
 'help',
 'the',
 'dog',
 '.',
 'Ca',
 "n't",
 'you',
 'do',
 'it',
 '?',
 'Do',
 "n't",
 'be',
 'afraid',
 'if',
 'you',
 'are']

In [16]:
text = "Unicorns are real. I saw a unicorn yesterday"
token = word_tokenize(text)
token

['Unicorns', 'are', 'real', '.', 'I', 'saw', 'a', 'unicorn', 'yesterday']

The problem with this algorithm is that words with similar meanings will be assigned different IDs, resulting in them being treated as entirely separate words with distinct meanings. For example, $Unicorns$ is the plural form of $Unicorn$, but a word-based tokenizer would tokenize them as two separate words, potentially causing the model to miss their semantic relationship.


Each word is split into a token, leading to a significant increase in the model's overall vocabulary. Each token is mapped to a large vector containing the word's meanings, resulting in large model parameters.


# Character Based Tokenizer

As the name suggests, character-based tokenization involves splitting text into individual characters. The advantage of using this approach is that the resulting vocabularies are inherently small. Furthermore, since languages have a limited set of characters, the number of out-of-vocabulary tokens is also limited, reducing token wastage.

For example:
Input text: `This is a sample sentence for tokenization.`

Character-based tokenization output: `['T', 'h', 'i', 's', 'i', 's', 'a', 's', 'a', 'm', 'p', 'l', 'e', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 'f', 'o', 'r', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', '.']`

However, it's important to note that character-based tokenization has its limitations. Single characters may not convey the same information as entire words, and the overall token length increases significantly, potentially causing issues with model size and a loss of performance.

## Subword Based Tokenizer

The subword-based tokenizer allows frequently used words to remian unsplit while breakind down infrequent words into meaningful subwords. Techniques such as `WordPiece` and `SentencePiece` are commonly used for subword tokenization.These methods learn subword units from a given text corpus, identifying common prefixes, suffixes, and root words as subword tokens based on their frequency of occurrence. This approach offers the advantage of representing a broader range of words and adapting to the specific language patterns within a text corpus.


1. `Unhappiness`-> 'Un' & 'Happiness'
2. `Unicorns` -> 'Unicorn' & 's'

In both examples below, words are split into subwords, which helps preserve the semantic information associated with the overall word. For instance, 'Unhappiness' is split into 'un' and 'happiness,' both of which can appear as stand-alone subwords. When we combine these individual subwords, they form 'unhappiness,' which retains its meaningful context. This approach aids in maintaining the overall information and semantic meaning of words.


## WordPiece

Initially, WordPiece initializes its vocabulary to include every `character present` in the training data and progresively learns a `specified number of merge rules`
So, it basically does character based tokenization first, and then progressively merges characters with one another to form words.

Now, the WordPiece tokenizer is implemented in BertTokenizer. Note that, BertTokenizer treats composite words as separate words

In [19]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("IBM taught me.")


['ibm', 'taught', 'me', '.']

Here’s a breakdown of the output:
- 'ibm': "IBM" is tokenized as 'ibm'. BERT converts tokens into lowercase, as it does not retain the case information when using the "bert-base-uncased" model.
- 'taught', 'me', '.': These tokens are the same as the original words or punctuation, just lowercased (except punctuation).
- 'token', '##ization': "Tokenization" is broken into two tokens. "Token" is a whole word, and "##ization" is a part of the original word. The "##" indicates that "ization" should be connected back to "token" when detokenizing (transforming tokens back to words).


## Unigram and SentencePiece

Unigram is a method for breaking words or text into smaller pieces. It accomplishes this by starting with a large list of possibilities and gradually narrowing it down based on how frequently those pieces appear in the text. This approach aids in efficient text tokenization.

SentencePiece is a tool that takes text, divides it into smaller, more manageable parts, assigns IDs to these segments, and ensures that it does so consistently. Consequently, if you use SentencePiece on the same text repeatedly, you will consistently obtain the same subwords and IDs.

Unigram and SentencePiece work together by implementing Unigram's subword tokenization method within the SentencePiece framework. SentencePiece handles subword segmentation and ID assignment, while Unigram's principles guide the vocabulary reduction process to create a more efficient representation of the text data. This combination is particularly valuable for various NLP tasks in which subword tokenization can enhance the performance of language models.


`Unigram` helps you figure out which subwords should be part of the model's vocabulary by looking at frequency.

`SentencePiece` is the tool that applies Unigram's method to break text into those subwords, ensuring consistency and assigning IDs to them.

In [20]:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
tokenizer.tokenize("IBM taught me tokenization.")

['▁IBM', '▁taught', '▁me', '▁token', 'ization', '.']

Here's what's happening with each token:
- '▁IBM': The "▁" (often referred to as "whitespace character") before "IBM" indicates that this token is preceded by a space in the original text. "IBM" is kept as is because it's recognized as a whole token by XLNet and it preserves the casing because you are using the "xlnet-base-cased" model.
- '▁taught', '▁me', '▁token': Similarly, these tokens are prefixed with "▁" to indicate they are new words preceded by a space in the original text, preserving the word as a whole and maintaining the original casing.
- 'ization': Unlike "BertTokenizer," "XLNetTokenizer" does not use "##" to indicate subword tokens. "ization" appears as its own token without a prefix because it directly follows the preceding word "token" without a space in the original text.
- '.': The period is tokenized as a separate token since punctuation is treated separately.


## Tokenization with PyTorch

In PyTorch, especially with the torchtext library, the tokenizer breaks down text from a data set into individual words or subwords, facilating their converstion into numerical part. After tokenization, the vocab maps these tokens to unique integers, allowing them to be fed into neural networks. This process is vital because deep learning models operate on numerical data and can't process raw text directly.

In [3]:
dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1," NLP Named Entity,Sentiment Analysis,Machine Translation "),
    (1," Machine Translation with NLP "),
    (1," Named Entity vs Sentiment Analysis  NLP ")
]

In [4]:
from torchtext.data.utils import get_tokenizer

In [5]:
tokenizer = get_tokenizer("basic_english")

You apply the tokenizer to the dataset. Note: if basic_english is selected, it returns the basic_english_normalize() function, which normalizes the string first, and then splits it by space.

In [6]:
tokenizer(dataset[0][1])

['introduction', 'to', 'nlp']

## Token Indices

`build_vocab_from_iterator`, the output is typically referred to as `token indices` or simply `indices`. These indices represent the numeric representations of the tokens in the vocabulary. These indices represent the numeric presentations of the tokens in the vocabulary

The **```build_vocab_from_iterator```** function, when applied to a list of tokens, assigns a unique index to each token based on its position in the vocabulary. These indices serve as a way to represent the tokens in a numerical format that can be easily processed by machine learning models.

For example, given a vocabulary with tokens ["apple", "banana", "orange"], the corresponding indices might be [0, 1, 2], where "apple" is represented by index 0, "banana" by index 1, and "orange" by index 2.

`Dataset` is an iterable. Therefore, you use a generator function yield_tokens to apply the `tokenizer`. The purpose of the generator function `yield_tokens` is to yield tokenized texts `one at a time`. Instead of preocessing the entire dataset and returning all the tokenized texts in one go, the generator function processes and yields each tokenized text individually as it is requested. The tokenization process is performed lazily, which means the next tokenized text is generated only when needed, saving memory and computational resources.

In [25]:
def yield_tokens(data_iter):
    for _,text in data_iter:
        yield tokenizer(text)

In [8]:
my_iterator = yield_tokens(dataset)

This creates an iterator called **```my_iterator```** using the generator. To begin the evaluation of the generator and retrieve the values, you can iterate over **```my_iterator```** using a for loop or retrieve values from it using the **```next()```** function.


In [9]:
next(my_iterator)

['introduction', 'to', 'nlp']

In [76]:
len(dataset)

9

In [10]:
for tokens in my_iterator:
    print(tokens)

['basics', 'of', 'pytorch']
['nlp', 'techniques', 'for', 'text', 'classification']
['named', 'entity', 'recognition', 'with', 'pytorch']
['sentiment', 'analysis', 'using', 'pytorch']
['machine', 'translation', 'with', 'pytorch']
['nlp', 'named', 'entity', ',', 'sentiment', 'analysis', ',', 'machine', 'translation']
['machine', 'translation', 'with', 'nlp']
['named', 'entity', 'vs', 'sentiment', 'analysis', 'nlp']


In [11]:
type(my_iterator)

generator

We build a `vocabulary` from the tokenized texts generated by the yield_tokens generator function, which represents the dataset. The build_vocab_from_iterator() function constructs the vocabulary, including a special token `unk` to represent out-of-vocabulary words.

## Out-of-Vocabulary (OOV)

When text data is tokenized, there may be words that are not present in the vocabulary because they are rare or unseen during the vocabulary building process. When encountering such OOV words during actual language processing tasks like text generation or language modeling, the model can use the <unk> token to represent them


For example, if the word "apple" is present in the vocabulary, but "pineapple" is not, "apple" will be used normally in the text, but "pineapple" (being an OOV word) would be replaced by the ```<unk>``` token.

In [34]:
a_new_iterator = yield_tokens(dataset)

In [35]:
next(a_new_iterator)

['introduction', 'to', 'nlp']

In [20]:
vocab = build_vocab_from_iterator(yield_tokens(dataset),specials = ['<unk>'])
vocab.set_default_index(vocab["<unk>"])

# converts each tokens in the dataset as indices

This code demonstrates how to fetch a tokenized sentence from an iterator, convert it tokens into indices using a provided vocabulary, and then print both the original sentence and its corresponding indices.

## Pipeline
1. Got the dataset
2. use torchtext.data.utils import get_tokenizer
3. Declare a variable --> tokenizer = get_tokenizer('basic_english') --> 'basic_english'  is the model name
4. token = tokenizer(dataset[idx][idx]) --> As the dataset is comprised of tuples.
5. creates a function which evantually iterates over the dataset and yield the tokenized form

    def yield_token(data_iter):
        for _,text in data_iter:
            yield tokenizer(text)
6. To create the vocabulart pass the item through the 
    build_vocab_from_iterator function


In [36]:
def get_tokenized_sentences_and_indices(data_iter):
    tokenized_sentence = next(data_iter) # Grab single example
    token_indices = [vocab[tone] for tone in tokenized_sentence]
    return tokenized_sentence,token_indices

tokenized_sentence, token_indices = get_tokenized_sentences_and_indices(a_new_iterator)
next(a_new_iterator)

print("Tokenized Sentence: ",tokenized_sentence)
print("Token Indices: ",token_indices)

Tokenized Sentence:  ['basics', 'of', 'pytorch']
Token Indices:  [11, 15, 2]


In [37]:
vocab['tinon']

0