### Installing required libraries

In [1]:
!pip install nltk
!pip install transformers
!pip install sentencepiece
!pip install spacy
!pip install numpy==1.24
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install numpy scikit-learn
!pip install torch==2.0.1
!pip install torchtext==0.15.2

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m73.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading regex-2024.11.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (792 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m792.7/792.7 kB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading click-8.1.7-py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━

### Importing required libraries

In [2]:
import nltk
nltk.download("punkt")
nltk.download('punkt_tab')
import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from transformers import BertTokenizer
from transformers import XLNetTokenizer

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /home/jupyterlab/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/jupyterlab/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


### Types of tokenizer
Word-based,Character-based,Subword-based

### Word-based tokenizer
#### nltk

In [9]:
# This showcases word_tokenize from nltk library

text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are."
tokens = word_tokenize(text)
print("text word_tokenize:",tokens)

text1 = "This is a sample sentence for word tokenization."
tokens1 = word_tokenize(text1)
print("text1 word_tokenize:",tokens1)

text word_tokenize: ['I', 'could', "n't", 'help', 'the', 'dog', '.', 'Ca', "n't", 'you', 'do', 'it', '?', 'Do', "n't", 'be', 'afraid', 'if', 'you', 'are', '.']
text1 word_tokenize: ['This', 'is', 'a', 'sample', 'sentence', 'for', 'word', 'tokenization', '.']


### token.pos_:
This feature indicates the part of speech (POS) of each token. Parts of speech refer to the grammatical categories of words, such as noun, verb, adjective, pronoun, and so on. Examples include:

PRON: Pronoun (e.g., "I", "you", "he")

VERB: Verb (e.g., "run", "is", "help")

NOUN: Noun (e.g., "dog", "cat", "car")

ADJ: Adjective (e.g., "big", "happy")

ADV: Adverb (e.g., "quickly", "always")

DET: Determiner (e.g., "the", "a")

PUNCT: Punctuation (e.g., period, question mark)

### token.dep_:
This feature shows the syntactic dependency of the token. Syntactic dependency refers to the grammatical relationship of a word with other words in a sentence. Examples include:

nsubj: Subject of the sentence

dobj: Direct object

iobj: Indirect object

ROOT: Root of the sentence

amod: Adjective modifier

punct: Punctuation mark

prep: Preposition

In [10]:
# This showcases the use of the 'spaCy' tokenizer with torchtext's get_tokenizer function

text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Making a list of the tokens and priting the list
token_list = [token.text for token in doc]
print("Tokens:", token_list)

# Showing token details
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tokens: ['I', 'could', "n't", 'help', 'the', 'dog', '.', 'Ca', "n't", 'you', 'do', 'it', '?', 'Do', "n't", 'be', 'afraid', 'if', 'you', 'are', '.']
I PRON nsubj
could AUX aux
n't PART neg
help VERB ROOT
the DET det
dog NOUN dobj
. PUNCT punct
Ca AUX aux
n't PART neg
you PRON nsubj
do VERB ROOT
it PRON dobj
? PUNCT punct
Do AUX aux
n't PART neg
be AUX ROOT
afraid ADJ acomp
if SCONJ mark
you PRON nsubj
are AUX advcl
. PUNCT punct


### Character-based tokenizer

However, it's important to note that character-based tokenization has its limitations. Single characters may not convey the same information as entire words, and the overall token length increases significantly, potentially causing issues with model size and a loss of performance.

In [11]:
# input
text = "This is a sample sentence for tokenization."

# convert txte to charecter list
char_tokens = list(text)


print("Character-based tokenization output:", char_tokens)


Character-based tokenization output: ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 'a', 'm', 'p', 'l', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', 'f', 'o', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', '.']


### Subword-based tokenizer

The subword-based tokenizer allows frequently used words to remain unsplit while breaking down infrequent words into meaningful subwords. Techniques such as SentencePiece, or WordPiece are commonly used for subword tokenization. These methods learn subword units from a given text corpus, identifying common prefixes, suffixes, and root words as subword tokens based on their frequency of occurrence. This approach offers the advantage of representing a broader range of words and adapting to the specific language patterns within a text corpus.

### WordPiece
Initially, WordPiece initializes its vocabulary to include every character present in the training data and progressively learns a specified number of merge rules. WordPiece doesn't select the most frequent symbol pair but rather the one that maximizes the likelihood of the training data when added to the vocabulary. In essence, WordPiece evaluates what it sacrifices by merging two symbols to ensure it's a worthwhile endeavor.

Now, the WordPiece tokenizer is implemented in BertTokenizer. Note that BertTokenizer treats composite words as separate tokens.

In [13]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("This  is example of  tokenization.")

['this', 'is', 'example', 'of', 'token', '##ization', '.']

### Unigram and SentencePiece
Unigram is a method for breaking words or text into smaller pieces. It accomplishes this by starting with a large list of possibilities and gradually narrowing it down based on how frequently those pieces appear in the text. This approach aids in efficient text tokenization.

SentencePiece is a tool that takes text, divides it into smaller, more manageable parts, assigns IDs to these segments, and ensures that it does so consistently. Consequently, if you use SentencePiece on the same text repeatedly, you will consistently obtain the same subwords and IDs.

Unigram and SentencePiece work together by implementing Unigram's subword tokenization method within the SentencePiece framework. SentencePiece handles subword segmentation and ID assignment, while Unigram's principles guide the vocabulary reduction process to create a more efficient representation of the text data. This combination is particularly valuable for various NLP tasks in which subword tokenization can enhance the performance of language models.

In [14]:
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer.tokenize("This is an example of  Unigram and SentencePiece.")

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

['▁This',
 '▁is',
 '▁an',
 '▁example',
 '▁of',
 '▁Uni',
 'gram',
 '▁and',
 '▁Sen',
 't',
 'ence',
 'Piece',
 '.']

### Tokenization with PyTorch
In PyTorch, especially with the torchtext library, the tokenizer breaks down text from a data set into individual words or subwords, facilitating their conversion into numerical format. After tokenization, the vocab (vocabulary) maps these tokens to unique integers, allowing them to be fed into neural networks. This process is vital because deep learning models operate on numerical data and cannot process raw text directly. Thus, tokenization and vocabulary mapping serve as a bridge between human-readable text and machine-operable numerical data. Consider the dataset:

In [16]:
from torchtext.data.utils import get_tokenizer

In [15]:
dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1," NLP Named Entity,Sentiment Analysis,Machine Translation "),
    (1," Machine Translation with NLP "),
    (1," Named Entity vs Sentiment Analysis  NLP ")]

In [18]:
tokenizer = get_tokenizer("basic_english")
tokenizer(dataset[0][1])

['introduction', 'to', 'nlp']

### Token indices
You would represent words as numbers as NLP algorithms can process and manipulate numbers more efficiently and quickly than raw text. You use the function build_vocab_from_iterator, the output is typically referred to as 'token indices' or simply 'indices.' These indices represent the numeric representations of the tokens in the vocabulary.

The build_vocab_from_iterator function, when applied to a list of tokens, assigns a unique index to each token based on its position in the vocabulary. These indices serve as a way to represent the tokens in a numerical format that can be easily processed by machine learning models.

For example, given a vocabulary with tokens ["apple", "banana", "orange"], the corresponding indices might be [0, 1, 2], where "apple" is represented by index 0, "banana" by index 1, and "orange" by index 2.

dataset is an iterable. Therefore, you use a generator function yield_tokens to apply the tokenizer. The purpose of the generator function yield_tokens is to yield tokenized texts one at a time. Instead of processing the entire dataset and returning all the tokenized texts in one go, the generator function processes and yields each tokenized text individually as it is requested. The tokenization process is performed lazily, which means the next tokenized text is generated only when needed, saving memory and computational resources.

In [19]:
def yield_tokens(data_iter):
    for  _,text in data_iter:
        yield tokenizer(text)

my_iterator = yield_tokens(dataset) 

In [20]:
next(my_iterator)

['introduction', 'to', 'nlp']

### Out-of-vocabulary (OOV)
When text data is tokenized, there might be words that aren't present in the vocabulary because they are rare or weren't seen during the vocabulary-building process. When the model encounters these out-of-vocabulary (OOV) words during tasks like text generation or language modeling, it can use the `<unk>` token to represent them.

For example, if "apple" is in the vocabulary but "pineapple" is not, "apple" will be used normally in the text, but "pineapple," being an OOV word, will be replaced with the `<unk>` token.

By adding the `<unk>` token to the vocabulary, you provide a consistent way to handle OOV words in your language model or any other NLP tasks.

In [21]:
vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

In [22]:
# Function to get the next tokenized sentence and its corresponding token indices
def get_tokenized_sentence_and_indices(iterator):
    # Get the next tokenized sentence from the iterator
    tokenized_sentence = next(iterator)  
    
    # Convert each token in the tokenized sentence into its corresponding index from the vocabulary
    token_indices = [vocab[token] for token in tokenized_sentence]  
    
    # Return both the tokenized sentence and its token indices
    return tokenized_sentence, token_indices

# Get the next tokenized sentence and its token indices from the iterator
tokenized_sentence, token_indices = get_tokenized_sentence_and_indices(my_iterator)

# Move to the next item in the iterator (this is optional)
next(my_iterator)

# Print the tokenized sentence and the token indices
print("Tokenized Sentence:", tokenized_sentence)
print("Token Indices:", token_indices)


Tokenized Sentence: ['basics', 'of', 'pytorch']
Token Indices: [11, 15, 2]


In [23]:
# Define sample lines of text
lines = ["IBM taught me tokenization", 
         "Special tokenizers are ready and they will blow your mind", 
         "just saying hi!"]

# Define special symbols that will be used for tokenization
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

# Initialize a tokenizer using the spaCy model for English
tokenizer_en = get_tokenizer('spacy', language='en_core_web_sm')

# Initialize an empty list to store the tokenized lines
tokens = []
max_length = 0  # Variable to store the maximum length of tokenized sentences

# Tokenize each line and add special tokens (<bos> at the beginning and <eos> at the end)
for line in lines:
    tokenized_line = tokenizer_en(line)  # Tokenize the sentence
    tokenized_line = ['<bos>'] + tokenized_line + ['<eos>']  # Add <bos> and <eos> tokens
    tokens.append(tokenized_line)  # Append the tokenized sentence to the tokens list
    max_length = max(max_length, len(tokenized_line))  # Update max_length for padding

# Padding shorter tokenized sentences to match the max length
for i in range(len(tokens)):
    tokens[i] = tokens[i] + ['<pad>'] * (max_length - len(tokens[i]))  # Add <pad> tokens to make all sentences equal in length

# Print the lines after adding the special tokens and padding
print("Lines after adding special tokens:\n", tokens)

# Build vocabulary from the tokenized sentences, including the special tokens
vocab = build_vocab_from_iterator(tokens, specials=['<unk>'])  # Build vocab with <unk> token
vocab.set_default_index(vocab["<unk>"])  # Set the default index to <unk> for OOV words

# Print the vocabulary (list of tokens) and their corresponding token IDs
print("Vocabulary:", vocab.get_itos())  # Print the vocabulary in string format
print("Token IDs for 'tokenization':", vocab.get_stoi())  # Print the token IDs for the word 'tokenization'


Lines after adding special tokens:
 [['<bos>', 'IBM', 'taught', 'me', 'tokenization', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'Special', 'tokenizers', 'are', 'ready', 'and', 'they', 'will', 'blow', 'your', 'mind', '<eos>'], ['<bos>', 'just', 'saying', 'hi', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]
Vocabulary: ['<unk>', '<pad>', '<bos>', '<eos>', '!', 'IBM', 'Special', 'and', 'are', 'blow', 'hi', 'just', 'me', 'mind', 'ready', 'saying', 'taught', 'they', 'tokenization', 'tokenizers', 'will', 'your']
Token IDs for 'tokenization': {'will': 20, 'tokenizers': 19, 'tokenization': 18, 'taught': 16, 'your': 21, 'saying': 15, '<unk>': 0, 'and': 7, 'hi': 10, '<pad>': 1, '<bos>': 2, 'they': 17, '<eos>': 3, '!': 4, 'ready': 14, 'IBM': 5, 'are': 8, 'Special': 6, 'mind': 13, 'me': 12, 'blow': 9, 'just': 11}


1. **Special Tokens**:
- Token: "`<unk>`", Index: 0: `<unk>` stands for "unknown" and represents words that were not seen during vocabulary building, usually during inference on new text.
- Token: "`<pad>`", Index: 1: `<pad>` is a "padding" token used to make sequences of words the same length when batching them together. 
- Token: "`<bos>`", Index: 2: `<bos>` is an acronym for "beginning of sequence" and is used to denote the start of a text sequence.
- Token: "`<eos>`", Index: 3: `<eos>` is an acronym for "end of sequence" and is used to denote the end of a text sequence.

2. **Word Tokens**:
The rest of the tokens are words or punctuation extracted from the provided sentences, each assigned a unique index:
- Token: "IBM", Index: 5
- Token: "taught", Index: 16
- Token: "me", Index: 12
    ... and so on.
    
3. **Vocabulary**:
It denotes the total number of tokens in the sentences upon which vocabulary is built.
    
4. **Token IDs for 'tokenization'**:
It represents the token IDs assigned in the vocab where a number represents its presence in the sentence.

In [24]:
new_line = "I learned about embeddings and attention mechanisms."

# Tokenize the new line
tokenized_new_line = tokenizer_en(new_line)
tokenized_new_line = ['<bos>'] + tokenized_new_line + ['<eos>']

# Pad the new line to match the maximum length of previous lines
new_line_padded = tokenized_new_line + ['<pad>'] * (max_length - len(tokenized_new_line))

# Convert tokens to IDs and handle unknown words
new_line_ids = [vocab[token] if token in vocab else vocab['<unk>'] for token in new_line_padded]

# Example usage
print("Token IDs for new line:", new_line_ids)

Token IDs for new line: [2, 0, 0, 0, 0, 7, 0, 0, 0, 3, 1, 1]
