
#NLP Basics and Tokenization Key Terms

##Raw Text and Data Units

must watch: https://youtu.be/kCc8FmEb1nY?si=5M1k1KJ_3Q45TYSc

Corpus:

A large collection of text documents used for training or analysis.

Document:

A single unit of text such as a sentence, paragraph, or article.

Sentence:

A sequence of words forming a complete thought, often used as a single input.

⸻

##Text Preprocessing

Text Normalization:

The process of cleaning and standardizing text before tokenization.

Lowercasing:

Converting all text to lowercase to reduce vocabulary size.

Punctuation Removal:

Removing symbols such as commas and periods when they are not useful.

Whitespace Normalization:

Removing extra spaces or line breaks.

Stop Words:

Very common words like “the”, “is”, and “and” that may be removed in classical NLP.

⸻

##Tokenization Fundamentals

Token:

The smallest unit of text processed by a language model.

Tokenization:

The process of splitting raw text into tokens.

Tokenizer:

The algorithm or tool that converts text into tokens.

⸻

##Types of Tokenization

Word Level Tokenization:

Splitting text into words based on spaces or punctuation.

Character Level Tokenization:

Splitting text into individual characters.

Subword Tokenization:

Splitting text into units smaller than words but larger than characters.

⸻

##Modern Tokenization Methods

Byte Pair Encoding (BPE):

A subword tokenization method that merges frequent character pairs.

WordPiece:

A subword tokenization method used in BERT that maximizes likelihood.

Unigram Language Model:

A probabilistic subword tokenization approach.

SentencePiece:

A tokenizer that treats text as a raw sequence without relying on spaces.

⸻

##Vocabulary and Encoding

Vocabulary:

The complete set of tokens known to a tokenizer.

Vocabulary Size:

The total number of unique tokens in the vocabulary.

Token ID:

A numerical representation assigned to each token.

Encoding:

Converting text into token IDs.

Decoding:

Converting token IDs back into readable text.

⸻

##Special Tokens

Padding Token [PAD]:

Used to make all sequences in a batch the same length.

Unknown Token [UNK]:

Represents words or subwords not present in the vocabulary.

Classification Token [CLS]:

A special token added at the beginning of a sequence for classification tasks.

Separator Token [SEP]:

Used to separate sentences or segments in a single input.

⸻

##Sequence Length Handling

Padding:

Adding extra tokens to shorter sequences so all sequences have equal length.

Truncation:

Cutting longer sequences to a fixed maximum length.

Maximum Sequence Length:

The largest number of tokens a model can process.

⸻

##Attention Related Concepts

Attention Mask:

A binary mask indicating which tokens are real and which are padding.

Self Attention:

A mechanism where each token attends to all other tokens in the sequence.

Masked Attention:

Preventing certain tokens from participating in attention computation.

⸻

##Practical NLP Pipeline

Tokenize:

Convert raw text into tokens.

Encode:

Convert tokens into token IDs.

Pad or Truncate:

Adjust sequence lengths for batching.

Create Attention Mask:

Ensure padding tokens do not affect the model.


#Assignment Questions

##Q1. Padding and Attention Mask Computation

Given batch token lengths:
[12, 7, 15, 9]

###(a) Longest token size

The longest sequence length in the batch is the maximum value in the list.

L = max([12, 7, 15, 9]) = 15

###(b) Padding added per sequence

Padding added to each sequence is calculated as:

padding = L - original_length

Sequence 1: 15 - 12 = 3
Sequence 2: 15 - 7  = 8
Sequence 3: 15 - 15 = 0
Sequence 4: 15 - 9  = 6

Padding per sequence:
[3, 8, 0, 6]

###(c) Attention mask for the length-7 sequence

After padding the length-7 sequence to length 15:

- Real tokens are marked with 1
- Padding tokens are marked with 0

Attention mask:
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

In [19]:
#demo code
import numpy as np
token_lengths = np.array([12, 7, 15, 9])
L = max(token_lengths)
LM= token_lengths.max()
print("Longest token size", LM)

#Padding added per sequence
Padding = LM - token_lengths
print("Padding per sequence", Padding)

#Attention mask for the length-7 sequence
seq_len=7
total_len = 15
no_of_zero = total_len - seq_len #15 - 7 = 8
attention_mask = np.concatenate([
    np.ones(seq_len),
    np.zeros(no_of_zero)])
print("Attention mask for the length-7 sequence")
print(attention_mask.astype(int))

Longest token size 15
Padding per sequence [3 8 0 6]
Attention mask for the length-7 sequence
[1 1 1 1 1 1 1 0 0 0 0 0 0 0 0]


In [23]:
np.concatenate([np.zeros(5).astype(int),np.ones(5).astype(int)])

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Given token lengths
token_lengths = np.array([12, 7, 15, 9])

# (a) Longest token size
L = token_lengths.max()

# (b) Padding added per sequence
padding = L - token_lengths

# (c) Attention mask for the length-7 sequence
seq_length = 7
attention_mask = np.concatenate([
    np.ones(seq_length),
    np.zeros(L - seq_length)
])

# Print results
print("Token lengths:", token_lengths)
print("Longest token size (L):", L)
print("Padding per sequence:", padding)
print("Attention mask for length-7 sequence:")
print(attention_mask.astype(int))

Token lengths: [12  7 15  9]
Longest token size (L): 15
Padding per sequence: [3 8 0 6]
Attention mask for length-7 sequence:
[1 1 1 1 1 1 1 0 0 0 0 0 0 0 0]


##Q2. Truncation with Fixed Maximum Length

You choose a fixed maximum token length:
Lmax = 128

Given tokenized sequence lengths:
[80, 140, 128, 200, 50]

###(a) Number of truncated samples

A sample is truncated if:
original_length > Lmax

Lengths greater than 128 are:
140 and 200

Number of truncated samples:
2

###(b) Tokens removed for each truncated sample

Tokens removed are calculated as:
removed = original_length - Lmax

For length 140:
140 - 128 = 12 tokens removed

For length 200:
200 - 128 = 72 tokens removed

###(c) Total tokens removed across the dataset

Total removed tokens:
12 + 72 = 84 tokens

In [36]:
#demo code
import numpy as np
Lmax = 128
seq_token= np.array([80, 140, 128, 200, 50])

trunkated_mask = seq_token > Lmax
print(trunkated_mask)
sum = 0
for i in trunkated_mask:
  if i == True:
    sum += 1
num_truncated = np.sum(trunkated_mask)
print(sum)

#(b) Tokens removed for each truncated sample

trunkated_seq  = seq_token[trunkated_mask] - Lmax
print(trunkated_seq)

#(c) Total tokens removed across the dataset
print(np.sum(trunkated_seq))

[False  True False  True False]
2
[12 72]
84


In [25]:
import numpy as np

# Fixed maximum length
Lmax = 128

# Given tokenized lengths
token_lengths = np.array([80, 140, 128, 200, 50])

# (a) Identify truncated samples
truncated_mask = token_lengths > Lmax
num_truncated = np.sum(truncated_mask)

# (b) Tokens removed per truncated sample
removed_tokens = token_lengths[truncated_mask] - Lmax

# (c) Total tokens removed
total_removed = np.sum(removed_tokens)

# Print results
print("Token lengths:", token_lengths.tolist())
print("Fixed maximum length (Lmax):", Lmax)
print("Number of truncated samples:", int(num_truncated))
print("Tokens removed per truncated sample:", removed_tokens.tolist())
print("Total tokens removed:", int(total_removed))

Token lengths: [80, 140, 128, 200, 50]
Fixed maximum length (Lmax): 128
Number of truncated samples: 2
Tokens removed per truncated sample: [12, 72]
Total tokens removed: 84
