# Tokenization in NLP

## Introduction
Tokenization is the process of breaking text into smaller units called **tokens**. Tokens can be:
- Words
- Sentences
- Subwords (used in modern models like BERT/GPT)

Tokenization is a **crucial step** in NLP because it transforms raw text into a structured format that models can understand.

**In this notebook, we will learn:**
1. Word tokenization
2. Sentence tokenization
3. Subword tokenization


In [1]:
# Step 0: Import libraries
import re  # for regex sentence splitting

# Sample text
text = "Hello there! NLP is amazing. Let's learn tokenization."

print("Original Text:")
print(text)


Original Text:
Hello there! NLP is amazing. Let's learn tokenization.


In [2]:
# Step 1: Word Tokenization (Basic)
# Using simple split
words = text.split()
print("Word Tokens:", words)


Word Tokens: ['Hello', 'there!', 'NLP', 'is', 'amazing.', "Let's", 'learn', 'tokenization.']


**Explanation:**  
- Word tokenization splits the text into individual words.  
- Here we used `.split()` for simplicity.  
- This method works on all systems without any extra library.

In [3]:
# Step 2: Sentence Tokenization (Basic)
# Using regex to split by punctuation
sentences = re.split(r'(?<=[.!?]) +', text)
print("Sentence Tokens:", sentences)

Sentence Tokens: ['Hello there!', 'NLP is amazing.', "Let's learn tokenization."]


**Explanation:**  
- Sentence tokenization splits the text into sentences.  
- We used regex to split sentences by `.`, `!`, or `?`.  
- This is a simple and reliable way without NLTK.


In [4]:
# Step 3: Tokenizing a small dataset
sample_texts = [
    "Hello there! How are you?",
    "Tokenization is essential for NLP.",
    "We will learn preprocessing step by step."
]

# Word tokenization for all sentences
tokenized_texts = [text.split() for text in sample_texts]

for i, tokens in enumerate(tokenized_texts):
    print(f"Original: {sample_texts[i]}")
    print(f"Word Tokens: {tokens}")
    print('-'*50)


Original: Hello there! How are you?
Word Tokens: ['Hello', 'there!', 'How', 'are', 'you?']
--------------------------------------------------
Original: Tokenization is essential for NLP.
Word Tokens: ['Tokenization', 'is', 'essential', 'for', 'NLP.']
--------------------------------------------------
Original: We will learn preprocessing step by step.
Word Tokens: ['We', 'will', 'learn', 'preprocessing', 'step', 'by', 'step.']
--------------------------------------------------


In [5]:
# Step 4: Subword Tokenization
# Split words into smaller units using basic rules

def simple_subword_tokenize(word):
    # Example: split words into halves or syllable-like chunks
    mid = len(word) // 2
    if mid == 0:
        return [word]
    return [word[:mid], word[mid:]]

subword_tokens = [simple_subword_tokenize(w) for w in words]
print("Subword Tokens (Basic):", subword_tokens)


Subword Tokens (Basic): [['He', 'llo'], ['the', 're!'], ['N', 'LP'], ['i', 's'], ['amaz', 'ing.'], ['Le', "t's"], ['le', 'arn'], ['tokeni', 'zation.']]
