# üìå Topic: Tokenization

### What you will learn
- What tokenization is and why it's fundamental to NLP
- How to split text into sentences (sentence tokenization)
- How to split text into individual words (word tokenization)
- Common edge cases (contractions, punctuation, possessives)
- Hands-on examples with real text

### Why this matters
Tokenization is the **first and most critical step** in almost every NLP pipeline. Before you can analyze text, you must break it down into meaningful units. Most downstream tasks (sentiment analysis, NER, machine translation, etc.) depend entirely on quality tokenization.

---

## What is Tokenization?

**Tokenization** is the process of splitting raw text into smaller, meaningful units called **tokens**. Tokens are usually words or sentences, but can be characters or subword units depending on your use case.

### Example:
- Input: `"Her cat's name is Luna."`
- Tokens: `['Her', 'cat', "'s", 'name', 'is', 'Luna', '.']`

### Why tokenize?
- Computers process discrete units, not continuous streams
- Enables counting word frequencies, building vocabularies, and creating features
- Necessary for sentence-level analysis (sentiment per sentence, etc.)

**Trade-off**: Simple splitting (e.g., on spaces) doesn't handle punctuation, contractions, or other edge cases well. NLTK's tokenizers use trained models to handle these intelligently.

In [None]:
# Import NLTK (Natural Language Toolkit) - one of the most popular NLP libraries
import nltk

# Download 'punkt_tab' - a pre-trained tokenizer model for English
# This model learned from millions of sentences how to properly split text
# You only need to run this once per environment
nltk.download('punkt_tab')

# Import two key tokenization functions:
# - sent_tokenize: splits text into sentences
# - word_tokenize: splits text (or sentences) into words
from nltk.tokenize import word_tokenize, sent_tokenize

## Step 1: Prepare Sample Text

Let's create a simple example with two sentences. Notice the possessive apostrophes (cat's, dog's) - these will test how well the tokenizer handles edge cases.

In [None]:
# Store a multi-sentence text with possessives as a test case
# Possessives ('s) are tricky - the tokenizer must decide whether to attach or separate them
sentences = "Her cat's name is Luna. Her dog's name is Max"

## Step 2: Sentence Tokenization

The first level of tokenization is **sentence splitting**. The `sent_tokenize()` function identifies sentence boundaries (typically marked by `.`, `!`, or `?`).

### Why sentence tokenization first?
- Sentences are often the relevant unit for analysis (e.g., sentiment per sentence)
- Many NLP models process one sentence at a time
- Helps avoid errors: word from one sentence shouldn't be confused with context from another

### How it works:
NLTK's `sent_tokenize()` uses a trained neural network that learned patterns of sentence endings. It handles edge cases like abbreviations (e.g., "Dr. Smith") better than naive period-splitting.

In [None]:
# Use sent_tokenize to split the text into sentences
# This returns a list of strings, one per sentence
sentence_tokens = sent_tokenize(sentences)

# Let's inspect what we got
print("Number of sentences:", len(sentence_tokens))
print("\nSentences:")
for i, sent in enumerate(sentence_tokens, 1):
    print(f"  {i}. {sent}")

## Step 3: Word Tokenization

Now that we have sentences, we split each into individual words. The `word_tokenize()` function is more complex than simple space-splitting because it must:

1. **Separate punctuation** from words (e.g., "Luna." ‚Üí "Luna", ".")
2. **Handle contractions** intelligently (e.g., "don't" ‚Üí "do", "n't")
3. **Respect possessives** (e.g., "cat's" ‚Üí "cat", "'s")

### Common beginner mistake:
‚ùå **Wrong**: Using `.split()` naively
```python
"Luna.".split()  # Returns ['Luna.'] - punctuation still attached!
```

‚úÖ **Right**: Using `word_tokenize()`
```python
word_tokenize("Luna.")  # Returns ['Luna', '.'] - punctuation separated
```

### Trade-off:
NLTK's word tokenizer is quite aggressive about splitting contractions and possessives. Some systems prefer to keep "don't" as one token, while others split it to "do" + "n't". Choose based on your downstream task.

In [None]:
# Let's tokenize the second sentence into words
# sentence_tokens[1] is "Her dog's name is Max"
second_sentence = sentence_tokens[1]
word_tokens = word_tokenize(second_sentence)

print(f"Original sentence: {second_sentence}")
print(f"\nWord tokens: {word_tokens}")
print(f"\nNumber of tokens: {len(word_tokens)}")

# Notice how 'dog's' was split into ['dog', "'s"]
# This is correct behavior - the apostrophe is a separate linguistic unit (possessive marker)
print("\nToken breakdown:")
for i, token in enumerate(word_tokens):
    print(f"  {i}: '{token}'")

## Advanced: Tokenize All Sentences

In practice, you'll want to tokenize all sentences. Here's a complete workflow:

### Workflow:
1. Split text into sentences
2. For each sentence, split into words
3. Store results in a structured format (e.g., list of lists)

In [None]:
# Complete two-level tokenization example
text = "Her cat's name is Luna. Her dog's name is Max"

# Step 1: Sentence tokenization
sentences = sent_tokenize(text)

# Step 2: Word tokenization for each sentence
all_word_tokens = []
for sent in sentences:
    words = word_tokenize(sent)  # Split sentence into words
    all_word_tokens.append(words)

# Display results
print("Complete tokenization:")
print("=" * 50)
for sent_id, sentence in enumerate(sentences, 1):
    print(f"\nSentence {sent_id}: {sentence}")
    print(f"Tokens: {all_word_tokens[sent_id - 1]}")

## Key Takeaways

1. **Tokenization is essential**: Almost every NLP task starts here
2. **Two-level approach**: Sentence ‚Üí Words (not the other way around)
3. **Use pre-trained models**: NLTK's tokenizers are better than naive string splitting because they handle edge cases
4. **Punctuation matters**: It gets separated from words, which is useful for downstream processing
5. **Understand your tokenizer's behavior**: Different systems split contractions/possessives differently

## Next steps:
- Try tokenizing text with contractions like "don't", "can't", "it's"
- Compare results from NLTK with simple `.split()` to see the difference
- Move on to **normalization** (lowercasing, removing punctuation) in the next notebook