# Introduction to Tokenization

![Scissors cutting a sentence into individual word pieces](images/text_being_cut_into_pieces.png)

*The art of breaking text into meaningful chunks...*

## What is Tokenization?

**Breaking text into individual pieces called "tokens"**

- 🔪 Split sentences into words
- 📝 Each word becomes a "token"
- 🔧 Foundation for all NLP tasks
- 🎯 Makes text analysis possible

*Think of it as creating a word puzzle - first, separate all the pieces!*

## Tokenization in Action

> **Input:** "I love Python programming!"
>
> **After Tokenization:**
> `["I", "love", "Python", "programming", "!"]`

> **Another Example:** "It's a beautiful day, isn't it?"
>
> **Smart Tokenization:**
> `["It", "'s", "a", "beautiful", "day", ",", "isn", "'t", "it", "?"]`

## Simple Python Tokenization

In [None]:
# Basic tokenization with Python
text = "Machine learning is fascinating!"

# Simple split (basic tokenization)
tokens_basic = text.split()
print("Basic:", tokens_basic)
# Output: ['Machine', 'learning', 'is', 'fascinating!']

# Better tokenization (handling punctuation)
import re
tokens_better = re.findall(r'\w+', text)
print("Better:", tokens_better)
# Output: ['Machine', 'learning', 'is', 'fascinating']

[🚀 Try in Colab](https://colab.research.google.com/github/Roopesht/codeexamples/blob/main/genai/python_easy/1/concept_3.ipynb)

## Tokenization Made Simple

**Imagine tokenization like organizing your closet:**

<div style="font-size: 1.1em;">
  <p>👕 <strong>Step 1:</strong> Take everything out (your text)</p>
  <p>📂 <strong>Step 2:</strong> Sort into categories (words, punctuation)</p>
  <p>🏷️ <strong>Step 3:</strong> Label each item (create tokens)</p>
  <p>✨ <strong>Result:</strong> Everything organized and findable!</p>
</div>

## Tokenization from a Different Angle

**Think of tokenization as preparing ingredients for cooking:**

- 🥗 **Raw text:** A whole salad (mixed ingredients)
- 🔪 **Tokenizer:** Your knife (cutting tool)
- 🥕 **Tokens:** Chopped vegetables (individual pieces)
- 🍽️ **Ready to cook:** Prepared for NLP recipes!

*I hope this cooking prep analogy makes tokenization clear now!*

## Quick Reflection

**Tokenization is like creating building blocks from text...**

> 💭 How do you think tokenization might handle tricky cases like "don't", "U.S.A.", or "😊"?