# 1. What is Tokenization?

**Tokenization** means **breaking text into smaller pieces** called **tokens**.
Tokens can be:

* **Words**

  * ‚ÄúNLP is fun‚Äù ‚Üí `["NLP", "is", "fun"]`
* **Sentences**

  * ‚ÄúNLP is fun. I love it.‚Äù ‚Üí `["NLP is fun.", "I love it."]`
* **Subwords / pieces** (used in modern LLMs like BERT/GPT)

  * ‚Äúplaying‚Äù ‚Üí `["play", "##ing"]`

**Why do we tokenize?**
Because machines cannot understand raw text. Tokenization converts text into parts that ML models can process.


# 2. Tokenization with NLTK  


### üîΩ Install NLTK

In [None]:
pip install nltk

### üîΩ Download tokenizer models

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### ‚≠ê Word Tokenization

In [2]:
from nltk.tokenize import word_tokenize

text = "NLP is fun. I love learning it!"
tokens = word_tokenize(text)
print(tokens)

['NLP', 'is', 'fun', '.', 'I', 'love', 'learning', 'it', '!']


üëâ Notice - punctuation also appears as tokens.

### ‚≠ê Sentence Tokenization

In [3]:
from nltk.tokenize import sent_tokenize

text = "NLP is fun. I love learning it!"
sentences = sent_tokenize(text)
print(sentences)

['NLP is fun.', 'I love learning it!']


# 3. Tokenization with spaCy

### üîΩ Install spaCy

In [None]:
pip install spacy

### üîΩ Download & Load English model

In [None]:
python -m spacy download en_core_web_sm

In [None]:
import spacy                      

nlp = spacy.load("en_core_web_sm")  
# Load the English model used for tokenization, POS, NER, etc.

**`en_core_web_sm`**

It is the **English language model** that spaCy provides.

* **en** ‚Üí English language
* **core** ‚Üí general-purpose model
* **web** ‚Üí trained on web text
* **sm** ‚Üí small size model

### ‚≠ê Word Tokenization

In [None]:
text = "NLP is fun. I love learning it!"  

doc = nlp(text)  
# Pass the text to spaCy‚Äôs pipeline so it becomes a processed Doc object

word_tokens = [token.text for token in doc]  
# Extract each token (word, punctuation, etc.) from the Doc object

print(word_tokens)  

['NLP', 'is', 'fun', '.', 'I', 'love', 'learning', 'it', '!']


üëâ Similar to NLTK but spaCy is faster and more accurate.

### ‚≠ê Sentence Tokenization

In [5]:
sentence_tokens = [sent.text for sent in doc.sents]
print(sentence_tokens)

['NLP is fun.', 'I love learning it!']


spaCy automatically splits sentences using its built-in parser.

## Summary

| Feature           | NLTK   | spaCy                        |
| ----------------- | ------ | ---------------------------- |
| Speed             | Slow   | Very fast                    |
| Beginner friendly | ‚úî‚úî‚úî    | ‚úî‚úî                           |
| Word tokens       | ‚úî      | ‚úî                            |
| Sentence tokens   | ‚úî      | ‚úî                            |
| Industry use      | Medium | High                         |
| Extra features    | Basic  | NER, POS, dependency parsing |
