# Tokenization and Stopwords

In NLP, **tokenization** means breaking a text into smaller units — called **tokens** — such as words, subwords, or sentences.

**Stopwords** are commonly used words (like *and, is, the, in*) that are often removed to focus on more meaningful content.

---
## 🎯 Objectives
- Understand word and sentence tokenization
- Compare tokenizers: NLTK, spaCy, and Hugging Face
- Learn stopword removal techniques

---
## 🧩 1️⃣ Word Tokenization using NLTK

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

text = "Natural Language Processing (NLP) helps machines understand human language. It’s amazing!"

tokens = word_tokenize(text)
print(tokens)

### 📌 Sentence Tokenization

In [None]:
sentences = sent_tokenize(text)
print(sentences)

---
## 🧠 2️⃣ Tokenization using spaCy

In [None]:
import spacy

# Load small English model
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

tokens_spacy = [token.text for token in doc]
print(tokens_spacy)

### 🔤 spaCy also provides detailed token-level information

In [None]:
for token in doc:
    print(f"{token.text:<15} POS: {token.pos_:<10} Lemma: {token.lemma_}")

---
## 🤗 3️⃣ Tokenization using Hugging Face Tokenizers

Many modern NLP models like BERT or GPT use **subword tokenization** — splitting text into smaller pieces based on word frequency.

Let’s use a BERT tokenizer to see how it handles complex words.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens_bert = tokenizer.tokenize(text)
print(tokens_bert)

### 📦 Converting Tokens to IDs

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens_bert)
print(token_ids)

---
## 🧹 4️⃣ Stopword Removal using NLTK

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered = [word for word in tokens if word.lower() not in stop_words and word.isalpha()]
print(filtered)

---
## 🧰 5️⃣ Stopword Removal using spaCy

In [None]:
filtered_spacy = [token.text for token in doc if not token.is_stop and token.is_alpha]
print(filtered_spacy)

---
## 🧾 Summary
- **Tokenization** splits text into tokens (words, subwords, or sentences).
- **Stopwords** are removed to focus on meaningful words.
- Libraries used: **NLTK, spaCy, Hugging Face Transformers.**

---
 Next: Move to `03-Stemming_and_Lemmatization.ipynb` to explore how words are reduced to their root forms.