# üìù NLP Fundamentals & HuggingFace

**M·ª•c ti√™u:** Essential NLP techniques v√† HuggingFace ecosystem

**N·ªôi dung:**
- Text preprocessing (NLTK, spaCy)
- Tokenization methods
- Text vectorization (TF-IDF, embeddings)
- HuggingFace Transformers basics
- Fine-tuning patterns
- Common NLP tasks

**Level:** Intermediate

---

In [None]:
# Installation (uncomment if needed)
# !pip install transformers datasets tokenizers nltk spacy scikit-learn
# !python -m spacy download en_core_web_sm

import numpy as np
import pandas as pd
import torch
import transformers

print(f"‚úÖ Transformers: {transformers.__version__}")
print(f"‚úÖ PyTorch: {torch.__version__}")

---

## 1. Text Preprocessing

### Basic Cleaning

In [None]:
import re
import string

def clean_text(text):
    """
    Basic text cleaning
    """
    # Lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Example
sample = "Check out https://example.com! @user #NLP is AMAZING!!! üòä"
cleaned = clean_text(sample)
print(f"Original: {sample}")
print(f"Cleaned:  {cleaned}")

### NLTK Preprocessing

In [None]:
import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = "The cats are running faster than the dogs were running"

# Tokenization
tokens = word_tokenize(text.lower())
print(f"Tokens: {tokens}")

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]
print(f"No stopwords: {filtered}")

# Stemming (chop word endings)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
print(f"Stemmed: {stemmed}")

# Lemmatization (find root form)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w, pos='v') for w in filtered]
print(f"Lemmatized: {lemmatized}")

print("\nüí° Stemming vs Lemmatization:")
print("   Stemming: Fast, crude (running ‚Üí run)")
print("   Lemmatization: Slow, accurate (better ‚Üí good)")

---

## 2. Text Vectorization

### TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
docs = [
    "Machine learning is awesome",
    "Deep learning is a subset of machine learning",
    "Natural language processing uses machine learning"
]

# TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=10)
tfidf_matrix = vectorizer.fit_transform(docs)

print(f"Vocabulary: {vectorizer.get_feature_names_out()}")
print(f"\nTF-IDF Matrix shape: {tfidf_matrix.shape}")
print(f"Matrix (dense):\n{tfidf_matrix.toarray()}")

# Get top words for each document
feature_names = vectorizer.get_feature_names_out()
for i, doc in enumerate(docs):
    scores = list(zip(feature_names, tfidf_matrix[i].toarray()[0]))
    top_words = sorted(scores, key=lambda x: x[1], reverse=True)[:3]
    print(f"\nDoc {i} top words: {top_words}")

---

## 3. HuggingFace Transformers

### Basic Pipeline API

In [None]:
from transformers import pipeline

# Sentiment Analysis
print("üîç Sentiment Analysis:")
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product! It's amazing!")
print(f"  Result: {result}")

# Text Generation
print("\n‚úçÔ∏è  Text Generation:")
generator = pipeline("text-generation", model="gpt2")
result = generator("Artificial intelligence will", max_length=30, num_return_sequences=1)
print(f"  Generated: {result[0]['generated_text']}")

# Named Entity Recognition
print("\nüè∑Ô∏è  Named Entity Recognition:")
ner = pipeline("ner", grouped_entities=True)
result = ner("Elon Musk founded SpaceX in California.")
for entity in result:
    print(f"  {entity['word']}: {entity['entity_group']} (score: {entity['score']:.2f})")

print("\n‚úÖ Pipeline API = Quick inference without writing code!")

### Manual Model Loading

In [None]:
from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenize
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

print(f"Input IDs: {inputs['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

print(f"\nOutput shape: {outputs.last_hidden_state.shape}")
print(f"  (batch_size, sequence_length, hidden_size)")

# Extract [CLS] token embedding (sentence representation)
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"\n[CLS] embedding shape: {cls_embedding.shape}")

---

## 4. Tokenization Deep Dive

### WordPiece vs BPE vs Unigram

In [None]:
# Different tokenizers
models = [
    ("bert-base-uncased", "WordPiece"),
    ("gpt2", "BPE"),
    ("albert-base-v2", "Unigram")
]

text = "unhappiness"

print(f"Text: '{text}'\n")
for model_name, method in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokens = tokenizer.tokenize(text)
    print(f"{method} ({model_name}):")
    print(f"  Tokens: {tokens}")
    print(f"  IDs: {tokenizer.convert_tokens_to_ids(tokens)}\n")

print("üí° Observations:")
print("   - Different methods split words differently")
print("   - Subword tokenization handles OOV words")
print("   - ## prefix (WordPiece) = continuation token")

### Special Tokens

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Special tokens:")
print(f"  PAD: {tokenizer.pad_token} ({tokenizer.pad_token_id})")
print(f"  CLS: {tokenizer.cls_token} ({tokenizer.cls_token_id})")
print(f"  SEP: {tokenizer.sep_token} ({tokenizer.sep_token_id})")
print(f"  UNK: {tokenizer.unk_token} ({tokenizer.unk_token_id})")
print(f"  MASK: {tokenizer.mask_token} ({tokenizer.mask_token_id})")

# Example with special tokens
text = "Hello world"
encoded = tokenizer(text, add_special_tokens=True)
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])

print(f"\nText: '{text}'")
print(f"Tokens with special: {tokens}")
print("  [CLS] = Start of sequence")
print("  [SEP] = End of sequence")

---

## 5. Fine-tuning Example

### Text Classification

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset

# Sample data
data = {
    'text': [
        "I love this movie!",
        "Terrible experience, never again.",
        "Absolutely fantastic product!",
        "Worst purchase ever."
    ],
    'label': [1, 0, 1, 0]  # 1=positive, 0=negative
}

dataset = Dataset.from_dict(data)

# Tokenize
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_strategy="epoch"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

print("‚úÖ Setup complete! Ready to train:")
print("   trainer.train()")
print("\nüí° Trainer API handles:")
print("   - Training loop")
print("   - Evaluation")
print("   - Checkpointing")
print("   - Logging")
print("   - Mixed precision")

---

## 6. Common NLP Tasks

### Task Overview

In [None]:
nlp_tasks = """
üéØ Common NLP Tasks:

1. Text Classification
   - Sentiment analysis
   - Spam detection
   - Topic classification
   Model: BERT, DistilBERT, RoBERTa

2. Named Entity Recognition (NER)
   - Extract person, location, organization
   Model: BERT, RoBERTa

3. Question Answering
   - Extract answer from context
   Model: BERT, ALBERT, RoBERTa

4. Text Generation
   - Story generation, completion
   Model: GPT-2, GPT-3, T5

5. Summarization
   - Abstractive or extractive
   Model: BART, T5, Pegasus

6. Translation
   - Machine translation
   Model: MarianMT, T5, mBART

7. Embedding/Similarity
   - Semantic search, clustering
   Model: Sentence-BERT, SimCSE
"""

print(nlp_tasks)

### Model Selection Guide

In [None]:
model_guide = """
üìä Model Selection:

| Task | Fast & Light | Balanced | High Accuracy |
|------|--------------|----------|---------------|
| Classification | DistilBERT | BERT-base | RoBERTa-large |
| NER | DistilBERT | BERT-base | RoBERTa-large |
| QA | DistilBERT | BERT-base | ALBERT-xxlarge |
| Generation | DistilGPT2 | GPT-2 | GPT-3 (API) |
| Summarization | DistilBART | BART | PEGASUS-large |
| Embeddings | MiniLM | SBERT | MPNet |

üí° Trade-offs:
   Speed: DistilBERT (40% faster) vs BERT
   Size: TinyBERT (7.5MB) vs BERT-base (440MB)
   Accuracy: Large models +2-3% vs base
"""

print(model_guide)

---

## 7. Best Practices

### Data Preprocessing

In [None]:
best_practices = """
‚úÖ Text Preprocessing Best Practices:

1. For Traditional ML (TF-IDF, CountVectorizer):
   ‚úì Lowercase
   ‚úì Remove punctuation
   ‚úì Remove stopwords
   ‚úì Stemming/Lemmatization

2. For Transformers (BERT, GPT):
   ‚úì Keep original text (case-sensitive models exist)
   ‚úì Keep punctuation
   ‚úó DON'T remove stopwords
   ‚úó DON'T stem/lemmatize
   ‚Üí Tokenizer handles it!

3. Data Augmentation:
   - Back-translation
   - Synonym replacement
   - Random insertion/deletion
   - Paraphrasing

4. Handling Imbalanced Data:
   - Oversampling minority class
   - Class weights in loss
   - Focal loss
"""

print(best_practices)

---

## üéØ Key Takeaways

### NLP Pipeline

```python
# Classical ML
text ‚Üí clean ‚Üí tokenize ‚Üí vectorize (TF-IDF) ‚Üí ML model

# Transformers
text ‚Üí tokenizer ‚Üí model ‚Üí output
```

### HuggingFace Workflow

```python
# 1. Quick inference
pipeline = pipeline("sentiment-analysis")
result = pipeline(text)

# 2. Custom model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# 3. Fine-tuning
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()
```

### Essential Concepts

1. **Tokenization**: Text ‚Üí IDs
   - WordPiece (BERT)
   - BPE (GPT-2)
   - Handles OOV with subwords

2. **Special Tokens**:
   - [CLS]: Sentence representation
   - [SEP]: Separator
   - [PAD]: Padding
   - [MASK]: Masked token

3. **Transfer Learning**:
   - Pretrained models learn language
   - Fine-tune on task-specific data
   - Much better than training from scratch

### Quick Reference

| Need | Use |
|------|-----|
| Quick inference | `pipeline()` |
| Custom model | `AutoModel` |
| Fine-tuning | `Trainer` API |
| Feature extraction | Model embeddings |
| Text generation | GPT-2, GPT-3 |
| Classification | BERT, RoBERTa |

---

**üéâ Series Complete!**

B·∫°n ƒë√£ c√≥ ƒë·∫ßy ƒë·ªß fundamentals cho ML/DL:
- ‚úÖ Libraries (NumPy, Pandas, Matplotlib, Scikit-learn, OpenCV)
- ‚úÖ Deep Learning (PyTorch Advanced, Transformers)
- ‚úÖ Computer Vision (timm)
- ‚úÖ NLP (HuggingFace)
- ‚úÖ Deployment (Docker)