# üìó Ch∆∞∆°ng 4: Embeddings - H·ªçc Bi·ªÉu Di·ªÖn (Representation Learning)

**Ch√†o m·ª´ng b·∫°n ƒë·∫øn v·ªõi notebook t∆∞∆°ng t√°c v·ªÅ Embeddings!**

## üéØ M·ª•c ti√™u h·ªçc t·∫≠p

Trong ch∆∞∆°ng n√†y, b·∫°n s·∫Ω n·∫Øm v·ªØng:
- üî¢ **S·ª± kh√°c bi·ªát** gi·ªØa one-hot encoding v√† embeddings
- üß† **C√°ch embeddings h·ªçc** ƒë∆∞·ª£c bi·ªÉu di·ªÖn ng·ªØ nghƒ©a
- ‚öñÔ∏è **Trade-off** c·ªßa k√≠ch th∆∞·ªõc embedding
- üé® **Visualize v√† ph√¢n t√≠ch** kh√¥ng gian embedding
- üíª **Implement t·ª´ ƒë·∫ßu** v√† train embeddings

---

## üìö N·ªôi dung

1. [One-Hot vs Embedding](#section1)
2. [Word/Token Embedding Implementation](#section2)
3. [Embedding Properties](#section3)
4. [Practical Experiments](#section4)

In [None]:
# ============================================
# SETUP: Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
# ============================================

# N·∫øu ch∆∞a c√†i ƒë·∫∑t, uncomment d√≤ng d∆∞·ªõi:
# !pip install torch numpy matplotlib seaborn scikit-learn plotly pandas

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import plotly.express as px
from typing import List, Dict, Tuple
import pandas as pd
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# C·∫•u h√¨nh hi·ªÉn th·ªã
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print("="*60)
print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ Device: {device}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
print("="*60)

---

<a id='section1'></a>
## 4.1 One-Hot Encoding vs Embeddings

### ü§î V·∫•n ƒë·ªÅ c·ªßa One-Hot Encoding

**One-Hot Encoding** l√† c√°ch bi·ªÉu di·ªÖn ƒë∆°n gi·∫£n nh·∫•t:
- M·ªói t·ª´/token ƒë∆∞·ª£c bi·ªÉu di·ªÖn b·∫±ng vector c√≥ **ƒë√∫ng 1 ph·∫ßn t·ª≠ = 1**, c√≤n l·∫°i = 0
- V·ªõi vocabulary size = 10,000 t·ª´ ‚Üí m·ªói t·ª´ c·∫ßn vector **10,000 chi·ªÅu**!

#### ‚ùå Nh∆∞·ª£c ƒëi·ªÉm c·ªßa One-Hot:
1. **Sparse** (th∆∞a th·ªõt): 99.99% l√† s·ªë 0
2. **Kh√¥ng c√≥ √Ω nghƒ©a ng·ªØ nghƒ©a**: "vua" v√† "ho√†ng ƒë·∫ø" xa nhau nh∆∞ "vua" v√† "chu·ªëi"
3. **Chi·∫øm b·ªô nh·ªõ kh·ªïng l·ªì**: 10K t·ª´ √ó 10K dim = 100 tri·ªáu parameters!
4. **Kh√¥ng generalize**: Kh√¥ng h·ªçc ƒë∆∞·ª£c m·ªëi quan h·ªá gi·ªØa c√°c t·ª´

#### ‚úÖ ∆Øu ƒëi·ªÉm c·ªßa Embeddings:
1. **Dense** (ƒë·∫∑c): M·ªói chi·ªÅu ƒë·ªÅu c√≥ √Ω nghƒ©a
2. **Learned representation**: T·ª± ƒë·ªông h·ªçc ƒë∆∞·ª£c ng·ªØ nghƒ©a
3. **Compact**: 10K t·ª´ √ó 300 dim = ch·ªâ 3 tri·ªáu parameters
4. **Similarity**: T·ª´ c√≥ nghƒ©a g·∫ßn nhau ‚Üí embeddings g·∫ßn nhau

In [None]:
# ============================================
# Demo: So s√°nh One-Hot vs Embedding
# ============================================

class OneHotVsEmbedding:
    """Class ƒë·ªÉ demo s·ª± kh√°c bi·ªát gi·ªØa One-Hot v√† Embedding"""
    
    def __init__(self, vocab_size: int, embedding_dim: int):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        
    def create_onehot(self, word_idx: int) -> torch.Tensor:
        """T·∫°o one-hot vector cho m·ªôt t·ª´"""
        onehot = torch.zeros(self.vocab_size)
        onehot[word_idx] = 1
        return onehot
    
    def visualize_comparison(self):
        """Visualize s·ª± kh√°c bi·ªát v·ªÅ memory v√† sparsity"""
        
        # T√≠nh to√°n memory usage
        onehot_memory = self.vocab_size * self.vocab_size * 4 / (1024**2)  # MB
        embed_memory = self.vocab_size * self.embedding_dim * 4 / (1024**2)  # MB
        
        # T·∫°o visualization
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # 1. Memory Comparison
        ax1 = axes[0, 0]
        methods = ['One-Hot', 'Embedding']
        memory = [onehot_memory, embed_memory]
        colors = ['#ff6b6b', '#4ecdc4']
        bars = ax1.bar(methods, memory, color=colors, alpha=0.7, edgecolor='black')
        ax1.set_ylabel('Memory (MB)', fontsize=12, fontweight='bold')
        ax1.set_title('üìä So s√°nh Memory Usage', fontsize=14, fontweight='bold')
        ax1.grid(axis='y', alpha=0.3)
        
        # Th√™m labels
        for bar, mem in zip(bars, memory):
            height = bar.get_height()
            ax1.text(bar.get_x() + bar.get_width()/2., height,
                    f'{mem:.1f} MB',
                    ha='center', va='bottom', fontweight='bold')
        
        # 2. Sparsity Visualization (One-Hot)
        ax2 = axes[0, 1]
        sample_size = min(20, self.vocab_size)
        onehot_sample = torch.eye(sample_size)[:5]  # 5 t·ª´
        im2 = ax2.imshow(onehot_sample, cmap='RdYlGn', aspect='auto')
        ax2.set_xlabel('Vocabulary Index', fontsize=11)
        ax2.set_ylabel('Word Index', fontsize=11)
        ax2.set_title('‚ùå One-Hot: Sparse (99% zeros)', fontsize=13, fontweight='bold')
        plt.colorbar(im2, ax=ax2)
        
        # 3. Embedding Visualization (Dense)
        ax3 = axes[1, 0]
        embedding_sample = torch.randn(5, min(20, self.embedding_dim))
        im3 = ax3.imshow(embedding_sample, cmap='viridis', aspect='auto')
        ax3.set_xlabel('Embedding Dimension', fontsize=11)
        ax3.set_ylabel('Word Index', fontsize=11)
        ax3.set_title('‚úÖ Embedding: Dense (m·ªçi gi√° tr·ªã ƒë·ªÅu c√≥ √Ω nghƒ©a)', 
                     fontsize=13, fontweight='bold')
        plt.colorbar(im3, ax=ax3)
        
        # 4. Parameter Count Comparison
        ax4 = axes[1, 1]
        vocab_sizes = [100, 1000, 5000, 10000, 50000]
        onehot_params = [v**2 for v in vocab_sizes]
        embed_params = [v * self.embedding_dim for v in vocab_sizes]
        
        ax4.plot(vocab_sizes, onehot_params, 'o-', label='One-Hot', 
                linewidth=2, markersize=8, color='#ff6b6b')
        ax4.plot(vocab_sizes, embed_params, 's-', label='Embedding', 
                linewidth=2, markersize=8, color='#4ecdc4')
        ax4.set_xlabel('Vocabulary Size', fontsize=12, fontweight='bold')
        ax4.set_ylabel('# Parameters (log scale)', fontsize=12, fontweight='bold')
        ax4.set_title('üìà Scalability: Parameters vs Vocab Size', 
                     fontsize=14, fontweight='bold')
        ax4.set_yscale('log')
        ax4.legend(fontsize=11, loc='upper left')
        ax4.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Print statistics
        print("\n" + "="*70)
        print("üìä SO S√ÅNH CHI TI·∫æT: ONE-HOT vs EMBEDDING")
        print("="*70)
        print(f"\nüîπ Vocabulary Size: {self.vocab_size:,} t·ª´")
        print(f"üîπ Embedding Dimension: {self.embedding_dim}")
        print("\n" + "-"*70)
        print("ONE-HOT ENCODING:")
        print("-"*70)
        print(f"  ‚Ä¢ Vector dimension: {self.vocab_size:,}")
        print(f"  ‚Ä¢ Sparsity: {((self.vocab_size-1)/self.vocab_size*100):.2f}% zeros")
        print(f"  ‚Ä¢ Memory: {onehot_memory:.2f} MB")
        print(f"  ‚Ä¢ Parameters: {self.vocab_size**2:,}")
        print("\n" + "-"*70)
        print("EMBEDDING:")
        print("-"*70)
        print(f"  ‚Ä¢ Vector dimension: {self.embedding_dim}")
        print(f"  ‚Ä¢ Sparsity: 0% (dense representation)")
        print(f"  ‚Ä¢ Memory: {embed_memory:.2f} MB")
        print(f"  ‚Ä¢ Parameters: {self.vocab_size * self.embedding_dim:,}")
        print("\n" + "-"*70)
        print("TI·∫æT KI·ªÜM:")
        print("-"*70)
        print(f"  ‚Ä¢ Memory reduction: {(1 - embed_memory/onehot_memory)*100:.1f}%")
        print(f"  ‚Ä¢ Parameter reduction: {(1 - (self.vocab_size*self.embedding_dim)/(self.vocab_size**2))*100:.1f}%")
        print("="*70 + "\n")

# Demo v·ªõi vocabulary size = 5000, embedding dimension = 128
demo = OneHotVsEmbedding(vocab_size=5000, embedding_dim=128)
demo.visualize_comparison()

### üéì Trade-off c·ªßa Embedding Dimension

Ch·ªçn k√≠ch th∆∞·ªõc embedding l√† m·ªôt ngh·ªá thu·∫≠t! 

| Dimension | ∆Øu ƒëi·ªÉm | Nh∆∞·ª£c ƒëi·ªÉm | Khi n√†o d√πng? |
|-----------|---------|------------|---------------|
| **Nh·ªè (16-64)** | ‚ö° Nhanh, nh·∫π | ‚ö†Ô∏è √çt th√¥ng tin | Dataset nh·ªè, mobile |
| **Trung b√¨nh (128-256)** | ‚öñÔ∏è C√¢n b·∫±ng t·ªët | üòê Trung b√¨nh | H·∫ßu h·∫øt use cases |
| **L·ªõn (512-1024)** | üß† Nhi·ªÅu th√¥ng tin | üêå Ch·∫≠m, overfitting | Dataset l·ªõn, ph·ª©c t·∫°p |

**Rule of thumb:**
- Vocabulary < 10K ‚Üí dim = 128
- Vocabulary 10K-100K ‚Üí dim = 256-512
- Vocabulary > 100K ‚Üí dim = 512-768

In [None]:
# ============================================
# Experiment: Embedding Dimension Trade-off
# ============================================

def analyze_embedding_dimensions():
    """Ph√¢n t√≠ch impact c·ªßa embedding dimension"""
    
    vocab_size = 10000
    dimensions = [16, 32, 64, 128, 256, 512, 768, 1024]
    
    # Metrics ƒë·ªÉ track
    memory_usage = []
    capacity = []  # Information capacity
    computation_cost = []
    
    for dim in dimensions:
        # Memory usage (MB)
        mem = vocab_size * dim * 4 / (1024**2)
        memory_usage.append(mem)
        
        # Information capacity (bits per word)
        cap = dim * 32  # 32-bit float
        capacity.append(cap)
        
        # Relative computation cost
        comp = dim / dimensions[0]  # relative to smallest
        computation_cost.append(comp)
    
    # Visualization
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # 1. Memory Usage
    axes[0].plot(dimensions, memory_usage, 'o-', linewidth=3, 
                markersize=10, color='#3498db')
    axes[0].fill_between(dimensions, memory_usage, alpha=0.3, color='#3498db')
    axes[0].set_xlabel('Embedding Dimension', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Memory (MB)', fontsize=12, fontweight='bold')
    axes[0].set_title('üíæ Memory Usage', fontsize=14, fontweight='bold')
    axes[0].grid(True, alpha=0.3)
    axes[0].set_xscale('log', base=2)
    
    # 2. Information Capacity
    axes[1].plot(dimensions, capacity, 's-', linewidth=3, 
                markersize=10, color='#2ecc71')
    axes[1].fill_between(dimensions, capacity, alpha=0.3, color='#2ecc71')
    axes[1].set_xlabel('Embedding Dimension', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('Capacity (bits)', fontsize=12, fontweight='bold')
    axes[1].set_title('üß† Information Capacity', fontsize=14, fontweight='bold')
    axes[1].grid(True, alpha=0.3)
    axes[1].set_xscale('log', base=2)
    
    # 3. Computation Cost
    axes[2].plot(dimensions, computation_cost, '^-', linewidth=3, 
                markersize=10, color='#e74c3c')
    axes[2].fill_between(dimensions, computation_cost, alpha=0.3, color='#e74c3c')
    axes[2].set_xlabel('Embedding Dimension', fontsize=12, fontweight='bold')
    axes[2].set_ylabel('Relative Cost', fontsize=12, fontweight='bold')
    axes[2].set_title('‚ö° Computation Cost', fontsize=14, fontweight='bold')
    axes[2].grid(True, alpha=0.3)
    axes[2].set_xscale('log', base=2)
    
    plt.tight_layout()
    plt.show()
    
    # Print recommendations
    print("\n" + "="*70)
    print("üéØ KHUY·∫æN NGH·ªä CH·ªåN EMBEDDING DIMENSION")
    print("="*70)
    for i, dim in enumerate(dimensions):
        print(f"\nDimension = {dim}:")
        print(f"  Memory: {memory_usage[i]:.2f} MB")
        print(f"  Capacity: {capacity[i]:,} bits")
        print(f"  Cost: {computation_cost[i]:.1f}x")
        
        # Recommendation
        if dim <= 64:
            print("  ‚Üí üì± T·ªët cho: Mobile, edge devices, real-time")
        elif dim <= 256:
            print("  ‚Üí üíª T·ªët cho: General purpose, balanced performance")
        else:
            print("  ‚Üí üñ•Ô∏è T·ªët cho: Large datasets, complex relationships")
    print("="*70 + "\n")

analyze_embedding_dimensions()

---

<a id='section2'></a>
## 4.2 Word/Token Embedding: Implementation

### üîß C√°ch Embedding ho·∫°t ƒë·ªông

Embedding layer v·ªÅ b·∫£n ch·∫•t l√† m·ªôt **lookup table** (b·∫£ng tra c·ª©u):

```
Word Index ‚Üí Embedding Table ‚Üí Embedding Vector
    3      ‚Üí  [0.2, -0.1, 0.5, ...]  ‚Üí Dense vector
```

**Qu√° tr√¨nh h·ªçc:**
1. Initialize ng·∫´u nhi√™n: `torch.nn.Embedding(vocab_size, embedding_dim)`
2. Forward pass: Lookup vector theo index
3. Backprop: Gradient ch·ªâ update vector ƒë∆∞·ª£c lookup
4. Optimizer: Update embedding table

### üéØ Padding & Masking

Trong th·ª±c t·∫ø, sequences c√≥ ƒë·ªô d√†i kh√°c nhau:
- **Padding**: Th√™m token ƒë·∫∑c bi·ªát (th∆∞·ªùng l√† 0) ƒë·ªÉ c√°c sequences c√≥ c√πng ƒë·ªô d√†i
- **Masking**: "Che" c√°c padding tokens ƒë·ªÉ model kh√¥ng h·ªçc t·ª´ ch√∫ng

In [None]:
# ============================================
# Implementation: Embedding Layer t·ª´ ƒë·∫ßu
# ============================================

class CustomEmbedding(nn.Module):
    """
    Custom Embedding Layer ƒë·ªÉ hi·ªÉu r√µ c√°ch ho·∫°t ƒë·ªông b√™n trong
    
    Args:
        vocab_size: S·ªë l∆∞·ª£ng t·ª´ trong vocabulary
        embedding_dim: K√≠ch th∆∞·ªõc vector embedding
        padding_idx: Index c·ªßa padding token (m·∫∑c ƒë·ªãnh 0)
    """
    
    def __init__(self, vocab_size: int, embedding_dim: int, padding_idx: int = 0):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.padding_idx = padding_idx
        
        # Embedding table: lookup table ch√≠nh
        self.weight = nn.Parameter(torch.randn(vocab_size, embedding_dim))
        
        # Initialize padding embedding th√†nh zeros
        if padding_idx is not None:
            with torch.no_grad():
                self.weight[padding_idx].fill_(0)
    
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: Lookup embeddings t·ª´ table
        
        Args:
            input_ids: (batch_size, seq_len) - indices c·ªßa tokens
        
        Returns:
            embeddings: (batch_size, seq_len, embedding_dim)
        """
        # Simple lookup operation
        return self.weight[input_ids]
    
    def get_gradient_mask(self, input_ids: torch.Tensor) -> torch.Tensor:
        """
        T·∫°o mask ƒë·ªÉ kh√¥ng update padding embeddings
        """
        return (input_ids != self.padding_idx).float()


# ============================================
# Demo: So s√°nh Custom vs PyTorch Embedding
# ============================================

def demo_embedding_layers():
    """Demo v√† so s√°nh custom embedding v·ªõi PyTorch's embedding"""
    
    vocab_size = 100
    embedding_dim = 16
    batch_size = 4
    seq_len = 10
    
    print("="*70)
    print("üîß DEMO: EMBEDDING LAYER IMPLEMENTATION")
    print("="*70)
    
    # 1. Custom Embedding
    custom_embed = CustomEmbedding(vocab_size, embedding_dim, padding_idx=0)
    
    # 2. PyTorch Embedding
    pytorch_embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
    
    # Copy weights ƒë·ªÉ c√≥ c√πng kh·ªüi t·∫°o
    with torch.no_grad():
        pytorch_embed.weight.copy_(custom_embed.weight)
    
    # T·∫°o input v·ªõi padding
    # 0 l√† padding token
    input_ids = torch.randint(1, vocab_size, (batch_size, seq_len))
    input_ids[:, -3:] = 0  # Add padding ·ªü cu·ªëi
    
    print(f"\nüìù Input shape: {input_ids.shape}")
    print(f"\nSample input (batch 0):")
    print(input_ids[0])
    print(f"  ‚Üí Padding positions: {(input_ids[0] == 0).nonzero().squeeze()}")
    
    # Forward pass
    custom_output = custom_embed(input_ids)
    pytorch_output = pytorch_embed(input_ids)
    
    print(f"\n‚úÖ Output shape: {custom_output.shape}")
    print(f"   ({batch_size} batches, {seq_len} tokens, {embedding_dim} dimensions)")
    
    # Verify outputs are identical
    diff = (custom_output - pytorch_output).abs().max().item()
    print(f"\nüîç Difference between custom and PyTorch: {diff:.10f}")
    print(f"   ‚Üí {'‚úÖ Identical!' if diff < 1e-6 else '‚ùå Different'}")
    
    # Visualize embeddings
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot embedding cho m·ªôt batch
    sample_embed = custom_output[0].detach().numpy()
    
    # 1. Heatmap c·ªßa embeddings
    im1 = axes[0].imshow(sample_embed.T, cmap='RdBu_r', aspect='auto')
    axes[0].set_xlabel('Token Position', fontsize=11)
    axes[0].set_ylabel('Embedding Dimension', fontsize=11)
    axes[0].set_title('üé® Embedding Vectors Visualization', fontsize=13, fontweight='bold')
    plt.colorbar(im1, ax=axes[0], label='Value')
    
    # ƒê√°nh d·∫•u padding positions
    padding_pos = (input_ids[0] == 0).nonzero().squeeze().tolist()
    if isinstance(padding_pos, int):
        padding_pos = [padding_pos]
    for pos in padding_pos:
        axes[0].axvline(x=pos, color='red', linestyle='--', linewidth=2, alpha=0.7)
    
    # 2. Norm c·ªßa embeddings (padding should be ~0)
    norms = torch.norm(custom_output[0], dim=1).detach().numpy()
    positions = range(len(norms))
    colors = ['red' if input_ids[0, i] == 0 else 'blue' for i in range(len(norms))]
    
    axes[1].bar(positions, norms, color=colors, alpha=0.7, edgecolor='black')
    axes[1].set_xlabel('Token Position', fontsize=11)
    axes[1].set_ylabel('L2 Norm', fontsize=11)
    axes[1].set_title('üìä Embedding Norms (Red = Padding)', fontsize=13, fontweight='bold')
    axes[1].grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\n" + "-"*70)
    print("üìå QUAN TR·ªåNG:")
    print("-"*70)
    print("1. Padding tokens (red) c√≥ norm ‚âà 0")
    print("2. Non-padding tokens (blue) c√≥ norm > 0")
    print("3. Gradients ch·ªâ flow qua non-padding tokens")
    print("="*70 + "\n")

demo_embedding_layers()

In [None]:
# ============================================
# Visualize: Gradient Flow trong Embedding
# ============================================

def visualize_gradient_flow():
    """
    Visualize c√°ch gradients flow qua embedding layer
    v√† ch·ªâ update c√°c embeddings ƒë∆∞·ª£c s·ª≠ d·ª•ng
    """
    
    vocab_size = 20
    embedding_dim = 8
    
    # Create embedding layer
    embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
    
    # Input: ch·ªâ s·ª≠ d·ª•ng m·ªôt v√†i tokens
    input_ids = torch.tensor([3, 5, 7, 0, 0])  # 0 l√† padding
    
    # Forward
    embeddings = embed(input_ids)  # (5, 8)
    
    # Dummy loss: ch·ªâ c·∫ßn backward ƒë·ªÉ xem gradients
    loss = embeddings.sum()
    loss.backward()
    
    # Check gradients
    grad_norm = torch.norm(embed.weight.grad, dim=1).numpy()
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 1. Gradient magnitude cho m·ªói word
    colors = ['red' if i == 0 else 'green' if i in [3, 5, 7] else 'gray' 
              for i in range(vocab_size)]
    
    bars = axes[0].bar(range(vocab_size), grad_norm, color=colors, 
                       alpha=0.7, edgecolor='black')
    axes[0].set_xlabel('Word Index', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold')
    axes[0].set_title('üî• Gradient Flow (Green = Used, Gray = Unused, Red = Padding)', 
                     fontsize=12, fontweight='bold')
    axes[0].grid(axis='y', alpha=0.3)
    
    # Highlight used tokens
    used_tokens = [3, 5, 7]
    for tok in used_tokens:
        axes[0].text(tok, grad_norm[tok], f'‚úì', ha='center', va='bottom', 
                    fontsize=16, fontweight='bold', color='darkgreen')
    
    # 2. Embedding weight heatmap v·ªõi gradient overlay
    weights = embed.weight.detach().numpy()
    im = axes[1].imshow(weights, cmap='coolwarm', aspect='auto')
    axes[1].set_xlabel('Embedding Dimension', fontsize=11)
    axes[1].set_ylabel('Word Index', fontsize=11)
    axes[1].set_title('üíæ Embedding Weights', fontsize=13, fontweight='bold')
    plt.colorbar(im, ax=axes[1])
    
    # Highlight rows v·ªõi gradients
    for tok in used_tokens:
        axes[1].axhline(y=tok, color='yellow', linestyle='--', linewidth=2, alpha=0.5)
    axes[1].axhline(y=0, color='red', linestyle='--', linewidth=2, alpha=0.7, 
                   label='Padding (no grad)')
    axes[1].legend(loc='upper right')
    
    plt.tight_layout()
    plt.show()
    
    print("\n" + "="*70)
    print("üéì GRADIENT FLOW ANALYSIS")
    print("="*70)
    print(f"\nVocabulary size: {vocab_size}")
    print(f"Input tokens: {input_ids.tolist()}")
    print(f"\nGradient Statistics:")
    print(f"  ‚Ä¢ Padding token (0): gradient = {grad_norm[0]:.6f}")
    print(f"  ‚Ä¢ Used tokens {used_tokens}: gradient = {[f'{grad_norm[t]:.6f}' for t in used_tokens]}")
    print(f"  ‚Ä¢ Unused tokens: gradient = {grad_norm[1:3].mean():.6f} (mean of first few)")
    print(f"\n‚úÖ Ch·ªâ c√≥ {len(used_tokens)}/{vocab_size} embeddings ƒë∆∞·ª£c update!")
    print("="*70 + "\n")

visualize_gradient_flow()

---

<a id='section3'></a>
## 4.3 Embedding Properties

### üîç Cosine Similarity

Embeddings t·ªët c√≥ t√≠nh ch·∫•t quan tr·ªçng: **t·ª´ c√≥ nghƒ©a g·∫ßn nhau ‚Üí vectors g·∫ßn nhau**

**Cosine Similarity** ƒëo ƒë·ªô t∆∞∆°ng ƒë·ªìng gi·ªØa 2 vectors:

$$\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} \in [-1, 1]$$

- **+1**: Ho√†n to√†n gi·ªëng nhau
- **0**: Kh√¥ng li√™n quan
- **-1**: ƒê·ªëi l·∫≠p nhau

### üìä Visualization: PCA & t-SNE

Embeddings th∆∞·ªùng c√≥ dimension cao (128-768), kh√≥ visualize. Ta d√πng:

1. **PCA** (Principal Component Analysis)
   - Linear projection xu·ªëng 2D/3D
   - Gi·ªØ ƒë∆∞·ª£c variance t·ªëi ƒëa
   - Nhanh, ·ªïn ƒë·ªãnh

2. **t-SNE** (t-Distributed Stochastic Neighbor Embedding)
   - Non-linear projection
   - Gi·ªØ ƒë∆∞·ª£c local structure t·ªët h∆°n
   - Ch·∫≠m h∆°n, c√≥ random seed

In [None]:
# ============================================
# Demo: Cosine Similarity & Semantic Space
# ============================================

class SemanticEmbeddingDemo:
    """
    Demo v·ªÅ semantic properties c·ªßa embeddings
    """
    
    def __init__(self):
        # T·∫°o toy vocabulary v·ªõi ng·ªØ nghƒ©a r√µ r√†ng
        self.vocab = {
            # Animals
            'm√®o': 0, 'ch√≥': 1, 'chu·ªôt': 2, 'voi': 3, 's∆∞_t·ª≠': 4,
            # Fruits
            't√°o': 5, 'cam': 6, 'chu·ªëi': 7, 'xo√†i': 8, 'd∆∞a': 9,
            # Colors
            'ƒë·ªè': 10, 'xanh': 11, 'v√†ng': 12, 'tr·∫Øng': 13, 'ƒëen': 14,
        }
        
        # T·∫°o embeddings c√≥ c·∫•u tr√∫c
        # Dimension: [animal_score, fruit_score, color_score, size, ...]]
        self.embeddings = torch.zeros(len(self.vocab), 16)
        
        # Animals: high on dim 0
        for word in ['m√®o', 'ch√≥', 'chu·ªôt', 'voi', 's∆∞_t·ª≠']:
            idx = self.vocab[word]
            self.embeddings[idx, 0] = 1.0
            self.embeddings[idx, 1:] = torch.randn(15) * 0.1
        
        # Fruits: high on dim 1
        for word in ['t√°o', 'cam', 'chu·ªëi', 'xo√†i', 'd∆∞a']:
            idx = self.vocab[word]
            self.embeddings[idx, 1] = 1.0
            self.embeddings[idx, 0] = 0.0
            self.embeddings[idx, 2:] = torch.randn(14) * 0.1
        
        # Colors: high on dim 2
        for word in ['ƒë·ªè', 'xanh', 'v√†ng', 'tr·∫Øng', 'ƒëen']:
            idx = self.vocab[word]
            self.embeddings[idx, 2] = 1.0
            self.embeddings[idx, :2] = 0.0
            self.embeddings[idx, 3:] = torch.randn(13) * 0.1
        
        self.idx_to_word = {v: k for k, v in self.vocab.items()}
    
    def cosine_similarity(self, word1: str, word2: str) -> float:
        """T√≠nh cosine similarity gi·ªØa 2 t·ª´"""
        idx1, idx2 = self.vocab[word1], self.vocab[word2]
        vec1, vec2 = self.embeddings[idx1], self.embeddings[idx2]
        return F.cosine_similarity(vec1, vec2, dim=0).item()
    
    def find_similar(self, word: str, top_k: int = 5) -> List[Tuple[str, float]]:
        """T√¨m top-k t·ª´ g·∫ßn nh·∫•t"""
        idx = self.vocab[word]
        query_vec = self.embeddings[idx]
        
        # Compute similarities v·ªõi t·∫•t c·∫£ t·ª´
        sims = F.cosine_similarity(query_vec.unsqueeze(0), self.embeddings, dim=1)
        
        # Sort v√† l·∫•y top-k (b·ªè ch√≠nh n√≥)
        top_indices = sims.argsort(descending=True)[1:top_k+1]
        
        results = []
        for idx in top_indices:
            word_similar = self.idx_to_word[idx.item()]
            sim_score = sims[idx].item()
            results.append((word_similar, sim_score))
        
        return results
    
    def visualize_similarity_matrix(self):
        """Visualize similarity matrix cho t·∫•t c·∫£ t·ª´"""
        n_words = len(self.vocab)
        sim_matrix = torch.zeros(n_words, n_words)
        
        # Compute pairwise similarities
        for i in range(n_words):
            for j in range(n_words):
                sim_matrix[i, j] = F.cosine_similarity(
                    self.embeddings[i], self.embeddings[j], dim=0
                )
        
        # Plot
        fig, ax = plt.subplots(figsize=(12, 10))
        
        im = ax.imshow(sim_matrix.numpy(), cmap='RdYlGn', vmin=-1, vmax=1)
        
        # Set ticks
        words = [self.idx_to_word[i] for i in range(n_words)]
        ax.set_xticks(range(n_words))
        ax.set_yticks(range(n_words))
        ax.set_xticklabels(words, rotation=45, ha='right')
        ax.set_yticklabels(words)
        
        # Th√™m text annotations
        for i in range(n_words):
            for j in range(n_words):
                text = ax.text(j, i, f'{sim_matrix[i, j]:.2f}',
                             ha="center", va="center", color="black", fontsize=8)
        
        ax.set_title('üé® Cosine Similarity Matrix', fontsize=16, fontweight='bold', pad=20)
        plt.colorbar(im, ax=ax, label='Similarity Score')
        plt.tight_layout()
        plt.show()
    
    def visualize_2d(self):
        """Visualize embeddings trong 2D space using PCA"""
        # PCA projection
        pca = PCA(n_components=2)
        embeddings_2d = pca.fit_transform(self.embeddings.numpy())
        
        # T·∫°o categories
        categories = []
        for word in self.vocab.keys():
            if word in ['m√®o', 'ch√≥', 'chu·ªôt', 'voi', 's∆∞_t·ª≠']:
                categories.append('Animals')
            elif word in ['t√°o', 'cam', 'chu·ªëi', 'xo√†i', 'd∆∞a']:
                categories.append('Fruits')
            else:
                categories.append('Colors')
        
        # Plot
        fig, ax = plt.subplots(figsize=(12, 8))
        
        # Scatter plot v·ªõi colors theo category
        category_colors = {'Animals': '#e74c3c', 'Fruits': '#2ecc71', 'Colors': '#3498db'}
        
        for category, color in category_colors.items():
            mask = [c == category for c in categories]
            ax.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1],
                      c=color, label=category, s=200, alpha=0.6, edgecolors='black', linewidth=2)
        
        # Add labels
        for i, word in enumerate(self.vocab.keys()):
            ax.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                       fontsize=11, fontweight='bold', ha='center', va='bottom')
        
        ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)', 
                     fontsize=12, fontweight='bold')
        ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)', 
                     fontsize=12, fontweight='bold')
        ax.set_title('üìä Embedding Space Visualization (PCA)', 
                    fontsize=16, fontweight='bold')
        ax.legend(fontsize=12, loc='best')
        ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# Demo
demo = SemanticEmbeddingDemo()

print("="*70)
print("üîç COSINE SIMILARITY DEMO")
print("="*70)

# Test m·ªôt v√†i pairs
test_pairs = [
    ('m√®o', 'ch√≥'),
    ('m√®o', 't√°o'),
    ('t√°o', 'cam'),
    ('ƒë·ªè', 'xanh'),
    ('voi', 'chu·ªëi')
]

for word1, word2 in test_pairs:
    sim = demo.cosine_similarity(word1, word2)
    print(f"\n'{word1}' ‚Üî '{word2}': {sim:.4f}")

print("\n" + "-"*70)
print("üîé T√åM T·ª™ T∆Ø∆†NG T·ª∞")
print("-"*70)

for word in ['m√®o', 't√°o', 'ƒë·ªè']:
    similar = demo.find_similar(word, top_k=3)
    print(f"\nT·ª´ gi·ªëng '{word}' nh·∫•t:")
    for similar_word, score in similar:
        print(f"  ‚Üí {similar_word}: {score:.4f}")

print("\n" + "="*70 + "\n")

# Visualizations
demo.visualize_similarity_matrix()
demo.visualize_2d()

---

<a id='section4'></a>
## 4.4 Practical Experiments

### üß™ Experiment 1: Train Embeddings on Toy Dataset

B√¢y gi·ªù ch√∫ng ta s·∫Ω train embeddings t·ª´ ƒë·∫ßu tr√™n m·ªôt toy dataset v√† quan s√°t c√°ch ch√∫ng h·ªçc ƒë∆∞·ª£c semantic relationships!

In [None]:
# ============================================
# Experiment 1: Train Embeddings
# ============================================

# T·∫°o toy dataset: Skip-gram style
# M·ª•c ti√™u: Predict context words t·ª´ center word

class SkipGramDataset(Dataset):
    """
    Dataset cho Skip-gram model
    M·ªói sample: (center_word, context_word)
    """
    
    def __init__(self, sentences: List[List[str]], vocab: Dict[str, int], window_size: int = 2):
        self.data = []
        self.vocab = vocab
        
        # Generate training pairs
        for sentence in sentences:
            for i, center_word in enumerate(sentence):
                # Get context words trong window
                start = max(0, i - window_size)
                end = min(len(sentence), i + window_size + 1)
                
                for j in range(start, end):
                    if i != j:  # Skip center word itself
                        context_word = sentence[j]
                        if center_word in vocab and context_word in vocab:
                            self.data.append((
                                vocab[center_word],
                                vocab[context_word]
                            ))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        center, context = self.data[idx]
        return torch.tensor(center), torch.tensor(context)


class SkipGramModel(nn.Module):
    """
    Simple Skip-gram model ƒë·ªÉ h·ªçc embeddings
    """
    
    def __init__(self, vocab_size: int, embedding_dim: int):
        super().__init__()
        # Center word embeddings
        self.center_embeddings = nn.Embedding(vocab_size, embedding_dim)
        # Context word embeddings
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Initialize
        self.center_embeddings.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
        self.context_embeddings.weight.data.zero_()
    
    def forward(self, center_words, context_words):
        # Get embeddings
        center_embeds = self.center_embeddings(center_words)  # (batch, embed_dim)
        context_embeds = self.context_embeddings(context_words)  # (batch, embed_dim)
        
        # Dot product
        scores = (center_embeds * context_embeds).sum(dim=1)  # (batch,)
        
        return scores
    
    def get_embeddings(self):
        """Return final embeddings (center + context)"""
        return (self.center_embeddings.weight + self.context_embeddings.weight) / 2


def train_embeddings(embedding_dim: int = 50, epochs: int = 100):
    """
    Train embeddings tr√™n toy corpus
    """
    
    # Toy corpus v·ªÅ ƒë·ªông v·∫≠t v√† tr√°i c√¢y
    sentences = [
        ['m√®o', 'v√†', 'ch√≥', 'l√†', 'ƒë·ªông_v·∫≠t'],
        ['ch√≥', 'th√≠ch', 'ch∆°i', 'v·ªõi', 'm√®o'],
        ['t√°o', 'v√†', 'cam', 'l√†', 'tr√°i_c√¢y'],
        ['t√¥i', 'th√≠ch', 'ƒÉn', 't√°o', 'v√†', 'cam'],
        ['ƒë·ªông_v·∫≠t', 'nh∆∞', 'm√®o', 'ch√≥', 'r·∫•t', 'd·ªÖ_th∆∞∆°ng'],
        ['tr√°i_c√¢y', 'nh∆∞', 't√°o', 'cam', 'r·∫•t', 'ngon'],
        ['m√®o', 'th√≠ch', 'ƒÉn', 'c√°'],
        ['ch√≥', 'th√≠ch', 'ƒÉn', 'th·ªãt'],
        ['t√°o', 'm√†u', 'ƒë·ªè', 'r·∫•t', 'ngon'],
        ['cam', 'm√†u', 'cam', 'r·∫•t', 't∆∞∆°i'],
    ] * 5  # Repeat ƒë·ªÉ c√≥ ƒë·ªß data
    
    # Build vocabulary
    all_words = [word for sent in sentences for word in sent]
    word_counts = Counter(all_words)
    vocab = {word: i for i, (word, _) in enumerate(word_counts.most_common())}
    idx_to_word = {i: word for word, i in vocab.items()}
    
    print("="*70)
    print("üöÄ TRAINING EMBEDDINGS")
    print("="*70)
    print(f"\nVocabulary size: {len(vocab)}")
    print(f"Embedding dimension: {embedding_dim}")
    print(f"Number of sentences: {len(sentences)}")
    print(f"\nVocabulary: {list(vocab.keys())[:10]}...")
    
    # Create dataset
    dataset = SkipGramDataset(sentences, vocab, window_size=2)
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    print(f"\nTraining pairs: {len(dataset)}")
    
    # Create model
    model = SkipGramModel(len(vocab), embedding_dim).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.025)
    
    # Training loop
    losses = []
    
    print("\n" + "-"*70)
    print("Training progress:")
    print("-"*70)
    
    for epoch in range(epochs):
        total_loss = 0
        
        for center, context in dataloader:
            center, context = center.to(device), context.to(device)
            
            # Forward
            scores = model(center, context)
            
            # Negative sampling loss (simplified)
            # Positive pairs should have high scores
            loss = -F.logsigmoid(scores).mean()
            
            # Backward
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(dataloader)
        losses.append(avg_loss)
        
        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1:3d}/{epochs} | Loss: {avg_loss:.4f}")
    
    print("-"*70)
    print("‚úÖ Training completed!")
    print("="*70 + "\n")
    
    # Plot loss curve
    plt.figure(figsize=(10, 5))
    plt.plot(losses, linewidth=2, color='#3498db')
    plt.xlabel('Epoch', fontsize=12, fontweight='bold')
    plt.ylabel('Loss', fontsize=12, fontweight='bold')
    plt.title('üìâ Training Loss Curve', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    return model, vocab, idx_to_word

# Train model
model, vocab, idx_to_word = train_embeddings(embedding_dim=50, epochs=100)

In [None]:
# ============================================
# Visualize learned embeddings
# ============================================

def visualize_learned_embeddings(model, vocab, idx_to_word):
    """
    Visualize embeddings ƒë√£ h·ªçc ƒë∆∞·ª£c
    """
    
    # Get embeddings
    embeddings = model.get_embeddings().detach().cpu().numpy()
    
    # Compute similarity matrix
    n_words = len(vocab)
    sim_matrix = np.zeros((n_words, n_words))
    
    for i in range(n_words):
        for j in range(n_words):
            sim_matrix[i, j] = np.dot(embeddings[i], embeddings[j]) / (
                np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j])
            )
    
    # 1. Similarity matrix
    fig, axes = plt.subplots(1, 2, figsize=(16, 7))
    
    words = [idx_to_word[i] for i in range(min(15, n_words))]  # Top 15 words
    sim_subset = sim_matrix[:len(words), :len(words)]
    
    im = axes[0].imshow(sim_subset, cmap='RdYlGn', vmin=-1, vmax=1)
    axes[0].set_xticks(range(len(words)))
    axes[0].set_yticks(range(len(words)))
    axes[0].set_xticklabels(words, rotation=45, ha='right')
    axes[0].set_yticklabels(words)
    axes[0].set_title('üé® Learned Similarity Matrix', fontsize=14, fontweight='bold')
    plt.colorbar(im, ax=axes[0])
    
    # 2. PCA visualization
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embeddings)
    
    # Plot all words
    axes[1].scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], 
                   s=100, alpha=0.6, c='steelblue', edgecolors='black')
    
    # Add labels
    for i, word in idx_to_word.items():
        axes[1].annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                        fontsize=10, ha='center', va='bottom')
    
    axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)', 
                      fontsize=11, fontweight='bold')
    axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)', 
                      fontsize=11, fontweight='bold')
    axes[1].set_title('üìä Embedding Space (PCA)', fontsize=14, fontweight='bold')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # 3. Find similar words
    print("\n" + "="*70)
    print("üîç SIMILAR WORDS (learned from context)")
    print("="*70)
    
    test_words = ['m√®o', 'ch√≥', 't√°o', 'cam']
    
    for word in test_words:
        if word in vocab:
            word_idx = vocab[word]
            word_vec = embeddings[word_idx]
            
            # Compute similarities
            sims = []
            for i in range(n_words):
                if i != word_idx:
                    sim = np.dot(word_vec, embeddings[i]) / (
                        np.linalg.norm(word_vec) * np.linalg.norm(embeddings[i])
                    )
                    sims.append((idx_to_word[i], sim))
            
            sims.sort(key=lambda x: x[1], reverse=True)
            
            print(f"\n'{word}' ‚Üí Similar words:")
            for similar_word, sim_score in sims[:5]:
                print(f"  {similar_word}: {sim_score:.4f}")
    
    print("\n" + "="*70 + "\n")

visualize_learned_embeddings(model, vocab, idx_to_word)

### üß™ Experiment 2: Compare Different Embedding Dimensions

B√¢y gi·ªù h√£y so s√°nh performance v·ªõi c√°c embedding dimensions kh√°c nhau!

In [None]:
# ============================================
# Experiment 2: Compare Embedding Dimensions
# ============================================

def compare_embedding_dimensions():
    """
    Train v√† so s√°nh models v·ªõi embedding dimensions kh√°c nhau
    """
    
    dimensions = [8, 16, 32, 64, 128]
    results = {}
    
    print("="*70)
    print("üî¨ EXPERIMENT: Comparing Embedding Dimensions")
    print("="*70 + "\n")
    
    for dim in dimensions:
        print(f"Training with dimension = {dim}...")
        model, vocab, idx_to_word = train_embeddings(embedding_dim=dim, epochs=50)
        
        # Get embeddings
        embeddings = model.get_embeddings().detach().cpu().numpy()
        
        # Compute average pairwise similarity
        n = len(vocab)
        total_sim = 0
        count = 0
        
        for i in range(n):
            for j in range(i+1, n):
                sim = np.dot(embeddings[i], embeddings[j]) / (
                    np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j])
                )
                total_sim += sim
                count += 1
        
        avg_sim = total_sim / count
        
        # Compute variance (information capacity)
        variance = np.var(embeddings)
        
        results[dim] = {
            'avg_similarity': avg_sim,
            'variance': variance,
            'embeddings': embeddings
        }
        
        print(f"  ‚Üí Avg similarity: {avg_sim:.4f}")
        print(f"  ‚Üí Variance: {variance:.4f}\n")
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Average similarity
    avg_sims = [results[d]['avg_similarity'] for d in dimensions]
    axes[0, 0].plot(dimensions, avg_sims, 'o-', linewidth=3, markersize=10, color='#3498db')
    axes[0, 0].set_xlabel('Embedding Dimension', fontsize=11, fontweight='bold')
    axes[0, 0].set_ylabel('Average Similarity', fontsize=11, fontweight='bold')
    axes[0, 0].set_title('üìä Semantic Coherence', fontsize=13, fontweight='bold')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].set_xscale('log', base=2)
    
    # 2. Variance (capacity)
    variances = [results[d]['variance'] for d in dimensions]
    axes[0, 1].plot(dimensions, variances, 's-', linewidth=3, markersize=10, color='#2ecc71')
    axes[0, 1].set_xlabel('Embedding Dimension', fontsize=11, fontweight='bold')
    axes[0, 1].set_ylabel('Variance', fontsize=11, fontweight='bold')
    axes[0, 1].set_title('üß† Information Capacity', fontsize=13, fontweight='bold')
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].set_xscale('log', base=2)
    
    # 3. Embedding distribution (small dim)
    small_dim = dimensions[0]
    axes[1, 0].hist(results[small_dim]['embeddings'].flatten(), bins=50, 
                   color='#e74c3c', alpha=0.7, edgecolor='black')
    axes[1, 0].set_xlabel('Value', fontsize=11)
    axes[1, 0].set_ylabel('Frequency', fontsize=11)
    axes[1, 0].set_title(f'üìà Distribution (dim={small_dim})', 
                        fontsize=13, fontweight='bold')
    axes[1, 0].grid(axis='y', alpha=0.3)
    
    # 4. Embedding distribution (large dim)
    large_dim = dimensions[-1]
    axes[1, 1].hist(results[large_dim]['embeddings'].flatten(), bins=50, 
                   color='#9b59b6', alpha=0.7, edgecolor='black')
    axes[1, 1].set_xlabel('Value', fontsize=11)
    axes[1, 1].set_ylabel('Frequency', fontsize=11)
    axes[1, 1].set_title(f'üìà Distribution (dim={large_dim})', 
                        fontsize=13, fontweight='bold')
    axes[1, 1].grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Summary
    print("\n" + "="*70)
    print("üìä SUMMARY: Embedding Dimension Comparison")
    print("="*70)
    print(f"\n{'Dimension':<12} {'Avg Similarity':<18} {'Variance':<15} {'Recommendation'}")
    print("-"*70)
    
    for dim in dimensions:
        sim = results[dim]['avg_similarity']
        var = results[dim]['variance']
        
        if dim <= 16:
            rec = "Too small, low capacity"
        elif dim <= 64:
            rec = "‚úÖ Good balance"
        else:
            rec = "High capacity, may overfit"
        
        print(f"{dim:<12} {sim:<18.4f} {var:<15.4f} {rec}")
    
    print("="*70 + "\n")

compare_embedding_dimensions()

---

## üéØ T·ªïng K·∫øt

### ‚úÖ Nh·ªØng g√¨ b·∫°n ƒë√£ h·ªçc ƒë∆∞·ª£c:

1. **One-Hot vs Embedding**
   - One-hot: Sparse, kh√¥ng c√≥ semantic meaning
   - Embedding: Dense, h·ªçc ƒë∆∞·ª£c relationships
   - Trade-off v·ªÅ dimension: capacity vs efficiency

2. **Implementation Details**
   - Embedding = lookup table
   - Gradient ch·ªâ flow qua used tokens
   - Padding & masking

3. **Embedding Properties**
   - Cosine similarity ƒëo semantic closeness
   - PCA/t-SNE ƒë·ªÉ visualize
   - Similar words ‚Üí close vectors

4. **Training Embeddings**
   - Skip-gram: predict context from center
   - Embeddings h·ªçc ƒë∆∞·ª£c t·ª´ data
   - Dimension size impacts performance

### üöÄ Next Steps:

- Th·ª≠ v·ªõi corpus l·ªõn h∆°n (Wikipedia, news)
- Implement CBOW (continuous bag of words)
- Pre-trained embeddings (Word2Vec, GloVe, FastText)
- Contextual embeddings (BERT, GPT)

---

## üí° Key Takeaways

1. **Embeddings are fundamental** to modern NLP
2. **Dimension size** l√† trade-off gi·ªØa capacity v√† efficiency
3. **Training data quality** quan tr·ªçng h∆°n model complexity
4. **Visualization** gi√∫p understand learned representations

---

## üìö T√†i li·ªáu tham kh·∫£o

- [Word2Vec paper](https://arxiv.org/abs/1301.3781)
- [GloVe paper](https://nlp.stanford.edu/pubs/glove.pdf)
- [PyTorch Embedding docs](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)

**Ch√∫c b·∫°n h·ªçc t·∫≠p vui v·∫ª! üéâ**