# üìó FILE 2-B ‚Äì Optimizer, Activation & Regularization

## üéØ M·ª•c ti√™u

Sau b√†i n√†y b·∫°n s·∫Ω hi·ªÉu:
- **Activation functions** n√¢ng cao (ReLU variants, GELU)
- **Optimizers** - SGD, Adam, AdamW v√† c√°ch ch·ªçn
- **Learning rate** - Scheduling strategies
- **Regularization** - Dropout, L2 ƒë·ªÉ ch·ªëng overfitting

---

## üìå T·∫°i sao quan tr·ªçng?

- **Activation** ‚Üí Quy·∫øt ƒë·ªãnh model c√≥ h·ªçc ƒë∆∞·ª£c non-linear patterns
- **Optimizer** ‚Üí Quy·∫øt ƒë·ªãnh training speed & final performance
- **Learning rate** ‚Üí Too high: kh√¥ng converge, too low: qu√° ch·∫≠m
- **Regularization** ‚Üí Quy·∫øt ƒë·ªãnh model c√≥ generalize t·ªët kh√¥ng

---

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

print(f"TensorFlow version: {tf.__version__}")

---

## 1Ô∏è‚É£ Activation Functions - Deep Dive

### üîπ ReLU Family

**1. ReLU (Rectified Linear Unit)**
```
f(x) = max(0, x)
```

**2. Leaky ReLU**
```
f(x) = max(0.01x, x)
```

**3. PReLU (Parametric ReLU)**
```
f(x) = max(Œ±x, x)  # Œ± is learnable
```

**4. ELU (Exponential Linear Unit)**
```
f(x) = x if x > 0 else Œ±(e^x - 1)
```

---

In [None]:
# Visualize ReLU family
x = np.linspace(-3, 3, 200)

relu = np.maximum(0, x)
leaky_relu = np.where(x > 0, x, 0.01 * x)
elu = np.where(x > 0, x, 1.0 * (np.exp(x) - 1))

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# ReLU
axes[0].plot(x, relu, 'b-', linewidth=2)
axes[0].axhline(0, color='k', linestyle='--', alpha=0.3)
axes[0].axvline(0, color='k', linestyle='--', alpha=0.3)
axes[0].set_title('ReLU: max(0, x)')
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].grid(True, alpha=0.3)

# Leaky ReLU
axes[1].plot(x, leaky_relu, 'g-', linewidth=2)
axes[1].axhline(0, color='k', linestyle='--', alpha=0.3)
axes[1].axvline(0, color='k', linestyle='--', alpha=0.3)
axes[1].set_title('Leaky ReLU: max(0.01x, x)')
axes[1].set_xlabel('x')
axes[1].set_ylabel('f(x)')
axes[1].grid(True, alpha=0.3)

# ELU
axes[2].plot(x, elu, 'r-', linewidth=2)
axes[2].axhline(0, color='k', linestyle='--', alpha=0.3)
axes[2].axvline(0, color='k', linestyle='--', alpha=0.3)
axes[2].set_title('ELU: x if x>0 else Œ±(e^x-1)')
axes[2].set_xlabel('x')
axes[2].set_ylabel('f(x)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey differences:")
print("- ReLU: Dead neurons problem (x<0 ‚Üí gradient=0)")
print("- Leaky ReLU: Small gradient when x<0 ‚Üí gi·∫£i quy·∫øt dying ReLU")
print("- ELU: Smooth, mean closer to 0 ‚Üí faster convergence")

### üîπ GELU (Gaussian Error Linear Unit)

**Formula:**
```
GELU(x) = x * Œ¶(x)
```
Trong ƒë√≥ Œ¶(x) l√† CDF c·ªßa normal distribution

**ƒê·∫∑c ƒëi·ªÉm:**
- Smooth, non-monotonic
- ƒê∆∞·ª£c d√πng trong **Transformers** (BERT, GPT)
- T·ªët h∆°n ReLU cho NLP tasks

---

In [None]:
# GELU implementation
def gelu(x):
    """GELU activation"""
    return 0.5 * x * (1 + tf.tanh(tf.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))

# Compare with ReLU
x_vals = np.linspace(-3, 3, 200)
x_tf = tf.constant(x_vals, dtype=tf.float32)

relu_vals = tf.nn.relu(x_tf).numpy()
gelu_vals = gelu(x_tf).numpy()

plt.figure(figsize=(10, 6))
plt.plot(x_vals, relu_vals, label='ReLU', linewidth=2)
plt.plot(x_vals, gelu_vals, label='GELU', linewidth=2)
plt.axhline(0, color='k', linestyle='--', alpha=0.3)
plt.axvline(0, color='k', linestyle='--', alpha=0.3)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('ReLU vs GELU')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("GELU advantages:")
print("- Smooth (differentiable everywhere)")
print("- Non-monotonic (c√≥ th·ªÉ gi·∫£m khi x<0)")
print("- State-of-the-art cho Transformers")

### üîπ Khi n√†o d√πng activation n√†o?

| Task | Hidden Layers | Output Layer |
|------|---------------|-------------|
| **Computer Vision** | ReLU, Leaky ReLU | Softmax |
| **NLP (Transformers)** | GELU | Softmax |
| **Simple MLP** | ReLU | Depends on task |
| **Deep Networks** | ELU, SELU | - |

**Rule of thumb:**
- Start with **ReLU** (default)
- Try **Leaky ReLU** if dying ReLU problem
- Try **GELU** for Transformers/NLP
- Try **ELU** for very deep networks

---

In [None]:
# Demo: Compare activations on real data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X = X.astype(np.float32)
y = y.astype(np.int32)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def build_model(activation='relu'):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation=activation, input_shape=(20,)),
        tf.keras.layers.Dense(32, activation=activation),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

activations = ['relu', 'elu', tf.keras.layers.LeakyReLU(alpha=0.01)]
activation_names = ['ReLU', 'ELU', 'LeakyReLU']
histories = []

for act, name in zip(activations, activation_names):
    print(f"Training with {name}...")
    model = build_model(act)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    hist = model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=0)
    histories.append(hist)

# Plot comparison
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for hist, name in zip(histories, activation_names):
    plt.plot(hist.history['loss'], label=name)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Comparison')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
for hist, name in zip(histories, activation_names):
    plt.plot(hist.history['val_accuracy'], label=name)
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Validation Accuracy Comparison')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 2Ô∏è‚É£ Optimizers - Thu·∫≠t to√°n t·ªëi ∆∞u

### üîπ Gradient Descent c∆° b·∫£n

**Vanilla Gradient Descent:**
```
w = w - learning_rate √ó gradient
```

**V·∫•n ƒë·ªÅ:**
- Learning rate c·ªë ƒë·ªãnh ‚Üí kh√¥ng linh ho·∫°t
- Ch·∫≠m v·ªõi saddle points
- Kh√¥ng adaptive

---

### üîπ 1. SGD (Stochastic Gradient Descent)

**Formula:**
```
w = w - learning_rate √ó gradient
```

**With momentum:**
```
v = momentum √ó v - learning_rate √ó gradient
w = w + v
```

**ƒê·∫∑c ƒëi·ªÉm:**
- ‚úÖ Simple, stable
- ‚úÖ Good for convex problems
- ‚ùå Slow convergence
- ‚ùå C·∫ßn tune learning rate carefully

**Khi n√†o d√πng:**
- Simple problems
- Mu·ªën stability > speed

---

In [None]:
# SGD demo
model_sgd = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# SGD with momentum
optimizer_sgd = tf.keras.optimizers.SGD(
    learning_rate=0.01,
    momentum=0.9  # Momentum helps escape local minima
)

model_sgd.compile(optimizer=optimizer_sgd, loss='binary_crossentropy', metrics=['accuracy'])
hist_sgd = model_sgd.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=0)

print(f"Final val accuracy: {hist_sgd.history['val_accuracy'][-1]:.4f}")

### üîπ 2. Adam (Adaptive Moment Estimation)

**Formula:**
```
m = Œ≤1 √ó m + (1-Œ≤1) √ó gradient       # 1st moment (mean)
v = Œ≤2 √ó v + (1-Œ≤2) √ó gradient¬≤      # 2nd moment (variance)
m_hat = m / (1 - Œ≤1^t)               # Bias correction
v_hat = v / (1 - Œ≤2^t)
w = w - learning_rate √ó m_hat / (‚àöv_hat + Œµ)
```

**ƒê·∫∑c ƒëi·ªÉm:**
- ‚úÖ Adaptive learning rate per parameter
- ‚úÖ Fast convergence
- ‚úÖ Works well out-of-the-box
- ‚ùå C√≥ th·ªÉ generalize k√©m h∆°n SGD

**Hyperparameters:**
- `learning_rate`: 0.001 (default)
- `beta_1`: 0.9 (momentum)
- `beta_2`: 0.999 (variance)
- `epsilon`: 1e-7 (numerical stability)

**Khi n√†o d√πng:**
- **Default choice** cho most problems
- Mu·ªën convergence nhanh
- Sparse gradients (NLP)

---

In [None]:
# Adam demo
model_adam = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

optimizer_adam = tf.keras.optimizers.Adam(
    learning_rate=0.001,  # Default
    beta_1=0.9,
    beta_2=0.999
)

model_adam.compile(optimizer=optimizer_adam, loss='binary_crossentropy', metrics=['accuracy'])
hist_adam = model_adam.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=0)

print(f"Final val accuracy: {hist_adam.history['val_accuracy'][-1]:.4f}")

### üîπ 3. AdamW (Adam with Weight Decay)

**Kh√°c bi·ªát v·ªõi Adam:**
```
Adam:  w = w - lr √ó m_hat / ‚àöv_hat
AdamW: w = w - lr √ó m_hat / ‚àöv_hat - lr √ó Œª √ó w  # Weight decay
```

**ƒê·∫∑c ƒëi·ªÉm:**
- ‚úÖ Better generalization than Adam
- ‚úÖ Decoupled weight decay
- ‚úÖ State-of-the-art cho Transformers

**Khi n√†o d√πng:**
- Large models (Transformers, Vision Transformers)
- Mu·ªën generalization t·ªët h∆°n Adam

---

In [None]:
# AdamW demo
model_adamw = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Note: TensorFlow 2.11+ c√≥ AdamW built-in
try:
    optimizer_adamw = tf.keras.optimizers.AdamW(
        learning_rate=0.001,
        weight_decay=0.01  # Weight decay coefficient
    )
except:
    # Fallback to Adam if AdamW not available
    print("AdamW not available in this TF version, using Adam")
    optimizer_adamw = tf.keras.optimizers.Adam(learning_rate=0.001)

model_adamw.compile(optimizer=optimizer_adamw, loss='binary_crossentropy', metrics=['accuracy'])
hist_adamw = model_adamw.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=0)

print(f"Final val accuracy: {hist_adamw.history['val_accuracy'][-1]:.4f}")

In [None]:
# Compare optimizers
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

histories_opt = [
    (hist_sgd, 'SGD'),
    (hist_adam, 'Adam'),
    (hist_adamw, 'AdamW')
]

# Loss
for hist, name in histories_opt:
    axes[0].plot(hist.history['loss'], label=name, linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
for hist, name in histories_opt:
    axes[1].plot(hist.history['val_accuracy'], label=name, linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- SGD: Slower but stable")
print("- Adam: Fast convergence")
print("- AdamW: Similar to Adam but better generalization (on larger datasets)")

---

## 3Ô∏è‚É£ Learning Rate Scheduling

### üîπ T·∫°i sao c·∫ßn learning rate schedule?

**Fixed LR:**
- Too high ‚Üí overshoot, kh√¥ng converge
- Too low ‚Üí qu√° ch·∫≠m

**Dynamic LR:**
- Start high ‚Üí fast progress
- Gradually decrease ‚Üí fine-tune

### üîπ C√°c strategies ph·ªï bi·∫øn

In [None]:
# 1. Exponential Decay
lr_schedule_exp = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.1,
    decay_steps=1000,
    decay_rate=0.96  # LR m·ªõi = LR c≈© √ó 0.96
)

# 2. Cosine Decay
lr_schedule_cos = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1,
    decay_steps=1000
)

# 3. Piecewise Constant (Step Decay)
lr_schedule_step = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
    boundaries=[500, 1000, 1500],
    values=[0.1, 0.01, 0.001, 0.0001]
)

# Visualize
steps = np.arange(0, 2000)
lr_exp = [lr_schedule_exp(s).numpy() for s in steps]
lr_cos = [lr_schedule_cos(s).numpy() for s in steps]
lr_step = [lr_schedule_step(s).numpy() for s in steps]

plt.figure(figsize=(12, 5))
plt.plot(steps, lr_exp, label='Exponential Decay', linewidth=2)
plt.plot(steps, lr_cos, label='Cosine Decay', linewidth=2)
plt.plot(steps, lr_step, label='Step Decay', linewidth=2)
plt.xlabel('Training Steps')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("When to use:")
print("- Exponential: Smooth, gradual decrease")
print("- Cosine: Popular for vision tasks")
print("- Step: Simple, interpretable")

In [None]:
# Demo: Train with LR schedule
model_scheduled = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Create LR schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1,
    decay_steps=1000
)

optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)
model_scheduled.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Train
hist_scheduled = model_scheduled.fit(
    X_train, y_train,
    epochs=50,
    validation_split=0.2,
    verbose=0
)

print(f"Final val accuracy: {hist_scheduled.history['val_accuracy'][-1]:.4f}")

---

## 4Ô∏è‚É£ Regularization - Ch·ªëng Overfitting

### üîπ 1. L2 Regularization (Weight Decay)

**Idea:** Penalize large weights

**Formula:**
```
Loss_total = Loss_original + Œª √ó Œ£(w¬≤)
```

**Effect:**
- Weights tend to be small ‚Üí simpler model
- Prevents overfitting

---

In [None]:
# Model WITHOUT L2
model_no_l2 = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Model WITH L2
model_with_l2 = tf.keras.Sequential([
    tf.keras.layers.Dense(
        128, 
        activation='relu', 
        kernel_regularizer=tf.keras.regularizers.L2(0.01),  # L2 regularization
        input_shape=(20,)
    ),
    tf.keras.layers.Dense(
        64, 
        activation='relu',
        kernel_regularizer=tf.keras.regularizers.L2(0.01)
    ),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile & train
model_no_l2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_with_l2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

print("Training without L2...")
hist_no_l2 = model_no_l2.fit(X_train, y_train, epochs=100, validation_split=0.2, verbose=0)

print("Training with L2...")
hist_with_l2 = model_with_l2.fit(X_train, y_train, epochs=100, validation_split=0.2, verbose=0)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(hist_no_l2.history['loss'], label='Train (No L2)', linestyle='--')
axes[0].plot(hist_no_l2.history['val_loss'], label='Val (No L2)', linestyle='--')
axes[0].plot(hist_with_l2.history['loss'], label='Train (With L2)', linewidth=2)
axes[0].plot(hist_with_l2.history['val_loss'], label='Val (With L2)', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('L2 Regularization Effect on Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(hist_no_l2.history['val_accuracy'], label='Val (No L2)', linestyle='--')
axes[1].plot(hist_with_l2.history['val_accuracy'], label='Val (With L2)', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservation:")
print("With L2: Smaller gap between train & val loss ‚Üí less overfitting")

### üîπ 2. Dropout

**Idea:** Randomly "drop" neurons during training

**How it works:**
```
Training:   Random drop neurons (e.g., 50%)
Inference:  Use all neurons
```

**Effect:**
- Prevents co-adaptation
- Acts like ensemble
- Very effective!

**Rate:**
- 0.2-0.5 for hidden layers
- Higher rate ‚Üí more regularization

---

In [None]:
# Model WITHOUT Dropout
model_no_dropout = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Model WITH Dropout
model_with_dropout = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dropout(0.5),  # Drop 50% neurons
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.3),  # Drop 30% neurons
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile & train
model_no_dropout.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_with_dropout.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

print("Training without Dropout...")
hist_no_dropout = model_no_dropout.fit(X_train, y_train, epochs=100, validation_split=0.2, verbose=0)

print("Training with Dropout...")
hist_with_dropout = model_with_dropout.fit(X_train, y_train, epochs=100, validation_split=0.2, verbose=0)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(hist_no_dropout.history['loss'], label='Train (No Dropout)', linestyle='--')
axes[0].plot(hist_no_dropout.history['val_loss'], label='Val (No Dropout)', linestyle='--')
axes[0].plot(hist_with_dropout.history['loss'], label='Train (With Dropout)', linewidth=2)
axes[0].plot(hist_with_dropout.history['val_loss'], label='Val (With Dropout)', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Dropout Effect on Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(hist_no_dropout.history['val_accuracy'], label='Val (No Dropout)', linestyle='--')
axes[1].plot(hist_with_dropout.history['val_accuracy'], label='Val (With Dropout)', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("- Training slower v·ªõi Dropout (expected)")
print("- Validation performance better/more stable")
print("- Smaller train-val gap ‚Üí less overfitting")

### üîπ L2 vs Dropout - Khi n√†o d√πng g√¨?

| Method | Pros | Cons | When to use |
|--------|------|------|-------------|
| **L2** | Simple, always applicable | Weaker effect | Small/medium networks |
| **Dropout** | Very effective | Slower training | Large networks, Computer Vision |
| **Both** | Best generalization | Most compute | Production models |

**Best practice:**
- Start with Dropout
- Add L2 if still overfitting
- Adjust rates based on validation performance

---

---

## 5Ô∏è‚É£ Putting It All Together

### üîπ Production-ready model template

In [None]:
def build_production_model(
    input_shape,
    num_classes,
    hidden_units=[128, 64],
    activation='relu',
    dropout_rate=0.3,
    l2_reg=0.01,
    use_batch_norm=False
):
    """
    Build production-ready model with best practices
    
    Args:
        input_shape: Input shape tuple
        num_classes: Number of output classes
        hidden_units: List of hidden layer sizes
        activation: Activation function
        dropout_rate: Dropout rate (0 to disable)
        l2_reg: L2 regularization coefficient (0 to disable)
        use_batch_norm: Whether to use batch normalization
    """
    model = tf.keras.Sequential()
    
    # Input layer
    model.add(tf.keras.layers.InputLayer(input_shape=input_shape))
    
    # Hidden layers
    for units in hidden_units:
        # Dense layer with L2 regularization
        regularizer = tf.keras.regularizers.L2(l2_reg) if l2_reg > 0 else None
        model.add(tf.keras.layers.Dense(
            units,
            kernel_regularizer=regularizer
        ))
        
        # Batch normalization (optional)
        if use_batch_norm:
            model.add(tf.keras.layers.BatchNormalization())
        
        # Activation
        model.add(tf.keras.layers.Activation(activation))
        
        # Dropout (optional)
        if dropout_rate > 0:
            model.add(tf.keras.layers.Dropout(dropout_rate))
    
    # Output layer
    if num_classes == 2:
        model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    else:
        model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))
    
    return model

# Example usage
model_prod = build_production_model(
    input_shape=(20,),
    num_classes=2,
    hidden_units=[128, 64, 32],
    activation='relu',
    dropout_rate=0.3,
    l2_reg=0.01,
    use_batch_norm=True
)

print(model_prod.summary())

In [None]:
# Train production model
# LR schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.01,
    decay_steps=1000
)

# Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

# Compile
model_prod.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train
history_prod = model_prod.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(history_prod.history['loss'], label='Train')
plt.plot(history_prod.history['val_loss'], label='Val')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Production Model - Loss')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(history_prod.history['accuracy'], label='Train')
plt.plot(history_prod.history['val_accuracy'], label='Val')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Production Model - Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final val accuracy: {history_prod.history['val_accuracy'][-1]:.4f}")

---

## 6Ô∏è‚É£ Exercises

### üìù Exercise 1: Hyperparameter Tuning

Try different combinations:
- Activations: ReLU, LeakyReLU, ELU
- Optimizers: SGD, Adam, AdamW
- Dropout rates: 0, 0.3, 0.5

Find best combination for validation accuracy.

In [None]:
# YOUR CODE HERE
# TODO: Grid search or random search

### üìù Exercise 2: Custom LR Schedule

Implement warmup + cosine decay:
- Warmup: LR tƒÉng t·ª´ 0 ‚Üí max trong 5 epochs
- Cosine decay: sau ƒë√≥ gi·∫£m theo cosine

In [None]:
# YOUR CODE HERE
# TODO: Implement custom LR schedule class

### üìù Exercise 3: Regularization Comparison

Compare 4 models:
1. No regularization
2. L2 only
3. Dropout only
4. L2 + Dropout

Plot validation curves and analyze.

In [None]:
# YOUR CODE HERE
# TODO: Train 4 models and compare

---

## üéØ T√≥m t·∫Øt

### ‚úÖ ƒê√£ h·ªçc

1. **Activations**: ReLU, LeakyReLU, ELU, GELU
2. **Optimizers**: SGD, Adam, AdamW
3. **LR Scheduling**: Exponential, Cosine, Step
4. **Regularization**: L2, Dropout

### üéì Decision Guide

**Activation:**
```
Default: ReLU
Dying ReLU problem: LeakyReLU
Transformers/NLP: GELU
```

**Optimizer:**
```
Default: Adam (lr=0.001)
Large models: AdamW
Need stability: SGD with momentum
```

**Regularization:**
```
Always: Dropout (0.2-0.5)
If still overfitting: Add L2 (0.001-0.01)
```

### üìö Next Steps

- **File 2-C**: CNN & Callbacks

---

## üìñ References

- [Keras Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
- [Keras Activations](https://www.tensorflow.org/api_docs/python/tf/keras/activations)
- [Dropout Paper](https://jmlr.org/papers/v15/srivastava14a.html)

---

**Ch√∫c b·∫°n h·ªçc t·ªët! üöÄ**