# Problem 4: Gradient Descent - Rolling the Ball Downhill

## Learning Objectives
By the end of this problem, you will:
- Understand gradient descent as the fundamental optimization algorithm
- Calculate gradients by hand and see how they guide weight updates
- Explore how learning rate affects convergence
- Watch gradient descent automatically find optimal weights for "Go Dolphins!"

## Task Overview

1. **Manual Gradient Calculation** - Compute gradients by hand to understand the math
2. **Gradient Descent Implementation** - Build the algorithm from scratch
3. **Learning Rate Exploration** - See how step size affects optimization
4. **Full Training Loop** - Watch the model learn optimal weights automatically

---

## The Story Continues

In Problem 3, you discovered the loss landscape - a mathematical terrain where:
- **Valleys** represent good weights (low loss)
- **Hills** represent bad weights (high loss)
- **The goal** is to find the deepest valley (optimal weights)

But here's the crucial question: **How do we automatically navigate this landscape to find the optimal weights?**

The answer is **gradient descent** - an algorithm that follows the steepest downhill direction to roll a ball to the bottom of the valley. It's the optimization engine that powers all machine learning.

## What Is Gradient Descent?

**Physical intuition**: Imagine you're in thick fog on a hillside and want to reach the bottom. You can't see the valley, but you can feel which direction slopes downward most steeply. Take a step in that direction, feel the slope again, repeat.

**Mathematical definition**: 
- **Gradient** = direction of steepest increase in loss
- **Negative gradient** = direction of steepest decrease (downhill)
- **Step** = adjust weights slightly in the downhill direction

**The algorithm**:
```
1. Calculate gradient of loss with respect to weights
2. Move weights in the opposite direction (downhill)
3. Repeat until you reach the bottom
```

In [None]:
# Setup and imports
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple, Callable

# Import our utilities
import sys
sys.path.append('./utils')
from data_generators import load_sports_dataset
from gradient_helpers import numerical_gradient, analytical_gradient_mse, gradient_descent_with_history
from visualization import plot_gradient_descent_path

# Load our data
features, labels, feature_names, texts = load_sports_dataset()

print("Ready to optimize 'Go Dolphins!' sentiment classification!")
print(f"Goal: Find optimal weights for {len(texts)} tweets")
print(f"Starting point: Random weights need to learn from data")
print(f"Method: Gradient descent optimization")

# Define our loss and activation functions
def sigmoid(x):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def binary_cross_entropy(prediction, actual):
    """Binary cross-entropy loss"""
    prediction = np.clip(prediction, 1e-15, 1 - 1e-15)
    if actual == 1:
        return -np.log(prediction)
    else:
        return -np.log(1 - prediction)

def compute_loss_and_prediction(features_single, label_single, weights):
    """Compute prediction and loss for a single example"""
    raw_prediction = np.dot(features_single, weights)
    probability = sigmoid(raw_prediction)
    loss = binary_cross_entropy(probability, label_single)
    return probability, loss, raw_prediction

## Task 1: Manual Gradient Calculation

Before we automate gradient descent, let's understand the math by calculating gradients by hand for our "Go Dolphins!" example.

In [None]:
# Step-by-step gradient calculation for "Go Dolphins!"
go_dolphins_features = features[0]  # [2, 1, 1]
go_dolphins_label = labels[0]       # 1 (positive)

# Start with some initial weights
initial_weights = np.array([0.1, 0.2, 0.1])

print("MANUAL GRADIENT CALCULATION FOR 'GO DOLPHINS!'")
print("=" * 55)
print(f"Input features: {go_dolphins_features} (word_count, has_team, has_exclamation)")
print(f"True label: {go_dolphins_label} (positive sentiment)")
print(f"Initial weights: {initial_weights}")
print()

# Step 1: Forward pass
raw_pred = np.dot(go_dolphins_features, initial_weights)
probability = sigmoid(raw_pred)
loss = binary_cross_entropy(probability, go_dolphins_label)

print("FORWARD PASS:")
print(f"1. Raw prediction (z): {go_dolphins_features} · {initial_weights} = {raw_pred:.4f}")
print(f"2. Probability (σ(z)): sigmoid({raw_pred:.4f}) = {probability:.4f}")
print(f"3. Loss: -log({probability:.4f}) = {loss:.4f}")
print()

# Step 2: Manual gradient calculation using chain rule
print("GRADIENT CALCULATION (Chain Rule):")
print("Goal: Find ∂Loss/∂w for each weight")
print()

# Chain rule: ∂Loss/∂w = (∂Loss/∂prob) × (∂prob/∂z) × (∂z/∂w)

# ∂Loss/∂prob (derivative of BCE loss)
dloss_dprob = -1/probability  # For positive examples
print(f"∂Loss/∂probability = -1/{probability:.4f} = {dloss_dprob:.4f}")

# ∂prob/∂z (derivative of sigmoid)
dprob_dz = probability * (1 - probability)
print(f"∂probability/∂z = {probability:.4f} × (1 - {probability:.4f}) = {dprob_dz:.4f}")

# ∂z/∂w (derivative of dot product)
dz_dw = go_dolphins_features  # This is just the input features!
print(f"∂z/∂w = {dz_dw} (input features)")

# Combine using chain rule
dloss_dz = dloss_dprob * dprob_dz
gradient = dloss_dz * dz_dw

print(f"\nCombined:")
print(f"∂Loss/∂z = {dloss_dprob:.4f} × {dprob_dz:.4f} = {dloss_dz:.4f}")
print(f"∂Loss/∂w = {dloss_dz:.4f} × {dz_dw} = {gradient}")
print()

print("INTERPRETATION:")
print(f"Gradient: {gradient}")
print("This tells us:")
for i, (feature_name, grad_val) in enumerate(zip(feature_names, gradient)):
    direction = "increase" if grad_val > 0 else "decrease"
    print(f"  - {direction} {feature_name} weight (gradient: {grad_val:+.4f})")

print(f"\nNote: All gradients are positive because our prediction ({probability:.4f}) is too low")
print(f"compared to the true label ({go_dolphins_label}). We need to increase all weights!")

In [None]:
# Verify manual calculation with numerical gradient
def loss_function_single_example(weights):
    """Loss function for gradient verification"""
    _, loss, _ = compute_loss_and_prediction(go_dolphins_features, go_dolphins_label, weights)
    return loss

# Calculate numerical gradient (finite differences)
numerical_grad = numerical_gradient(loss_function_single_example, initial_weights)

print("GRADIENT VERIFICATION:")
print("=" * 30)
print(f"Manual calculation:    {gradient}")
print(f"Numerical calculation: {numerical_grad}")
print(f"Difference:            {np.abs(gradient - numerical_grad)}")
print(f"Max difference:        {np.max(np.abs(gradient - numerical_grad)):.2e}")

if np.max(np.abs(gradient - numerical_grad)) < 1e-6:
    print("✅ Manual calculation is correct!")
else:
    print("❌ Check manual calculation - there might be an error")

# Show what a gradient descent step would look like
learning_rate = 0.1
new_weights = initial_weights - learning_rate * gradient

print(f"\nGRADIENT DESCENT STEP:")
print(f"Learning rate: {learning_rate}")
print(f"Weight update: w_new = w_old - α × gradient")
print(f"New weights: {initial_weights} - {learning_rate} × {gradient}")
print(f"           = {new_weights}")

# Check if this improved our prediction
new_prob, new_loss, _ = compute_loss_and_prediction(go_dolphins_features, go_dolphins_label, new_weights)
print(f"\nIMPROVEMENT CHECK:")
print(f"Old prediction: {probability:.4f}, loss: {loss:.4f}")
print(f"New prediction: {new_prob:.4f}, loss: {new_loss:.4f}")
print(f"Loss change: {new_loss - loss:+.4f} ({'✅ improved!' if new_loss < loss else '❌ got worse'})")

## Task 2: Gradient Descent Implementation

Now let's implement the full gradient descent algorithm and watch it automatically optimize our weights.

In [None]:
# Implement gradient descent from scratch

def compute_batch_gradient(features_batch, labels_batch, weights):
    """
    Compute gradient across all training examples.
    """
    total_gradient = np.zeros_like(weights)
    total_loss = 0.0
    
    for i in range(len(features_batch)):
        # Forward pass
        raw_pred = np.dot(features_batch[i], weights)
        probability = sigmoid(raw_pred)
        loss = binary_cross_entropy(probability, labels_batch[i])
        
        # Backward pass (gradient calculation)
        # Chain rule: ∂Loss/∂w = (∂Loss/∂prob) × (∂prob/∂z) × (∂z/∂w)
        
        if labels_batch[i] == 1:
            dloss_dprob = -1/probability
        else:
            dloss_dprob = 1/(1-probability)
        
        dprob_dz = probability * (1 - probability)
        dz_dw = features_batch[i]
        
        gradient = dloss_dprob * dprob_dz * dz_dw
        
        total_gradient += gradient
        total_loss += loss
    
    # Return average gradient and loss
    avg_gradient = total_gradient / len(features_batch)
    avg_loss = total_loss / len(features_batch)
    
    return avg_gradient, avg_loss

def gradient_descent(features_batch, labels_batch, initial_weights, 
                    learning_rate=0.1, num_iterations=100, verbose=True):
    """
    Run gradient descent optimization.
    """
    weights = initial_weights.copy()
    weight_history = [weights.copy()]
    loss_history = []
    gradient_norms = []
    
    if verbose:
        print(f"Starting gradient descent with learning rate {learning_rate}")
        print(f"Initial weights: {weights}")
        print()
    
    for iteration in range(num_iterations):
        # Compute gradient and loss
        gradient, avg_loss = compute_batch_gradient(features_batch, labels_batch, weights)
        gradient_norm = np.linalg.norm(gradient)
        
        # Store history
        loss_history.append(avg_loss)
        gradient_norms.append(gradient_norm)
        
        # Update weights
        weights = weights - learning_rate * gradient
        weight_history.append(weights.copy())
        
        # Print progress
        if verbose and (iteration % 20 == 0 or iteration < 10):
            print(f"Iteration {iteration:3d}: Loss = {avg_loss:.4f}, |Gradient| = {gradient_norm:.4f}, Weights = {weights}")
        
        # Check for convergence
        if gradient_norm < 1e-6:
            if verbose:
                print(f"\nConverged after {iteration} iterations!")
            break
    
    if verbose:
        print(f"\nFinal weights: {weights}")
        print(f"Final loss: {avg_loss:.4f}")
    
    return weights, weight_history, loss_history, gradient_norms

# Run gradient descent on our sports tweets
print("TRAINING 'GO DOLPHINS!' SENTIMENT CLASSIFIER")
print("=" * 50)

# Start with random weights
np.random.seed(42)  # For reproducibility
initial_weights = np.random.normal(0, 0.1, 3)
print(f"Random initial weights: {initial_weights}")
print()

# Run optimization
optimal_weights, weight_history, loss_history, gradient_norms = gradient_descent(
    features, labels, initial_weights, learning_rate=0.5, num_iterations=100
)

In [None]:
# Visualize the optimization process
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

iterations = range(len(loss_history))

# Plot 1: Loss over time
axes[0, 0].plot(iterations, loss_history, 'b-', linewidth=2, marker='o', markersize=4)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Average Loss')
axes[0, 0].set_title('Loss Reduction During Training')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_yscale('log')  # Log scale to see improvement better

# Plot 2: Gradient magnitude over time
axes[0, 1].plot(iterations, gradient_norms, 'r-', linewidth=2, marker='s', markersize=4)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Gradient Magnitude')
axes[0, 1].set_title('Gradient Magnitude (Convergence Indicator)')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_yscale('log')

# Plot 3: Weight evolution
weight_history_array = np.array(weight_history)
for i, feature_name in enumerate(feature_names):
    axes[1, 0].plot(range(len(weight_history)), weight_history_array[:, i], 
                   linewidth=2, marker='o', markersize=3, label=f'{feature_name}')

axes[1, 0].set_xlabel('Iteration')
axes[1, 0].set_ylabel('Weight Value')
axes[1, 0].set_title('Weight Evolution During Training')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].legend()

# Plot 4: Final predictions
final_predictions = []
for i in range(len(features)):
    prob, _, _ = compute_loss_and_prediction(features[i], labels[i], optimal_weights)
    final_predictions.append(prob)

final_predictions = np.array(final_predictions)
pos_mask = labels == 1
neg_mask = labels == 0

axes[1, 1].scatter(range(np.sum(pos_mask)), final_predictions[pos_mask], 
                  c='green', s=100, alpha=0.7, label='Positive tweets')
axes[1, 1].scatter(range(np.sum(neg_mask)), final_predictions[neg_mask], 
                  c='red', s=100, alpha=0.7, label='Negative tweets')
axes[1, 1].axhline(y=0.5, color='black', linestyle='--', alpha=0.7, label='Decision threshold')
axes[1, 1].set_xlabel('Tweet Index')
axes[1, 1].set_ylabel('Predicted Probability')
axes[1, 1].set_title('Final Predictions After Training')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate final accuracy
final_binary_predictions = (final_predictions > 0.5).astype(int)
accuracy = np.mean(final_binary_predictions == labels)

print(f"\nTRAINING RESULTS:")
print(f"Final accuracy: {accuracy:.1%}")
print(f"Initial loss: {loss_history[0]:.4f}")
print(f"Final loss: {loss_history[-1]:.4f}")
print(f"Loss reduction: {loss_history[0] - loss_history[-1]:.4f}")
print(f"Training iterations: {len(loss_history)}")

## Task 3: Learning Rate Exploration

The learning rate is crucial for gradient descent. Let's explore how different step sizes affect the optimization process.

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.5, 1.0, 2.0]
lr_results = {}

print("LEARNING RATE COMPARISON")
print("=" * 40)

for lr in learning_rates:
    print(f"\nTesting learning rate: {lr}")
    print("-" * 30)
    
    try:
        weights, weight_hist, loss_hist, grad_norms = gradient_descent(
            features, labels, initial_weights.copy(), 
            learning_rate=lr, num_iterations=100, verbose=False
        )
        
        # Calculate final accuracy
        final_preds = []
        for i in range(len(features)):
            prob, _, _ = compute_loss_and_prediction(features[i], labels[i], weights)
            final_preds.append(prob)
        
        final_accuracy = np.mean((np.array(final_preds) > 0.5) == labels)
        
        lr_results[lr] = {
            'final_weights': weights,
            'loss_history': loss_hist,
            'final_loss': loss_hist[-1],
            'final_accuracy': final_accuracy,
            'iterations': len(loss_hist),
            'converged': grad_norms[-1] < 1e-3,
            'stable': not (np.any(np.isnan(weights)) or np.any(np.isinf(weights)))
        }
        
        status = "✅ Converged" if lr_results[lr]['converged'] else "⚠️  Did not converge"
        if not lr_results[lr]['stable']:
            status = "❌ Unstable (NaN/Inf)"
        
        print(f"Final loss: {lr_results[lr]['final_loss']:.4f}")
        print(f"Final accuracy: {lr_results[lr]['final_accuracy']:.1%}")
        print(f"Iterations: {lr_results[lr]['iterations']}")
        print(f"Status: {status}")
        
    except Exception as e:
        print(f"❌ Failed: {str(e)}")
        lr_results[lr] = None

# Visualize learning rate effects
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Loss curves for different learning rates
colors = ['blue', 'green', 'orange', 'red', 'purple']
for i, lr in enumerate(learning_rates):
    if lr_results[lr] is not None and lr_results[lr]['stable']:
        loss_hist = lr_results[lr]['loss_history']
        axes[0].plot(range(len(loss_hist)), loss_hist, 
                    color=colors[i], linewidth=2, label=f'α = {lr}')

axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Curves for Different Learning Rates')
axes[0].set_yscale('log')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Final performance summary
valid_lrs = [lr for lr in learning_rates if lr_results[lr] is not None and lr_results[lr]['stable']]
final_losses = [lr_results[lr]['final_loss'] for lr in valid_lrs]
final_accuracies = [lr_results[lr]['final_accuracy'] for lr in valid_lrs]

axes[1].scatter(valid_lrs, final_losses, s=100, alpha=0.7, color='red', label='Final Loss')
ax2 = axes[1].twinx()
ax2.scatter(valid_lrs, final_accuracies, s=100, alpha=0.7, color='blue', label='Final Accuracy')

axes[1].set_xlabel('Learning Rate')
axes[1].set_ylabel('Final Loss', color='red')
ax2.set_ylabel('Final Accuracy', color='blue')
axes[1].set_title('Learning Rate vs Final Performance')
axes[1].set_xscale('log')
axes[1].grid(True, alpha=0.3)

# Add legends
axes[1].legend(loc='upper left')
ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

In [None]:
# Analyze learning rate effects in detail
print("\nLEARNING RATE ANALYSIS")
print("=" * 30)

# Find the best learning rate
valid_results = [(lr, result) for lr, result in lr_results.items() 
                if result is not None and result['stable']]

if valid_results:
    # Sort by final loss (lower is better)
    valid_results.sort(key=lambda x: x[1]['final_loss'])
    
    print("Learning rates ranked by final loss (best to worst):")
    for i, (lr, result) in enumerate(valid_results, 1):
        converged = "✓" if result['converged'] else "✗"
        print(f"{i}. α = {lr:<4} | Loss: {result['final_loss']:.4f} | Accuracy: {result['final_accuracy']:.1%} | Converged: {converged}")
    
    best_lr, best_result = valid_results[0]
    print(f"\n🏆 Best learning rate: {best_lr}")
    print(f"   Final loss: {best_result['final_loss']:.4f}")
    print(f"   Final accuracy: {best_result['final_accuracy']:.1%}")
    print(f"   Optimal weights: {best_result['final_weights']}")

print("\nLEARNING RATE INSIGHTS:")
print("1. Too small (0.01): Slow convergence, many iterations needed")
print("2. Just right (0.1-0.5): Fast, stable convergence")
print("3. Too large (1.0+): May overshoot, unstable, or diverge")
print("4. Way too large (2.0+): Likely to explode or oscillate")

# Show what happens with extreme learning rates
print("\nWHY LEARNING RATE MATTERS:")

for lr in learning_rates:
    if lr_results[lr] is None:
        print(f"α = {lr}: ❌ Training failed completely")
    elif not lr_results[lr]['stable']:
        print(f"α = {lr}: ❌ Weights became infinite (exploded)")
    elif lr_results[lr]['final_loss'] > 1.0:
        print(f"α = {lr}: ⚠️  Poor final performance")
    elif not lr_results[lr]['converged']:
        print(f"α = {lr}: ⚠️  Slow convergence")
    else:
        print(f"α = {lr}: ✅ Good performance and convergence")

## Task 4: Full Training Loop

Let's put it all together and watch gradient descent automatically learn to classify "Go Dolphins!" and all our other tweets.

In [None]:
# Complete training demonstration with detailed analysis

def detailed_training_analysis(features, labels, texts, learning_rate=0.3, num_iterations=150):
    """
    Run complete training with detailed tracking and analysis.
    """
    # Initialize with small random weights
    np.random.seed(42)
    weights = np.random.normal(0, 0.1, 3)
    
    # Track everything
    weight_history = [weights.copy()]
    loss_history = []
    accuracy_history = []
    prediction_history = []
    
    print(f"COMPLETE TRAINING DEMONSTRATION")
    print(f"=" * 50)
    print(f"Dataset: {len(texts)} sports tweets")
    print(f"Learning rate: {learning_rate}")
    print(f"Initial weights: {weights}")
    print()
    
    # Training loop with detailed tracking
    for iteration in range(num_iterations):
        # Compute predictions and gradients for all examples
        total_gradient = np.zeros_like(weights)
        total_loss = 0.0
        predictions = []
        
        for i in range(len(features)):
            # Forward pass
            prob, loss, _ = compute_loss_and_prediction(features[i], labels[i], weights)
            predictions.append(prob)
            
            # Backward pass
            if labels[i] == 1:
                dloss_dprob = -1/prob
            else:
                dloss_dprob = 1/(1-prob)
            
            dprob_dz = prob * (1 - prob)
            dz_dw = features[i]
            
            gradient = dloss_dprob * dprob_dz * dz_dw
            total_gradient += gradient
            total_loss += loss
        
        # Average across all examples
        avg_gradient = total_gradient / len(features)
        avg_loss = total_loss / len(features)
        
        # Calculate accuracy
        binary_predictions = (np.array(predictions) > 0.5).astype(int)
        accuracy = np.mean(binary_predictions == labels)
        
        # Store history
        loss_history.append(avg_loss)
        accuracy_history.append(accuracy)
        prediction_history.append(predictions.copy())
        
        # Update weights
        weights = weights - learning_rate * avg_gradient
        weight_history.append(weights.copy())
        
        # Print periodic updates
        if iteration % 30 == 0 or iteration < 5:
            print(f"Iteration {iteration:3d}: Loss = {avg_loss:.4f}, Accuracy = {accuracy:.1%}, Weights = {weights}")
    
    return weights, weight_history, loss_history, accuracy_history, prediction_history

# Run the complete training
final_weights, weight_hist, loss_hist, acc_hist, pred_hist = detailed_training_analysis(
    features, labels, texts, learning_rate=0.3, num_iterations=100
)

print(f"\nTRAINING COMPLETE!")
print(f"Final weights: {final_weights}")
print(f"Final accuracy: {acc_hist[-1]:.1%}")
print(f"Final loss: {loss_hist[-1]:.4f}")

In [None]:
# Analyze what the model learned
print("\nWHAT DID THE MODEL LEARN?")
print("=" * 40)

# Analyze final weights
print(f"Learned weights: {final_weights}")
print("\nWeight interpretation:")
for i, (feature_name, weight) in enumerate(zip(feature_names, final_weights)):
    strength = "strongly" if abs(weight) > 0.5 else "moderately" if abs(weight) > 0.2 else "weakly"
    direction = "positive" if weight > 0 else "negative"
    print(f"  {feature_name:<15}: {weight:+.3f} → {strength} {direction} sentiment signal")

# Test on our key examples
print("\nFINAL PREDICTIONS ON KEY EXAMPLES:")
key_examples = [
    (0, "Go Dolphins!"),
    (1, "Terrible game"),
    (2, "Love the fins!"),
    (4, "Great win!!"),
    (7, "Worst season ever")
]

for idx, text in key_examples:
    prob, loss, raw_pred = compute_loss_and_prediction(features[idx], labels[idx], final_weights)
    true_sentiment = "Positive" if labels[idx] == 1 else "Negative"
    pred_sentiment = "Positive" if prob > 0.5 else "Negative"
    confidence = prob if prob > 0.5 else (1 - prob)
    correct = "✅" if (prob > 0.5) == (labels[idx] == 1) else "❌"
    
    print(f"{correct} '{text:<20}' | True: {true_sentiment:<8} | Pred: {pred_sentiment:<8} ({confidence:.1%} confidence)")

# Show the learning trajectory
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Training curves
iterations = range(len(loss_hist))
ax1 = axes[0, 0]
ax1.plot(iterations, loss_hist, 'b-', linewidth=2, label='Loss')
ax1.set_xlabel('Iteration')
ax1.set_ylabel('Loss', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

ax1_twin = ax1.twinx()
ax1_twin.plot(iterations, acc_hist, 'r-', linewidth=2, label='Accuracy')
ax1_twin.set_ylabel('Accuracy', color='red')
ax1_twin.tick_params(axis='y', labelcolor='red')

ax1.set_title('Training Progress: Loss and Accuracy')
ax1.grid(True, alpha=0.3)

# Plot 2: Weight evolution
weight_history_array = np.array(weight_hist)
for i, feature_name in enumerate(feature_names):
    axes[0, 1].plot(range(len(weight_hist)), weight_history_array[:, i], 
                   linewidth=2, marker='o', markersize=2, label=f'{feature_name}')

axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Weight Value')
axes[0, 1].set_title('How Weights Changed During Learning')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].legend()

# Plot 3: Prediction evolution for key examples
pred_history_array = np.array(pred_hist)
for idx, text in [(0, "Go Dolphins!"), (1, "Terrible game")]:
    axes[1, 0].plot(range(len(pred_hist)), pred_history_array[:, idx], 
                   linewidth=2, label=f'"{text}" (true: {"pos" if labels[idx]==1 else "neg"})')

axes[1, 0].axhline(y=0.5, color='black', linestyle='--', alpha=0.7, label='Decision threshold')
axes[1, 0].set_xlabel('Iteration')
axes[1, 0].set_ylabel('Predicted Probability')
axes[1, 0].set_title('How Predictions Improved During Training')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Final weight visualization
axes[1, 1].bar(feature_names, final_weights, color=['blue', 'green', 'orange'], alpha=0.7)
axes[1, 1].set_ylabel('Weight Value')
axes[1, 1].set_title('Final Learned Weights')
axes[1, 1].grid(True, alpha=0.3, axis='y')
axes[1, 1].axhline(y=0, color='black', linewidth=1)

# Add weight values on bars
for i, (name, weight) in enumerate(zip(feature_names, final_weights)):
    axes[1, 1].text(i, weight + 0.02 if weight > 0 else weight - 0.05, 
                    f'{weight:.3f}', ha='center', va='bottom' if weight > 0 else 'top')

plt.tight_layout()
plt.show()

print("\n🎉 GRADIENT DESCENT SUCCESS!")
print("The algorithm automatically discovered that:")
print(f"- Team mentions (weight: {final_weights[1]:+.3f}) are important for positive sentiment")
print(f"- Exclamation marks (weight: {final_weights[2]:+.3f}) indicate excitement/positivity")
print(f"- Word count (weight: {final_weights[0]:+.3f}) has some influence")
print("\nThis matches our intuition about sports sentiment!")

## What's Next?

You've now witnessed the magic of gradient descent - automatic optimization that finds optimal weights! Here's what we discovered:

**🔑 Key Insights:**
1. **Gradients show the steepest uphill direction** - so we go downhill (negative gradient)
2. **The chain rule connects all the pieces** - from final loss back to every weight
3. **Learning rate controls the step size** - too small is slow, too large is unstable
4. **The algorithm discovers patterns automatically** - no human programming of decision rules!

**🔗 The Connection:**
- **Problem 1**: Text → Features (`[2, 1, 1]`)
- **Problem 2**: Features + Weights → Predictions (via dot products)
- **Problem 3**: Predictions + Truth → Loss (measuring quality)
- **Problem 4**: Loss + Gradients → Automatic weight optimization
- **Problem 5**: Coming up - How do we scale this to millions of examples?

**The Big Picture:**
Gradient descent is the engine that powers all machine learning. From simple classifiers to ChatGPT, this same algorithm (with variations) automatically discovers optimal parameters from data. You've now seen the mathematical heart of artificial intelligence!

**Coming up in Problem 5: Matrix Operations**
- How do we process thousands of tweets simultaneously?
- What are the computational tricks that make large-scale ML possible?
- How do GPUs accelerate the math we've been doing?

The journey from "Go Dolphins!" to scalable machine learning is almost complete! 🐬➡️📊➡️🎯➡️⚡➡️🚀