# PyTorch Tutorial: Practical Example - Regression

In this notebook, we will solve a real-world regression problem: predicting continuous values (like house prices) based on input features.

## Learning Objectives

By the end of this notebook, you will:
- Understand how to prepare data for regression
- Build a flexible neural network for regression tasks
- Implement a complete training loop with validation
- Visualize training progress and model predictions
- Analyze model errors

## Setting Up

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)

## 1. Generating a Synthetic Dataset

To understand regression, let's create a synthetic dataset representing house prices.

**Features:**
1. **Size**: Square footage (500 - 2500 sq ft)
2. **Bedrooms**: Number of bedrooms (1 - 5)
3. **Age**: Age of the house (0 - 30 years)

**Target:**
- **Price**: Determined by a formula + some random noise

In [None]:
# Number of samples
n_samples = 1000

# Generate random features
# Size: Random values between 500 and 2500
size = torch.rand(n_samples, 1) * 2000 + 500

# Bedrooms: Random integers between 1 and 5
bedrooms = torch.randint(1, 6, (n_samples, 1)).float()

# Age: Random values between 0 and 30
age = torch.rand(n_samples, 1) * 30

# Create the feature matrix X (concatenate columns)
X = torch.cat([size, bedrooms, age], dim=1)

# Define the "true" relationship (Price formula)
# Price = 100 * Size + 50000 * Bedrooms - 2000 * Age + Noise
true_weights = torch.tensor([100.0, 50000.0, -2000.0])
bias = 50000.0  # Base price

# Calculate target y
noise = torch.randn(n_samples, 1) * 20000  # Add some random noise
y = (X @ true_weights.unsqueeze(1)) + bias + noise

print(f"Dataset shape: {X.shape}")
print(f"Target shape: {y.shape}")
print("\nSample data (first 3 rows):")
print("Size | Bedrooms | Age | Price")
for i in range(3):
    print(f"{X[i,0]:.0f} | {X[i,1]:.0f} | {X[i,2]:.1f} | ${y[i,0]:,.2f}")

## 2. Data Preprocessing

**Crucial Step:** Neural networks train much better when input features are on a similar scale (e.g., between 0 and 1, or mean 0 and std 1).

- **Size**: ~2000
- **Bedrooms**: ~3

If we don't normalize, the "Size" feature will dominate the gradients because its values are huge!

In [None]:
# Split into train and validation sets (80% train, 20% val)
train_split = int(0.8 * n_samples)

X_train_raw = X[:train_split]
y_train_raw = y[:train_split]
X_val_raw = X[train_split:]
y_val_raw = y[train_split:]

# Compute Mean and Std from TRAINING data only (to avoid data leakage)
X_mean = X_train_raw.mean(dim=0)
X_std = X_train_raw.std(dim=0)

y_mean = y_train_raw.mean()
y_std = y_train_raw.std()

# Normalize function
def normalize(data, mean, std):
    return (data - mean) / std

def denormalize(data, mean, std):
    return data * std + mean

# Normalize inputs and targets
X_train = normalize(X_train_raw, X_mean, X_std)
X_val = normalize(X_val_raw, X_mean, X_std)

y_train = normalize(y_train_raw, y_mean, y_std)
y_val = normalize(y_val_raw, y_mean, y_std)

print("Data normalized!")
print(f"X_train mean: {X_train.mean(dim=0)} (should be close to 0)")
print(f"X_train std: {X_train.std(dim=0)} (should be close to 1)")

## 3. Building the Regression Model

In [None]:
class HousePriceModel(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        # A simple feed-forward network
        self.net = nn.Sequential(
            nn.Linear(input_size, 64),  # Input layer -> Hidden layer 1
            nn.ReLU(),                  # Activation
            nn.Linear(64, 32),          # Hidden layer 1 -> Hidden layer 2
            nn.ReLU(),                  # Activation
            nn.Linear(32, 1)            # Hidden layer 2 -> Output (1 value)
        )
    
    def forward(self, x):
        return self.net(x)

model = HousePriceModel(input_size=3)
print(model)

## 4. Training the Model

In [None]:
# Loss function: MSE (Mean Squared Error) is standard for regression
criterion = nn.MSELoss()

# Optimizer: Adam is usually better than SGD for this kind of task
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training Loop
num_epochs = 500
train_losses = []
val_losses = []

print("Starting training...")

for epoch in range(num_epochs):
    # --- Training Phase ---
    model.train()
    # 1. Forward pass
    predictions = model(X_train)
    
    # 2. Compute loss
    loss = criterion(predictions, y_train)
    
    # 3. Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # --- Validation Phase ---
    model.eval()
    with torch.no_grad():
        val_predictions = model(X_val)
        val_loss = criterion(val_predictions, y_val)
    
    # Record losses
    train_losses.append(loss.item())
    val_losses.append(val_loss.item())
    
    # Print progress
    if (epoch + 1) % 50 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}] | Train Loss: {loss.item():.4f} | Val Loss: {val_loss.item():.4f}")

print("Training complete!")

## 5. Visualizing Results

In [None]:
# Plot Training vs Validation Loss
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("MSE Loss (Normalized)")
plt.title("Training and Validation Loss Over Time")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Predictions vs Actuals

Let's see how well our model predicts prices on the validation set. We need to **denormalize** the predictions to get actual dollar values.

In [None]:
model.eval()
with torch.no_grad():
    # Get predictions on normalized validation data
    pred_normalized = model(X_val)
    
    # Convert back to real dollars
    pred_actual = denormalize(pred_normalized, y_mean, y_std)
    y_val_actual = denormalize(y_val, y_mean, y_std)

# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(y_val_actual.numpy(), pred_actual.numpy(), alpha=0.6, color="blue")
plt.plot([y_val_actual.min(), y_val_actual.max()], [y_val_actual.min(), y_val_actual.max()], "r--", lw=2)
plt.xlabel("Actual Price ($)")
plt.ylabel("Predicted Price ($)")
plt.title("Actual vs Predicted Prices")
plt.grid(True, alpha=0.3)
plt.show()

## 6. Checking Individual Examples

Let's pick a few houses from the validation set and see the specific numbers.

In [None]:
print("{:<15} {:<15} {:<15}".format("Actual", "Predicted", "Difference"))
print("-" * 45)

for i in range(10):
    actual = y_val_actual[i].item()
    predicted = pred_actual[i].item()
    diff = abs(actual - predicted)
    print(f"${actual:,.0f:<14} ${predicted:,.0f:<14} ${diff:,.0f}")

## Key Takeaways

1. **Normalization**: Always normalize input features when they have different scales (e.g., square footage vs number of bedrooms).
2. **Regression Output**: The output layer usually has 1 neuron and **no activation function** (or linear activation) because we want to predict any continuous value.
3. **Loss**: MSE (Mean Squared Error) is the standard loss for regression.
4. **Evaluation**: Visualizing "Predicted vs Actual" is a great way to check regression performance.