# PyTorch Deep Dive: Building Neural Networks

You know Tensors (Data). You know Autograd (Math). Now let's build the **Machine**.

Before we start assembling, we need to define the **Parts**.

## Learning Objectives
- **The Vocabulary**: What is a "Layer", "Module", "Parameter", and "Activation"?
- **The Intuition**: Layers as "Filters" or "Assembly Line Stations".
- **The Mechanism**: `nn.Module` and the `forward()` method.
- **The Deep Dive**: What are "Parameters" and where do they live?


In [1]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(42)

<torch._C.Generator at 0x10ea443b0>

## Part 1: The Vocabulary (Definitions First)

A Neural Network is just a big function made of smaller functions. Here are the names of the parts:

### 1. Layer
- A reusable block of math that transforms data.
- Example: `nn.Linear`, `nn.Conv2d`.
- Analogy: A single machine in a factory.

### 2. Module (`nn.Module`)
- The base class for all neural network parts in PyTorch.
- Your entire model is a Module. A single layer is also a Module.
- Analogy: The "Blueprint" for the machine.

### 3. Parameter
- The internal numbers (Weights and Biases) that the model learns.
- These are the "knobs" the optimizer turns.
- Analogy: The settings on the machine.

### 4. Activation Function
- A non-linear function applied after a layer.
- Without this, a neural network is just one big linear regression.
- Example: ReLU, Sigmoid.
- Analogy: The "Spark" or "Decision" to fire.

## Part 2: The Intuition (The Assembly Line)

Imagine a car factory assembly line.

1. **Input**: Raw Steel (Data).
2. **Station 1 (Layer 1)**: Stamps steel into doors. (Transforms shape).
3. **Station 2 (Layer 2)**: Welds doors to frame. (Combines features).
4. **Station 3 (Layer 3)**: Paints the car. (Final polish).
5. **Output**: Finished Car (Prediction).

In PyTorch, this factory is a `nn.Module`. The stations are Layers. The conveyor belt is the `forward()` method.

In [None]:
import numpy as np

# Create a simple visualization of network architecture
fig, ax = plt.subplots(figsize=(14, 7))

# Layer positions
layers = [
    {"name": "Input\nLayer", "neurons": 4, "x": 0.1, "color": "lightblue"},
    {"name": "Hidden\nLayer 1", "neurons": 6, "x": 0.35, "color": "lightgreen"},
    {"name": "Hidden\nLayer 2", "neurons": 4, "x": 0.6, "color": "lightyellow"},
    {"name": "Output\nLayer", "neurons": 2, "x": 0.85, "color": "lightcoral"}
]

# Draw neurons and connections
neuron_positions = {}

for layer_idx, layer in enumerate(layers):
    x_pos = layer["x"]
    n_neurons = layer["neurons"]
    y_positions = np.linspace(0.2, 0.8, n_neurons)
    
    neuron_positions[layer_idx] = []
    
    for neuron_idx, y_pos in enumerate(y_positions):
        # Draw neuron
        circle = plt.Circle((x_pos, y_pos), 0.03, color=layer["color"], 
                          ec='black', linewidth=2, zorder=3)
        ax.add_patch(circle)
        neuron_positions[layer_idx].append((x_pos, y_pos))
    
    # Add layer label
    ax.text(x_pos, 0.95, layer["name"], ha='center', fontsize=12, 
            fontweight='bold', bbox=dict(boxstyle='round', facecolor=layer["color"], alpha=0.8))
    
    # Draw connections to next layer
    if layer_idx < len(layers) - 1:
        for pos1 in neuron_positions[layer_idx]:
            for pos2 in neuron_positions[layer_idx + 1]:
                ax.plot([pos1[0], pos2[0]], [pos1[1], pos2[1]], 
                       'gray', alpha=0.3, linewidth=0.5, zorder=1)

# Add arrows between layers
for i in range(len(layers) - 1):
    mid_x = (layers[i]["x"] + layers[i+1]["x"]) / 2
    ax.annotate('', xy=(layers[i+1]["x"] - 0.05, 0.5), 
               xytext=(layers[i]["x"] + 0.05, 0.5),
               arrowprops=dict(arrowstyle='->', lw=3, color='blue'))
    
    # Add operation label
    if i == 0:
        operation = "Linear + ReLU"
    elif i == 1:
        operation = "Linear + ReLU"
    else:
        operation = "Linear"
    
    ax.text(mid_x, 0.05, operation, ha='center', fontsize=10,
           bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
ax.set_title('Neural Network Architecture: Data Flow', fontsize=16, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("Data Flow:")
print("1. Input (4 features) → Linear transformation → 6 hidden neurons → ReLU activation")
print("2. Hidden (6) → Linear transformation → 4 hidden neurons → ReLU activation")
print("3. Hidden (4) → Linear transformation → 2 output predictions")
print("\nEach connection represents a learnable weight parameter!")

### Visualization: Neural Network as Assembly Line

Let's visualize how data flows through layers in a neural network.

## Part 3: The Linear Layer (The Workhorse)

The most basic layer is `nn.Linear`. It performs the equation of a line (in N dimensions):

$$ y = xA^T + b $$

Where:
- $x$: Input features.
- $A$: Weights (The "Slope").
- $b$: Bias (The "Intercept").

It simply maps input points to output points via rotation and stretching.

In [None]:
# Visualize linear transformation in 2D
torch.manual_seed(42)

# Create sample 2D points (spiral pattern)
theta = torch.linspace(0, 4*np.pi, 100)
r = torch.linspace(0.5, 2, 100)
x_input = torch.stack([r * torch.cos(theta), r * torch.sin(theta)], dim=1)

# Create a linear layer (2D -> 2D transformation)
linear_layer = nn.Linear(2, 2)
with torch.no_grad():
    # Set specific weights for a clear transformation
    linear_layer.weight = nn.Parameter(torch.tensor([[1.5, 0.5], [-0.3, 1.2]]))
    linear_layer.bias = nn.Parameter(torch.tensor([0.5, -0.3]))

# Transform the points
x_output = linear_layer(x_input)

# Visualize
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))

# Original points
ax1.scatter(x_input[:, 0].numpy(), x_input[:, 1].numpy(), 
           c=range(len(x_input)), cmap='viridis', s=30, alpha=0.7)
ax1.set_xlabel('Feature 1', fontsize=12)
ax1.set_ylabel('Feature 2', fontsize=12)
ax1.set_title('Input Data (Original)', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.axis('equal')
ax1.set_xlim(-3, 3)
ax1.set_ylim(-3, 3)

# Show the transformation
ax2.scatter(x_input[:, 0].numpy(), x_input[:, 1].numpy(), 
           c=range(len(x_input)), cmap='viridis', s=30, alpha=0.3, label='Original')
ax2.scatter(x_output[:, 0].detach().numpy(), x_output[:, 1].detach().numpy(), 
           c=range(len(x_output)), cmap='plasma', s=30, alpha=0.7, label='Transformed')
ax2.set_xlabel('Feature 1', fontsize=12)
ax2.set_ylabel('Feature 2', fontsize=12)
ax2.set_title('Linear Transformation', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend()
ax2.axis('equal')
ax2.set_xlim(-3, 3)
ax2.set_ylim(-3, 3)

# Transformed points
ax3.scatter(x_output[:, 0].detach().numpy(), x_output[:, 1].detach().numpy(), 
           c=range(len(x_output)), cmap='plasma', s=30, alpha=0.7)
ax3.set_xlabel('Feature 1', fontsize=12)
ax3.set_ylabel('Feature 2', fontsize=12)
ax3.set_title('Output Data (Transformed)', fontsize=13, fontweight='bold')
ax3.grid(True, alpha=0.3)
ax3.axis('equal')
ax3.set_xlim(-3, 3)
ax3.set_ylim(-3, 3)

plt.tight_layout()
plt.show()

print("What Linear Layers Do:")
print("• Rotation: Change the orientation of data")
print("• Scaling: Stretch or compress along axes")
print("• Translation: Shift the data (via bias)")
print(f"\nWeight matrix:\n{linear_layer.weight.data}")
print(f"Bias vector: {linear_layer.bias.data}")

### Visualization: How Linear Layers Transform Data

Let's visualize what a linear layer does geometrically - it rotates, scales, and shifts points in space.

In [None]:
# Create a Linear Layer
# Input: 3 features (e.g., Age, Height, Weight)
# Output: 1 feature (e.g., Life Expectancy)
layer = nn.Linear(in_features=3, out_features=1)

print("Weights (A):", layer.weight)
print("Bias (b):", layer.bias)

# Pass data through it
input_data = torch.tensor([[1.0, 2.0, 3.0]]) # Batch of 1 sample
output = layer(input_data)

print("Output:", output)

## Part 4: Activation Functions (The Spark)

Linear layers can only learn straight lines. But the world is curved.

To learn curves, we need **Non-Linearity**. We call these "Activation Functions".

Think of a biological neuron. It gathers signals. If the signal is strong enough, it **FIRES** (Action Potential). If not, it stays silent.

- **ReLU (Rectified Linear Unit)**: The most common. If x > 0, return x. If x < 0, return 0. (Like a switch).
- **Sigmoid**: Squashes numbers between 0 and 1. (Like a probability).

In [None]:
# Comprehensive visualization of activation functions
x = torch.linspace(-5, 5, 200)
relu = nn.ReLU()
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()
leaky_relu = nn.LeakyReLU(0.1)

fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# 1. ReLU
ax = axes[0, 0]
ax.plot(x, relu(x), linewidth=3, color='blue', label='ReLU')
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax.axvline(x=0, color='k', linestyle='--', alpha=0.3)
ax.fill_between(x.numpy(), 0, relu(x).numpy(), alpha=0.2, color='blue')
ax.set_xlabel('Input (x)', fontsize=11)
ax.set_ylabel('Output', fontsize=11)
ax.set_title('ReLU: max(0, x)\n"The Switch" - Dead below 0', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()

# 2. Sigmoid
ax = axes[0, 1]
ax.plot(x, sigmoid(x), linewidth=3, color='green', label='Sigmoid')
ax.axhline(y=0.5, color='k', linestyle='--', alpha=0.3, label='Midpoint')
ax.axhline(y=0, color='r', linestyle=':', alpha=0.3)
ax.axhline(y=1, color='r', linestyle=':', alpha=0.3)
ax.set_xlabel('Input (x)', fontsize=11)
ax.set_ylabel('Output', fontsize=11)
ax.set_title('Sigmoid: 1/(1+e⁻ˣ)\n"The Probability" - Output [0,1]', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()

# 3. Tanh
ax = axes[0, 2]
ax.plot(x, tanh(x), linewidth=3, color='orange', label='Tanh')
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax.axhline(y=-1, color='r', linestyle=':', alpha=0.3)
ax.axhline(y=1, color='r', linestyle=':', alpha=0.3)
ax.set_xlabel('Input (x)', fontsize=11)
ax.set_ylabel('Output', fontsize=11)
ax.set_title('Tanh: (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)\n"Centered" - Output [-1,1]', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()

# 4. Leaky ReLU
ax = axes[1, 0]
ax.plot(x, leaky_relu(x), linewidth=3, color='purple', label='Leaky ReLU')
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax.axvline(x=0, color='k', linestyle='--', alpha=0.3)
ax.fill_between(x.numpy(), 0, leaky_relu(x).numpy(), alpha=0.2, color='purple')
ax.set_xlabel('Input (x)', fontsize=11)
ax.set_ylabel('Output', fontsize=11)
ax.set_title('Leaky ReLU: max(0.1x, x)\n"Allows small negative values"', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()

# 5. Comparison of all
ax = axes[1, 1]
ax.plot(x, relu(x), linewidth=2, label='ReLU', alpha=0.8)
ax.plot(x, sigmoid(x), linewidth=2, label='Sigmoid', alpha=0.8)
ax.plot(x, tanh(x), linewidth=2, label='Tanh', alpha=0.8)
ax.plot(x, leaky_relu(x), linewidth=2, label='Leaky ReLU', alpha=0.8)
ax.set_xlabel('Input (x)', fontsize=11)
ax.set_ylabel('Output', fontsize=11)
ax.set_title('Comparison of All Activations', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()

# 6. Why non-linearity matters
ax = axes[1, 2]
x_demo = torch.linspace(-2, 2, 100)
# Linear only (no activation)
linear_output = 2 * x_demo + 1
# With ReLU activation
nonlinear_output = relu(2 * x_demo + 1)

ax.plot(x_demo, linear_output, linewidth=2.5, label='Linear only (boring!)', color='gray', linestyle='--')
ax.plot(x_demo, nonlinear_output, linewidth=2.5, label='Linear + ReLU (learns curves!)', color='red')
ax.set_xlabel('Input (x)', fontsize=11)
ax.set_ylabel('Output', fontsize=11)
ax.set_title('Why We Need Activation Functions', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
ax.text(0, -1.5, 'Without activation: just a straight line!\nWith activation: can learn complex patterns', 
        ha='center', fontsize=9, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))

plt.tight_layout()
plt.show()

print("Activation Function Properties:")
print("• ReLU: Fast, but can 'die' (always output 0)")
print("• Sigmoid: Smooth, but gradients vanish for large |x|")
print("• Tanh: Zero-centered, but also suffers from vanishing gradients")
print("• Leaky ReLU: Fixes dying ReLU problem with small negative slope")

In [None]:
# Visualize the parameters of our network
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Get parameters from the model
params_list = list(model.named_parameters())

# Plot Layer 1 weights
ax = axes[0, 0]
weights_layer1 = params_list[0][1].detach().numpy()
im1 = ax.imshow(weights_layer1, cmap='RdBu', aspect='auto', vmin=-1, vmax=1)
ax.set_xlabel('Input Features (10)', fontsize=11)
ax.set_ylabel('Hidden Neurons (5)', fontsize=11)
ax.set_title('Layer 1 Weights (5×10)\nEach row = one neuron\'s weights', fontsize=12, fontweight='bold')
plt.colorbar(im1, ax=ax, label='Weight Value')
for i in range(weights_layer1.shape[0]):
    for j in range(weights_layer1.shape[1]):
        text = ax.text(j, i, f'{weights_layer1[i, j]:.2f}',
                      ha="center", va="center", color="black", fontsize=7)

# Plot Layer 1 bias
ax = axes[0, 1]
bias_layer1 = params_list[1][1].detach().numpy()
ax.barh(range(len(bias_layer1)), bias_layer1, color='steelblue', edgecolor='black')
ax.set_ylabel('Hidden Neurons (5)', fontsize=11)
ax.set_xlabel('Bias Value', fontsize=11)
ax.set_title('Layer 1 Biases (5)\nShift each neuron\'s activation', fontsize=12, fontweight='bold')
ax.axvline(x=0, color='red', linestyle='--', linewidth=1)
ax.grid(True, alpha=0.3, axis='x')
for i, v in enumerate(bias_layer1):
    ax.text(v + 0.02, i, f'{v:.3f}', va='center', fontsize=9)

# Plot Layer 2 weights
ax = axes[1, 0]
weights_layer2 = params_list[2][1].detach().numpy()
im2 = ax.imshow(weights_layer2, cmap='RdBu', aspect='auto', vmin=-1, vmax=1)
ax.set_xlabel('Hidden Features (5)', fontsize=11)
ax.set_ylabel('Output Neurons (1)', fontsize=11)
ax.set_title('Layer 2 Weights (1×5)\nCombines hidden features', fontsize=12, fontweight='bold')
plt.colorbar(im2, ax=ax, label='Weight Value')
for i in range(weights_layer2.shape[0]):
    for j in range(weights_layer2.shape[1]):
        text = ax.text(j, i, f'{weights_layer2[i, j]:.2f}',
                      ha="center", va="center", color="black", fontsize=9)

# Plot parameter histogram
ax = axes[1, 1]
all_weights = torch.cat([p.flatten() for name, p in model.named_parameters() if 'weight' in name])
all_biases = torch.cat([p.flatten() for name, p in model.named_parameters() if 'bias' in name])

ax.hist(all_weights.detach().numpy(), bins=30, alpha=0.6, label='Weights', color='blue', edgecolor='black')
ax.hist(all_biases.detach().numpy(), bins=15, alpha=0.6, label='Biases', color='orange', edgecolor='black')
ax.set_xlabel('Parameter Value', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('Distribution of All Parameters', fontsize=12, fontweight='bold')
ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Zero')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"Total parameters: {sum(p.numel() for p in model.parameters())}")
print(f"• Layer 1: {weights_layer1.size} weights + {bias_layer1.size} biases = {weights_layer1.size + bias_layer1.size}")
print(f"• Layer 2: {weights_layer2.size} weights + {params_list[3][1].numel()} biases = {weights_layer2.size + params_list[3][1].numel()}")
print("\nThese are the 'knobs' gradient descent will tune during training!")

### Visualization: Network Parameters

Let's visualize the actual parameters (weights and biases) that the network learns.

## Part 5: Building the Factory (nn.Module)

To build a full network, we subclass `nn.Module`. We must define two things:

1. `__init__`: **Define the stations**. (Create the layers).
2. `forward`: **Define the conveyor belt**. (Connect the layers).

In [None]:
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Station 1: 10 inputs -> 5 hidden features
        self.layer1 = nn.Linear(10, 5)
        # Station 2: 5 hidden -> 1 output
        self.layer2 = nn.Linear(5, 1)
        # The Spark
        self.activation = nn.ReLU()

    def forward(self, x):
        # The Conveyor Belt
        x = self.layer1(x)      # Step 1
        x = self.activation(x)  # Step 2 (Non-linearity)
        x = self.layer2(x)      # Step 3
        return x

model = SimpleNet()
print(model)

## Part 6: The Deep Dive (Parameters)

Where does the "Knowledge" live?

It lives in the **Parameters** (Weights and Biases). PyTorch automatically tracks these for you because you used `nn.Linear`.

Let's inspect them.

In [None]:
print("Model Parameters:")
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# Total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal Learnable Parameters: {total_params}")

## Summary Checklist

1. **nn.Module** = The Blueprint for your network.
2. **Layers** = The transformation stations (Linear, Conv2d).
3. **Activation** = The non-linear spark (ReLU, Sigmoid).
4. **forward()** = The path data takes through the network.
5. **Parameters** = The learnable weights that hold the knowledge.

Next, we will learn how to **Train** this machine.