<a href="https://colab.research.google.com/github/ThomasWong-ST/Intro-to-RL/blob/main/Deep_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from collections import namedtuple, deque
import numpy as np

#DQN Implementation Example

In [2]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # Define layers: for example, 1 input -> 10 hidden -> 10 hidden -> 1 output
        self.hidden1 = nn.Linear(1, 10)   # linear layer (1 -> 10)
        self.hidden2 = nn.Linear(10, 10)  # linear layer (10 -> 10)
        self.output = nn.Linear(10, 1)    # linear layer (10 -> 1)
        self.activation = nn.ReLU()       # ReLU activation for hidden layers

    def forward(self, x):
        # Forward pass: apply linear layers and activation
        x = self.activation(self.hidden1(x))
        x = self.activation(self.hidden2(x))
        x = self.output(x)  # output layer (no activation for regression)
        return x

# Instantiate the model
model = Net()
print(model)


Net(
  (hidden1): Linear(in_features=1, out_features=10, bias=True)
  (hidden2): Linear(in_features=10, out_features=10, bias=True)
  (output): Linear(in_features=10, out_features=1, bias=True)
  (activation): ReLU()
)


In [3]:
criterion = nn.MSELoss()                   # Mean Squared Error loss
optimizer = optim.SGD(model.parameters(), lr=0.01)

In [None]:
# Example function and data: let's approximate f(x) = sin(x)^2 on [0, 2]
# Generate training data (e.g., 20 random points in [0,2])
x_train = 2 * torch.rand(20, 1)            # shape (20,1) inputs in [0,2]
y_train = torch.sin(x_train) ** 2          # shape (20,1) outputs f(x) = sin^2(x)

# Training loop
epochs = 5000
for epoch in range(epochs):
    optimizer.zero_grad()                 # 1. zero out gradients from previous step
    y_pred = model(x_train)               # 2. forward pass: compute predictions
    loss = criterion(y_pred, y_train)     # 3. compute loss (MSE between y_pred and y_train)
    loss.backward()                       # 4. backward pass: compute gradients dL/dw for each param
    optimizer.step()                      # 5. update weights: w <- w - lr * grad

    if (epoch+1) % 100 == 0:  # print loss every 100 epochs
        print(f"Epoch {epoch+1}/{epochs}, Loss = {loss.item():.6f}")

#A Non-Linear Function Approximation
Environment Description: Imagine a one-dimensional track from -1.0 to +1.0. The agent starts at position 0.0. It has two actions: move left (decrease position) or move right (increase position) by a fixed step. If the agent reaches the left end (-1.0) or the right end (+1.0), it receives a reward of +1 and the episode ends. If the agent fails to reach either end within a certain number of steps, the episode ends with 0 reward. This environment is deterministic and fully observable (state = current position).

In [2]:
class DoubleGoalEnv:
    def __init__(self):
        # Define the 1D state space boundaries
        self.min_position = -1.0
        self.max_position =  1.0
        self.step_size    =  0.1    # movement increment for each action
        self.max_steps    = 50     # episode terminates if this many steps elapse without reaching a goal
        self.state        = None
        self.current_step = 0

    def reset(self):
        """Resets the environment to the starting state."""
        self.state = 0.0               # start at the center
        self.current_step = 0
        # Return state as a NumPy array (for compatibility with PyTorch later)
        return np.array([self.state], dtype=np.float32)

    def step(self, action):
        """
        Executes one action in the environment.
        Action: 0 = move left, 1 = move right.
        Returns: next_state, reward, done, info
        """
        # Validate action
        if action not in [0, 1]:
            raise ValueError("Invalid action. Must be 0 or 1.")
        # Determine new position after the action
        if action == 1:  # move right
            new_state = self.state + self.step_size
        else:            # move left
            new_state = self.state - self.step_size
        # Increase step count
        self.current_step += 1
        # Avoid floating-point accumulation errors by rounding
        new_state = float(np.round(new_state, 5))

        # Initialize reward and done flag
        reward = 0.0
        done   = False
        # Check if the new state crosses or reaches the goal boundaries
        if new_state >= self.max_position:
            self.state = self.max_position
            reward = 1.0
            done   = True    # reached right end -> success
        elif new_state <= self.min_position:
            self.state = self.min_position
            reward = 1.0
            done   = True    # reached left end -> success
        else:
            # Not yet at a goal, just update state
            self.state = new_state
        # If max steps exceeded and no goal reached, end the episode (failure)
        if self.current_step >= self.max_steps and not done:
            done   = True
            reward = 0.0    # no goal reached within time limit

        # Return next state as a float32 array, reward, termination flag, and an empty info dict
        return np.array([self.state], dtype=np.float32), reward, done, {}

#DQN Implementation Test

In [15]:
class NNet(nn.Module):
    def __init__(self, input_dim = 1, output_dim = 2):
        super(NNet, self).__init__()
        # Define layers: for example, 1 input -> 10 hidden -> 10 hidden -> 1 output
        self.hidden1 = nn.Linear(input_dim, 10)   # linear layer (1 -> 10)
        self.hidden2 = nn.Linear(10, 10)  # linear layer (10 -> 10)
        self.output = nn.Linear(10, output_dim)    # linear layer (10 -> 1)
        self.activation = nn.ReLU()       # ReLU activation for hidden layers

    def forward(self, x):
        # Forward pass: apply linear layers and activation
        x = self.activation(self.hidden1(x))
        x = self.activation(self.hidden2(x))
        x = self.output(x)  # output layer (no activation for regression)
        return x

# Instantiate the model
model = NNet()
print(model)

NNet(
  (hidden1): Linear(in_features=1, out_features=10, bias=True)
  (hidden2): Linear(in_features=10, out_features=10, bias=True)
  (output): Linear(in_features=10, out_features=2, bias=True)
  (activation): ReLU()
)


In [4]:
Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done'))

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)  # automatically drops oldest when full

    def push(self, state, action, reward, next_state, done):
        """Save a transition."""
        self.buffer.append(Transition(state, action, reward, next_state, done))

    def sample(self, batch_size):
        """Randomly sample a batch of transitions."""
        return random.sample(self.buffer, batch_size)

    def __len__(self):
        return len(self.buffer)

In [5]:
def select_action(q_net, state, epsilon):
    """
    state: np.array or torch.tensor shape [input_dim] or [1, input_dim]
    returns: action index (int)
    """
    if random.random() < epsilon:
        # explore
        return random.randrange(2)   # 2 actions: 0 or 1

    # exploit
    if not isinstance(state, torch.Tensor):
        state = torch.tensor(state, dtype=torch.float32)

    if state.dim() == 1:
        state = state.unsqueeze(0)   # [1, input_dim]

    with torch.no_grad():
        q_values = q_net(state)      # [1, 2]
        action = q_values.argmax(dim=1).item()
    return action

In [17]:
def compute_dqn_loss(batch, q_net, target_q_net, gamma, criterion):
    # batch is a list of Transition
    batch = Transition(*zip(*batch))
    # batch.state is a tuple of states, etc.

    # Fix: Convert batch.state/next_state to numpy array first to avoid extra dimension
    # This ensures states and next_states are [B, 1], which leads to q_values being [B, 2]
    states      = torch.tensor(np.array(batch.state),      dtype=torch.float32)  # Changed: Removed .unsqueeze(1)
    actions     = torch.tensor(batch.action,     dtype=torch.long).unsqueeze(1)     # [B, 1]
    rewards     = torch.tensor(batch.reward,     dtype=torch.float32).unsqueeze(1)  # [B, 1]
    next_states = torch.tensor(np.array(batch.next_state), dtype=torch.float32)  # Changed: Removed .unsqueeze(1)
    dones       = torch.tensor(batch.done,       dtype=torch.float32).unsqueeze(1)  # [B, 1]

    # 1) Q_pred = Q_online(s,a; θ)
    q_values = q_net(states)                     # [B, 2] now (2 dimensions)
    q_pred = q_values.gather(1, actions)         # pick Q(s,a) → [B, 1] - now works as actions is also 2 dimensions

    # 2) y = r + γ * (1 - done) * max_a' Q_target(s',a'; θ⁻)
    with torch.no_grad():
        next_q_values = target_q_net(next_states)           # [B, 2] now (2 dimensions)
        max_next_q = next_q_values.max(dim=1, keepdim=True)[0]   # [B, 1]
        targets = rewards + gamma * (1.0 - dones) * max_next_q   # [B, 1]

    loss = criterion(q_pred, targets)
    return loss

In [18]:
buffer = ReplayBuffer(capacity=10_000)
num_episodes = 500
batch_size = 64
gamma = 0.99
target_update_freq = 10   # episodes

epsilon_start = 1.0
epsilon_end = 0.05
epsilon_decay = 300       # higher = slower decay

step_count = 0
env = DoubleGoalEnv()

q_net = NNet(input_dim=1, output_dim=2)
target_q_net = NNet(input_dim=1, output_dim=2)
# copy weights from q_net → target_q_net initially
target_q_net.load_state_dict(q_net.state_dict())
target_q_net.eval()  # we don't train this with gradients

criterion = nn.MSELoss()
optimizer = optim.Adam(q_net.parameters(), lr=1e-3)

for episode in range(num_episodes):
    state = env.reset()   # assume scalar like 0.0
    done = False

    while not done:
        # decay epsilon over time
        epsilon = epsilon_end + (epsilon_start - epsilon_end) * \
                  torch.exp(torch.tensor(-step_count / epsilon_decay)).item()

        action = select_action(q_net, state, epsilon)

        next_state, reward, done, info = env.step(action)

        # store transition
        buffer.push(state, action, reward, next_state, done)
        state = next_state
        step_count += 1

        # learn if we have enough samples
        if len(buffer) >= batch_size:
            batch = buffer.sample(batch_size)
            loss = compute_dqn_loss(batch, q_net, target_q_net, gamma, criterion)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # periodically update target network
    if (episode + 1) % target_update_freq == 0:
        target_q_net.load_state_dict(q_net.state_dict())


##Q-Network Implementation:
Below, we define a neural network using PyTorch's nn.Module. It has a couple of fully-connected layers with non-linear activations (ReLU). The final layer outputs one Q-value per action. We keep the network small (given the simple state space) – this also ensures it trains quickly in Colab. The code comments explain the torch.nn components:

In [3]:
# Neural network for Q-value approximation
class QNetwork(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(QNetwork, self).__init__()
        # Define feed-forward neural network layers
        self.fc1 = nn.Linear(input_dim, 64)   # first hidden layer with 64 units
        self.fc2 = nn.Linear(64, 64)          # second hidden layer with 64 units
        self.fc3 = nn.Linear(64, output_dim)  # output layer (Q-values for each action)
    def forward(self, x):
        # x is a tensor of shape [batch_size, input_dim]
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        # No activation on output layer (we want raw Q-values)
        return self.fc3(x)

In [4]:
# Initialize environment and Q-network
env = DoubleGoalEnv()
state_dim  = 1            # state is one-dimensional (position)
action_dim = 2            # two actions: 0 or 1
q_net = QNetwork(state_dim, action_dim)
optimizer = torch.optim.Adam(q_net.parameters(), lr=0.001)

# Replay memory to store past transitions
memory = deque(maxlen=10000)

# Hyperparameters
episodes = 200           # number of episodes to train
batch_size = 32
gamma = 0.99             # discount factor for future rewards
epsilon = 1.0            # initial exploration probability
epsilon_min = 0.05       # minimum exploration probability
epsilon_decay = 0.995    # decay rate per episode

for episode in range(1, episodes+1):
    state = env.reset()                        # reset environment at start of episode
    total_reward = 0.0
    done = False
    # Iterate for each step of the episode
    while not done:
        # Choose action using epsilon-greedy policy
        if random.random() < epsilon:
            action = random.randint(0, action_dim-1)  # explore: random action
        else:
            # exploit: choose best action according to Q-network
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)  # shape [1,1]
            q_values = q_net(state_tensor)       # get Q-values for both actions
            action = int(torch.argmax(q_values, dim=1).item())  # select action with max Q

        # Apply action to the environment
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        # Store transition in replay buffer
        memory.append((state, action, reward, next_state, done))
        # Update current state
        state = next_state

        # Train the Q-network with a batch sampled from memory (if enough samples exist)
        if len(memory) >= batch_size:
            # Sample a random minibatch of transitions
            batch = random.sample(memory, batch_size)
            # Separate the components of each transition
            states  = np.vstack([transition[0] for transition in batch])   # shape (batch_size, state_dim)
            actions = np.array([transition[1] for transition in batch])
            rewards = np.array([transition[2] for transition in batch], dtype=np.float32)
            next_states = np.vstack([transition[3] for transition in batch])
            dones   = np.array([transition[4] for transition in batch], dtype=np.float32)

            # Convert to tensors
            state_tensor      = torch.tensor(states, dtype=torch.float32)        # shape [B, 1]
            next_state_tensor = torch.tensor(next_states, dtype=torch.float32)   # shape [B, 1]
            action_tensor     = torch.tensor(actions, dtype=torch.int64)         # shape [B]
            reward_tensor     = torch.tensor(rewards, dtype=torch.float32)       # shape [B]
            done_tensor       = torch.tensor(dones, dtype=torch.float32)         # shape [B]

            # Compute current Q values for the actions taken: Q(s, a)
            q_values = q_net(state_tensor)            # shape [B, action_dim]
            # Gather the Q-values for the chosen actions
            # Using unsqueeze to align indices: shape of chosen_q = [B]
            chosen_q = q_values.gather(1, action_tensor.unsqueeze(1)).squeeze(1)

            # Compute target Q values: if done, target = reward; else target = reward + γ * max_a' Q_next
            with torch.no_grad():
                # Get max Q for next state
                next_q_values = q_net(next_state_tensor)               # [B, action_dim]
                max_next_q = torch.max(next_q_values, dim=1)[0]        # [B]
                target = reward_tensor + gamma * max_next_q * (1 - done_tensor)
                # (When done_tensor is 1, the term (1-done) will zero out the future reward.)

            # Compute loss (Mean Squared Error between targets and current Q estimates)
            loss = F.mse_loss(chosen_q, target)

            # Backpropagation and optimization step
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Decay exploration rate each episode
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

    # Optionally, print training progress
    if episode % 20 == 0:
        print(f"Episode {episode}: Total Reward = {total_reward:.1f}, Epsilon = {epsilon:.3f}")


Episode 20: Total Reward = 1.0, Epsilon = 0.905
Episode 40: Total Reward = 1.0, Epsilon = 0.818
Episode 60: Total Reward = 1.0, Epsilon = 0.740
Episode 80: Total Reward = 0.0, Epsilon = 0.670
Episode 100: Total Reward = 1.0, Epsilon = 0.606
Episode 120: Total Reward = 1.0, Epsilon = 0.548
Episode 140: Total Reward = 1.0, Epsilon = 0.496
Episode 160: Total Reward = 1.0, Epsilon = 0.448
Episode 180: Total Reward = 1.0, Epsilon = 0.406
Episode 200: Total Reward = 1.0, Epsilon = 0.367


In [5]:
def generate_episode(env, Q, epsilon=0.0, max_steps=1000):
    """
    generate_episode just reproduces the sets of actions that follows the largest Q-value
    from the Q-table created by either SARSA or Q-learning. We can set epsilon = 0,
    (unless we want to explore differnt Q-values) and return a list of
    states, actions and rewards for the Q-table used.

    Generate one episode using the learned Q-table.
    If epsilon > 0, follow an epsilon-greedy policy (exploration).
    If epsilon = 0, follow a greedy policy (pure exploitation).
    """
    states, actions, rewards = [], [], []
    state = env.reset()

    for _ in range(max_steps):
        # --- ε-greedy action from Q ---
        if np.random.rand() < epsilon:
            action = np.random.randint(env.nA)
        else:
            q = Q[state]
            best = np.flatnonzero(q == q.max())
            action = int(np.random.choice(best))

        next_state, reward, done, _ = env.step(action)
        states.append(state)
        actions.append(action)
        rewards.append(reward)
        state = next_state

        if done:
            break

    return states, actions, rewards

In [6]:
generate_episode(env, q_values, epsilon=0.0)

([array([0.], dtype=float32),
  array([0.1], dtype=float32),
  array([0.2], dtype=float32),
  array([0.3], dtype=float32),
  array([0.4], dtype=float32),
  array([0.5], dtype=float32),
  array([0.6], dtype=float32),
  array([0.7], dtype=float32),
  array([0.8], dtype=float32),
  array([0.9], dtype=float32)],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0])

#Actor-Critic Implementation

In [2]:
# Policy (Actor) network
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim, 32)       # smaller network (32 hidden units)
        self.fc2 = nn.Linear(32, output_dim)      # output logits for each action
    def forward(self, x):

        '''F.relu(self.fc1(x)) takes the output of the first layer
        (which has 32 units) and applies the ReLU activation function to it.
        This helps the network learn non-linear patterns.'''

        x = F.relu(self.fc1(x))
        # Use softmax to get probabilities over actions

        '''in this case the 32 nodes, undergo some liner vector combination,
        which returns a vector size of output_dim, and the soft max thats the
        output_dim to return a probability for each value of the output_dim
        return probs  # tensor of shape [batch_size, output_dim] with probabilities'''

        probs = F.softmax(self.fc2(x), dim=-1)
# Value (Critic) network
class ValueNetwork(nn.Module):
    def __init__(self, input_dim):
        super(ValueNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim, 32)
        self.fc2 = nn.Linear(32, 1)  # outputs a single value
    def forward(self, x):
        x = F.relu(self.fc1(x))

        '''The Critic estimates the Value ($V(s)$) of the state, which represents
        the total expected future reward'''

        value = self.fc2(x)  # no activation on value output
        return value         # tensor of shape [batch_size, 1]


NameError: name 'nn' is not defined

In [4]:
# Initialize environment, actor, critic, and optimizers
env = DoubleGoalEnv()
actor = PolicyNetwork(state_dim, action_dim)
critic = ValueNetwork(state_dim)

'''the optimizer is the "manager." We tell the manager exactly which
"employees" (parameters) to supervise (actor.parameters()) and how fast
they should learn (lr or learning rate).'''

actor_optimizer = torch.optim.Adam(actor.parameters(), lr=0.001)

'''Notice the Critic's learning rate (0.005) is higher than the Actor's
(0.001). This is common; we often want the Critic to learn the value of
states faster so it can provide accurate feedback to the Actor'''

critic_optimizer = torch.optim.Adam(critic.parameters(), lr=0.005)  # critic can have a different LR

# Hyperparameters
episodes = 200
gamma = 0.99
'''The usual policy gradient update rule is "disguised" in the code across three
different steps.The Update ($\theta_{t+1} = \theta_t + \dots$): This is handled
by actor_optimizer.step().The Alpha ($\alpha$): This is the lr=0.001 inside the
optimizer.The Gradient ($\nabla \dots$): This is calculated automatically when
we call .backward()'''

for episode in range(1, episodes+1):
    state = env.reset()
    '''state is typically a single observation from your environment
    (e.g., np.array([0.0]) or just 0.0); torch.tensor(...) converts
    this state into a PyTorch tensor. For example, if state was
    np.array([0.0]), it would become a tensor like tensor([0.]);
    if the tensor had a shape of [1] after the first step, applying
    .unsqueeze(0) changes its shape to [1, 1]

    Even though we only feed one state at a time, PyTorch layers
    (like nn.Linear) are strict. They always expect the input to
    have a batch dimension: Input Shape = [Batch Size, Input Feature]
    Since we only have a single state vector of shape [1], we use
    .unsqueeze(0) to fake a batch dimension, turning it into [1, 1].'''
    state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
    done = False
    ep_return = 0.0  # track total reward in this episode
    while not done:
        # 1. Actor: get action probabilities and sample an action
        probs = actor(state_tensor)            # probabilities over actions
        # Sample action from the probability distribution
        '''Categorical(probs) creates a statistical distribution based
        on those probabilities'''
        action_dist = torch.distributions.Categorical(probs)
        '''Using sample() introduces Exploration. If we used argmax
        (always picking the highest probability) right from the start,
        the agent would fall into a trap called a Local Optimum'''
        action = action_dist.sample()          # draws an action index (0 or 1)
        # Log probability of the chosen action (for loss calculation)
        log_prob = action_dist.log_prob(action)

        # 2. Take the action in the environment
        next_state, reward, done, _ = env.step(int(action.item()))
        ep_return += reward

        # 3. Critic: evaluate state and next state
        '''The critic in this case is just standard TD(0) learning with function
        approximation so it is v(s_t) = beta * (r_t+1 + gamma * v(s_t+1) - v(s_t)),
        and we have a nn to approximate v(s), and the critic takes in the
        prediction made by the actor in the form of the next step of the environment,
        i.e. pi(a|s,theta); so the step(a) is influenced by the actor. Now step(a)
        produces state s' which now the critic takes and evaluates what value to
        assign the value function v(s')'''
        value = critic(state_tensor)                             # V(s)
        next_state_tensor = torch.tensor(next_state, dtype=torch.float32).unsqueeze(0)
        next_value = critic(next_state_tensor) if not done else torch.zeros(1, 1)

        # 4. Compute the TD target and advantage
        # TD target = r + γ * V(s') (or just r if done since next_value = 0)
        td_target = torch.tensor([[reward]], dtype=torch.float32) + gamma * next_value
        advantage = td_target - value  # (TD Error) how much better (positive) or worse (negative) the outcome was than expected

        # 5. Compute critic loss (MSE between V(s) and TD target) and actor loss
        critic_loss = F.mse_loss(value, td_target.detach())      # detach target to avoid affecting actor gradients
        actor_loss = -log_prob * advantage.detach()              # policy gradient loss (detached advantage)

        # 6. Backpropagate losses
        actor_optimizer.zero_grad()
        critic_optimizer.zero_grad()
        actor_loss.backward()
        critic_loss.backward()
        '''This is a method belonging to a PyTorch optimizer object, its purpose
        is to update the weights (parameters) of a neural network based on the
        gradients that were computed during the backward pass (.backward()).'''
        actor_optimizer.step()
        critic_optimizer.step()

        # 7. Move to the next state
        state_tensor = next_state_tensor

    # Optionally print episode result
    if episode % 20 == 0:
        print(f"Episode {episode}: Total Reward (Return) = {ep_return:.1f}")


NameError: name 'DoubleGoalEnv' is not defined

##Batch Size, Zeoring Gradients, Gradient Descent, SGD, Mini-Batch SGD

We categorize these based on the **Batch Size** (how much data the model sees before updating weights).

### **Batch Gradient Descent (Full Batch)**
* **Batch Size:** $N$ (The entire dataset).
* **Process:** The model calculates the error for *every* data point in the training set, averages them, and then performs a single weight update.
* **Pros:** The trajectory is smooth and stable because it uses the true gradient of the dataset.
* **Cons:** It is computationally very slow per step and memory-intensive. It can get stuck in "saddle points" (flat areas) because it lacks noise.
* **Zero Grad:** Required between **Epochs** (passes through the dataset).

### **Stochastic Gradient Descent (SGD)**
* **Batch Size:** $1$ (A single data point).
* **Process:** The model calculates the error and updates weights after *each* individual sample.
* **Pros:** Updates are extremely fast. The "noise" from individual samples helps the model escape shallow local minima.
* **Cons:** The path to the minimum is very jittery (high variance), which can make it hard to settle on the exact best solution.
* **Zero Grad:** Required between **every sample**.

### **Mini-Batch Gradient Descent**
* **Batch Size:** $M$ (e.g., 32, 64, 128), where $1 < M < N$.
* **Process:** The model updates weights after processing a small group of samples.
* **Pros:** The "Sweet Spot." It utilizes GPU parallelism efficiently (faster than SGD) and offers a balance of stability and useful noise (better than Full Batch).
* **Cons:** Requires tuning an extra hyperparameter (the batch size).
* **Zero Grad:** Required between **every batch**.

---

# 2. Why We Zero Out Gradients (`zero_grad()`)

* **The Mechanism:** In PyTorch, calling `.backward()` **adds** (accumulates) the new gradients to the existing ones rather than replacing them.
* **The Reason:** We generally want each update step to be based **only** on the error from the current batch (the current "terrain").
* **The Rule:** If you do not zero out the gradients, you are mixing old directional instructions with new ones, leading to massive, incorrect updates. You must zero them out before every new `.step()` calculation.