### 7. Second Dense Layer — Output Neuron

Our second layer takes the 64 outputs from the previous (hidden) layer and maps them to a **single output neuron**:

$$
z^{(2)} = \vec{a}^{(1)} \cdot \vec{w}^{(2)} + b^{(2)}
$$

Where:
- $\vec{a}^{(1)}$: activations from first layer (after ReLU)
- $\vec{w}^{(2)}$: weights connecting 64 hidden units to this output neuron
- $b^{(2)}$: scalar bias

This output neuron gives a **logit** — a raw prediction score — that will be passed through a **Sigmoid activation** to convert it to probability.

Since we are doing **binary classification**, the output should lie in the range [0, 1], indicating probability of belonging to class 1.


In [None]:
dense2 = Layer_Dense(64, 1)


### 8. Sigmoid Activation Function

Sigmoid is an S-shaped curve that maps any real number to a value in (0, 1):

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:
- $z$: input logit (raw score)
- $\sigma(z)$: output probability

This makes it perfect for binary classification — the output becomes interpretable as the **probability that a sample belongs to class 1**.

### Derivative (used in backward pass):

$$
\sigma'(z) = \sigma(z)(1 - \sigma(z))
$$

This derivative shows how the output changes with respect to the input — critical for computing gradients during training.


In [None]:
class Activation_Sigmoid:
    def forward(self, inputs):
        self.inputs = inputs
        self.output = 1 / (1 + np.exp(-inputs))

    def backward(self, dvalues):
        self.dinputs = dvalues * (1 - self.output) * self.output


In [None]:
activation2 = Activation_Sigmoid()


### 9. Final Output and Classification

The output of the Sigmoid layer is:

$$
\hat{y} = \sigma(z^{(2)}) \in (0, 1)
$$

We interpret this as:

- $\hat{y} > 0.5 \Rightarrow$ predict class 1
- $\hat{y} \leq 0.5 \Rightarrow$ predict class 0

This simple rule turns a probability into a **class prediction**.

For 2D visualization tasks (like spiral data), this threshold effectively draws a decision boundary in space between the two classes.


In [None]:
predictions = (activation2.output > 0.5) * 1


### 10. What Is a Loss Function?

A **loss function** quantifies how wrong the network's predictions are.

If $\hat{y}$ is the prediction and $y$ is the true label, then the loss is a **measure of error**.

In **binary classification**, we use:

### Binary Cross-Entropy Loss:

For a single sample:

$$
\mathcal{L}(\hat{y}, y) = -\left[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})\right]
$$

Why this form?
- It heavily penalizes confident but wrong predictions (e.g., $\hat{y} = 0.99$ but $y = 0$)
- It is derived from maximum likelihood estimation under a Bernoulli model

We compute the average loss over the dataset:

$$
\mathcal{L}_{avg} = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(\hat{y}_i, y_i)
$$


In [None]:
class Loss_BinaryCrossentropy:
    def forward(self, y_pred, y_true):
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        sample_losses = -(y_true * np.log(y_pred_clipped) +
                          (1 - y_true) * np.log(1 - y_pred_clipped))
        return np.mean(sample_losses, axis=-1)

    def backward(self, dvalues, y_true):
        samples = len(dvalues)
        outputs = len(dvalues[0])
        clipped_dvalues = np.clip(dvalues, 1e-7, 1 - 1e-7)
        self.dinputs = -(y_true / clipped_dvalues -
                         (1 - y_true) / (1 - clipped_dvalues)) / outputs
        self.dinputs = self.dinputs / samples


In [None]:
loss_function = Loss_BinaryCrossentropy()


### 11. Why Do We Train?

Our aim is to **minimize the loss function** by adjusting the weights and biases:

This means:
- If the loss is high, the model is making poor predictions.
- If the loss is low, the model is making accurate predictions.

### How to Minimize?

We use **Gradient Descent** — a method that tweaks each parameter in the direction that reduces the loss.

Each parameter update is:

$$
\theta \leftarrow \theta - \eta \cdot \frac{\partial \mathcal{L}}{\partial \theta}
$$

Where:
- $\theta$: a parameter (weight or bias)
- $\eta$: learning rate
- $\frac{\partial \mathcal{L}}{\partial \theta}$: gradient of loss

This process repeats across many **epochs**, allowing the network to learn the best parameter values.


### 12. Accuracy

Accuracy is the **fraction of correctly predicted labels**:

$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Predictions}}
$$

In binary classification, if:

- $\hat{y} > 0.5$ and $y = 1$ → correct
- $\hat{y} \le 0.5$ and $y = 0$ → correct

All other cases are incorrect. We calculate this metric after every epoch to monitor performance.


In [None]:
accuracy = np.mean(predictions == y)
