# Multiclass Classification

In multiclass classification, the target variable $y$ can take on more than two discrete values (e.g., digits 0–9, various diseases, or different defect types).  

Imagine plotting data points on a plane where each cluster represents a different class. Instead of just two clusters (binary), you may have several clusters, each with a unique marker (e.g., circles, triangles, squares).

**Binary vs. Multiclass:**  
- **Binary:** $y \in \{0, 1\}$
- **Multiclass:** $y \in \{1, 2, \dots, n\}$

**Examples:**  
- **Handwritten Digit Recognition:** Recognize digits $0$ through $9$ (10 classes).
- **Medical Diagnosis:** Identify one disease among several possibilities.
- **Quality Inspection:** Classify parts with different types of defects (e.g., scratch, discoloration, chip).

---

## Softmax Regression

Softmax regression extends logistic regression to handle multiclass problems.


**Logistic Regression (Binary):**  

Compute:
  
$$z = \mathbf{w} \cdot \mathbf{x} + b$$
  
Then apply the sigmoid function:
  
$$a = g(z) = \frac{1}{1 + e^{-z}}$$
  
- where $a$ is the probability that $y = 1$. The probability for $y = 0$ is $1 - a$.


**Softmax Regression (Multiclass):**  

For each class $j$ (with $j = 1, 2, \dots, n$), compute:
  
$$z_j = \mathbf{w}_j \cdot \mathbf{x} + b_j.$$
  
Then, calculate the probabilities using the softmax function:
  
$$a_j = \frac{e^{z_j}}{\sum_{k=1}^{n} e^{z_k}}$$
  
The probabilities satisfy:

$$\sum_{j=1}^{n} a_j = 1$$

### Example with 4 Classes

For $n = 4$, compute:

$$
z_1 = \mathbf{w}_1 \cdot \mathbf{x} + b_1,
$$

$$
z_2 = \mathbf{w}_2 \cdot \mathbf{x} + b_2,
$$

$$
z_3 = \mathbf{w}_3 \cdot \mathbf{x} + b_3,
$$

$$
z_4 = \mathbf{w}_4 \cdot \mathbf{x} + b_4.
$$

Then, for each class:

$$
a_j = \frac{e^{z_j}}{e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4}} \quad \text{for } j=1,2,3,4.
$$

---

## Cost Function for Softmax Regression

### Cross-Entropy Loss

**For Logistic Regression (Binary):**

$$
\text{Loss} = -\Big[ y \log(a) + (1-y) \log(1-a) \Big].
$$

**For Softmax Regression (Multiclass):**  

If the true label is $y = j$, the loss for that example is:

$$
L = -\log(a_j).
$$

- A high $a_j$ (close to 1) leads to a low loss.
- A low $a_j$ results in a high loss.
- The overall cost is the average loss over all training examples.

**Cost**

Note that only the line that corresponds to the target contributes to the loss, other lines are zero. To write the cost equation we need an 'indicator function' that will be 1 when the index matches the target and zero otherwise. 

$$\mathbf{1}\{y == n\} = =\begin{cases}
1, & \text{if $y==n$}.\\
0, & \text{otherwise}.
\end{cases}$$
  
Therefore the cost function can be written as:

$$
\begin{align}
J(\mathbf{w},b) = -\frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{N}  1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right]
\end{align}
$$

---

## Neural Networks with a Softmax Output Layer

**Structure:**
- **Input Layer:** Features $\mathbf{X}$.
- **Hidden Layers:** One or more layers with activations (e.g., ReLU).
- **Output Layer:** $K$ neurons (one per class) with a softmax activation.

### Forward Propagation in the Output Layer

For a network with $K$ output classes, compute the logits:

$$
Z_j = \mathbf{W}_j \cdot a^{(L-1)} + b_j \quad \text{for } j = 1, \dots, K,
$$

where $a^{(L-1)}$ are the activations from the previous layer. Then, apply softmax:

$$
a_j = \frac{e^{Z_j}}{\sum_{k=1}^{K} e^{Z_k}}.
$$

**Note:**  
Each $a_j$ depends on all $Z_k$ values, unlike element-wise activations (e.g., sigmoid).

### TensorFlow Implementation (Conceptual Overview)

```python
model = tf.keras.Sequential([
    tf.keras.layers.Dense(25, activation='relu', input_shape=(input_dim,)),
    tf.keras.layers.Dense(15, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')  # For 10 classes
])
```

**Loss Function:**  

Use `SparseCategoricalCrossentropy`:
  
```python
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
```

---

## Numerical Stability in Softmax Implementations

- **Floating-Point Precision:** Exponential functions can produce very large or very small numbers, leading to round-off errors.
- **Example:** Computing $2/10000$ directly versus through an alternative formulation can yield slight differences due to limited precision.

### Improving Stability

**Combine Computations:**  

Instead of computing softmax probabilities and then applying cross-entropy, combine them. This allows frameworks like TensorFlow to rearrange operations for better numerical accuracy.

**TensorFlow Example with Logits:**  

Use `from_logits=True` to compute the loss more stably:

```python
model = tf.keras.Sequential([
    tf.keras.layers.Dense(25, activation='relu', input_shape=(input_dim,)),
    tf.keras.layers.Dense(15, activation='relu'),
    tf.keras.layers.Dense(10)  # Linear activation; outputs are logits
])

model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)
```

---

## Multi-label Classification

**Multi-class vs. Multi-label:**
- **Multi-class:** Each example is assigned exactly one label (even if there are many classes).
- **Multi-label:** Each example can have **multiple labels** simultaneously.

**Example:** In self-driving cars, an image may be labeled for:
- Presence of a car
- Presence of a bus
- Presence of a pedestrian
  
Here, the output might be a vector such as $[1, 0, 1]$, indicating "yes" for car and pedestrian, and "no" for bus.

### Neural Network Implementation for Multi-label Classification

**Approach 1:**  

Train separate binary classifiers for each label.

**Approach 2:**  

Use a single network with multiple outputs:
- **Output Layer:** One neuron per label.
- **Activation:** Use the sigmoid function for each output, as each label represents an independent binary decision.
  
**TensorFlow Example:**
  
```python
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(input_dim,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(3, activation='sigmoid')  # 3 labels: car, bus, pedestrian
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',  # Suitable for multi-label classification
    metrics=['accuracy']
)
```
