In [2]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from IPython.display import display, Markdown, Latex
from sklearn.datasets import make_blobs
from matplotlib.widgets import Slider
from lab_utils_common import dlc
from lab_utils_softmax import plt_softmax
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

## Softmax Function
In both softmax regression and neural networks with Softmax outputs, N outputs are generated and one output is selected as the predicted category. In both cases a vector $\mathbf{z}$ is generated by a linear function which is applied to a softmax function. The softmax function converts $\mathbf{z}$  into a probability distribution as described below. After applying softmax, each output will be between 0 and 1 and the outputs will add to 1, so that they can be interpreted as probabilities. The larger inputs  will correspond to larger output probabilities.

The softmax function can be written:
$$a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1}$$
The output $\mathbf{a}$ is a vector of length N, so for softmax regression, you could also write:
\begin{align}
\mathbf{a}(x) =
\begin{bmatrix}
P(y = 1 | \mathbf{x}; \mathbf{w},b) \\
\vdots \\
P(y = N | \mathbf{x}; \mathbf{w},b)
\end{bmatrix}
=
\frac{1}{ \sum_{k=1}^{N}{e^{z_k} }}
\begin{bmatrix}
e^{z_1} \\
\vdots \\
e^{z_{N}} \\
\end{bmatrix} \tag{2}
\end{align}


In [4]:
def my_softmax(z):
    ez = np.exp(z)
    sm = np.sum(ez)
    return sm

As you are varying the values of the z's above, there are a few things to note:
* the exponential in the numerator of the softmax magnifies small differences in the values 
* the output values sum to one
* the softmax spans all of the outputs. A change in `z0` for example will change the values of `a0`-`a3`. Compare this to other activations such as ReLU or Sigmoid which have a single input and single output.

### Softmax + Cross-Entropy Loss (Step-by-Step)

---

#### 1. Softmax Output
- Model se output **logits** aate hai → unko softmax se pass karte hai → **probabilities** milti hai.
- Softmax formula:
\begin{equation}
[
a_j = \frac{e^{z_j}}{\sum_{k=1}^{N} e^{z_k}}
]
\end{equation}
- Matlab **model ke hisaab se har class ki probability**.

---

#### 2. Ek example ke liye loss
- Agar true label \(y = 3\) hai aur model output \(\mathbf{a} = [0.1, 0.2, 0.7]\) hai:
\begin{equation}
[
\text{Loss} = -\log(a_{\text{true label}}) = -\log(0.7)
]
\end{equation}
- Sirf **true label wali probability** ko consider karte hai, baki ignore.

---

#### 3. Indicator Function
- **Indicator function**:
\[
\mathbf{1}\{y=j\} =
\begin{cases}
1 & \text{agar true label } j \text{ hai} \\
0 & \text{warna ignore}
\end{cases}
\]
- Iska use karke ek formula likh sakte hai jo **sab classes par chale par sirf true class pick kare**.

---

#### 4. Cost Function (batch average)
- Ek example ke liye:
\begin{equation}
[
\text{Loss} = -\log \frac{e^{z_j}}{\sum_k e^{z_k}}
]
\end{equation}
- **Batch ke liye average:**
$$
[
J(w,b) = - \frac{1}{m}\sum_{i=1}^m \sum_{j=1}^N 
1\{y^{(i)} = j\}
\log \frac{e^{z_j^{(i)}}}{\sum_k e^{z_k^{(i)}}}
]
$$
- **Indicator** ensure karta hai ki har example ke liye sirf uska true label loss mein add ho.

---

### **Numerical Example**

**Setup:**
- Classes (N) = 3 → (Dog, Cat, Horse)  
- Batch size (m) = 3 examples  
- Logits (z) (model raw outputs):  


---

#### Step 1: Softmax Outputs
- Example 1 → Softmax: **[0.66, 0.24, 0.10]**  
- Example 2 → Softmax: **[0.10, 0.66, 0.24]**  
- Example 3 → Softmax: **[0.24, 0.10, 0.66]**

---

#### Step 2: Loss for each example
$$
L_i = -\log(a_{\text{true label}})
$$
- Example 1 → true label = 1 → $-\log(0.66)=0.415$  
- Example 2 → true label = 2 → $-\log(0.66)=0.415$  
- Example 3 → true label = 3 → $-\log(0.66)=0.415$

---

#### Step 3: Cost (average)
$$
J = \frac{1}{m}\sum_{i=1}^{m}L_i
  = \frac{0.415+0.415+0.415}{3}
  = 0.415
$$

---

#### Indicator ka role
- Har example ke liye ek **one-hot** indicator hota hai:
  - Example 1 → [1,0,0]
  - Example 2 → [0,1,0]
  - Example 3 → [0,0,1]
- Formula mein:
$$
\sum_j 1\{y=i\} \log(a_j) 
\to \text{sirf true label pick hota hai.}
$$

---

**Bottom line:**  
- Softmax se probability nikal  
- True label ki probability lo  
- $-\log(\text{true prob})$ ka average le lo


In [6]:
# make  dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

In [9]:
model = Sequential([
    Dense(25, activation="relu"),
    Dense(15, activation="relu"),
    Dense(4, activation="linear"),
])

model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = tf.keras.optimizers.Adam(0.001)
)

model.fit(X_train, y_train, epochs=10)

Epoch 1/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 1.2010
Epoch 2/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.5391
Epoch 3/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.2577
Epoch 4/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1435
Epoch 5/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0918
Epoch 6/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0755
Epoch 7/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0611
Epoch 8/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0501
Epoch 9/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0421
Epoch 10/10
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0341


<keras.src.callbacks.history.History at 0x1958dd257e0>

In [10]:
output = model.predict(X_train)
print(np.max(output), np.min(output))

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
16.596172 -7.2160234


In [11]:
sm_output = tf.nn.softmax(output).numpy()
print(np.max(sm_output), np.min(sm_output))

0.99999714 2.86837e-10


In [12]:
for i in range (5):
    print(sm_output[i])

[0.   0.   0.98 0.02]
[9.94e-01 5.92e-03 7.62e-05 6.85e-06]
[9.66e-01 3.36e-02 7.19e-04 1.20e-04]
[1.18e-02 9.79e-01 5.20e-04 8.25e-03]
[2.48e-03 9.25e-05 9.97e-01 6.35e-05]


### SparseCategoricalCrossentropy vs CategoricalCrossentropy

---

#### 1. SparseCategoricalCrossentropy
- **Target format:** Just the class index as an integer.  
- **Example:** 10 classes (0–9)  
  - If true label is class 2 → `y = 2`  
- **Why?** You don’t need to manually make one-hot vectors; TensorFlow will internally convert it.

---

#### 2. CategoricalCrossentropy
- **Target format:** One-hot vector.  
- **Example:** 10 classes (0–9)  
  - If true label is class 2 → `y = [0,0,1,0,0,0,0,0,0,0]`  
- **Why?** Use when your labels are already one-hot encoded.

---

### Visual difference

| Label Type                  | Example (true class = 2)     |
|----------------------------|-----------------------------|
| Sparse (integer index)     | `2`                         |
| One-hot vector (length=10) | `[0,0,1,0,0,0,0,0,0,0]`     |



### When to Use Which?

---

#### Use **SparseCategoricalCrossentropy** when:
- Your labels are **integers** (e.g., `[2, 0, 5, 3]`).
- Your dataset **does not** already have one-hot encoded labels.
- **Memory efficient** (no need to store full one-hot vectors).

---

#### Use **CategoricalCrossentropy** when:
- Your labels are **already one-hot encoded** (e.g., `[[0,0,1,0], [1,0,0,0], ...]`).
- You are **manually handling one-hot encoding** or working with models that output one-hot format.

---

### Bottom Line:
- **SparseCategoricalCrossentropy → Integer labels**  
- **CategoricalCrossentropy → One-hot labels**  
- **Loss result is the same**, only input label format differs.

