# How to solve Multi-Class Classification Problems

# Objective

This tutorial covers the **Multi-Class Classification** Problems in **Deep Learning** with **Tensorflow & Keras**.

#  Import Dependencies

In [None]:
#@title Import Dependencies
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds
from tensorflow import keras
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [None]:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

First, we will download the MNIST dataset.

In multi-class classification problems, we have **two options** to **encode** the true **labels**  by using either:

* integer numbers, or
* one-hot vector

We will experiment with both encodings to observe the effect of the combinations of various last layer's activation functions and loss functions on a Keras CNN model's performance.


In both experiments, we will discuss the relationship between
 ***Activation & Loss functions***,  ***label encodings***, and **accuracy metrics** in details.

We will understand why sometimes we could get **suprising results** when using ***different*** parameter settings other than the generally **recommended** ones.

As a result, we will **gain insight** about activation  and loss functions and their inteactions.

If you are ready, let's get started!

---
# Load a Multi-Label Dataset
I pick the MNIST dataset a famous multi label dataset

First let's load the MNIST dataset from [Tensorflow Datasets](https://www.tensorflow.org/datasets)

In [None]:
[ds_raw_train, ds_raw_test], info = tfds.load('mnist', 
                                              split=['train[:10%]','test[:10%]'], 
                                              as_supervised=True, 
                                              with_info=True)

Check the number of samples in Train and Test datasets

In [None]:
print("Number of samples in train : ", ds_raw_train.cardinality().numpy(),
      " in test : ",ds_raw_test.cardinality().numpy())

Observe the information about labels

In [None]:
print("Number of classes/labels: ",info.features["label"].num_classes)
print("Names of classes/labels: ",info.features["label"].names)

labels= info.features["label"].names

See some sample images (data) with their true labels

In [None]:
def show_samples(dataset):
  fig=plt.figure(figsize=(16, 16))
  columns = 3
  rows = 3
  
  print(columns*rows,"samples from the dataset")
  i=1
  for a,b in dataset.take(columns*rows): 
    fig.add_subplot(rows, columns, i)
    plt.imshow(np.squeeze(a))
    #plt.imshow(a.numpy())
    plt.title("image shape:"+ str(a.shape)+" ("+str(b.numpy()) +")" )

    i=i+1
  plt.show()
show_samples(ds_raw_test)

Notice that:**
* There are **10 classes** 
* For each sample, there is a **single integer value per class** 



### Let's resize and scale the images so that we can save time in training

In [None]:
#VGG16 expects min 32 x 32 
def resize_scale_image(image, label):
  image = tf.image.resize(image, [32, 32])
  image = image/255.0
  image = tf.image.grayscale_to_rgb(image)
  return image, label

In [None]:
ds_train_resize_scale=ds_raw_train.map(resize_scale_image)
ds_test_resize_scale=ds_raw_test.map(resize_scale_image)

show_samples(ds_test_resize_scale)

### Prepare the data pipeline by setting batch size & buffer size using [tf.data](https://www.tensorflow.org/guide/data)

In [None]:
batch_size = 64 

ds_train_resize_scale_batched=ds_train_resize_scale.batch(batch_size, drop_remainder=True ).cache().prefetch(tf.data.experimental.AUTOTUNE)
ds_test_resize_scale_batched=ds_test_resize_scale.batch(batch_size, drop_remainder=True ).cache().prefetch(tf.data.experimental.AUTOTUNE)

print("Number of batches in train: ", ds_train_resize_scale_batched.cardinality().numpy())
print("Number of batches in test: ", ds_test_resize_scale_batched.cardinality().numpy())


### To train fast, let's use Transfer Learning by importing VGG16

In [None]:
base_model = keras.applications.VGG16(
    weights='imagenet',  # Load weights pre-trained on ImageNet.
    input_shape=(32, 32, 3), # VGG16 expects min 32 x 32
    include_top=False)  # Do not include the ImageNet classifier at the top.
base_model.trainable = False

# 1. True (Actual) Labels are encoded with a **single integer number** 

### Create the classification model


In [None]:
number_of_classes = 10

In [None]:
inputs = keras.Input(shape=(32, 32, 3))
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
initializer = tf.keras.initializers.GlorotUniform(seed=42)

activation =  None  # tf.keras.activations.sigmoid or softmax

outputs = keras.layers.Dense(number_of_classes,
                             kernel_initializer=initializer,
                             activation=activation)(x) 
model = keras.Model(inputs, outputs)

**Pay attention**:
* The last layer has 10 (***number_of_classes***) unit. So the output (***y_pred***)  will be **10 floating points** as the true (actual) label (***y_true***) is **a single integer number**!

* For the last layer, the activation function can be:
  * None 
  * sigmoid 
  * softmax
* When there is **no activation** function is used in the model's last layer, we need to set `from_logits=True` **in cross-entropy loss functions** as we discussed above. Thus, **cross-entropy loss functions** will apply a **sigmoid** transformation on **predicted label values**:

  `if from_logits: return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)`


### Compile the model

In [None]:
model.compile(optimizer=keras.optimizers.Adam(),
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), # default from_logits=False
              metrics=[keras.metrics.SparseCategoricalAccuracy()])

**IMPORTANT:** We need to use **keras.metrics.SparseCategoricalAccuracy()** for **measuring** **the** **accuracy** since it calculates how often predictions match **integer labels**.


* As we mentioned above, Keras does ***not*** define a ***single*** accuracy metric, but ***several*** different ones, among them: `accuracy`, `binary_accuracy` and `categorical_accuracy`. 
* What happens under the hood is that, if you select ***mistakenly*** **categorical cross entropy as your loss function** in a **binary classification** and if you do ***not specify*** a particular accuracy metric by just writing
 
 `metrics="Accuracy"`

  Keras (***wrongly***...) **infers** that you are interested in the **categorical_accuracy**, and this is what it returns - while in fact you are interested in the **binary_accuracy** since our problem is a binary classification.

In summary; 
* to get `model.fit()` and `model.evaulate()` run correctly (without mixing the loss function and the classification problem at hand) we need to **specify the actual accuracy metric**!
* if the true (actual) labels are encoded with integer numbers, you need to use **keras.metrics.SparseCategoricalAccuracy()** for **measuring** **the** **accuracy** since it calculates how often how often predictions match **integer labels**.

### Try & See
Now, we can try and see the performance of the model by using **combination of activation  and loss functions.**

In [None]:
model.fit(ds_train_resize_scale_batched, validation_data=ds_test_resize_scale_batched, epochs=40)

In [None]:
ds= ds_test_resize_scale
print("Test Accuracy: ", model.evaluate(ds.batch(batch_size=10))[1])
predictions= model.predict(ds.batch(batch_size=10).take(1))
y=[]
print("10 Sample predictions:")
for (pred,(a,b)) in zip(predictions,ds.take(10)):
  print("predicted: " , np.argmax(pred), "Actual Label: "+labels[b.numpy()]+" ("+str(b.numpy()) +")", " True" if (np.argmax(pred)==b.numpy()) else " False" )
  y.append(b.numpy())

---
## Obtained Results*:

| Activation | Loss | Accuracy |
| :- | -: | :-: |
| softmax | BinaryCrossentropy() | **ValueError: logits and labels must have the same shape ((64, 10) vs (64, 1))**
| sigmoid | BinaryCrossentropy() | **ValueError: logits and labels must have the same shape ((64, 10) vs (64, 1))**
| None | BinaryCrossentropy(from_logits=True) | **ValueError: logits and labels must have the same shape ((64, 10) vs (64, 1))**
| None | CategoricalCrossentropy(from_logits=True) |**ValueError: Shapes (64, 1) and (64, 10) are incompatible** 
| sigmoid | CategoricalCrossentropy() |**ValueError: Shapes (64, 1) and (64, 10) are incompatible**
| softmax | CategoricalCrossentropy() |**ValueError: Shapes (64, 1) and (64, 10) are incompatible**
| **softmax** | **SparseCategoricalCrossentropy()** | **0.9440**
| **sigmoid** | **SparseCategoricalCrossentropy()** | **0.9440**
| **None** | **SparseCategoricalCrossentropy(from_logits=True)** |**0.9440** 


   *When you run this notebook, most probably you would not get the exact numbers rather you would observe very similar values due to the stochastic nature of ANNs.* 

---
### **Why do BinaryCrossentropy & CategoricalCrossentropy loss functions generate errors?**




Because, for true labels we are using **a single integer** value. 

However, the last layer outputs **a vector of size 10**(number_of_classes). 

Therefore, these loss functions can ***NOT compare a single integer with a vector!***

---
### **Why do softmax & sigmoid activation functions with SparseCategoricalCrossentropy loss lead to the same accuracy?**



* Generally, we use **softmax activation** instead of **sigmoid** with the **cross-entropy loss** because softmax activation distributes the probability throughout each output node (class).
* For **multi-class classification**, **softmax** is more recommended  rather than **sigmoid**. 
* The practical reason is that 
  * **softmax** is specially designed for **multi-class** classification tasks.
  * **Sigmoid** is equivalent to a 2-element **Softmax**, where the second element is assumed to be zero. Therefore, **sigmoid** is mostly used for **binary classification** and **multi-label classification**.

Let's see a simple example:

In [None]:
# Assume last layer output is as:
y_pred_logit = tf.constant([[-20, -1.0, 4.5, 12.5, 74, 43.2, -58.4, 8.2, 99.9, -101]], dtype = tf.float32)
print("y_pred_logit:\n", y_pred_logit.numpy())

# and last layer activation function is softmax:
y_pred_softmax = tf.keras.activations.softmax(y_pred_logit)
print("\nsoftmax(y_pred) :\n", y_pred_softmax.numpy())

# and last layer activation function is sigmoid:
y_pred_sigmoid = tf.keras.activations.sigmoid(y_pred_logit)
print("\nsigmoid(y_pred) :\n", y_pred_sigmoid.numpy())



As seen above, when the last layer generates some logits, s***igmoid and softmax functions produce different results***.

However, when we apply **sparse categorical_crossentropy** loss function on their results, ***the computed loss is exactly the same***:

In [None]:
y_true=[[5]]
y_pred = y_pred_sigmoid
print("\ny_true {} \n\ny_pred by sigmoid {}\n".format(y_true, y_pred))
print("categorical_crossentropy loss: ", tf.keras.losses.sparse_categorical_crossentropy
      (y_true, y_pred).numpy())


y_pred = y_pred_softmax
print("\ny_true {} \n\ny_pred by softmax {}\n".format(y_true, y_pred))
print("categorical_crossentropy loss: ", tf.keras.losses.sparse_categorical_crossentropy
      (y_true, y_pred).numpy())

Notice that sigmoid or softmax functions converts logits differently but the **calculated loss is exactly the same!**

That is the reason why  softmax & sigmoid activation functions with SparseCategoricalCrossentropy loss lead to the same accuracy.

---
### **Why does SparseCategoricalCrossentropy loss functions with from_logits=True lead to good accuracy without any activation function?**



             

Because, using ***from_logits=True*** tells to any Cross Entropy loss functions to apply its own **sigmoid** transformation over the inputs:

```Python 
if from_logits: 
    return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)` 
```
[In Keras documentation](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy): "***Note - Using ```from_logits=True``` may be more numerically stable.***"



A simple example:

In [None]:
# Assume last layer output is as:
y_pred_logit = tf.constant([[-20, -1.0, 4.5, 12.5, 74, 43.2, -58.4, 8.2, 99.9, -101]], dtype = tf.float32)
print("y_pred_logit:\n", y_pred_logit.numpy())

# and last layer activation function is sigmoid:
y_pred_sigmoid = tf.keras.activations.sigmoid(y_pred_logit)

y_true=[[5]]
y_pred = y_pred_sigmoid
print("\ny_true {} \n\ny_pred by sigmoid {}\n".format(y_true, y_pred))
print("sparse categorical_crossentropy loss: ", tf.keras.losses.sparse_categorical_crossentropy
      (y_true, y_pred).numpy())

print("\ny_true {} \n\ny_pred by None activation {}\n".format(y_true, y_pred_logit))
print("sparse_categorical_crossentropy (from_logits=True)) loss: ", tf.keras.losses.sparse_categorical_crossentropy
      (y_true, y_pred_logit,from_logits=True).numpy())

Notice that if we **do not apply any Activation** function at the last layer, we need to ***inform*** the cross entropy loss functions by setting the parameter ***from_logits=True*** so that the cross entropy loss functions will **apply a sigmoid** transformation onto the given **logits** by themselves!

---
## **In summary:**

We can **conclude** that, if the task is **multi-class classification** and true (actual) labels are encoded as a **single integer number** we have 2 options to go:
  * Option 1: 
```Python 
  activation = **sigmoid** or **softmax** 

  loss =SparseCategoricalCrossentropy()
  
  accuracy metric= SparseCategoricalAccuracy()
```
  * Option 2: 
```Python 
  activation = **None**  
  
  loss =SparseCategoricalCrossentropy(from_logits=True)
  
  accuracy metric=SparseCategoricalAccuracy()
```

# 2. True (Actual) Labels are one-hot encoded 

In multi-class classification problems, we can also use **one-hot encoding** for **target (y_true)** values. 
Now, let's **which activation, loss, and accuracy** functions we need to select when true classes are encoded one hot.

### First convert the true (actual) label encoding to one-hot

In [None]:
def one_hot(image, label):
  label = tf.one_hot(label, depth=number_of_classes)
  return image, label

In [None]:
ds_train_resize_scale_one_hot= ds_train_resize_scale.map(one_hot)
ds_test_resize_scale_one_hot= ds_test_resize_scale.map(one_hot)
show_samples(ds_test_resize_scale_one_hot)

**Notice that:**
* There are **10 labels / classes** 
* Labels are now **one-hot encoded** 


### Prepare the data pipeline by setting batch size

In [None]:
ds_train_resize_scale_one_hot_batched=ds_train_resize_scale_one_hot.batch(64)
ds_test_resize_scale_one_hot_batched=ds_test_resize_scale_one_hot.batch(64)

### Create the classification model


In [None]:
inputs = keras.Input(shape=(32, 32, 3))
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)

initializer = tf.keras.initializers.GlorotUniform(seed=42)
activation = tf.keras.activations.softmax # None  #  tf.keras.activations.sigmoid or softmax

outputs = keras.layers.Dense(number_of_classes,
                             kernel_initializer=initializer,
                             activation=activation)(x) 
 
model = keras.Model(inputs, outputs)

**Pay attention**:
* The last layer has **number_of_classes (10) units**. Thus the output will support  **one-hot** encoding of the true (actual) label. 

* For the last layer, the activation function can be:
  * None 
  * sigmoid 
  * softmax
* When there is **no activation** function is used, we need to set `from_logits=True` **in cross-entropy functions** as we discussed above

### Compile the model

In [None]:
model.compile(optimizer=keras.optimizers.Adam(),
              loss=keras.losses.CategoricalCrossentropy(), # default from_logits=False
              metrics=[keras.metrics.CategoricalAccuracy()])

**IMPORTANT:** We need to use **keras.metrics.CategoricalAccuracy()** for **measuring** **the** **accuracy** since it calculates how often predictions matches **one-hot labels**. **DO NOT USE** just `metrics=['accuracy']` as a performance metric, as explained [in Part A!](https://kmkarakaya.medium.com/how-to-solve-classification-problems-in-deep-learning-with-tensorflow-keras-6e39c5b09501)  



---
### Try & See
You can try and see the performance of the model by using **combination of activation  and loss functions.**


Each epoch takes almost 15 seconds on Colab TPU accelerator.

In [None]:
model.fit(ds_train_resize_scale_one_hot_batched, validation_data=ds_test_resize_scale_one_hot_batched, epochs=20)

In [None]:
ds= ds_test_resize_scale_one_hot
print("Test Accuracy: ", model.evaluate(ds.batch(batch_size=10))[1])
print("10 Sample predictions ")
predictions= model.predict(ds.batch(batch_size=10).take(1))
y=[]
for (pred,(a,b)) in zip(predictions,ds.take(10)):
  print("predicted: " , (pred), "Actual Label: "+str(b.numpy()) , " True" if (np.argmax(pred)==np.argmax(b.numpy())) else " False" )
  print()
  y.append(b.numpy())

---
## Obtained Results*:


| Activation | Loss | Accuracy |
| :- | -: | :-: |
| softmax | BinaryCrossentropy() |0.9060
| sigmoid | BinaryCrossentropy() |0.9060 
| None | BinaryCrossentropy(from_logits=True) | 0.9060
| **softmax**  | **CategoricalCrossentropy()** |**0.9300**
| **sigmoid** | **CategoricalCrossentropy()** | **0.9300**
| **None** | **CategoricalCrossentropy(from_logits=True)** | **0.9300**
| softmax | SparseCategoricalCrossentropy() | InvalidArgumentError:  logits and labels must have the same first dimension, got logits shape [64,10] and labels shape [640]
| sigmoid | SparseCategoricalCrossentropy() | InvalidArgumentError:  logits and labels must have the same first dimension, got logits shape [64,10] and labels shape [640]
| None | SparseCategoricalCrossentropy(from_logits=True) | InvalidArgumentError:  logits and labels must have the same first dimension, got logits shape [64,10] and labels shape [640]

.
* When you run this notebook, most probably you would not get the exact numbers rather you would observe very similar values due to stochastic nature of ANNs.


---
### **Why do SparseCategoricalCrossentropy loss functions generate errors?**



Because, for true labels we are using **one-hot encoding** and the last layer outputs **a vector of size 10** (number_of_classes). 

However, SparseCategoricalCrossentropy loss function expects integer numbers for true labels. Thus, `SparseCategoricalCrossentropy` loss function can ***NOT compute with one-hot vector!***

---
### **Why do Binary and Categorical cross-entropy loss functions with  lead to similar accuracy?**








I would like to remind you that when two loss functions are applied to the true labels that are encoded as **one-hot**, the calculated loss values are **very similar**. 
Thus, the model converges by using the loss function results and since both functions ***might*** generate similar loss values, the resulting trained models would have similar accuracy as seen above.

A simple example:


In [None]:
y_true=[[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]]

# Assume last layer output is as:
y_pred_logit = tf.constant([[-20, -1.0, 4.5, 12.5, 74, 43.2, -58.4, 8.2, 99.9, -101]], dtype = tf.float32)
print("y_pred_logit:\n", y_pred_logit.numpy())

# and last layer activation function is softmax:
y_pred_softmax = tf.keras.activations.softmax(y_pred_logit)
print("\nsoftmax(y_pred) :\n", y_pred_softmax.numpy())

print("\ncategorical_crossentropy loss: ", tf.keras.losses.categorical_crossentropy
      (y_true, y_pred_softmax).numpy())
print("\nbinary_crossentropy loss: ", tf.keras.losses.binary_crossentropy
      (y_true, y_pred_softmax).numpy())

---
### **Why do Sigmoid and Softmax activation with Categorical cross-entropy loss function lead to the same accuracy?**




* Since we use **one-hot** encoding in true label encoding, sigmoid transforms all 10 numbers from the last layer to  floating numbers ranging from 0.0 to 1.0 but the sum of these 10 numbers **does not necessarily  equal to 1** (they are not probability distribution). 
* On the other hand, softmax transforms all 10 numbers from the last layer to  floating numbers ranging from 0.0 to 1.0 **and** the sum of these 10 numbers  **exactly  equals to 1.0**. 
* Normally, the Categorical cross-entropy loss function expects a probability distribution over the input values (when `from_logit = False` as default). 

* Even so, the  Categorical cross-entropy loss functions can consume sigmoid outputs and generate similar loss values.

A simple example:

In [None]:
y_true=[[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]]

# Assume last layer output is as:
y_pred_logit = tf.constant([[-20, -1.0, 4.5, 12.5, 74, 43.2, -58.4, 8.2, 99.9, -101]], dtype = tf.float32)
print("y_pred_logit:\n", y_pred_logit.numpy())

# and last layer activation function is softmax:
y_pred_softmax = tf.keras.activations.softmax(y_pred_logit)
print("\nsoftmax(y_pred_logit) :\n", y_pred_softmax.numpy())
print("categorical_crossentropy(y_true, y_pred_softmax) loss:\n", 
      tf.keras.losses.categorical_crossentropy
      (y_true, y_pred_softmax).numpy())

# and last layer activation function is sigmoid:
y_pred_sigmoid = tf.keras.activations.sigmoid(y_pred_logit)
print("\nsigmoid(y_pred_logit) :\n", y_pred_sigmoid.numpy())
print("categorical_crossentropy(y_true, y_pred_sigmoid) loss:\n", 
      tf.keras.losses.categorical_crossentropy
      (y_true, y_pred_sigmoid).numpy())

As seen above, when the last layer generates some logits, s***igmoid and softmax functions produce different results***.

However, when we apply **categorical_crossentropy** loss function on their results, ***the computed loss is exactly the same***.

---
### Summary

According to the above experiment results, if the task is **multi-class classification** and true (actual) labels are encoded as a **one-hot**, we might have 2 options:
* Option A
```Python 
  * activation = **None**
  * loss = **CategoricalCrossentropy(from_logits=True)**
  * accuracy metric= **CategoricalAccuracy()**
```
* Option B
```Python 
  * activation = **sigmoid or softmax** 
  * loss =**CategoricalCrossentropy()** 
  * accuracy metric= **CategoricalAccuracy()**
```






# Multi-Class Classification Summary

In a nut shel, in a **multi-class** classification 
* We can use **integer numbers** or **one-hot encoding** to encode the **true** classes / labels 
*  ***The correct accuracy metric*** depends on **the selected true label encoding**
* Last layer activation function could be **Sigmoid, Softmax or None**
* ***The correct loss function*** should be decided according to **the selected true label encoding**

So the summary of the experiments are below:

![image.png](attachment:57c8d4d0-3cc1-49d1-8c7d-6e8c398e094f.png)

**Acknowledgement:** 
All credits go to [Professor Murat Karakaya](https://www.muratkarakaya.net) and his excellent contributions of this series of deep learning tutorials.


---
# References

- [ Keras API reference / Losses / Probabilistic losses](https://keras.io/api/losses/probabilistic_losses/
) 

- [Keras Activation Functions](https://keras.io/api/layers/activations/)

- [Tensorflow Data pipeline (tf.data) guide](https://www.tensorflow.org/guide/data#using_tfdata_with_tfkeras)

- [How does tensorflow sparsecategoricalcrossentropy work?](https://stackoverflow.com/questions/59787897/how-does-tensorflow-sparsecategoricalcrossentropy-work)


- [Cross-entropy vs sparse-cross-entropy: when to use one over the other](https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other)

- [Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?](https://stackoverflow.com/questions/42081257/why-binary-crossentropy-and-categorical-crossentropy-give-different-performances)