In [1]:
!conda env list

# conda environments:
#
base                     /Users/mohan/opt/anaconda3
rnd                   *  /Users/mohan/opt/anaconda3/envs/rnd



Evidential Deep Learning to Quantify Classification Uncertainty <br>
Murat Sensoy, Lance Kaplan, Melih Kandemir <br>
https://arxiv.org/abs/1806.01768

## Motivation
Deep Learning models lack transparency in decision-making process. This leads to concerns about robustness and reliability in safety critical applications. A new approach called evidential deep learning has emerged to quantify uncertainty of a deep learning model.

## NN with cross entropy loss

In [2]:
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from io import StringIO
from IPython.display import SVG
import pydot

dot_graph1 = pydot.Dot(graph_type='digraph')

sd_node = pydot.Node('Standard NN')
sd_node.set_shape('box3d')
dot_graph1.add_node(sd_node)

riq_node = pydot.Node('Softmax \n (Cross entropy loss)')
riq_node.set_shape('square')
dot_graph1.add_node(riq_node)

iedge = pydot.Edge(sd_node,riq_node)
iedge.set_label('logits \n [2, -1, 4]')
dot_graph1.add_edge(iedge)

ariq_node = pydot.Node('Predicted class = Max(logits)')
ariq_node.set_shape('square')
dot_graph1.add_node(ariq_node)

aiedge = pydot.Edge(sd_node,ariq_node)
aiedge.set_label('logits \n [2, -1, 4]')
dot_graph1.add_edge(aiedge)

asp_node1 = pydot.Node('NLL(Cross entropy loss) \n \n CE = -∑_i(y_i * log(p_i)) \n y_i = one hot label \n p_i = class probabilities')
asp_node1.set_shape('square')
dot_graph1.add_node(asp_node1)

iedge = pydot.Edge(riq_node, asp_node1)
iedge.set_label('Class probabilities \n [0.74, 0.02, 0.24]')
dot_graph1.add_edge(iedge)

asp_node2 = pydot.Node('Optimize weights \n to minmize CE loss')
asp_node2.set_shape('square')
dot_graph1.add_node(asp_node2)

iedge = pydot.Edge(asp_node1, asp_node2)
iedge.set_label('Cross entropy loss \n CE = -(1 * log(0.74) + 0 * log(0.02) + 0 * log(0.24)) = -log(0.74) = 0.13')
dot_graph1.add_edge(iedge)


# dot_graph1.write_svg('big_data1.svg')
# dot_graph1.write_ps2('big_data1.ps2')
# SVG('big_data1.svg')

* Multinomial distribution (discrete class probabilities) - Softmax is used in the last layer to convert the continuous activations of output layer to class probabilities
* Neural Network is responsible for adjusting the ratio of class probabilities and softmax squashes the ratio to simplex
* The softmax squashed multinomial likelihood is then maximised w.r.t $\theta$
* Cross entropy loss (NLL + logsoftmax) -log p(y|x,$\theta$) = -log $\sigma(f_{y}$|x,$\theta$)
* $\sigma(u_{j})=\frac{e^{uj}}{\sum_{i=1}^{K}e^{uk}}$
* MLE underestimates the true variance and the exponentiation in the softmax function can result in probabilities that are larger than 1 and therefore, may lead to the inflation of the probability of the predicted class. 
* If the softmax function inflates the predicted probabilities, then the distances between the predicted class and other classes may not accurately reflect the true uncertainty of the model's prediction.

## Evidential Deep Learning
The outputs of a deep learning model are treated as probability distributions over class labels. These distributions can be used to quantify the uncertainty of the model's predictions in a number of ways. <br>
Example: The entropy of the probability distribution (measurement of the amount of uncertainty in the model's predictions)

Softmax, the standard output of a classification network is interpreted as the parameter set of a categorical distribution. By replacing this parameter set with the parameters of a Dirichlet density, the predictions of the learner is represented as a distribution over possible softmax outputs, rather than the point estimate of a softmax output.

In [3]:
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from io import StringIO
from IPython.display import SVG
import pydot

dot_graph1 = pydot.Dot(graph_type='digraph')

sd_node = pydot.Node('Standard NN')
sd_node.set_shape('box3d')
dot_graph1.add_node(sd_node)

riq_node = pydot.Node('Relu/Exponential \n (activation function)')
riq_node.set_shape('square')
dot_graph1.add_node(riq_node)

iedge = pydot.Edge(sd_node,riq_node)
iedge.set_label('logits \n [2, -1, 4]')
dot_graph1.add_edge(iedge)

ariq_node = pydot.Node('Predicted class = Max(logits)')
ariq_node.set_shape('square')
dot_graph1.add_node(ariq_node)

aiedge = pydot.Edge(sd_node,ariq_node)
aiedge.set_label('logits \n [2, -1, 4]')
dot_graph1.add_edge(aiedge)

asp_node1 = pydot.Node('DST \n u + ∑ b_k = 1 \n \n b_k = e_k/S \n u = K/S \n S = ∑(e_i + 1)')
asp_node1.set_shape('square')
dot_graph1.add_node(asp_node1)

iedge = pydot.Edge(riq_node, asp_node1)
iedge.set_label('evidence(e_k) \n [0.74, 0.02, 0.24]')
dot_graph1.add_edge(iedge)

asp_node2 = pydot.Node('Dirichlet Distribution \n α_k = e_k + 1 \n \n Expected class \n probability \n p_k = α_k/S')
asp_node2.set_shape('square')
dot_graph1.add_node(asp_node2)

iedge = pydot.Edge(asp_node1, asp_node2)
iedge.set_label('evidence(e_k) \n [0.74, 0.02, 0.24]')
dot_graph1.add_edge(iedge)


dot_graph1.write_svg('big_data1.svg')
dot_graph1.write_ps2('big_data1.ps2')
SVG('big_data1.svg')

- e evidence
- b belief
- u uncertainty
- K number of classes

* The Dempster–Shafer Theory of Evidence (DST) assigns belief mass $b_{k}$ to all classes and an overall uncertainty mass u
* All belief mass and uncertainty mass are non negative and sum upto one
* u + $\sum\limits_{k=1}^K$ $b_{k}$ = 1
* $b_{k}$ = $\frac{e_{k}}{S}$
* u = $\frac{K}{S}$
* S = $\sum\limits_{i=1}^K$ $e_{i}$+1
* Uncertainty is inversely proportional to the total evidence

* In dirichlet distribution parameters $\alpha_{k}$ = $e_{k}$ + 1
* $b_{k}$ = $\frac{\alpha_{k}-1}{S}$
* S = $\sum\limits_{i=1}^K$ $\alpha_{i}$

* Replace the softmax layer of standard NN with a ReLU layer to ascertain non negtaive values and obtain evidences  $e_{1}$, $e_{2}$, $e_{3}$,..... $e_{K}$ 
* Dirichlet distribution parametrized over evidence represents the density of each such probability assignment. Hence it models second-order probabilities and uncertainty
* Dirichlet distribution D(p∣α) is a pdf over categorical distribution parameterized by [$\alpha_{1}$, $\alpha_{2}$, $\alpha_{3}$,..... $\alpha_{K}$] (possible to sample class probability)
* D(p∣$\alpha$) = 
* Expected class probability, $\hat{p}_{k}$ = $\frac{\alpha_{k}}{S}$

### Type II Maximum Likelihood loss
* D(p∣α) is a prior on the likelihood Multi(y∣p) and the negative log marginal likelihood is calculated by integrating out the class probabilities
* Loss, L($\theta$) = -log(Marginal likelihood)
$$L(\theta) = -log(\int\prod_{j=1}^{k}p_{ij}^{y_{ij}}\frac{1}{B(\alpha_{i})}\prod_{j=1}^{k}p_{ij}^{\alpha_{ij}-1}dp_{i})$$
$$L(\theta) = \sum \limits_{j=1}^{k}y_{ij}(log(S_{i})-log(\alpha_{ij}))$$

### Bayes risk with cross entropy loss
$$L(\theta) = \int[\sum\limits_{j=1}^{k}-{y_{ij}log(p_{ij})}]\frac{1}{B(\alpha_{i})}\prod_{j=1}^{k}p_{ij}^{\alpha_{ij}-1}dp_{i}$$
$$L(\theta) = \sum \limits_{j=1}^{k}y_{ij}(\psi(S_{i})-\psi(\alpha_{ij}))$$

<br>

$\psi$ is digamma function

### Bayes risk with squared loss
$$L_{i}(\theta) = \int ||y_{i}-p_{i}||_{2}^{2}\frac{1}{B(\alpha_{i})}\prod_{j=1}^{k}p_{ij}^{\alpha_{ij}-1}dp_{i}$$
$$L_{i}(\theta) = \sum \limits_{j=1}^{k}E[y_{ij}^2 - 2y_{ij}p_{ij} + p_{ij}^2]$$
$$L_{i}(\theta) = \sum \limits_{j=1}^{k}(y_{ij}^2 - 2y_{ij}E[p_{ij}] + E[p_{ij}^2])$$

<br>

* $L_{i}(\theta) = \sum \limits_{j=1}^{k}(y_{ij} - E[p_{ij}])^2) + Var(p_{ij})$
* $L_{i}(\theta)$ = $\sum\limits_{j=1}^K$ $L_{ij}^{err}$ + $L_{ij}^{var}$  
* The loss aims to achieve the joint goals of minimizing the prediction error and the variance of the Dirichlet experiment
* Prioritizes data fit over variance estimation

**Proposition 1** <br>
For any $\alpha_{ij}$ $\geq$ 1, the inequality $L_{ij}^{var}$ < $L_{ij}^{err}$ is satisfied.\
i.e. The loss prioritizes data fit over variance estimation

**Proposition 2** <br>
For a given sample i with the correct label j, $L_{i}^{err}$ decreses when new evidence is added to $\alpha_{ij}$ and increases when evidence is removed from $\alpha_{ij}$\
i.e. The loss has a tendency to fit to the data

**Proposition 3** <br>
For a given sample i with the correct class label j, $L_{i}^{err}$ decreases when some evidence is removed from the biggest dirichilet parameter $\alpha_{ij}$ such that $l \neq$ j.\
i.e. The loss performs learned loss attenuation

### KL Divergence
* Shrinking the total evidence to zero for a sample, if it cannot be correctly classified is preferred. This is achieved by incorporating Kullback-Leibler (KL) Divergence term in the loss function
* Remove non-misleading evidence from predicted parameters $\alpha$
* The annealing coefficient $\lambda_{t}$ is manipulated to gradullay increase the effect of KL divergence in the loss function 
$$L(\theta) = \sum \limits_{i=1}^{N} L_{i}(\theta) + \lambda_{t}\sum \limits_{i=1}^{N}KL[D(p_{i}|\hat{\alpha_{i}})||D(p_{i}|<1,...,1>)]$$