# Sigmoid and Sofmax

## Bayes' theorem
From Bayes' theorem, the posterior probability takes the form

$$p(C_k|\mathbf{x}) = \frac{p(\mathbf{x}|C_k)p(C_k)}{\sum_j p(\mathbf{x}|C_j)p(C_j)}$$

## Logistic sigmoid function
Consider first of all the case of two classes, we can adjust the numerator and denominator simultaneously to obtain the *logistic sigmoid* form of the posterior probability.

<font color='red'>$$p(C_1|\mathbf{x}) = \sigma(a) = \frac{1}{1+exp(-a)}\qquad where\ a=\ln\frac{p(\mathbf{x}|C_1)p(C_1)}{p(\mathbf{x}|C_2)p(C_2)}\tag{4.57,4.58,4.59}$$</font>

and $\sigma(a)$ is the *logistic sigmoid* function. And the term 'sigmoid' means S-shaped, which is the shape of the sigmoid function on the coordinate.

## Softmax function
For the multiclass cases, the posterior probability can be adjusted to have the form

<font color='red'>$$p(C_k|\mathbf{x}) = \frac{p(\mathbf{x}|C_k)p(C_k)}{\sum_j p(\mathbf{x}|C_j)p(C_j)}=\frac{exp(a_k)}{\sum_j exp(a_j)}\qquad where\ a_k=\ln p(\mathbf{x}|C_k)p(C_k)\tag{4.62,4.63}$$</font>

which is know as the *normalized exponential* and can be regarded as a multiclass generalization of the logistic sigmoid. It is also known as the *softmax function*, as it represents a smoothed version of the 'max' function because, if $a_k>>a_j$ for all $j\neq k$, then $p(C_k|\mathbf{x})\simeq 1$, and $p(C_j|\mathbf{x})\simeq 0$.

## Why we need the exponential (sigmoid/softmax) form
<font color='red'>The reason is that we generally assume the class-conditional densities are Gaussian, and the exponential form will transform the decision boundary between these **shared covariance** conditional distributions to a linear function.</font>

Assume the density for class $C_k$ is given by

$$p(\mathbf{x}|C_k)=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp\left\{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu_k})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu}_k)\right\} \tag{4.64}$$

### Sigmoid
Consider first the case of **two classes**. From (4.57) and (4.58), we have

<font color='red'>$$p(C_1|\mathbf{x})=\sigma(\mathbf{w}^T\mathbf{x}+w_0) \tag{4.65}$$</font>

where we have defined

<font color='red'>$$\begin{align*}
\mathbf{w} &=\Sigma^{-1}(\mathbf{\mu}_1-\mathbf{\mu}_2) \tag{4.66}\\
w_0 &=-\frac{1}{2}\mathbf{\mu}_1^T\Sigma^{-1}\mathbf{\mu}_1+\frac{1}{2}\mathbf{\mu}_2^T\Sigma^{-1}\mathbf{\mu}_2+\ln\frac{p(C_1)}{p(C_2)} \tag{4.67}
\end{align*}$$</font>

We see that the quadratic terms in $\mathbf{x}$ from the exponents of the Gaussian densities have cancelled due to the assumption of common covariance matrices. 

The decision boundary satisfies the condition that $p(C_1|\mathbf{x})=p(C_2|\mathbf{x})=\sigma(\mathbf{w}^T\mathbf{x}+w_0)=0.5$, which leads to that

$$\mathbf{w}^T\mathbf{x}+w_0=0$$

is the boundary between the two densities.

### Softmax
For the general case of **$K$ classes** we have, from (4.62) and (4.63),

<font color='red'>$$a_k(\mathbf{x}) = \mathbf{w}_k^T\mathbf{x}+w_{k0} \tag{4.68}$$</font>

where we have defined

<font color='red'>$$\begin{align*}
\mathbf{w}_k &= \Sigma^{-1}\mathbf{\mu}_k \tag{4.69}\\
w_{k0} &= -\frac{1}{2}\mathbf{\mu}_k^T\Sigma^{-1}\mathbf{\mu}_k+\ln p(C_k) \tag{4.70}
\end{align*}$$</font>

We see that the $a_k(\mathbf{x})$ are again linear function of $\mathbf{x}$ as a consequence of the cancelation of the quadratic terms due to the shared covariances. 

For a single point $\mathbf{x}$, the denominator of the posterior probability $\sum_j exp(a_j)$ is fixed, so the class should be assigned to the class $C_k$ that provide the largest $exp(a_k)$, in the meanwhile the largest $a_k$. Then the resulting decision boundaries, corresponding to the minumum misclassification rate, <font color='red'>will occur when two of the posterior probabilities (the two largest) are equal</font>, and so will be defined by linear functions of $\mathbf{x}$, and so again we have a generalized linear model.

-------------

# Maximum likelihood solution

We have known that the classification problem can be modeled as linear functions, then the next step is to determine the parameters of the linear models through maximum likelihood solution.

There are 3 kinds of parameters that are given by $p(C_k)$, $\mathbf{\mu}_k$ and $\Sigma$.

Next we will solve these parameters in two classes case. It can be easily solved in multiclass case with the same approach.

## Likelihood function
 The likelihood function is given by
 
 $$p(\mathbb{t}|\pi, \mathbf{\mu}_1,\mathbf{\mu}_2,\Sigma) = \prod_{n=1}^N \big[\pi\mathcal{N}(\mathbf{x}_n|\mathbf{\mu}_1,\Sigma)\big]^{t_n}\big[(1-\pi)\mathcal{N}(\mathbf{x}_n|\mathbf{\mu}_2,\Sigma)\big]^{1-t_n} \tag{4.71}$$

where
- $\pi$ denotes the prior class probability $p(C_1)=\pi$, so that $p(C_2) = 1-\pi$.
- $\mathbf{\mu}_1,\mathbf{\mu}_2$ denote the means of the class-conditional densities of $C_1$ and $C_2$.
- $\Sigma$ denotes the shared covariance matrix.
- $t_n$ denotes the target which can take two values. $t_n=1$ denotes class $C_1$ and $t_n=0$ denotes class $C_2$.  
- $\mathbb{t} = (t_1,\cdots,t_N)^T$ denotes the targets of the training set.
- $\mathbf{X} = (\mathbf{x}_1,\cdots,\mathbf{x}_N)$ denotes the vectors of the traning set.



## Prior class probability
Maximizing the log likelihood function with respect to $\pi$, we obtain the terms that depend on $\pi$ are

$$\sum_{n=1}^N \big\{ t_n\ln \pi+(1-t_n)\ln(1-\pi) \big\} \tag{4.72}$$

Setting the derivative with respect to $\pi$ equal to zero and rearranging, we obtain

<font color='red'>$$\pi = \frac{1}{N}\sum_{n=1}^{N}t_n=\frac{N_1}{N}=\frac{N_1}{N_1+N_2} \tag{4.73}$$</font>

where $N_1, N_2$ denote the total number of data points in class $C_1$ and $C_2$.


## Mean vectors

Picking out the terms in the log likelihood function that depend on $\mathbf{\mu}_1$, we have

$$\sum_{n=1}^N t_n \ln \mathcal{N}(\mathbf{x}_n|\mathbf{\mu}_1,\Sigma) = \frac{1}{2}\sum_{n=1}^{N}t_n(\mathbf{x}_n-\mathbf{\mu}_1)^T\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_1)+const \tag{4.74}$$

Setting the derivative with respect to $\mathbf{\mu}_1$ to zero and rearranging, we obtain
<font color='Red'>$$\mathbf{\mu}_1 = \frac{1}{N_1}\sum_{n=1}^N t_n\mathbf{x}_n = \frac{1}{N_1}\sum_{n\in C_1}\mathbf{x}_n \tag{4.75}$$</font>

Likewise, 

<font color='Red'>$$\mathbf{\mu}_2 = \frac{1}{N_2}\sum_{n=1}^N (1-t_n)\mathbf{x}_n = \frac{1}{N_2}\sum_{n\in C_2}\mathbf{x}_n \tag{4.76}$$</font>

## Shared covariance matrix
Picking out the terms in the log likelihood function that depend on $\Sigma$, we have

$$\begin{align*}&\quad -\frac{1}{2}\sum_{n=1}^N t_n\ln|\Sigma| - \frac{1}{2}\sum_{n=1}^N t_n(\mathbf{x}_n-\mathbf{\mu}_1)^T\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_1)
-\frac{1}{2}\sum_{n=1}^N (1-t_n)\ln|\Sigma| - \frac{1}{2}\sum_{n=1}^N (1-t_n)(\mathbf{x}_n-\mathbf{\mu}_2)^T\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_2)\\
&=-\frac{N}{2}\ln|\Sigma|-\frac{1}{2}\sum_{n\in C_1}(\mathbf{x}_n-\mathbf{\mu}_1)^T\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_1)
-\frac{1}{2}\sum_{n\in C_2}(\mathbf{x}_n-\mathbf{\mu}_2)^T\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_2)\\
&=-\frac{N}{2}\ln|\Sigma|-\frac{1}{2}\sum_{n\in C_1}\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_1)\cdot(\mathbf{x}_n-\mathbf{\mu}_1)
-\frac{1}{2}\sum_{n\in C_2}\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_2)\cdot(\mathbf{x}_n-\mathbf{\mu}_2)\qquad \Sigma^{-1T}=\Sigma^{-1}\\
&=-\frac{N}{2}\ln|\Sigma|-\frac{1}{2}Tr\left\{\sum_{n\in C_1}\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_1)(\mathbf{x}_n-\mathbf{\mu}_1)^T\right\}
-\frac{1}{2}Tr\left\{\sum_{n\in C_2}\Sigma^{-1}(\mathbf{x}_n-\mathbf{\mu}_2)(\mathbf{x}_n-\mathbf{\mu}_2)^T\right\}\\
&=-\frac{N}{2}\ln|\Sigma|-\frac{1}{2}Tr\left\{\Sigma^{-1}\sum_{n\in C_1}(\mathbf{x}_n-\mathbf{\mu}_1)(\mathbf{x}_n-\mathbf{\mu}_1)^T\right\}
-\frac{1}{2}Tr\left\{\Sigma^{-1}\sum_{n\in C_2}(\mathbf{x}_n-\mathbf{\mu}_2)(\mathbf{x}_n-\mathbf{\mu}_2)^T\right\}\\
&=-\frac{N}{2}\ln|\Sigma|-\frac{N_1}{2}Tr\{\Sigma^{-1}S_1\}-\frac{N_2}{2}Tr\{\Sigma^{-1}S_2\}\qquad \color{Red}{
\left\{\begin{array}{ll}
S_1 &= \frac{1}{N_1}\sum_{n\in C_1}(\mathbf{x}_n-\mathbf{\mu}_1)(\mathbf{x}_n-\mathbf{\mu}_1)^T\qquad (4.79)\\
S_2 &= \frac{1}{N_2}\sum_{n\in C_2}(\mathbf{x}_n-\mathbf{\mu}_2)(\mathbf{x}_n-\mathbf{\mu}_2)^T\qquad (4.80)
\end{array}\right.}\\
&=-\frac{N}{2}\ln|\Sigma|-\frac{N}{2}Tr\{\Sigma^{-1}S\}\qquad \color{Red}{S = \frac{N_1}{N}S_1+\frac{N_2}{N}S_2} \tag{4.77, 4.78}
\end{align*}$$

Setting the derivative with respect to $\Sigma$ to zero

$$\begin{align*}
&\quad \frac{d}{d\Sigma}\left(-\frac{N}{2}\ln|\Sigma|-\frac{N_1}{2}Tr\{\Sigma^{-1}S\}\right) \\
&= \underbrace{-\frac{N}{2}\Sigma^{-1T}}_{C.28}-\underbrace{\frac{N}{2}S^T(-\Sigma^{-2})}_{C.24}\\
&= -\frac{N}{2}\Sigma^{-1T}+\frac{N}{2}S^T \Sigma^{-2T}\qquad \Sigma^{T} =\Sigma \\
&= 0
\end{align*}$$

we obtain

<font color='Red'>$$\Sigma = S$$</font>

---------------

# Discrete features

Here we make the *naive Bayes* assumption in which the feature values are treated as independent, conditioned on the class $C_k$. Thus we have class-conditional distributions of the form

$$p(\mathbf{x}|C_k)=\prod_{i=1}^{D}\mu_{ki}^{x_i}(1-\mu_{ki})^{1-x_i} \tag{4.81}$$

where

- $\mathbf{x} = (x_0,\cdots, x_D)^T$ denotes $D$ independent input features.
- $x_i$ denotes each input feature that is binary, and have the values $x_i\in\{0,1\}$.
- $\mu_{ki}$ denotes the mean(probability) of $i^{th}$ input feature in class $k$.

Substituting into (4.63) then gives

$$a_k(\mathbf{x})=\sum_{i=1}^D\big\{ x_i\ln\mu_{ki}+(1-x_i)\ln(1-\mu_{ki}) \big\}+\ln p(C_k) \tag{4.82}$$

which again are linear functions of the input values $x_i$.

It can also be verified to be a linear function in two classes case by substituting into (4.57).

------------

# Exponential family

With the restriction of $\mathbf{u}(\mathbf{x})=\mathbf{x}$ and introduction of a scaling parameter $s$, we obtain the restricted set of exponential family class-condictional densities of the form

$$p(\mathbf{x}|\mathbf{\lambda}_k, s)=\frac{1}{s}h\left(\frac{1}{s}\mathbf{x}\right)g(\mathbf{\lambda}_k)exp\left\{\frac{1}{s}\mathbf{\lambda}_k^T\mathbf{x}\right\} \tag{4.84}$$

Substituting into (4.63) then gives

$$a_k(\mathbf{x}) = \mathbf{\lambda}_k^T\mathbf{x}+\ln g(\mathbf{\lambda}_k)+\ln p(C_k) \tag{4.86}$$

which again are linear functions of the input vector $\mathbf{x}$.

It can also be verified to be a linear function in two classes case by substituting into (4.57).