## Binary Logistic Regression (Discriminative)

### Negative Log-Likelihood

The hypothesis and assumed probabilities are

$$
h_\theta(x)\equiv g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}
$$

$$\begin{gather}
P(y^{(i)}=1\mid x^{(i)};\theta) = h_\theta(x^{(i)})\\\\
P(y^{(i)}=0\mid x^{(i)};\theta) = 1-h_\theta(x^{(i)})\\\\
\end{gather}$$

More compactly:

$$
p(y^{(i)}\mid x^{(i)};\theta) = \big(h_\theta(x^{(i)})\big)^{y^{(i)}}\big(1-h_\theta(x^{(i)})\big)^{1-y^{(i)}}
$$

The likelihood function is

$$
\mathscr{L}(\theta)=p(y\mid X;\theta)=\prod_{i=1}^{m}p(y^{(i)}\mid x^{(i)};\theta)=\prod_{i=1}^{m}\big(h_\theta(x^{(i)})\big)^{y^{(i)}}\big(1-h_\theta(x^{(i)})\big)^{1-y^{(i)}}
$$

The log-likelihood function is

$$\begin{align*}
\ell(\theta)=\log[\mathscr{L}(\theta)]&=\sum_{i=1}^{m}\log\big[\big(h_\theta(x^{(i)})\big)^{y^{(i)}}\big(1-h_\theta(x^{(i)})\big)^{1-y^{(i)}}\big]\\
    &=\sum_{i=1}^{m}\log\big[\big(h_\theta(x^{(i)})\big)^{y^{(i)}}\big]+\log\big[\big(1-h_\theta(x^{(i)})\big)^{1-y^{(i)}}\big]\\
    &=\sum_{i=1}^{m}y^{(i)}\log\big(h_\theta(x^{(i)})\big)+(1-y^{(i)})\log\big(1-h_\theta(x^{(i)})\big)
\end{align*}$$

The __negative-log-likelihood__ objective function is

$$
J(\theta)\equiv-\frac{1}{m}\ell(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\big(h_\theta(x^{(i)})\big)+(1-y^{(i)})\log\big(1-h_\theta(x^{(i)})\big)=\frac{1}{m}\sum_{i=1}^{m}\log(1+e^{\theta^Tx^{(i)}})-y^{(i)}\theta^Tx^{(i)}
$$

The scaling term $\frac{1}{m}$ is optional.

### Logistic Likelihood

Define $g$ to be the sigmoid function:

$$
g(z)\equiv\frac{1}{1+e^{-z}}
$$

In the notebook Binary-Logistic-Regression, we showed that the sigmoid is a cumulative distribution function and its density is symmetric about zero. This makes the sigmoid a good candidate to define a probability model for binary classification.

Let $\mathscr{Y}$ be a discrete random variable with values in $\{-1,1\}$. And let $\mathscr{X}=[\mathscr{X}_1,...,\mathscr{X}_n]^{T}$ be a continuous random vector.

Define the __logistic model__ for classification as

$$
P(\mathscr{Y}=y^{(i)}\mid\mathscr{X}=x^{(i)};\theta)=g(y^{(i)}\theta^Tx^{(i)})=\frac{1}{1+e^{-y^{(i)}\theta^Tx^{(i)}}}
$$

For intepretation, we see that if the margin $y^{(i)}\theta^Tx^{(i)}$ is large — bigger than, say, 5 or so — then $P(\mathscr{Y} = y^{(i)}\lvert\mathscr{X}=x^{(i)};\theta) = g(y^{(i)}\theta^Tx^{(i)})\approx1$. That is, we assign nearly probability 1 to the event that the label is $y^{(i)}$. Conversely, if $y^{(i)}\theta^Tx^{(i)}$ is quite negative, then $P(\mathscr{Y} = y^{(i)}\mid\mathscr{X}=x^{(i)};\theta)\approx0$.

Define the hypothesis function $h_\theta$ as

$$
h_\theta(x)\equiv g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}}
$$

Given a training set $\{X,y\}=\{x^{(i)},y^{(i)}\}_{i=1}^{m}$ of independent observations from $\mathscr{X},\mathscr{Y}$, then we define our likelihood as

$$\begin{align*}
\mathscr{L}(\theta) &= p(y\mid X;\theta) \\
    &= P(\mathscr{Y}=y^{(1)}\mid\mathscr{X}=x^{(1)},...,\mathscr{Y}=y^{(m)}\mid\mathscr{X}=x^{(m)};\theta) \\
    &= \prod_{i=1}^{m}P(\mathscr{Y}=y^{(i)}\lvert\mathscr{X}=x^{(i)};\theta)\tag{Independent observations} \\
    &= \prod_{i=1}^{m}g(y^{(i)}\theta^Tx^{(i)})\tag{LLG.0}
\end{align*}$$

Then our log-likelihood is

$$
\mathscr{l}(\theta)=\sum_{i=1}^{m}\log\big(g(y^{(i)}\theta^Tx^{(i)})\big)
$$

Let $\phi$ be the __logistic loss__ defined as

$$
\phi(z)\equiv\log(1+e^{-z})=\log\Big(\Big[\frac{1}{1+e^{-z}}\Big]^{-1}\Big)=\log\big(g(z)^{-1}\big)=-\log\big(g(z)\big)
$$

The __logistic objective__ function (aka __logistic risk__) is

$$
J(\theta)\equiv\frac{1}{m}\sum_{i=1}^{m}\phi\big(y^{(i)}\theta^Tx^{(i)}\big)=-\frac{1}{m}\sum_{i=1}^{m}\log\big(g(y^{(i)}\theta^Tx^{(i)})\big)=-\frac{1}{m}\mathscr{l}(\theta)
$$

## Gaussian Discriminant Analysis (Generative)

Probability assumptions:

$$\begin{align*} 
p(x\mid y=1) &= \frac{1}{(2\pi)^{n/2} \lvert \Sigma\rvert^{1/2}} \mathrm{exp}\bigg(-\frac{1}{2}(x - \mu_{1})^T \Sigma^{-1} (x - \mu_{1}) \bigg) \\
p(x\mid y=-1) &= \frac{1}{(2\pi)^{n/2} \lvert \Sigma\rvert^{1/2}} \mathrm{exp}\bigg(-\frac{1}{2}(x - \mu_{-1})^T \Sigma^{-1} (x - \mu_{-1}) \bigg) \\
\end{align*}$$

and

$$\begin{align*} 
p(y) &= \begin{cases}
  \phi          & \text{if} \; y = 1 \\
  1 - \phi     & \text{if} \; y = -1
\end{cases} \\\\
   &=\phi^{1\{y=1\}}(1-\phi)^{1\{y=-1\}}
\end{align*}$$

And the $\ln$ of this is

$$\begin{align*} 
\ln(p(y))&=\ln(\phi^{1\{y=1\}}(1-\phi)^{1\{y=-1\}})\\
    &=\ln(\phi^{1\{y=1\}}) + \ln((1-\phi)^{1\{y=-1\}})\\
    &={1\{y=1\}}\ln(\phi) + {1\{y=-1\}}\ln((1-\phi))
\end{align*}$$

Hence the log-likelihood is

$$\begin{align*} 
\ell(\phi,\mu_{-1},\mu_1,\Sigma) &= \mathrm{ln}\prod_{i=1}^mp(x^{(i)},y^{(i)};\phi,\mu_{-1},\mu_1\Sigma) \\
     &= \mathrm{ln}\prod_{i=1}^mp(x^{(i)}\mid y^{(i)};\phi,\mu_{-1},\mu_1,\Sigma) p(y^{(i)};\phi)\\
     &= \sum_{i=1}^m\mathrm{ln}p(x^{(i)}\mid y^{(i)};\phi,\mu_{-1},\mu_1,\Sigma)+ \sum_{i=1}^m\mathrm{ln}p(y^{(i)};\phi)\\
     &\approxeq \sum_{i=1}^m\big[\frac{1}{2}\mathrm{ln}\frac{1}{\lvert\Sigma\rvert}-\frac{1}{2}(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i)}})+{1\{y^{(i)}=1\}}\ln(\phi) + {1\{y^{(i)}=-1\}}\ln(1-\phi)\big]
\end{align*}$$

In the last quasi-equality, we have discarded the term $\ln\big(\frac{1}{(2\pi)^{n/2}}\big)$ since this is irrelevant to determining maxima.

## Naive Bayes (Generative)

### Bernoulli Naive Bayes

Let $y$ be a discrete random variable with values in $\{0,1\}$. Define $\phi_y\equiv \hat{P}(y=1)=\hat{p}_{y}(1)$.

Let $x=[x_1,...,x_n]^{T}$ be a discrete random vector with values in $\{0,1\}^{n}$, where $n$ is the number of features. That is, an observation from $x$ is a column vector in $\mathbb{R}^n$ whose elements are $0$ or $1$. Define

$$\begin{gather}
\phi_{j\mid y=1}\equiv\hat{P}(x_j=1\mid y=1)=\hat{p}_{x_j\mid y}(1\mid 1)\\\\
\phi_{j\mid y=0}\equiv\hat{P}(x_j=1\mid y=0)=\hat{p}_{x_j\mid y}(1\mid 0)
\end{gather}$$

Given a training set $\{x^{(i)},y^{(i)}\}_{i=1}^{m}$ where each $(x^{(i)},y^{(i)})$ is an independent observation from the joint random variable $(x,y)$, then the joint likelihood of the training data is

$$\begin{align*} 
\mathscr{L}\big(\phi_{j\mid y=1},\phi_{j\mid y=0},\phi_y;\{x^{(i)},y^{(i)}\}_{i=1}^{m}\big) &= P\big((x,y=x^{(1)},y^{(1)}),...,(x,y=x^{(m)},y^{(m)});\phi_{j|y=1},\phi_{j|y=0},\phi_y\big)\\\\
    &= \prod_{i=1}^{m}P\big(x,y=x^{(i)},y^{(i)};\phi_{j\mid y=1},\phi_{j\mid y=0},\phi_y\big)\tag{SCN.1}\\\\
    &= \prod_{i=1}^{m}p_{x,y}\big(x^{(i)},y^{(i)};\phi_{j\mid y=1},\phi_{j\mid y=0},\phi_y\big)\\\\
    &= \prod_{i=1}^{m}p_{x\mid y}\big(x^{(i)}\mid y^{(i)};\phi_{j\mid y=1},\phi_{j\mid y=0}\big)p_{y}\big(y^{(i)};\phi_y\big)\tag{SCN.2}\\\\
\end{align*}$$

SCN.1 holds from the assumption that observations are independent. SCN.2 holds from the definition of conditional probability. And the log-likelihood is

$$\begin{align*} 
\ell\big(\phi_{j\mid y=1},\phi_{j\mid y=0},\phi_y\big) &= \sum_{i=1}^m\mathrm{ln}\big[p_{x\mid y}(x^{(i)}\mid y^{(i)};\phi_{j\mid y=1},\phi_{j\mid y=0})\big]+ \sum_{i=1}^m\mathrm{ln}\big[p_{y}(y^{(i)};\phi_y)\big]\\\\
     &= \sum_{i=1}^m\mathrm{ln}\Big[\prod_{j=1}^{n}p_{x_{j}\mid y}(x_j^{(i)}\mid y^{(i)};\phi_{j\mid y=1},\phi_{j\mid y=0})\Big]+ \sum_{i=1}^m\mathrm{ln}\big[p_{y}(y^{(i)};\phi_y)\big]\tag{SCN.3}\\\\
     &= \sum_{i=1}^m\sum_{j=1}^{n}\mathrm{ln}\big[p_{x_{j}\mid y}(x_j^{(i)}\mid y^{(i)};\phi_{j\mid y=1},\phi_{j\mid y=0})\big]+ \sum_{i=1}^m\mathrm{ln}\big[p_{y}(y^{(i)};\phi_y)\big] \tag{SCN.4}\\\\
     &= \sum_{i=1}^m\sum_{j=1}^{n}\big\{\boldsymbol{1}\{y^{(i)}=1\wedge x_j^{(i)}=1\}\mathrm{ln}(\phi_{j\mid y=1})+\boldsymbol{1}\{y^{(i)}=0\wedge x_j^{(i)}=1\}\mathrm{ln}(\phi_{j\mid y=0}) \\
     &+ \boldsymbol{1}\{y^{(i)}=1\wedge x_j^{(i)}=0\}\mathrm{ln}(1-\phi_{j\mid y=1})+\boldsymbol{1}\{y^{(i)}=0\wedge x_j^{(i)}=0\}\mathrm{ln}(1-\phi_{j|y=0})\big\} \tag{SCN.5} \\
     &+ \sum_{i=1}^m\big\{\boldsymbol{1}\{y^{(i)}=1\}\mathrm{ln}(\phi_y)+\boldsymbol{1}\{y^{(i)}=0\}\mathrm{ln}(1-\phi_y)\big\}\\\\
\end{align*}$$

SCN.3 follows from the Naive Bayes Assumption: $x_1^{(i)},...,x_n^{(i)}$ are conditionally independent given $y^{(i)}$. Intuitively, if we know whether an email is spam or not, then the appearance of different words is assumed to be independent. This is a naive assumption because it's generally not true.

### Multinomial Naive Bayes