## Chapter 9 Linear two class classification

# 9.7  The probablistic view of logistic regression

We saw in Section 8.2 that one could re-derive the Least Squares solution to linear regression via a maximum likelihood approach. The same can be done here for logistic regression, leading to formation of *cross-entropy* cost for logistic regression.  This cost - as we show here as well - is completely equivalent to the softmax cost used in previous Sections.

Note here we will explicitly write the entire weight vector $\mathbf{w}$ in two components: the bias $w_0$ and the remainder of the weignts, our normal vector, as $\mathbf{w}$.

In [1]:
# This code cell will not be shown in the HTML version of this notebook
# import custom library
import sys
sys.path.append('../../')
from mlrefined_libraries import superlearn_library as superlearn
from mlrefined_libraries import math_optimization_library as optlib
datapath = '../../mlrefined_datasets/superlearn_datasets/'

# demos for this notebook
regress_plotter = superlearn.lin_regression_demos
optimizers = optlib.optimizers
static_plotter = optlib.static_plotter.Visualizer();

# import autograd functionality to bulid function's properly for optimizers
import autograd.numpy as np

# import timer
from datetime import datetime 

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

## 9.7.1  Assumption 1: geometric relationship

## 9.7.2  Assumption 2: noise distribution

## 9.7.3  Assumption 3: statistical independence

As with linear regression, we again make the assumption that the datapoints are statistically independent of each other, so that the likelihood can be written as 

\begin{equation}
{\cal L}=\prod_{p=1}^{P}{\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)
\end{equation}  

where ${\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)$ is the probability of the $p^{th}$ datapoint $\mathbf{x}_p$ having the label $y_p$ when the separating hyperplane is given by its defining parameters $w_0$ and $\mathbf{w}$. As pointed out previously in Subsection 1.1 where we chose $y_p \in\left\{−1,+1\right\}$, the class labels can take arbitrary (but distinct) values. With the probabilistic framework it is more convenient to replace $-1$ labels with $0$, so that $y_p \in\left\{0,1\right\}$ and we can write each product term in the likelihood compactly as


\begin{equation}
{\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)={\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)^{y_p}\,{\cal P}\left(y=0\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)^{1-y_p}
\end{equation} 


Since $y_p$ takes only one of two values, we have that 

\begin{equation}
{\cal P}\left(y=0\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)= 1-{\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)
\end{equation} 

giving the likelihood as 

\begin{equation}
{\cal L}=\prod_{p=1}^{P}{\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)^{y_p}\left[1-{\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)\right]^{1-y_p}
\end{equation}

which is now written only in terms of ${\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)$.  

Before proposing a mathematical form for this probability and attempting to maximize the resulting likelihood, recall from our geometric discussion of logistic regression in previous subsections that the separating hyperplane $w_0+\mathbf{x}^T\mathbf{w}=0$ divides the input space of the problem into two half-spaces with the one characterized by $w_0+\mathbf{x}^T\mathbf{w}>0$ encompassing the $y=1$ class, and the one characterized by $w_0+\mathbf{x}^T\mathbf{w}<0$ encompassing the $y=0$ (previously $y=-1$) class.

Thinking probabilistically, this means that a point $\mathbf{x}_p$ belongs to class $y=1$ with probability 1 when $w_0+\mathbf{x}_p^T\mathbf{w}>0$, and to class $y=0$ with probability 1 when $w_0+\mathbf{x}_p^T\mathbf{w}<0$. Focusing only on the positive class, we can then write

\begin{equation}
{\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)=\begin{cases}
1 \,\,\,\,\,\text{if} \,\, w_0+\mathbf{x}_p^T\mathbf{w}>0 \\
0 \,\,\,\,\,\text{else} \\
\end{cases}
\end{equation} 

This probability expression can itself be thought of as a step function - with steps of $1$ and $0$ - jumping from one step to the other when $w_0+\mathbf{x}_p^T\mathbf{w}$ changes sign.    

Although seemingly a proper probability, the expression in equation (29) cannot be used for any dataset that is not perfectly separable by a hyperplane, in which case any (inevitable) misclassification will send the entire likelihood in equation (28) to zero. Using the logistic sigmoid function $\sigma(\cdot)$ as a strictly positive approximation to the 0/1 step function, we can resolve this issue by writing    

\begin{equation}
{\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0,\mathbf{w}\right)=\sigma\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)
\end{equation} 

which gives the likelihood as 

\begin{equation}
{\cal L}=\prod_{p=1}^{P}\sigma\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)^{y_p}\left[1-\sigma\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)\right]^{1-y_p}
\end{equation}

and the corresponding negative log-likelihood as 


\begin{equation}
g\left(w_0,\mathbf{w}\right)=-\sum_{p=1}^{P}y_p\,\text{log}\,\sigma\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)+\left(1-y_p\right)\text{log}\,\left(1-\sigma\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)\right)
\end{equation}  

The cost function in equation (32) is typically referred to as *the cross-entropy cost*.

## 9.7.4  The equivalency of the cross-entropy and softmax costs

Here we show the equivalency of the cross-entropy and softmax costs.  Geometrically speaking they both began with a similar desire - the only difference being the particular label values chosen in each instance ($\{0,1\}$ with cross-entropy, $\{-1,+1\}$ with softmax).  Both then use a sigmoidal function to perform the proper classification, thus in short the fact that both costs are equivalent is not too surprising.  

We can show equivalency between the two cost functions in two cases (addressing one class at a time):

- the cross-entropy cost at a point with label $y_p = 0$ equals the softmax cost at the same point using label $y_p = -1$


- the cross-entropy cost at a point with label $y_p = +1$ equals the softmax cost at the same point using label $y_p = +1$

We will show the equivalency of the first case, with the second case following in completely the same way.

To begin suppose we have a point $\left(\mathbf{x}_p,y_p\right)$ with label $y_p=0$. The cross-entropy term for $\mathbf{x}_p$ is   

\begin{equation}
-\text{log}\,\left(1-\sigma\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)\right).
\end{equation}  

Using the definition of $\sigma\left(s\right)=\frac{1}{1+e^{-s}}$  this term can be rewritten equivalently as

\begin{equation}
-\text{log}\,\left(1-\sigma\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)\right) = -\text{log}\,\left(\frac{e^{-\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)}}{1 + e^{-\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)}}\right).
\end{equation}

Next using the property $-\text{log}(s) = \text{log}\left(\frac{1}{s}\right)$ and that $\frac{1}{e^{-s}} = e^{s}$ the above is equal to

\begin{equation}
= \text{log}\,\left(\frac{1 + e^{-\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)}}{ e^{-\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)}}\right) = \text{log}\,\left(1 + e^{\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)}\right).
\end{equation}


Notice that this is precisely the softmax term for the datapoint $\left(\mathbf{x}_p,y_p\right)$ if instead of using the label value $y_p = 0$ we used the label value $y_p = -1$ since such a term takes the form

\begin{equation}
\text{log}\,\left(1+e^{-y_p\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)}\right) = \text{log}\,\left(1+e^{\left(w_0+\mathbf{x}_p^T\mathbf{w}\right)}\right).
\end{equation}  

> The maximum likelihood approach to logistic regression leads to the cross-entropy loss, which is equivalent to the softmax cost (after simple reassignment of class labels).   

## 9.7.5  Probability scoring

The probabilistic perspective on logistic regression, by design, enables us to assign a probability score to each point in our dataset once the optimal hyperplane parameters $w_0^{\star}$ and $\mathbf{w}^{\star}$ are learned. Specifically, the probability of the point $\mathbf{x}_p$ belonging to class $y=1$ is given by

\begin{equation}
{\cal P}\left(y=1\,|\,\mathbf{x}_{p},w_0^{\star},\mathbf{w}^{\star}\right)=\sigma\left(w_0^{\star}+\mathbf{x}_p^T\mathbf{w}^{\star}\right)
\end{equation} 

This probability score can be interpreted as how 'confident' our classifier is in predicting a label of $y=1$ for $\mathbf{x}_p$. If $w_0^{\star}+\mathbf{x}_p^T\mathbf{w}^{\star}>0$, the larger this evaluation gets the closer the probability gets to a value of 1, and hence the more confident we are of xp belonging to class y=1. On the other hand, when $w_0^{\star}+\mathbf{x}_p^T\mathbf{w}^{\star}<0$ the probability of $y=1$ falls below $50\%$ and as the evaluation gets smaller the probability approaches zero, while at the same time its complement - the probability of $y=0$ - gets closer to 1.       


This notion of confidence - which can be particularly useful when dealing with new datapoints whose labels are not known a priori - is strongly connected to the geometric notion of distance from the separating hyperplane. The figure below shows two datapoints $\mathbf{x}_p$ and $\mathbf{x}_q$, both lying in the positive half-space created by the hyperplane $w_0+\mathbf{x}^T\mathbf{w}=0$, whose distances from the hyperplane can be calculated - using simple algebra - as $\frac{w_{0}+\mathbf{x}_{p}^{T}\mathbf{w}}{\left\Vert \mathbf{w}\right\Vert _{2}}$ and $\frac{w_{0}+\mathbf{x}_{q}^{T}\mathbf{w}}{\left\Vert \mathbf{w}\right\Vert _{2}}$, respectively.      

<figure>
  <img src= '../../mlrefined_images/superlearn_images/distance_to_boundary.png' width="70%" height="80%" alt=""/>
  <figcaption>   
<strong>Figure 4:</strong> <em> Visual representation of the distance to the hyperplane $w_0+\mathbf{x}^T\mathbf{w}=0$, from two points $\mathbf{x}_p$ and $\mathbf{x}_q$ lying above it. </em>  </figcaption> 
</figure>

Ignoring the common denominator $\left\Vert \mathbf{w}\right\Vert _{2}$, the distance from a point to the hyperplane is proportional to its evaluation using the hyperplane, which is then mapped to a number between 0 and 1 via the sigmoid function, producing a valid probability score.  

> The notion of class probability is closely connected to the geometric notion of distance to the decision boundary. The farther a point is from the decision boundary the more confident the classier becomes about its predicted label. Conversely, as we get closer to the boundary, the classifier loses 'confidence' and the probability scores for each class get closer in value. In the most extreme case where the point lies precisely in the separating hyperplane, the distance is zero, giving a 50-50 probability of the point belonging to either class.

&copy; This material is not to be distributed, copied, or reused without written permission from the authors.