<!-- dom:TITLE: Data Analysis and Machine Learning: Logistic Regression -->
# Data Analysis and Machine Learning: Logistic Regression
<!-- dom:AUTHOR: Morten Hjorth-Jensen at Department of Physics, University of Oslo & Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University -->
<!-- Author: -->  
**Morten Hjorth-Jensen**, Department of Physics, University of Oslo and Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University

Date: **Sep 17, 2018**

Copyright 1999-2018, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license




<!-- !split  -->
## Logistic Regression

So far we have focused on learning from datasets for which there is a
**continuous** output. In linear regression we have been 
concerned with learning the coefficients of a polynomial to predict
the response of a continuous variable $y_i$ on unseen data based on
its independent variables ${\bf x}_i$. 

Classification problems,
however, are concerned with outcomes taking the form of discrete
variables (i.e. categories). For example, we may want to detect if
there's a cat or a dog in an image. Or given a specific system,
we'd like to identify its state, say whether it is an ordered or disordered system (typical situation in solid state physics).
(e.g. ordered/disordered). 

**Logistic regression deals with binary, dichotomous outcomes (e.g. True or
False, Success or Failure, etc.). It is worth noting that logistic
regression is also commonly used in modern supervised Deep Learning
models**, as we will see later.


<!-- !split  -->
## Basics

We consider the case where the dependent variables $y_i\in\mathbb{Z}$
are discrete and only take values from $m=0,\dots,M-1$ (i.e. $M$
classes).

The goal is to predict the
output classes from the design matrix $X\in\mathbb{R}^{n\times p}$
made of $n$ samples, each of which bears $p$ features. The
primary goal is to identify the classes to which new unseen samples
belong.


## Linear classifier

Let us start by considering a slightly simpler classifier: a linear classifier that categorizes examples using a weighted linear-combination of the features and an additive offset

<!-- Equation labels as ordinary links -->
<div id="_auto1"></div>

$$
\begin{equation}
s_i = \boldsymbol{x}_i^T\boldsymbol{w} + b_0 \equiv  \mathbf{x}_i^T\mathbf{w},
\label{_auto1} \tag{1}
\end{equation}
$$

where we use the short-hand notation 
$\mathbf{x}_i = (1,\boldsymbol{x}_i)$ and $\mathbf{w}_i = (b_0,\boldsymbol{w}_i)$. 

## Some selected properties

This function takes values on the entire real axis. In the case of
logistic regression, however, the labels $y_i$ are discrete
variables. One simple way to get a discrete output is to have sign
functions that map the output of a linear regressor to $\{0,1\}$,
$f(s_i)=sign(s_i)=1$ if $s_i\ge 0$ and 0 if otherwise. Indeed,
this is commonly known as the "perceptron" in the machine learning
literature. This model is extremely simple, and it is favorable in
many cases (e.g. noisy data) to have a ``soft" classifier that outputs
the probability of a given category. For example, given
$\mathbf{x}_i$, the classifier outputs the probability of being in
category $m$. One such function is the logistic (or sigmoid) function:

<!-- Equation labels as ordinary links -->
<div id="eq:log_fun"></div>

$$
\begin{equation}
f(s) = \frac{1}{1+\mathrm e^{-s}}.
\label{eq:log_fun} \tag{2}
\end{equation}
$$

Note that $1-f(s)= f(-s)$, which will be useful shortly. 

## The cross-entropy as a cost function for logistic regression

The perceptron is an example of a ``hard classification": each datapoint is deterministically assigned to a category (i.e $y_i=0$ or $y_i=1$). In many cases, it is favorable to have a "soft" classifier that outputs the probability of a given category rather than a single value. For example, given $\mathbf{x}_i$, the classifier outputs the probability of being in category $m$. 
Logistic regression is the most canonical example of a soft classifier. In logistic regression, the probability that a data point $\boldsymbol{x}_i$ belongs to a category $y_i=\{0,1\}$ is  is given by

$$
\begin{eqnarray}
P(y_i=1|\boldsymbol{x}_i,\boldsymbol{\theta)} &=& \frac{1}{1+\mathrm{e}^{-\mathbf{x}^T_i\mathbf{w}}},\nonumber\\
P(y_i=0|\boldsymbol{x}_i,\boldsymbol{\theta)} &=& 1 - P(y_i=1|\boldsymbol{x}_i,\boldsymbol{\theta)},
\end{eqnarray}
$$

where $\boldsymbol{\theta}=\mathbf{w}$ are the weights we wish to learn from the data. 


Notice that in terms of the logistic function, we can write

$$
P(y_i=1) =f(\mathbf{x}_i^T\mathbf{w})=1-P(y_i=0).
$$

<!-- !split  -->
## Maximum likelihood

We now define the cost function for logistic regression using Maximum
Likelihood Estimation (MLE). Recall, that in MLE we choose parameters
to maximize the probability of seeing the observed data. Consider a
dataset $\mathcal{D}=\{(y_i,\boldsymbol{x}_i)\}$ with binary labels
$y_i\in\{0,1\}$ where the data points are drawn independently.  The
likelihood of the seeing the data under our model is just:

$$
P(\mathcal{D}|\mathbf{w}) = \prod_{i=1}^n \left[f(\mathbf{x}_i^T\mathbf{w})\right]^{y_i}\left[1-f(\mathbf{x}_i^T\mathbf{w})\right]^{1-y_i}\nonumber
$$

<!-- Equation labels as ordinary links -->
<div id="_auto2"></div>

$$
\begin{equation} 
\label{_auto2} \tag{3}
\end{equation}
$$

from which we can readily compute the log-likelihood:

<!-- Equation labels as ordinary links -->
<div id="_auto3"></div>

$$
\begin{equation}
l(\mathbf{w}) = \sum_{i=1}^n  y_i\log f(\mathbf{x}_i^T\mathbf{w}) + (1-y_i)\log\left[1-f(\mathbf{x}_i^T\mathbf{w})\right].
\label{_auto3} \tag{4}
\end{equation}
$$

The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to $\theta$

$$
\hat{\mathbf{w}} = \sum_{i=1}^n y_i\log f(\mathbf{x}_i^T\mathbf{w}) + (1-y_i)\log\left[1-f(\mathbf{x}_i^T\mathbf{w})\right].
$$

Since the cost (error) function is just the negative log-likelihood, for logistic regression we have that

$$
\begin{eqnarray}
\mathcal{C}(\mathbf{w}) &=& - l(\mathbf{w}) \\
&=& \sum_{i=1}^n  -y_i\log f(\mathbf{x}_i^T\mathbf{w}) - (1-y_i)\log\left[1-f(\mathbf{x}_i^T\mathbf{w})\right].\nonumber
\end{eqnarray}
$$

This equation is known in statistics as the \emph{cross entropy}. Finally, we note that just as in linear regression, 
in practice we usually supplement the cross-entropy with additional regularization terms, usually $L_1$ and $L_2$ regularization as we did for Ridge and Lasso regression.

## Minimizing the cross entropy

The cross entropy is a convex function of the weights $\mathbf{w}$ and,
therefore, any local minimizer is a global minimizer. Minimizing this
cost function leads to the following equation

<!-- Equation labels as ordinary links -->
<div id="_auto4"></div>

$$
\begin{equation}
\boldsymbol{0}=\boldsymbol{\nabla} \mathcal{C}(\mathbf{w}) = \sum_{i=1}^n\left[f(\mathbf{x}_i^T\mathbf{w})-y_i\right]\mathbf{x}_i,
\label{_auto4} \tag{5}
\end{equation}
$$

where we made use of the logistic function identity $\partial_z f(z) =
f(z)[1-f(z)]$.  This equation defines a transcendental equation for
$\mathbf{w}$, the solution of which, unlike linear regression, cannot
be written in a closed form. 
Here we need gradient descent methods!