# Logistic Regression is a Neural Network

## Summary

The purpose of this tutorial is to explain what some may be considered a complex idea in machine learning: neural networks. Here, the reader will see that logistic regression can be framed as a specific architecture of a neural network. Specifically, a logistic regression model is a neural network with no hidden layers, a particular optimization method, without minibatches, and a slightly altered loss function.


The target reader is the data analytics professional who may not consider themself a machine learning practitioner.  The only prerequisites are only basic linear algebra and calculus. Familiarity with certain concepts related to neural networks made in this article is not required but the reader is encouraged to familiarize themselves if they are not already familiar using the provided links.

This tutorial will not appeal to visual representations of neurons like the below image; rather, it will utilize basic concepts in linear algebra, probability, and calculus. 

![Source: Wikipedia](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/296px-Colored_neural_network.svg.png)
Source: Wikipedia

Finally, tutorial is in R in order to drive home the fact that tools which data analytics professionals use on a regular basis (namely, `glm`), leverage the same concepts 'under the hood' that power machine learning methods like neural networks.

## Setup

To run this tutorial in your own R environment, you will need the following R CRAN libraries: 

-  [Keras](https://cran.r-project.org/web/packages/keras/index.html)

Make sure the required packages are installed if they aren't already. [Keras](https://keras.io/) is a machine learning library that is a wrapper  to other deep learning frameworks such as TensorFlow and Theano. Here we import the R interface to Keras. Note that this post will not cover the Keras API; however, code snippets are provided. 

The script below will help you verify that the packages are installed. If they are not, the command `install.packages()` will install them. Be advised that the libraries have dependencies that will also be downloaded.

In [1]:
if (!("keras" %in% installed.packages()[, "Package"])) {
    install.packages(package)
}

library(keras)

As the insights in this post are not specific to any particular dataset, we can generate a random Gaussian features with a random Bernoulli target. We will initialize the matrix $X$ as having 1000 rows and 10 columns and array $Y$ as having 1000 rows. Keep these dimensions in mind for later.

In [2]:
set.seed(42)

X <- matrix(rnorm(10 * 1000, mean = 1, sd = 0.5), ncol=10)
Y <- matrix(rbinom(1000, 1, 0.5))

The reader is free to rename the features of the dataset but as the particular names aren't important as the values of the estimated coefficients, the features will remain nameless for this demonstration. 

In [3]:
data = data.frame(X)
data$target = Y

In [4]:
head(data)

X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,target
1.6854792,2.1625292,1.125289034,0.6571692,0.929095672,1.0356112,1.0864161,1.7081707,0.9712737,0.5389325,1
0.7176509,1.2620611,0.861037975,0.6036428,0.593050963,1.485145,0.3635182,1.278617,0.8754823,0.7520914,0
1.1815642,1.4853667,0.137632133,0.7964979,0.837229705,1.1550176,0.5660523,1.4906207,0.2379189,-0.5552311,0
1.3164313,1.1884867,-0.003352472,0.4256647,1.189078699,0.9302257,1.3131606,0.7069086,1.2317955,0.6536197,1
1.2021342,0.5020333,0.354095835,1.5578802,0.002757319,0.8368444,0.9471847,1.4695853,0.4061896,1.1494506,0
0.9469377,0.7012585,1.182919114,0.5602716,0.500321653,0.9405952,0.8718393,0.9676495,1.2470321,0.9656635,0


Note that the terms weights and coefficients will be used interchangeably with the math terms $\theta$ and $\beta$, respectively. 

## Conceptual Review

It’s worth pausing now to consider the dimensions of the data arrays we’re going to use for the model. The actual dimensions aren’t as important as the formalization of the dimensions, which are as follows:

- __N__: Number of rows (observations)
- __K__: Number of columns (features)

In a standard logistic regression setting, where there is also an intercept that is estimated in addition to the coefficients, the dimensionality of the input data is $N \times (K + 1)$. However, for this demonstration we won't be estimating an intercept for simplicity. 

Therefore, __the dimension of our input data $X$ can be formalized as $N \times K$__.

In [5]:
dim(X)

__The dimension of our output data or target $Y$ is $N \times 1$__.

In [6]:
dim(Y)

For generalized linear models, problem is formalized as $\hat{Y} = f(X\beta)$ (where $\beta$ is our coefficient matrix). 

- In a __linear regression__ setting, $f$ is the identity function so __$\hat{Y} = X\beta$__. 

- In a __logistic regression setting__, $f$ is the sigmoid function, defined as $\frac{1}{1 + \exp^{-x}}$, so __$\hat{Y} = sigmoid(X\beta)$__.

Recall from linear algebra that in order to obtain the dot product of two matrices, the dimension of the first matrix's columns must match the dimension of the second matrix's rows. 

It follows that in our example, __the dimension of the coefficient matrix $\beta$ is $K \times 1$__. 

## Building a 1 Layer Neural Network

Full disclosure: all of the hyperparameters were carefully chosen to produce the same weights in the NN model as the logistic regression model. It is by no means guaranteed to get similar weights for the 2 types of models across different hyperparameters or even across datasets.

Without further ado, let's dive right in and build a neural network model using Keras.

$1$. We first __define the input layer__, specifying that the shape of the input data is equal to the number of columns of the dataset $X$. Notice that we don't have to define the number of rows $N$. This is because while training the neural network, we can control the number of rows used to train the model in each epoch or iteration.

In [7]:
inputs <- layer_input(shape = c(10))

Consider why we used 10 in the `shape` argument of our input layer. It corresponds to the column space of our dataset $X$.

$2$. Since this model is only 1 layer (ignoring the input layer), we next __define the output layer__ which is the only layer with trainable weights. We use a sigmoid [activation function](https://en.wikipedia.org/wiki/Activation_function) which is essentially $f$ from above.

In [8]:
outputs <- inputs %>%
    layer_dense(units = 1, kernel_initializer = initializer_constant(value = 0), activation = 'sigmoid', use_bias = FALSE)

Recall from above the dimensions of target $Y$: $N \times 1$. Similar to above, the value in the `units` argument in `layer_dense` corresponds to the columns space of our target $Y$.

Using matrix multiplication we can describe at a very high level what is going on:

`layer_input` (dimension: $? \times K)$ `%*%` `layer_dense` (dimension: $K \times 1$) `%>%` `sigmoid_function` -> `outputs` (dimension: $? \times 1$)

-  `%*%` is R syntax for matrix multiplication
-  `%>%` is `dplyr` syntax for piping the output of the previous function as input to the following function

$3$. Finally, we __define the overall structure of the model__.

In [9]:
nn_model <- keras_model(inputs = inputs, outputs = outputs)

$4$. The model has to be compiled with a few hyperparameters that we set. We __compile our model__ using `binary_crossentropy` as the loss function to minimize, which is explained below. We also define `Adam` as its optimization method. A brief discussion of optimization methods follows later.

In [10]:
nn_model %>% compile(
    loss = 'binary_crossentropy',
    optimizer = optimizer_adam(),
    metrics = c('binary_crossentropy')
)

What is binary crossentropy? Binary cross-entropy is defined as:

$$\large - \Sigma_{i=1}^{n} \big( y_i \ log(\hat{p}) + (1 - y_i) \ log(1 - \hat{p}) \big)$$

- $y_i$ is each _observed_ label

- $\hat{p}$ is the _predicted_ probability that $y_i = 1$. It's calculated as the output of the sigmoid function $\frac{1}{1 + \exp^{- X \theta}}$. You may also find it as $\hat{y}$ in other online resources.


In a logistic regression setting, we are interested in doing maximum likelihood estimation using the Bernoulli likelihood function, which is defined as:

$$\large \Pi_{i=1}^{n} \big( \hat{p}^{y_i} \ (1 - \hat{p})^{(1 - y_i)} \big) $$


However, taking the log of the above makes the problem more computationally tractable. 

Recall that:
- $log (a * b) = log(a) + log(b)$

- $log(b^{a}) = b \ log(a)$.

So taking the log of the above we obtain the Bernoulli log-likelihood function, defined as:

$$ \large \Sigma_{i=1}^{n} \big( y_i \ log(\hat{p}) + (1 - y_i) \ log(1-\hat{p}) \big)$$



The two functions are identitical except for the sign reversal! In a neural network setting, we are trying to _minimize_ binary cross-entropy while in a logistic regression setting we are trying to _maximize_ Bernoulli log-likelihood (aka [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)).

In [11]:
history <- nn_model %>% fit(
    X, Y, 
    epochs = 1000, batch_size = 1000
)

-  `epochs` is the number of passes over the entire dataset.
-  `batch_size` is the number of samples out of the original data set we're using to run optimization at one time. It's considered good practice to use a value that's a fraction of the size of your data but in this example we're using the entire dataset at once to mimic logistic regression.

Next, we can get the weights from the model. In our model there is only 1 trainable layer so we only see an array of dimension $K \times 1$, just as defined above in `layer_dense`.

In [12]:
get_weights(nn_model)[1]

0
-0.003036723
-0.005477559
0.149673387
-0.211892888
0.006067039
-0.126188636
0.02324024
0.038955178
0.108082101
0.118358888


To recap so far: we’ve trained a 1 layer neural network. Since the target variable is binary, we used `binary_crossentropy` as the loss function. We used [Adam](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam) as the optimizer. After training the model we extracted the weights from the single trainable layer.

## Building a Logistic Regression Model

Let's return to logistic regression by fitting a logistic regression model to the data, leaving out the intercept as with the neural network. Just like with other machine learning methods, logistic regression in R is doing some optimization under the hood.

In [13]:
logistic_model <- glm(Y ~ X - 1, family = binomial(link='logit'), control = list(maxit = 25, epsilon=1e-10))

Just as we obtained the weights from the neural network, we can obtain the coefficients from the logistic regression model for comparison to the weights from the neural network model.

In [14]:
logistic_model$coefficients

Do they look similar to the weights from the neural network model above? Let's find the mean absolute percentage error of the neural network weights from the logistic regression coefficients.

In [15]:
nn_weights <- get_weights(nn_model)[[1]]
lr_coefs <- logistic_model$coefficients

cat("Mean absolute percentage error between NN model and LR: ", 100 * mean(abs((lr_coefs - nn_weights) / lr_coefs)))

Mean absolute percentage error between NN model and LR:  0.0002394773

The coefficients from the logistic regression model and the weights from the neural network model are very similar, but not exactly the same. Any differences might be attributable to any number of things, but most probably the different optimization methods used in both models. 

## Side Discussion on Optimization Methods

Let's pause to highlight perhaps the biggest difference between logistic regression our neural network: the optimization method. The links provided in this section will be more challenging. 

R uses [Iteratively reweighted least squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) to solve for the coefficient values of logistic regression. IRLS is the algorithm implementation of [Fisher's scoring](https://en.wikipedia.org/wiki/Scoring_algorithm), a variant of [Newton-Raphson](https://en.wikipedia.org/wiki/Newton%27s_method), which is a _second order method_ as it makes use of the second derivative of the loss function.  

All of the commonly used optimization methods used for deep neural networks, such as `Adam` here are _first order methods_. First order methods rather than second order methods are used for deep neural networks because the loss functions are almost certainly not [convex](https://en.wikipedia.org/wiki/Convex_function) and computing the second order derivatives would be extremely time consuming and computationally expensive. 

Logistic regression can make use of second order methods because the log-likelihood function is a concave function, which guarantees that there is at most one maximum for the function. Side note: It is possible there is no global maximum for the function. The reader has perhaps encountered problems where there has not been convergence for their logistic regression model. For example, when there is multicollinarity between predictor values or when the predictor values perfectly predict the target variable ([complete separation](https://en.wikipedia.org/wiki/Separation_(statistics)).



## Re-building the Neural Network 

Perhaps you're still skeptical that logistic regression is the same as a 1 layer neural network. So far we've built a 1 layer neural network model in Keras with binary cross-entropy as the loss function and we've also demonstrated that binary cross-entropy and Bernoulli log-likelihood are equivalent except for a loss function. To close the loop let's implement our neural network model from above from scratch, with one exception: instead of minimizing binary crossentropy we're going to maximize log-likelihood.

Recall that 

$$binary \ crossentropy = \large - \Sigma_{i=1}^{n} \big( y_i \ (log(\hat{p})) + (1 - y_i) \ log(1 - \hat{p}) \big)$$

and since $\hat{p} = \frac{1}{1 + \exp^{- X \theta}}$, we obtain

$$\frac{\partial \ binary \ crossentropy}{\partial \ \theta} = \large \Sigma_{i=1}^{n} \big( \hat{p} - y_i \big) x_i$$

The derivative of the log-likelihood with respect to $\theta$ is the negative of the above. The generalization of this concept applied to loss functions of deep neural networks is called [backpropagation](https://en.wikipedia.org/wiki/Backpropagation).

The following implements the `Adam` optimizer from the original paper [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980). The choices of hyperparameters are taken from the default Keras implementation of `Adam` which are identical to those chosen by the paper's authors. In-depth understanding of the optimizer is not necessary and the below is only given for a demonstration that __if you switch the sign of the original loss function (binary cross-entropy to Bernoulli log-likelihood) and update weights accordingly, you again get weights that are similar to the logistic regression coefficients__.

In [16]:
N <- dim(X)[1]
K <- dim(X)[2]
n_epochs <- 1000
batch_size <- 1000


sigmoid_function <- function(z) {
    return (1 / (1 + exp(-z)))
}

# The following is a direct implementation of Adam per the paper above. The reader is not expected
# to develop intuition for the algorithm; rather, it's presented only as a more explicit demonstration
# of the Keras implementation of the neural network model.

# Initialize weight vector theta
theta <- rep(0, K)

# Initialize hyperparameters
alpha <- 0.001
beta1 <- 0.9
beta2 <- 0.999

# Initialize 1st moment vector
m <- 0

# Initialize 2nd moment vector
v <- 0

# Keras default epsilon value but different from paper
eps <- 1e-7

for (epoch in 1:n_epochs) {
    
    # Similar to batch from above. Note that we also randomize the indexes of the minibatches.
    for (batch in split(sample(seq(1, length(Y))), ceiling(seq_along(seq(1, length(Y))) / batch_size))) {
    
        z <- X[batch,] %*% theta
        p_hat <- sigmoid_function(z)

        # In a normal neural network setting we would use the derivative of binary cross-entropy, but we're
        # using the derivative of the log-likelihood here. Remember they are identical except for a sign reversal.

        # deriv_binary_crossentropy <- t(X) %*% (y_hat - Y)
        deriv_log_likelihood <-  - t(X[batch,]) %*% (p_hat - Y[batch,])
        
        # Update biased first order estimate
        m <- beta1 * m + (1 - beta1) * deriv_log_likelihood
    
        # Update biased second raw moment estimate
        v <- beta2 * v + (1 - beta2) * deriv_log_likelihood^2
        
        # Compute bias-corrected first moment estimate
        m_hat <- m / (1 - beta1^epoch)
        
        # Compute bias-corrected second raw moment estimate
        v_hat <- v / (1 - beta2^epoch)

        # Update parameters. Remember that we are adding the update because we are maximizing 
        # Bernoulli log-likelihood.
        theta <- theta + alpha * (m_hat / (sqrt(v_hat) + eps))
    }
    
}

In [17]:
cat("Mean absolute percentage error between new NN model and LR: ", 100 * mean(abs((lr_coefs - theta) / lr_coefs)))

Mean absolute percentage error between new NN model and LR:  0.000326558

## Conclusion

Let's recap what we've done here:

-  We built a 1 layer neural network model using popular deep learning library `keras` and a logitic regression model using `glm` in R.
- We've seen that the weights from the neural network model we built are very similar to the coefficients obtained from a logistic regression model. 
-  We've seen how Bernoulli log-likelihood used in logistic regression is just binary cross-entropy with a sign reversal.
- We built the same neural network model 'from scratch' except that we switched the sign of the loss function and obtained weights that are very similar to the logistic regression coefficients.

To sum: __logistic regression is a neural network!__

If you've used logistic regression in R before, you've used a neural network. (Though you probably shouldn't change your LinkedIn profile to Deep Learning Engineer).

## References

-  [What is the relation between Logistic Regression and Neural Networks and when to use which?](https://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html)
-  [Single-Layer Neural Networks and Gradient Descent](http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html)
-  [A Statistical View of Deep Learning (I): Recursive GLMs](http://blog.shakirm.com/2015/01/a-statistical-view-of-deep-learning-i-recursive-glms/)
- [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)