## Logistic Regression Demo 

Instructor: Nedelina Teneva

In [1]:
# Import our standard libraries.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns  # for nicer plots
sns.set(style='darkgrid')  # default style
import tensorflow as tf

## Logistic Regression

Suppose we have a dataset with 2 datapoints, $x^{(0)}$ and $x^{(1)}$, each with 3 features (and a dummy 1 for learning the bias). Now our target labels are binary (0 or 1).

In [2]:
# Here are our inputs.
X = np.array([[1, 3, -2, 0],
              [1, 1, 0, 1]])
Y = np.array([0, 1])

Let's write out our model function:

\begin{align}
h_W(x) = \phi(w_0x_0 + w_1x_1 + w_2x_2 + w_3x_3) = \phi(xW^T) = \frac{1}{1+e^{(-xW^T)}}
\end{align}

We can get all predictions with this matrix product:

\begin{align}
\hat{Y} = h_W(X) = \phi(XW^T)= \phi
\begin{pmatrix}
x_{0,0} & x_{0,1} & x_{0,2} & x_{0,3}\\
x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3}\\
\vdots & \vdots & \vdots & \vdots \\
x_{m-1,0} & x_{m-1,1} & x_{m-1,2} & x_{m-1,3} \\
\end{pmatrix}
\begin{pmatrix}
w_0 \\
w_1 \\
w_2 \\
w_3 \\
\end{pmatrix}
\end{align}


First let's write the sigmoid (logistic) function $\phi$. 

Sigmoid details: https://mathworld.wolfram.com/SigmoidFunction.html

In [3]:
def sigmoid(z):
  return 1 / (1 + np.exp(-z))

Now, given some initial parameter values (below), compute the model's initial predictions.

In [4]:
# Initial parameter values.
W = [1, 1, 1, 1]

# Compute predictions.
preds = sigmoid(np.dot(X, W))
print(preds)

[0.88079708 0.95257413]


We're not going to use MSE for logistic regression. Instead, we'll use the *logistic loss*, also called *binary cross-entropy* (KL divergence between empirical and predicted distribution).  

\begin{align}
LogLoss = \frac{1}{m} \sum_i -y_i\log(\hat{y_i}) - (1-y_i)\log(1-\hat{y_i})
\end{align}

Despite this new loss function, it turns out that the gradient computation is the same as it was for MSE with linear regression  (a happy coincidence ...) 
\begin{align}
\nabla J(W) &= \frac{1}{m}(h_W(X) - Y)X
\end{align}

Let's write the code for a single gradient descent step:

In [5]:
# Run gradient descent
m, n = X.shape  # m = number of examples; n = number of features (including bias)
learning_rate = 0.1

preds = sigmoid(np.dot(X, W))

loss = (-Y * np.log(preds) - (1 - Y) * np.log(1 - preds)).mean()

gradient = np.dot((preds - Y), X) / m
W = W - learning_rate * gradient

print('predictions:', preds)
print('loss:', loss)
print('gradient:', gradient)
print('weights:', W)

predictions: [0.88079708 0.95257413]
loss: 1.0877576813083567
gradient: [ 0.4166856   1.29748268 -0.88079708 -0.02371294]
weights: [0.95833144 0.87025173 1.08807971 1.00237129]


## Now with TensorFlow/Keras

In [6]:
tf.keras.backend.clear_session()

model = tf.keras.Sequential()

model.add(tf.keras.layers.Dense(
    units=1,                     # output dim
    input_shape=[4],             # input dim
    use_bias=False,              # we included the bias in X
    activation='sigmoid',        # apply a sigmoid to the output
    kernel_initializer=tf.ones_initializer,  # initialize params to 1
))

optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

model.compile(loss='binary_crossentropy', optimizer=optimizer)

In [7]:
# As above, get predictions for the current model first.
preds = model.predict(X)

# Do gradient descent 
history = model.fit(
  x = X,
  y = Y,
  epochs=40,
  batch_size=2,
  verbose=0)

# Show the loss (before the update) and the new weights.
loss = history.history['loss'][0]
weights = model.layers[0].get_weights()[0].T
print('train predictions:', preds.T)
print('loss:', loss)
print('W:', weights)

train predictions: [[0.8807971  0.95257413]]
loss: 1.0877577066421509
W: [[0.8144151  0.01065879 1.8037566  1.2162935 ]]
