<a href="https://colab.research.google.com/github/HSE-LAMBDA/MLatFIAN2020/blob/master/MLatFIAN2020_optional_DL_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optional homework: from Logistic Regression to MLP

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

We'll consider the XOR problem - that is a notable example of a simple problem that cannot be solved with a linear classifier.

In [None]:
X = np.random.uniform(-1, 1, size=(5000, 2))
y = ((X.T[0] >= 0.) ^ (X.T[1] >= 0.)).astype('int32')

plt.scatter(*X.T, c=y, s=0.4, alpha=0.8, cmap='bwr');

Helper function to iterate through batches of data:

In [None]:
def sample_batch(X, y, batch_size):
  assert len(X) == len(y)
  idx = np.random.choice(len(X), len(X), replace=False)
  shuffled_X = X[idx]
  shuffled_y = y[idx]

  for i in range(0, len(X), batch_size):
    yield (shuffled_X[i : i + batch_size], shuffled_y[i : i + batch_size])

**Step 1:** Define the model

In [None]:
# At first, try just a single Dense layer without activation
# (that will result in a simple linear regression model)
# 
# After having trained it, come back and make your model more complicated
# (add more layers and activations)

model = tf.keras.Sequential([
    # <YOUR CODE>
])


# Automatic checks:
assert isinstance(model, tf.keras.Sequential), 'Your model should be an instance of tf.keras.Sequential'
dummy_pred = model.predict([[-7.04634833, -2.39160895],
                            [ 0.09212291,  7.40897428],
                            [ 2.53335199, -2.70660191],
                            [ 3.62886011, -4.02756296],
                            [ 2.7433485 , -1.10504784],
                            [ 3.99561633, -8.68322612],
                            [ 8.13889866, -8.92227882],
                            [-0.10157622,  4.26008939],
                            [ 6.10780474,  7.75495299],
                            [-4.96919624,  3.83381552]])
assert dummy_pred.shape == (10, 1), 'The last layer needs only a single output (since we are doing binary classification)'
assert isinstance(model.layers[-1], tf.keras.layers.Dense), "Why isn't your last layer Dense? o0"
assert model.layers[-1].activation is tf.keras.activations.linear, "No activation needed in the last layer. We'll combine CrossEntropy and Sigmoid activation in the loss function"

**Step 2:** Define the loss function

In [None]:
loss_fn = <YOUR CODE> # Cross-entropy loss function with signature:
                      #      (y_true, y_pred) -> loss_value
                      # where y_true are labels (0 or 1), y_pred are logit predictions, i.e.
                      # the predicted probability is `sigmoid(y_pred)`.
                      # Make sure to return a scalar (average the loss over predictions).
                      #
                      # Hint: check out the losses available in `tf.losses`
                      # Alternatively, you can define it explicitly as:
                      #   loss_fn = lambda y_true, y_pred: ...

# Automatic checks:
dummy_y_true = tf.convert_to_tensor([0, 0, 1, 1, 0, 1])
dummy_y_pred = tf.convert_to_tensor([-7.04634833, -2.39160895, 0.09212291, 7.40897428, 2.53335199, -2.70660191])
with tf.GradientTape() as t:
  t.watch(dummy_y_pred)
  dummy_loss_value = loss_fn(dummy_y_true, dummy_y_pred)
dummy_grads = t.gradient(dummy_loss_value, dummy_y_pred)

assert isinstance(dummy_loss_value, tf.Tensor)
assert dummy_loss_value.shape == []
assert np.isclose(dummy_loss_value.numpy(), 1.01969)
assert np.allclose(dummy_grads.numpy(), [1.4497087e-04, 1.3969133e-02, -7.9497591e-02, -1.0090417e-04, 1.5440786e-01, -1.5623584e-01])

**Step 3:** Run the training loop

I'm providing a ready to use training loop here, but feel free to ignore my example and write it from scratch.

In [None]:
from IPython.display import clear_output

opt = tf.optimizers.Adam()

num_epochs = 200
batch_size = 512

losses = []
for i_epoch in range(num_epochs):
  if (i_epoch + 1) % 10 == 0:
    print("Epoch:", i_epoch + 1)

  epoch_loss = 0
  for X_batch, y_batch in sample_batch(X, y, batch_size=batch_size):
    with tf.GradientTape() as t:
      loss_batch = loss_fn(y_batch[:,None], model(X_batch, training=True))
    grads = t.gradient(loss_batch, model.trainable_variables)
    opt.apply_gradients(zip(grads, model.trainable_variables))
    epoch_loss += loss_batch * len(X_batch)

  losses.append(epoch_loss.numpy() / len(X))

  if (i_epoch + 1) % 100 == 0:
    clear_output()
    plt.figure()
    plt.plot(losses)
    plt.show()

Check out the prediction:

In [None]:
xx0, xx1 = np.meshgrid(
    np.linspace(-1, 1, 100),
    np.linspace(-1, 1, 100)
)

yy = model.predict(np.c_[xx0.ravel(), xx1.ravel()]).reshape(xx0.shape)

plt.contourf(xx0, xx1, yy, cmap='bwr', alpha=0.5, levels=30)
plt.scatter(*X.T, c=y, s=0.4, alpha=0.8, cmap='bwr');


What predictions do you get with no hidden layers (simple logistic regression model)?

Try other things:

- Single hidden layer:

  - Try adding a single hidden layer with 100 units and an activation function (e.g. ELU) and run the training again. Does it get better?

  - Play around with the number of neurons in the hidden layer.
    - Try very small numbers (e.g. 1-2 neurons in the hidden layer). Does it work?
    - Is it possible to solve this problem with just 1 neuron in the hidden layer?
    - Can you think of a theoretical solution with 2 neurons in the hidden layer? (Hint: it exists.) Were you able to find a solution by training your model with a gradient based method?

- MLP:
  - Play around with the number of hidden layers
  - What is better (from the training perspective) - to have more layers (deeper network) or to have more neurons per layer (wider network)?