# Exercise Sheet 4

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split

## XOR in Keras

On the last exercise sheet we have seen that a single layer Perceptron cannot accommodate the XOR function. Show that by utilising a single hidden layer with sigmoid activation functions, XOR can be realised.  
* Implement this network utilising Keras.
* What is the minimum number of hidden units needed in this network?
* What are the network parameters after training?
* $\star$ Could there be significantly different results depending on the weight initialisation?

### Solution

See Keras Documentation for [Sequential model API](https://keras.io/models/sequential/)

In [None]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam

Solve it "Regression Style"

In [None]:
# XOR
data = np.array([[0.,0.],[0.,1.],[1.,0.],[1.,1.]])
labels = np.array([[0.],[1.],[1.],[0.]])

# We construct the neural network
model = tf.keras.models.Sequential()
# Input layer. Note that 2 is the dimensionality of the OUTPUT, i.e. the hidden layer
model.add(tf.keras.layers.Dense(2, input_dim=2, activation='sigmoid'))
# Hidden layer
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
#model.compile(loss='mean_squared_error', optimizer='adam', metrics=['binary_accuracy'])
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.01), metrics=['binary_accuracy'])

# Visualize the model
tf.keras.utils.plot_model(model, 'model.png', show_shapes=True, show_layer_names=True)

![Our Model](model.png)

In [None]:
model.summary()

In [None]:
model.predict(data)

In [None]:
# Train NN:
model.fit(data, labels, epochs=6000,shuffle=True)

We see that two hidden units are enough to learn XOR. If we only use one hidden unit, this basically reduces to the perceptron and one can see that the accuracy will not go above 0.75, which basically corresponds to learing OR instead of XOR. So two is indeed minimal. 

In the above code there is a commented out line where the loss function is replaced by the cross-entropy. This does not work reliably with a single output. The reason is that the cross-entropy is comparing probability distributions and with one output there is no probability interpretation. The cross-entropy is more suitable if we frame this as a classification problem and assign each class label an own output, see below.

In [None]:
# Network parameters after training:

print('First layer weights:\n',model.layers[0].get_weights()[0])
print('First layer bias:\n',model.layers[0].get_weights()[1])

print('Second layer weights:\n',model.layers[1].get_weights()[0])
print('Second layer bias:\n',model.layers[1].get_weights()[1])

Usually, the weights are initialized using small random numbers. If we use zeros instead, the NN is stuck and will never converge.

In [None]:
# We construct the neural network
model = tf.keras.Sequential()
# Input layer. Note that 2 is the dimensionality of the OUTPUT, i.e. the hidden layer
model.add(tf.keras.layers.Dense(2, input_dim=2, activation='sigmoid', kernel_initializer='zeros'))
# Hidden layer
model.add(tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer='zeros'))
model.compile(loss='mean_squared_error', optimizer=Adam(lr=0.01), metrics=['binary_accuracy'])

# Train NN:
model.fit(data, labels, epochs=6000,shuffle=True)

It should be fairly obvious, but if we use crazy values for the weight initialization, it also stops working. This is because the sigmoid is very flat (vanishing gradient) for large values of the input.

In [None]:
# Initialize to stupid values
crazyInit = tf.keras.initializers.Constant(value=1e3)

# We construct the neural network
model = tf.keras.Sequential()
# Input layer. Note that 2 is the dimensionality of the OUTPUT, i.e. the hidden layer
model.add(tf.keras.layers.Dense(2, input_dim=2, activation='sigmoid', kernel_initializer=crazyInit))
# Hidden layer
model.add(tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer=crazyInit))
model.compile(loss='mean_squared_error', optimizer=Adam(lr=0.01), metrics=['binary_accuracy'])

# Train NN:
model.fit(data, labels, epochs=6000,shuffle=True)

Solve it "Classification Style"

In [None]:
# XOR
data = np.array([[0.,0.],[0.,1.],[1.,0.],[1.,1.]])
labels = np.array([[0.,1.],[1.,0.],[1.,0.],[0.,1.]]) 

# We construct the neural network
model = tf.keras.models.Sequential()
# Input layer. Note that 2 is the dimensionality of the OUTPUT, i.e. the hidden layer
model.add(tf.keras.layers.Dense(2, input_dim=2, activation='sigmoid'))
# Hidden layer
model.add(tf.keras.layers.Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['binary_accuracy'])

# train
model.fit(data, labels, epochs=6000,shuffle=True)

In [None]:
model.predict(data)