# A Neural Network with just one Neuron?

In general neural networks represent very complicated functions with a huge amount of trainable parameters. To visualize the training process this chapter will focus on a minimal neural network with just two trainable parameters. This chapter is meant to give a short and easy example of a neural network application. For a deeper understanding of neural networks feel free to follow the next chapters :)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import functools

# Import some common functions created for this notebook
import common

# Set a random state
random_state = 42

### Which problem could be solved?

Since the model we want to train will be extremely simple we also have to find a task that can be solved only by two trainable parameters.
In the following we want the neural network to distinguish two sets of points according to their x- and y-coordinate.
The distribution of each set of points is determined by a Gaussian distribution.

One set of set of points will be centered at $\mu_1 = (1, 1)$ and the second set is distributed around $\mu_2 = (-1, -1)$. In order to have a visible separation but still a relevant overlap the covariance matrix for both sets of points is $ cov = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$

In [None]:
# Define parameters for the two distributions
mean1 = np.array([-1, -1])
mean2 = np.array([1, 1])

# Covariance matrix (assuming diagonal covariance matrix for simplicity)
cov_matrix = np.eye(2)

# Number of points in each set
num_points = 1000

# Generate points for each distribution
points_set1 = common.generate_gaussian_points(mean1, cov_matrix, num_points)
points_set2 = common.generate_gaussian_points(mean2, cov_matrix, num_points)

In [None]:
# Plot the generated points
plt.scatter(points_set1[:, 0], points_set1[:, 1], label='Set 1', alpha=0.3, s=10, c='darkorange')
plt.scatter(points_set2[:, 0], points_set2[:, 1], label='Set 2', alpha=0.3, s=10, c='blue')

# Set labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Add legend
plt.legend()
plt.show()

As you can see the two sets of points are overlapping and so a perfect separation will not be possible. Nevertheless, the majority of the two sets differ significantly. Let's see if this difference can be learned by the network.

Since a network doesn't know of the concept of "set 1" and "set 2" we assign numeric labels to the points. The points of "set 1" get the label 0 and the points of "set 2" the label 1. Thus the neural network can be a function classifying a point according to its coordinates by a output score between 0 and 1.

In [None]:
# Create labels for the two sets (0 for Set 1, 1 for Set 2)
labels_set1 = np.zeros(num_points)
labels_set2 = np.ones(num_points)

# Combine the data and labels
point_coordinates = np.concatenate([points_set1, points_set2], axis=0)
point_labels = np.concatenate([labels_set1, labels_set2])

### How to build a network for classification?

Now we have the required labeled input data to perform a supervised training of a neural network.
In this machine learning course we will use tensorflow which is one of the commmonly used machine learning libaries

In [None]:
import tensorflow as tf
from tensorflow.data import Dataset

To classify the points our neural network needs an input layer with two input neurons representing the x- and y-coordinate of the given points. Usually, the input layer of a network is followed by several hidden layers but for our example, we will continue directly with the output score
<div>
<center>
<img src='figures/DNN_1_neuron.png' width='400'/>
</center>
</div>

The activation of the score neuron is given by its corresponding activation function applied to the weighted sum of the input neurons.
\begin{equation}
score = f_{activation}(w_1 \cdot x + w_2 \cdot y)
\end{equation}

To enforce a classification between 0 and 1 we use the Sigmoid function for the score activation
\begin{equation}
Sigmod(x) = \frac{1}{1 + e^{-x}}
\end{equation}

<div>
<center>
<img src='figures/sigmoid.png' width='500'/>
</center>
</div>

Thus the resulting network has two trainable parameters, $w_1$ and $w_2$.

We create this network as a sequential keras model with two input parameters. The activation of this one layer is given by the Sigmoid function and we don't use any trainable bias for this layer to keep it as simple as possible.
Usually one uses a random initialization of the trainable parameters but in order to investigate the training progress in more detail we initialize the parameters $w_1$ and $w_2$ both to 0.

In [None]:
# Build the neural network with bias fixed at zero
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, input_dim=2, activation='sigmoid', use_bias=False, kernel_initializer=tf.keras.initializers.Constant([0.0, 0.0]))
])

# Display the model's architecture
model.summary()

### How to train the network?

The training of the network is the variation of the weights $w_1$ and $w_2$ to get the resulting score for each point as close as possible to its actual label. This is achieved by calculating the current score for several points and comparing it with the actual label.
\begin{equation}
score(w_1, w_2) = Sigmoid \left( 
w_1 \cdot \begin{bmatrix} -1.03 \\ 0.63 \\ -0.36 \\ \vdots \\ 0.86 \\ 1.47 \\ 0.59 \\\end{bmatrix} + w_2 \cdot \begin{bmatrix} -1.21 \\ -2.15 \\ -2.66 \\ \vdots \\ 1.74 \\ 0.07 \\ 1.69 \\\end{bmatrix}
\right)
= \begin{bmatrix} 0.32 \\ 0.49 \\ 0.12 \\ \vdots \\ 0.85 \\ 0.69 \\ 0.55 \\\end{bmatrix}
\qquad \xleftrightarrow{\text{difference} \,=\,  loss} \qquad
label = \begin{bmatrix} \color{BurntOrange}0.00 \\ \color{BurntOrange}0.00 \\ \color{BurntOrange}0.00 \\ \vdots \\ \color{Blue}1.00 \\ \color{Blue}1.00 \\ \color{Blue}1.00 \\\end{bmatrix}
\end{equation}

For a binary classification, the difference (for machine learning called loss) between the label and score is usually given by the binary cross entropy (BCE). The BCE depends on both the score and the label, and the smaller the loss, the better the agreement between the score and the label.

\begin{equation}
\begin{split}
loss_\text{BCE}(score, label) & = \color{Blue}-label \cdot log(score) \color{BurntOrange}\,-\,(1-label) \cdot log(1-score) \\
& =
\begin{cases}
     \color{BurntOrange} -log(1-score) & \text{for Set 1 } (label = 0)\\
    \color{Blue} -log(score) & \text{for Set 2 } (label = 1)
\end{cases}
\end{split}
\end{equation}

<div>
<center>
<img src='figures/binary_cross_entropy_points.png' width='600'/>
</center>
</div>

Since our network only has two trainable parameters, we can directly visualize the mean loss for the given points as a function of $w_1$ and $w_2$.

\begin{equation}
\begin{split}
loss_\text{BCE}(w_1, w_2) = & \frac{1}{N} \sum_i^N \left(\color{Blue}-label_i \cdot log(score_i(w_1, w_2)) \color{BurntOrange}\,-\,(1-label_i) \cdot log(1-score_i(w_1, w_2)) \color{black} \right) \\
score_i(w_1, w_2) = & \, Sigmoid(w_1 \cdot x_i + w_2 \cdot y_i)
\end{split}
\end{equation}
<div>
<center>
<img src='figures/loss_surface_rotation.gif' width='700'/>
</center>
</div>

As you can see, the mean BCE loss has a clear minimum. At this minimum, the network gives the best classification for the given set of points.
During training, the parameters $w_1$ and $w_2$ are now adjusted towards this minimum until there is no more improvement. For the given network, this minimization of the loss seems trivial, but note that usually, networks can have hundreds of thousands if not many more parameters. For this reason, special optimization algorithms are used which efficiently search for the global minima even in very large parameter spaces. With the chosen loss and a suitable optimizer, the network can now be compiled and is ready for training.

In [None]:
# Loss function
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)

# Optimizer
adam_optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=0.005, beta_1=0.7)

# Compilation
model.compile(optimizer=adam_optimizer, loss=loss_fn)

During training, the average loss is calculated for the given training data, the direction in which the loss decreases in the parameter space is determined, a small step is taken in this direction and then this process is repeated. However, such a training step is usually not performed on the entire training data set. For training, the training data set is divided into smaller sets, so-called batches, which are then trained on one after the other. On the one hand, this is advantageous if you have a very large set of training data. The division into batches speeds up the training in such a case, as only a fraction of the training data needs to be evaluated at once. On the other hand, the use of batches brings a certain noise into the training. This noise can prevent the training from getting stuck in small local minima, as all batches differ slightly from each other.
A complete run over all batches is referred to as an epoch in which the network has seen the entire training data set once. The training can then be continued for any number of further epochs until the minimum of the loss has been reached.

To train our neural network for classification we combine the point positions with their label to a tensorflow dataset. To have evenly distributed training data the dataset gets shuffled for the training. Then the dataset gets split into batches of 32 data points each.

In [None]:
train_data = Dataset.from_tensor_slices((point_coordinates, point_labels))
train_data = train_data.shuffle(len(point_coordinates), seed=random_state)
# Set the batch size
train_data = train_data.batch(32)

Before we start the training let's define a callback function to store current training parameters $w_1$ and $w_2$ while training. Usually, this is not needed but it will allow us to visualize the training process in more detail later on.

In [None]:
# Lists to store parameter values during training
weight1_history = [0]
weight2_history = [0]

# Custom callback to store weights during training
class CustomCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        weights = self.model.get_weights()
        weights = weights[0].flatten()
        weight1_history.append(weights[0])
        weight2_history.append(weights[1])

Finally we can train the network

In [None]:
training_history = model.fit(train_data, epochs=50, batch_size=32, callbacks=[CustomCallback()])

After training, the can be used to give predictions for the given data points

In [None]:
# Predict the labels for the points
point_predictions = model.predict(point_coordinates)


# Select some random points
idx = [0, 400, 800, 1200, 1600]

print(f'The true labels: {point_labels[idx]}')
print(f'The classification: {point_predictions[idx].flatten()}')

Let's visualize the classification of the data points

In [None]:
# Plot the model output as a histogram
common.plot_dnn_output(point_predictions, point_labels)

# Visualize the classification in a scatter plot
common.classification_scatter_plot(point_coordinates, point_predictions)
_ = plt.show()

To better understand the training process a interactive plot is given in the following. You can choose the training epoch you are interested in to see the current training parameters and classification performance of the network.

In [None]:
from ipywidgets import interact
%matplotlib widget

# Give the points and the training history to the visualization function
partial_visualise = functools.partial(
    common.visualise_training_minimal_dnn,
    weight1_history=weight1_history,
    weight2_history=weight2_history,
    point_coordinates=point_coordinates,
    point_labels=point_labels
)

# Define a wrapper function that calls the partial function
def interactive_plot_wrapper(epoch):
    return partial_visualise(epoch=epoch)

# Interactive plot
_ = interact(interactive_plot_wrapper, epoch=(1, 50, 1))