## Artificial neurons - a brief glimpse into the early history of machine learning

Warren McCulloch and Walter Pitts in 1943 published the first concept of a simplified brain cell, the so-called McCulloch-Pitts (MCP) neuron, when they were trying to understand how the biological brain works in order to design an Artificial Intelligence.

Biological neurons are interconnected nerve cells in the brain that are involved in the processing and 
transmitting of chemical and electrical signals, which is illustrated in the figure below.

<img src="./imgs/biological_neurons.png">

McCulloch and Pitts simplified the biological neuron to a simple model of an artificial neuron. They described it as a simple logic gate with binary outputs; multiple signals arrive at the dendrites, they are then integrated into the cell body, and, if the accumulated signal exceeds a certain threshold, an output signal is generated that will be passed on by the axon.

The artificial neuron either fires or it doesn't. And, whether it fires depends on the net input exceeding a certain threshold, $θ$ i.e., the output signal can be thought of as a unit step function.

$$ \text{output} = \begin{cases} \text{1 : net input } \geq \theta \\ \text{0 : otherwise} \end{cases} $$

Only a few years later in 1957 Frank Rosenblatt published the first concept of the perceptron learning rule based on the MCP neuron model.

With his perceptron rule, Rosenblatt proposed an algorithm that would automatically *learn the optimal weight coefficients* that would *then be multiplied with the input features* in order to make the decision of whether a neuron fires (transmits a signal) or not.

In the context of supervised learning and classification, such an algorithm could then be used to predict whether a new data point belongs to one class or the other.

### The formal definition of an artificial neuron

Formally we can describe the concept of MCP (MacCulloch-Pitts) neurons in the context of a binary classification task with two classes: 0 and 1.

The weights $w$ are real-valued numbers and $x$ is a vector of the input features:

$$ w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_m \end{bmatrix} \quad x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix} $$

The output of a neuron is then a function of the net input, $z$, where $z$ is the sum of the weighted inputs:

$$ z = w_1x_1 + \dots + w_mx_m = \sum_{j=1}^{m}w_jx_j = w^Tx $$

Now, if the net input of a particular obsevation, $x^{(i)}$, is greater than a defined threshold, $\theta$, the neuron fires and takes the value 1 otherwise, it takes the value 0 and the decision function is simply:

$$ \sigma(z) = \begin{cases} 1 & \text{if } z \geq \theta \\ 0 & \text{otherwise} \end{cases} $$

**`Note:`** Here the vector of input features, $x$ is a $(n \times 1)$ dimensional matrix which is also called a column vector. In reality there are $m$ number of features for each observation, $i$. If we wrote the feature matrix as a whole for all the observations then it would have been a $(n \times m)$ dimensional matrix where $m$ is the number of features and $n$ in the number of observations. The weight vector $w$ will be a $(m \times p)$ dimensional matrix where $m$ is the number of columns in the feature matrix and $p$ is the number of outputs. The output of the neuron $y$ will be a $(n \times p)$ dimensional matrix. One thing to note is that, for all observations the weight vector $w$ will be the same. The only thing that will change is the input feature vector $x$.

To simplify the code implementation, we rewrite the condition for firing the neuron as, 

$$z - \theta \geq 0$$ 

And we also define a **bias unit,** 

$$b = -\theta$$

So that the net input $z$ becomes:

$$ z = w_1x_1 + \dots + w_mx_m + b = \sum_{j=1}^{m}w_jx_j + b = w^Tx + b $$

And thus the decision function becomes:

$$ \sigma(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{otherwise} \end{cases} $$

### The perceptron learning rule

Roseblatt's initial perceptron rule is fairly simple and can be summarized by the following steps:

1. Initialize the weights to 0 or small random numbers.
2. For each training sample $x^{(i)}$ perform the following steps:
    - Compute the output value, $\hat{y}$.
    - Update the weights and bias unit.

The weights and bias are updated based on the following rules:

$$ w_j := w_j + \Delta w_j $$
$$ \Delta w_j = \eta (y^{(i)} - \hat{y}^{(i)})x_j^{(i)} $$

$$ b := b + \Delta b $$
$$ \Delta b = \eta (y^{(i)} - \hat{y}^{(i)}) $$

Where, 
- $\eta$ is the learning rate (a constant between 0.0 and 1.0)
- $y^{(i)}$ is the true class label of the $i$ th training sample 
- $\hat{y}^{(i)}$ is the predicted class label of the $i$ th training sample
- $x_j^{(i)}$ is the $j$ th feature value of the $i$ th training sample