# Introduction

### PLA for basic logic gates

We also tested the performance capabilities of the Perceptron Learning Algorithm (PLA) for extremely simple binary problems: representing the logical functions NOT, AND, OR, and XOR (output equal to 1 if and only if the two inputs are different). To be able to use PLA (output is 1 or -1), we will change the zero output value of these functions to -1. In the top row of Figure 1 below, the blue square points are points labeled equal to 1, the red circle points are points labeled equal to -1. The bottom row of Figure 1 is the perceptual models corresponding to the coefficients.

![image.png](attachment:image.png)

### XOR in Neural Network

With the XOR function, because the data is not linearly separable, meaning it is impossible to find a straight line to help divide the two red and blue classes, the problem has no solution. If we replace PLA with Logistic Regression, that is, change the activation function from sgn to sigmoid, we cannot find satisfactory coefficients, because in essence, Logistic Regression only creates linear boundaries. So the Neural Network models we know cannot represent this simple logic function.

Notice that if two straight lines are allowed, the problem of representing the XOR function will be solved as shown in Figure 2 (left) below:

![image.png](attachment:image.png)

The coefficients corresponding to the two lines in Figure 2 (left) are illustrated in Figure 2 (right) at the blue nodes (there are two types of blue). Output $a_1^{(1)}$ equal to 1 with data point at the (+) of the line $-2x_1 - 2x_2 + 3 = 0$, and equal to -1 with data point at the (-) of the line. Same with $a_2^{(1)}$ and the line $2x_1 + 2x_2 -1 = 0$.

Because XOR only have one output so we have one more additional step: Assume $a_1^{(1)}$ and $a_2^{(1)}$ are the input of another PLA. In this new one, input is the blue nodes with one bias node, and output is the red node.

Let's double check:
- If $a_1^{(1)} = 1$ and $a_2^{(1)} = 1$, then the output of the second PLA is $\text{sgn}(1 + 1 - 1) = 1$ (correct on red point).
- If $a_1^{(1)} = 1$ and $a_2^{(1)} = -1$, then the output of the second PLA is $\text{sgn}(1 - 1 - 1) = -1$ (correct on blue point).

To Sum up, we have a model with multiple PLA model, which is the simplest form of a neural network. Also call **Multi-Layer Perceptron (MLP)**. The model have 3 layers: input layer, hidden layer and output layer. The hidden layer is the layer between the input and output layers. The number of hidden layers and the number of nodes in each layer can be adjusted according to the complexity of the problem.

**Some notes:**
- Perception Learning Algorithm is a simple single-layer neural network model with activation function ***sgn***. Perceptron is a general name for Neural Network with only one input and one output.
- Activation function must be a different non-linear function, like sigmoid or softmax.

# Notation and Definitions

### Layer

Beside *Input layer* and *Output layer*, a MLP can have multiple *Hidden Layer**. Layer will be indexed by $1, 2, ..., L$ from input to output. Starting after the input layer, the first hidden layer is layer 1, the second hidden layer is layer 2, and so on. The output layer is layer $L$.

### Units

One *node* in a layer called a *unit*. Unit at input layer is called *input unit*, at output layer is called *output unit*. The number of units in layer $l$ is denoted by $n^{[l]}$.

The input and output of each unit can be denoted flexibly. But in general, the input of unit $j$ in layer $l$ is denoted by $z_j^{[l]}$, and the output of unit $j$ in layer $l$ is denoted by $a_j^{[l]}$ (the activation of unit $j$ in layer $l$).

![image.png](attachment:image.png)

### Weight and Bias

There're *L* matrix of weights for each layer, denoted by $W^{[l]}$. The element at row $i$ and column $j$ of $W^{[l]}$ is the weight from unit $j$ in layer $l-1$ to unit $i$ in layer $l$. The bias for each unit is denoted by $b^{[l]}$.

### Activation Function

I will go short in this because I already made a separate notebook for each of them. The activation function is a function that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.

# Backpropagation & Forward Propagation

We will not dive too much in this here (If you still want, go check my Deep Learning for it). But in general, the process of training a neural network involves adjusting the weights and biases to minimize the difference between the predicted output and the actual output. This process is called backpropagation. The process of calculating the predicted output is called forward propagation.

Overall in this Machine Learning section, we will mostly stick to some basic derivations and formulas.

# Implementation

- In fact, finding the number of hidden units and *nonlinear activation function* mentioned above is often impossible. Instead, experiments prove that Neural Networks with many hidden layers combined with nonlinear activation functions (as simple as ReLU) have the ability to approximate (represent) training data better.

- If every unit of one layer is connected to every unit of the next layer (as we are considering in this article), we call it a ***fully connected layer***. Neural Networks with fully connected layers are rarely used in practice. Instead, there are methods that reduce model complexity by reducing the number of connections by zeroing many connections (e.g., sparse autoencoder), or constraining the coefficients to be the same ( to reduce the number of coefficients that need to be optimized).