## **2. Neural Networks**

### **2.1 Perceptron**

* **Perceptron** is building block of neural networks. 
* Frank Rosenblatt 1962

1. **Inputs**: The perceptron receives one or more input values, often denoted as $x_1, x_2, \dots, x_n$.

2. **Weights**: Each input has a corresponding weight ($w_1, w_2, \dots, w_n$) which indicates the importance of that input.

3. **Weighted Sum**: The perceptron calculates a weighted sum of the inputs:
   
   $$
   z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b
   $$

   Here, $b$ is the **bias** term, which helps shift the decision boundary.

4. **Activation Function**: The weighted sum $z$ is passed through an activation function to produce the output. In a basic perceptron, this is usually a **step function**, which outputs a binary value (e.g., 0 or 1):
   
$$
\mathrm{output} =
\begin{cases}
1 & \text{if } z \geq 0 \\
0 & \text{if } z < 0
\end{cases}
$$

#### **Functionality**

**Regression** and **Classification**

* The perceptron can be seen as a linear classifier that separates data into two classes using a **decision boundary** (a line, plane, or hyperplane depending on the number of dimensions). During the training process, the perceptron adjusts its weights and bias to correctly classify input data.

#### **Limitations**
- The basic perceptron can only solve **linearly separable problems** (e.g., problems where a single line can separate classes).
- It cannot handle more complex tasks that require non-linear decision boundaries (like the XOR problem).

#### **Usage and Evolution**
- A single-layer perceptron is limited to simple classification tasks.
- A **Multi-Layer Perceptron (MLP)**, which includes multiple layers of perceptrons with non-linear activation functions, can solve more complex tasks and is the foundation of modern neural networks.

### **2.2 Multiclass Perceptron with Softmax**

### 1. **Model Structure**
- Consider an input vector $ \mathbf{x} = [x_1, x_2, \dots, x_d] $, where $ d $ is the number of input features.
- The model aims to classify $ \mathbf{x} $ into one of $ k $ classes: $ y_1, y_2, \dots, y_k $.
- Each class $ y_i $ has a corresponding weight vector $ \mathbf{w}_i = [w_{i1}, w_{i2}, \dots, w_{id}] $ and a bias term $ w_{i0} $.

### 2. **Score Calculation**
- The **score** $ o_i $ for class $ y_i $ is computed as:

$$
o_i = \mathbf{w}_i^T \cdot \mathbf{x} + w_{i0} = \sum_{j=1}^{d} w_{ij} x_j + w_{i0}
$$

  where:
  - $ \mathbf{w}_i^T $ is the transpose of the weight vector for class $ y_i $.
  - $ w_{i0} $ is the bias term for class $ y_i $.

### 3. **Softmax Activation for Class Probabilities**
- To convert the raw scores $ o_i $ into **class probabilities**, we apply the **softmax** function:

$$
y_i = \frac{\exp(o_i)}{\sum_{j=1}^{k} \exp(o_j)}
$$

  - $ \exp $ is the exponential function.
  - The softmax function ensures that the output probabilities sum to 1, making it a proper probability distribution over the classes.

### 4. **Prediction**
- The final predicted class $ \hat{y} $ is the one with the highest probability:

$$
\hat{y} = \arg\max_{k} y_k
$$

### 5. **Loss Function**
- The Multiclass Perceptron with softmax typically uses the **Cross-Entropy Loss** to measure prediction error:

$$
L = -\sum_{i=1}^{k} r_i \log(y_i)
$$

  - Here, $ r_i $ is the true label for the class (usually represented as a one-hot vector, indicating the desired output).

### 6. **Gradient Descent Weight Update**
- The weights and biases are updated using **Gradient Descent** based on the Cross-Entropy Loss.
- For weight $ w_{ij} $ (weight for the $ j $-th feature in class $ i $), the update $ \Delta w_{ij} $ is given by:

$$
\Delta w_{ij} = - \eta \frac{\partial L}{\partial w_{ij}}
$$

  where:
  - $ \eta $ is the **learning rate**.

- The gradient with respect to $ w_{ij} $ can be computed as:

$$
\frac{\partial L}{\partial w_{ij}} = (y_i - r_i) x_j
$$

  Therefore, the weight update becomes:

$$
\Delta w_{ij} =\eta (r_i - y_i) x_j
$$

- For the bias $ w_{i0} $, the update $ \Delta w_{i0} $ is:

$$
\Delta w_{i0} = - \eta (y_i - r_i)
$$

### **Summary of Key Formulas**

1. **Score Computation**:
   
$$
o_i = \sum_{j=1}^{d} w_{ij} x_j + w_{i0}
$$

2. **Softmax Probability**:
   
$$
y_i = \frac{\exp(o_i)}{\sum_{j=1}^{k} \exp(o_j)}
$$

3. **Cross-Entropy Loss**:
   
$$
L = -\sum_{i=1}^{k} r_i \log(y_i)
$$

4. **Weight Update**:
   
$$
\Delta w_{ij} = - \eta (y_i - r_i) x_j
$$
