A feedforward neural network is a type of artificial neural network where the information moves strictly in one direction—from the input layer, through one or more hidden layers, to the output layer. There are no cycles or loops, meaning each neuron receives input from the previous layer and passes its output only to the next layer. This architecture is often referred to as a Multi-Layer Perceptron (MLP) when it contains one or more hidden layers.

Below is a detailed breakdown, including the essential math:

---

### 1. **Architecture Overview**

- **Input Layer:**  
  The network receives the input data as a vector. For example, if your data has \( n \) features, then the input is represented as  
  \[
  \mathbf{x} = [x_1, x_2, \dots, x_n]^T \in \mathbb{R}^n.
  \]

- **Hidden Layers:**  
  One or more layers where neurons process the input using weighted sums followed by an activation function. Each hidden layer transforms its input into a higher-level representation.

- **Output Layer:**  
  This layer produces the final output of the network. The nature of the output (e.g., probabilities for classification, continuous values for regression) determines the choice of activation function in this layer.

---

### 2. **Mathematical Formulation of a Single Layer**

Consider a layer with an input vector \(\mathbf{x}\), weight matrix \(\mathbf{W}\), bias vector \(\mathbf{b}\), and activation function \(f\). The computations in this layer are:

1. **Linear Transformation (Weighted Sum):**
   \[
   \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b},
   \]
   where:
   - \(\mathbf{W} \in \mathbb{R}^{m \times n}\) if the layer has \(m\) neurons,
   - \(\mathbf{b} \in \mathbb{R}^{m}\).

2. **Activation:**
   \[
   \mathbf{a} = f(\mathbf{z}).
   \]
   The function \(f\) could be a non-linear activation such as:
   - **Sigmoid:** \(\sigma(z) = \frac{1}{1 + e^{-z}}\),
   - **Hyperbolic Tangent (tanh):** \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\),
   - **ReLU:** \(\text{ReLU}(z) = \max(0, z)\).

---

### 3. **Stacking Layers: The Complete Feedforward Process**

In a network with \(L\) layers (including the output layer), the forward pass through the network can be written as:

- **First Hidden Layer:**
  \[
  \mathbf{a}^{(1)} = f^{(1)}\left(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\right)
  \]

- **Subsequent Hidden Layers:**
  For layer \( l \) (where \( 2 \leq l \leq L-1 \)):
  \[
  \mathbf{a}^{(l)} = f^{(l)}\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)
  \]

- **Output Layer:**
  \[
  \mathbf{y} = f^{(L)}\left(\mathbf{W}^{(L)} \mathbf{a}^{(L-1)} + \mathbf{b}^{(L)}\right)
  \]

Here, each \( \mathbf{W}^{(l)} \) and \( \mathbf{b}^{(l)} \) are the weight matrix and bias vector for layer \( l \), and each \( f^{(l)} \) is the activation function applied in that layer.

---

### 4. **Training the Network: Backpropagation and Gradient Descent**

The goal during training is to adjust the weights and biases so that the network's output \(\mathbf{y}\) is as close as possible to the desired target \(\mathbf{t}\). This is typically done by minimizing a loss (or cost) function \(J\).

- **Example Loss Functions:**
  - **Mean Squared Error (MSE):**
    \[
    J(\mathbf{y}, \mathbf{t}) = \frac{1}{2} \sum_{i} (y_i - t_i)^2,
    \]
  - **Cross-Entropy Loss (for classification):**  
    (Depends on the specific formulation, often paired with a softmax activation in the output layer.)

- **Backpropagation:**
  This algorithm computes the gradient of the loss function with respect to each weight and bias using the chain rule. The key steps are:

  1. **Compute the error at the output layer:**
     \[
     \delta^{(L)} = \nabla_{\mathbf{a}^{(L)}} J \odot f'^{(L)}(\mathbf{z}^{(L)}),
     \]
     where \(\odot\) denotes the element-wise (Hadamard) product and \( f'^{(L)} \) is the derivative of the activation function at the output layer.

  2. **Propagate the error backward for hidden layers:**
     For \( l = L-1, L-2, \dots, 1 \):
     \[
     \delta^{(l)} = \left( (\mathbf{W}^{(l+1)})^T \delta^{(l+1)} \right) \odot f'^{(l)}(\mathbf{z}^{(l)}).
     \]

  3. **Gradient Descent Update:**
     The weights and biases are updated using a learning rate \(\eta\):
     \[
     \mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \, \frac{\partial J}{\partial \mathbf{W}^{(l)}},
     \]
     \[
     \mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} - \eta \, \frac{\partial J}{\partial \mathbf{b}^{(l)}}.
     \]
  
  This iterative process of forward pass, loss computation, backpropagation, and parameter update continues until the network’s performance on the training data is satisfactory.

---

### 5. **Summary**

- **Feedforward Neural Network:**  
  A network where data flows only in one direction—from input to output—with no cycles or loops.

- **Core Computation:**  
  Each layer computes a weighted sum of its inputs, adds a bias, and then applies a non-linear activation function.

- **Mathematics:**  
  The operation for each layer can be expressed as:
  \[
  \mathbf{a} = f(\mathbf{W}\mathbf{x} + \mathbf{b}),
  \]
  and in networks with multiple layers, this process is repeated sequentially.

- **Training:**  
  The network is trained using backpropagation to compute gradients and update the parameters via gradient descent, minimizing a loss function that quantifies the difference between the network's predictions and the actual targets.

This detailed overview should provide a comprehensive understanding of feedforward neural networks, including the mathematical foundations underlying their operation.