# Neural Networks

Neural networks are advanced computational models inspired by the human brain. They can be thought of as sophisticated forms of logistic regression that automatically learn hierarchical features from data—eliminating the need for manual feature engineering. Instead of pre-selecting features, the network learns the best representations during training.

**Key Advantages:**
- **Automatic Feature Learning:**  
  Neural networks discover and extract useful features directly from raw data.  
  *Analogy:* Rather than a chef pre-selecting ingredients, the network experiments with combinations until it finds the best recipe for prediction.
- **Flexible Architecture:**  
  Design choices include the number of hidden layers and neurons per layer, which can be adapted to the problem at hand.

---

## Network Architecture

A typical neural network consists of several layers:

- **Input Layer:**  Receives the raw data as a feature vector.
- **Hidden Layers:** Intermediate layers where the network extracts increasingly complex features.
- **Output Layer:** Produces the final prediction (e.g., a probability or classification label).

**Fully Connected Layers:**  
    In these layers, each neuron receives input from every neuron in the previous layer, allowing the network to learn which features are most important.
    
**Activation Functions:**  
    Each neuron applies an activation function (such as the sigmoid function) to a linear combination of its inputs. The sigmoid function is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

where $z$ is the weighted sum of inputs plus a bias term.

**Illustration of a Neuron:**
- **Linear Combination:**  
  For an input vector $\mathbf{x}$, weights $\mathbf{w}$, and bias $b$, the neuron computes:

$$z = \mathbf{w} \cdot \mathbf{x} + b$$

- **Activation:**  
  The output (or activation) is:

$$a = \sigma(z) = \frac{1}{1 + e^{-z}}$$

---

## Examples

### T-Shirt Demand Prediction (Single Feature)

Suppose we want to predict whether a T-shirt is a top seller based solely on its price.

- **Input Feature:** Price, denoted by $x$.
- **Neuron Computation:**  
  The neuron calculates a weighted sum plus a bias and passes it through the sigmoid activation:

$$
a = \sigma(wx + b) = \frac{1}{1 + e^{-(wx + b)}}
$$

  Here, $w$ is the weight, $b$ is the bias, and $a$ represents the probability that the T-shirt is a top seller.

### Expanded Case (Multiple Features)
Now, consider a model that uses several features:
- **Input Features:** Price, Shipping Cost, Marketing Spend, Material Quality.
- **Key Factors:**
  1. **Affordability:** A function of price and shipping cost.
  2. **Awareness:** Driven by marketing spend.
  3. **Perceived Quality:** Influenced by material quality and price.

**Network Structure:**
- **Input Layer:**  
  The feature vector is:

$$
\mathbf{x} = [\text{Price}, \text{Shipping Cost}, \text{Marketing Spend}, \text{Material Quality}]
$$

- **Hidden Layer:**  
  Contains 3 neurons, each learning one of the key factors (Affordability, Awareness, Perceived Quality).
- **Output Layer:**  
  A single neuron that combines the hidden layer activations to compute the final probability that the T-shirt is a top seller.

**Diagram:**

| **Layer**       | **Number of Neurons** | **Role**                                                       |
|-----------------|-----------------------|----------------------------------------------------------------|
| **Input Layer** | 4                     | Receives raw features.                                         |
| **Hidden Layer**| 3                     | Learns intermediate features (Affordability, Awareness, Quality). |
| **Output Layer**| 1                     | Outputs the final probability prediction.                      |

---

### Neural Networks in Computer Vision – Face Recognition

Consider training a neural network to recognize a face in an image.

- **Image Details:**  
  A grayscale image of size $1000 \times 1000$ pixels. Each pixel has an intensity value from 0 to 255.
- **Data Representation:**  
  The image is represented as a $1000 \times 1000$ matrix, which is flattened into a vector:

$$
\mathbf{x} \in \mathbb{R}^{1\,000\,000}
$$

**Network Architecture for Face Recognition:**
- **Input Layer:**  
  Receives the flattened pixel intensity vector.
- **Hidden Layers:**  
  - **First Hidden Layer:** Detects low-level features such as edges (using small image regions).
  - **Second Hidden Layer:** Combines edges to form facial parts (like eyes and nose) using larger regions.
  - **Third Hidden Layer (optional):** Aggregates parts to recognize complete face shapes.
- **Output Layer:**  
  Outputs a probability distribution over possible identities (often using a softmax function).

| **Layer**         | **Role**                                       | **Learned Features**                                  |
|-------------------|------------------------------------------------|-------------------------------------------------------|
| **Input Layer**   | Raw pixel intensities                          | N/A                                                   |
| **1st Hidden Layer** | Extracts basic features                      | Edges and simple lines                                |
| **2nd Hidden Layer** | Combines features into facial parts          | Eyes, nose, and other facial features                |
| **3rd Hidden Layer** | Aggregates parts into full face shapes       | Complete facial structure (if used)                   |
| **Output Layer**  | Classifies the image into a person’s identity  | Identity probabilities (via softmax)                  |

*Tip:* The network learns these features automatically from the training data.

---

### Handwritten Digit Recognition

- **Task:** Classify an $8 \times 8$ grayscale image as either the digit 0 or 1.
- **Input:**  
  The image is flattened into a 64-dimensional vector:

$$
\mathbf{x} \in \mathbb{R}^{64}
$$

- **Neural Network Architecture:**
  - **Input Layer (Layer 0):** 64 features.
  - **Hidden Layer 1 (Layer 1):** 25 neurons.
  - **Hidden Layer 2 (Layer 2):** 15 neurons.
  - **Output Layer (Layer 3):** 1 neuron that outputs a probability for the digit 1.

## Forward Propagation Process

Forward propagation computes the network output by passing the input through each layer:

1. **From Input to First Hidden Layer:**

$$
\mathbf{a}^{[1]} = \sigma\left(W^{[1]} \mathbf{x} + b^{[1]}\right)
$$

   - $W^{[1]}$ is the weight matrix and $b^{[1]}$ is the bias vector for Layer 1.

2. **From First to Second Hidden Layer:**

$$
\mathbf{a}^{[2]} = \sigma\left(W^{[2]} \mathbf{a}^{[1]} + b^{[2]}\right)
$$

3. **From Second Hidden Layer to Output Layer:**

$$
a^{[3]} = \sigma\left(W^{[3]} \mathbf{a}^{[2]} + b^{[3]}\right)
$$

   - Here, $a^{[3]}$ is a scalar representing the probability that the image is the digit 1.

4. **Prediction:**  
   A threshold is applied (commonly 0.5) to convert the probability into a binary label:
   - If $a^{[3]} \ge 0.5$, predict **1**.
   - Otherwise, predict **0**.

**Summary of Forward Propagation:**
- **Step 1:** Process input features through the network layer by layer.
- **Step 2:** Compute activations using:

$$a_j^{[l]} = \sigma\left(w_j^{[l]} \cdot \mathbf{a}^{[l-1]} + b_j^{[l]}\right)$$

- Where $l$ is the layer number and $j$ is the $j^{th}$ neuron in that layer.
    
- **Step 3:** Use the final activation to make a prediction.

---

## Mathematical Computation and Notation

Understanding the computation within each layer is crucial. The following sections summarize the mathematical foundations and notation conventions used in neural networks.

### Neuron Computation

Each neuron performs two key operations:
1. **Linear Combination:**

$$
z = \mathbf{w} \cdot \mathbf{x} + b
$$

   where:
   - $\mathbf{x}$ is the input vector.
   - $\mathbf{w}$ is the weight vector.
   - $b$ is the bias term.
2. **Activation:**

$$
a = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

   The activation function $\sigma(\cdot)$ (here, the sigmoid) determines the neuron's output.

### Notation Conventions

- **Layer Indexing:**  
  The input layer is denoted as $\mathbf{a}^{[0]}$, while subsequent layers are labeled with superscripts in square brackets. For example:
  - $\mathbf{a}^{[1]}$ is the activation vector of the first hidden layer.
  - $W^{[l]}$ and $b^{[l]}$ denote the weights and biases for Layer $l$.
- **Neuron-Specific Parameters:**  
  For the $j^\text{th}$ neuron in layer $l$, the activation is:

$$
a_j^{[l]} = \sigma\left(w_j^{[l]} \cdot \mathbf{a}^{[l-1]} + b_j^{[l]}\right)
$$

  where $w_j^{[l]}$ is the weight vector and $b_j^{[l]}$ is the bias for that neuron.

### Example Computation in a Hidden Layer

Suppose a hidden layer has three neurons. For each neuron $i$:

1. **Compute the Linear Combination:**

$$
z_i^{[1]} = w_i^{[1]} \cdot \mathbf{x} + b_i^{[1]}
$$

2. **Apply the Activation Function:**

$$
a_i^{[1]} = \sigma(z_i^{[1]})
$$
   
If the computed activations are:

$$
a_1^{[1]} \approx 0.3,\quad a_2^{[1]} \approx 0.7,\quad a_3^{[1]} \approx 0.2,
$$

the activation vector for this layer is:

$$
\mathbf{a}^{[1]} = \begin{bmatrix} 0.3 \\ 0.7 \\ 0.2 \end{bmatrix}.
$$

This vector is then used as the input to the next layer.
