# Neural Network Architecture and Hyperparameters

## Overview
This notebook covers activation functions and network architecture design for different types of machine learning tasks.

---

## 1. Activation Functions

### What are Activation Functions?
Activation functions introduce **non-linearity** into neural networks, allowing them to learn complex patterns. Without activation functions, neural networks would only be able to learn linear relationships, no matter how many layers you add.

### Why are they important?
- Enable networks to learn complex, non-linear patterns
- Help with gradient flow during backpropagation
- Determine the output format (probabilities, classes, etc.)

---

## 2. Sigmoid Activation Function

### Mathematical Formula:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

### Properties:
- **Output range**: 0 to 1
- **Use case**: Binary classification (yes/no, true/false)
- **Interpretation**: Output can be interpreted as a probability

### When to use Sigmoid:
- Binary classification problems (2 classes)
- When you need output as probability (0 to 1 range)
- Typically used in the **output layer** for binary tasks

### Example:
```python
sigmoid = nn.Sigmoid()
probability = sigmoid(input_tensor)  # Output between 0 and 1
```

---

## 3. Softmax Activation Function

### Mathematical Formula:
$$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$$

### Properties:
- **Output range**: 0 to 1 for each class
- **Sum**: All outputs sum to 1.0
- **Use case**: Multi-class classification (choosing one class from many)
- **Interpretation**: Probability distribution across classes

### When to use Softmax:
- Multi-class classification (3+ classes)
- When you need probabilities for each class
- Typically used in the **output layer** for classification
- Use `dim=-1` to apply softmax across the last dimension

### Example:
```python
softmax = nn.Softmax(dim=-1)
probabilities = softmax(input_tensor)  # Outputs sum to 1.0
```

---

## 4. Neural Network Architecture Design

### 4.1 Architecture Components:
1. **Input Layer**: Size matches number of input features
2. **Hidden Layers**: Intermediate processing layers
3. **Output Layer**: Size and activation depend on task type

### 4.2 Common Architecture Patterns:

#### Regression Tasks:
- **Output layer size**: 1 (or number of values to predict)
- **Output activation**: None (linear output)
- **Example**: Predicting house prices, temperature, etc.

```python
model = nn.Sequential(
    nn.Linear(11, 20),  # Input layer
    nn.Linear(20, 12),  # Hidden layer
    nn.Linear(12, 6),   # Hidden layer
    nn.Linear(6, 1),    # Output: 1 value (regression)
)
```

#### Multi-Class Classification:
- **Output layer size**: Number of classes
- **Output activation**: Softmax
- **Example**: Image classification, sentiment analysis

```python
model = nn.Sequential(
    nn.Linear(11, 20),     # Input layer
    nn.Linear(20, 12),     # Hidden layer
    nn.Linear(12, 6),      # Hidden layer
    nn.Linear(6, 4),       # Output: 4 classes
    nn.Softmax(dim=-1)     # Convert to probabilities
)
```

---

## 5. Key Design Decisions

### Choosing Output Layer Size:
- **Binary classification**: 1 neuron + Sigmoid OR 2 neurons + Softmax
- **Multi-class classification**: N neurons (N = number of classes) + Softmax
- **Regression**: 1 neuron (or number of values to predict) + No activation

### Choosing Hidden Layer Sizes:
- Start with layers that gradually decrease in size
- Common pattern: Input → Larger → Medium → Smaller → Output
- No strict rules - requires experimentation

### Layer Size Example:
```
Input (11) → 20 → 12 → 6 → Output (4)
```
This creates a "funnel" pattern, gradually compressing information.

---

## 6. Hyperparameters

### What are Hyperparameters?
Hyperparameters are settings you choose **before** training:
- Number of layers
- Number of neurons per layer
- Learning rate
- Batch size
- Number of epochs

### Architecture Hyperparameters:
1. **Depth**: Number of hidden layers
2. **Width**: Number of neurons in each layer
3. **Activation functions**: Which functions to use and where

---

## 7. Quick Reference Table

| Task Type | Output Size | Activation | Example |
|-----------|-------------|------------|---------|
| Binary Classification | 1 | Sigmoid | Spam detection |
| Multi-Class Classification | N classes | Softmax | Digit recognition (0-9) |
| Regression | 1 or more | None | Price prediction |

---

## 8. Important Notes

⚠️ **Common Mistakes to Avoid:**
- Using Softmax for regression (should use no activation)
- Wrong output size (must match number of classes)
- Forgetting to add activation for classification

✅ **Best Practices:**
- Match output layer to your task type
- Start with simple architectures and add complexity if needed
- Use appropriate activation functions for each task
- Monitor training to ensure model is learning

---

## 9. Summary

- **Sigmoid**: Binary classification (0 to 1)
- **Softmax**: Multi-class classification (probabilities summing to 1)
- **Architecture**: Design based on task (regression vs classification)
- **Output layer**: Must match task requirements
- **Hidden layers**: Experiment with sizes and depths

In [49]:
import torch
import torch.nn as nn
import numpy as np
from torch.nn import functional as F
import torch.optim as optim

input_tensor = torch.tensor([[2.4]])

sigmoid = nn.Sigmoid()
probability = sigmoid(input_tensor)
print(probability)

tensor([[0.9168]])


## Example 1: Sigmoid Activation Function

This example demonstrates:
- **Input**: A single value (2.4) as a tensor
- **Sigmoid function**: Squashes any input to range [0, 1]
- **Output**: A probability value between 0 and 1

**How it works:**
- Input 2.4 is passed through sigmoid: σ(2.4) = 1/(1 + e^(-2.4)) ≈ 0.917
- Larger positive values → closer to 1
- Larger negative values → closer to 0
- Zero → exactly 0.5

**Use case**: Binary classification where you need to predict probability of one class (e.g., probability of being spam email).

In [50]:
input_tensor = torch.tensor([[1.0, -6.0, 2.5, -0.3, 1.2, 0.8]])

softmax = nn.Softmax(dim=-1)
probability = softmax(input_tensor)
print(probability)

tensor([[1.2828e-01, 1.1698e-04, 5.7492e-01, 3.4961e-02, 1.5669e-01, 1.0503e-01]])


## Example 2: Softmax Activation Function

This example demonstrates:
- **Input**: A tensor with 6 values (representing 6 classes)
- **Softmax function**: Converts values to probabilities that sum to 1.0
- **dim=-1**: Applies softmax across the last dimension (the 6 values)

**How it works:**
- Each value is exponentiated: e^x
- Then divided by sum of all exponentials
- Result: Probability distribution across all classes
- The highest input value gets the highest probability

**Use case**: Multi-class classification when you have 3+ classes (e.g., classifying images into 6 categories). Each output represents the probability of belonging to that class.

In [51]:
# From regression to multi-class classification

input_tensor = torch.Tensor([[3, 4, 6, 7, 10, 12, 2, 3, 6, 8, 9]])

model = nn.Sequential(
    nn.Linear(11, 20),
    nn.Linear(20, 12),
    nn.Linear(12, 6),
    nn.Linear(6, 1),
)

output = model(input_tensor)
print(output)

tensor([[0.9888]], grad_fn=<AddmmBackward0>)


## Example 3: Regression Network Architecture

This example shows a network designed for **regression tasks**:

**Architecture breakdown:**
- **Input**: 11 features
- **Layer 1**: 11 → 20 neurons (expanding)
- **Layer 2**: 20 → 12 neurons (compressing)
- **Layer 3**: 12 → 6 neurons (compressing)
- **Output**: 6 → 1 neuron (single continuous value)

**Key features:**
- **No activation function** on output (raw value prediction)
- **Funnel architecture**: Gradually reduces dimensions
- Output is a continuous value (not bounded)

**Use case**: Predicting continuous values like prices, temperatures, distances, or any numerical quantity. The network learns to map 11 input features to 1 output value.

In [52]:
input_tensor = torch.Tensor([[3, 4, 6, 7, 10, 12, 2, 3, 6, 8, 9]])

# made network below to perform a multi-class classification with four labels.
model = nn.Sequential(
  nn.Linear(11, 20),
  nn.Linear(20, 12),
  nn.Linear(12, 6),
  nn.Linear(6, 4),
  nn.Softmax(dim=-1)
)

output = model(input_tensor)
print(output)

tensor([[0.3251, 0.2706, 0.2406, 0.1637]], grad_fn=<SoftmaxBackward0>)


## Example 4: Multi-Class Classification Network

This example shows converting the regression network to **multi-class classification**:

**Architecture modifications:**
- **Input**: Same 11 features
- **Hidden layers**: Same structure (11→20→12→6)
- **Output layer**: Changed from 1 to **4 neurons** (4 classes)
- **Added Softmax**: Converts outputs to probability distribution

**Key changes from regression:**
1. Output size = number of classes (4)
2. Added `nn.Softmax(dim=-1)` to get probabilities
3. Each output represents probability of one class
4. All outputs sum to 1.0

**Output interpretation:**
- 4 values, each between 0 and 1
- Highest value = most likely class
- Example: [0.7, 0.1, 0.15, 0.05] → Class 0 is predicted with 70% confidence

**Use case**: Classifying inputs into one of 4 categories (e.g., sentiment analysis with 4 emotions, product categorization with 4 types).

In [53]:
# Creating one-hot encoded labels

y = 1
num_classes = 3

one_hot_numpy = np.array([0, 1, 0])

one_hot_pytorch = F.one_hot(torch.tensor(y), num_classes=num_classes)

print("One-hot Vector using NumPy:", one_hot_numpy)
print("One-hot Vector using PyTorch:", one_hot_pytorch)

One-hot Vector using NumPy: [0 1 0]
One-hot Vector using PyTorch: tensor([0, 1, 0])


In [54]:
# Accessing the model parameters

model = nn.Sequential(
    nn.Linear(16, 8),
    nn.Linear(8, 2),
    nn.Linear(2, 1)
)

weight0 = model[0].weight
print("Weight of first layer:", weight0)

bias_1 = model[1].bias
print("Bias of second layer:", bias_1)

Weight of first layer: Parameter containing:
tensor([[ 0.1723,  0.2378,  0.1054, -0.0426,  0.0125, -0.2186, -0.2464, -0.0612,
          0.2222, -0.1872, -0.1316,  0.1619, -0.1131, -0.0674,  0.0598, -0.0621],
        [-0.0280, -0.0523, -0.0172, -0.1929, -0.1212, -0.0987,  0.1360, -0.2201,
          0.0828, -0.1478, -0.0396,  0.1866, -0.0482,  0.0894, -0.1385, -0.0157],
        [ 0.2057, -0.0845, -0.2106, -0.1561, -0.1500,  0.0995, -0.0650,  0.0982,
          0.0675,  0.2314,  0.2010, -0.2123,  0.0730, -0.1607,  0.1091, -0.1293],
        [-0.1849,  0.2496,  0.1694, -0.0150,  0.1661,  0.2066,  0.1009,  0.1684,
         -0.1981,  0.1089, -0.1963,  0.0232, -0.1143,  0.0834,  0.0576, -0.1283],
        [ 0.1495, -0.0761,  0.0874,  0.2147,  0.2308, -0.0480, -0.0643,  0.1971,
         -0.1072, -0.1620,  0.0268, -0.1141,  0.0657,  0.0013, -0.0491,  0.0591],
        [-0.0696,  0.1303,  0.0989,  0.0194,  0.0238,  0.0427,  0.2092, -0.2401,
          0.0374, -0.2434,  0.1693, -0.1067, -0.2210, -0.21

In [55]:
# First, perform a forward pass and backward pass to compute gradients
input_tensor = torch.Tensor([[2, 4, 6, 7, 9, 3, 2, 1, 5, 6, 3, 8, 9, 2, 1, 3]])
target = torch.Tensor([[1.0]])

# Forward pass
output = model(input_tensor)

# Compute loss (mean squared error)
loss = ((output - target) ** 2).mean()

# Backward pass to compute gradients
loss.backward()

# Now access the gradients
weight0 = model[0].weight
weight1 = model[1].weight
weight2 = model[2].weight

grads0 = weight0.grad
grads1 = weight1.grad
grads2 = weight2.grad

# Update the weights using the learning rate and the gradients
learning_rate = 0.01

weight0 = weight0 - learning_rate * grads0
weight1 = weight1 - learning_rate * grads1
weight2 = weight2 - learning_rate * grads2

print("Updated weights of first layer:", weight0)

print("Loss:", loss.item())

Updated weights of first layer: tensor([[ 0.1739,  0.2409,  0.1101, -0.0372,  0.0194, -0.2163, -0.2448, -0.0604,
          0.2261, -0.1826, -0.1293,  0.1681, -0.1061, -0.0658,  0.0606, -0.0598],
        [-0.0259, -0.0480, -0.0107, -0.1854, -0.1115, -0.0954,  0.1382, -0.2190,
          0.0882, -0.1413, -0.0363,  0.1952, -0.0384,  0.0915, -0.1374, -0.0124],
        [ 0.2062, -0.0834, -0.2089, -0.1541, -0.1474,  0.1004, -0.0645,  0.0985,
          0.0689,  0.2332,  0.2019, -0.2100,  0.0756, -0.1601,  0.1094, -0.1284],
        [-0.1923,  0.2348,  0.1472, -0.0409,  0.1328,  0.1955,  0.0935,  0.1647,
         -0.2166,  0.0868, -0.2074, -0.0064, -0.1476,  0.0761,  0.0539, -0.1394],
        [ 0.1440, -0.0872,  0.0708,  0.1953,  0.2059, -0.0563, -0.0699,  0.1944,
         -0.1210, -0.1786,  0.0185, -0.1362,  0.0408, -0.0043, -0.0518,  0.0508],
        [-0.0702,  0.1291,  0.0972,  0.0174,  0.0212,  0.0419,  0.2086, -0.2404,
          0.0360, -0.2451,  0.1685, -0.1090, -0.2235, -0.2150,  0.0739, 

In [56]:
# Using the PyTorch optimizer

optimizer = optim.SGD(model.parameters(), lr=learning_rate)

optimizer.step()