| Model | Use Case | Definition | Performance Metrics | Key Formulas |
|-------|----------|------------|---------------------|--------------|
| Linear Regression | Predicting continuous values | A model that assumes a linear relationship between input features and the target variable | - Mean Squared Error (MSE)<br>- R-squared (R²)<br>- Root Mean Squared Error (RMSE) | y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε |
| Logistic Regression | Binary classification | A model that predicts the probability of an instance belonging to a particular class | - Accuracy<br>- Precision<br>- Recall<br>- F1-score<br>- ROC AUC | P(Y=1) = 1 / (1 + e^(-z))<br>where z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ |
| Decision Trees | Classification and regression | A tree-like model of decisions based on feature values | - Accuracy (classification)<br>- MSE (regression)<br>- Gini impurity<br>- Information gain | Gini impurity = 1 - Σ(pᵢ²)<br>where pᵢ is the probability of class i |
| Random Forest | Classification and regression | An ensemble of decision trees | - Accuracy (classification)<br>- MSE (regression)<br>- Out-of-bag error | N/A (Ensemble of decision trees) |
| Support Vector Machines (SVM) | Classification and regression | A model that finds the hyperplane that best separates classes in high-dimensional space | - Accuracy<br>- Margin<br>- Hinge loss | w · x - b = 0 (Linear SVM hyperplane) |
| K-Nearest Neighbors (KNN) | Classification and regression | A model that classifies based on the majority class of K nearest neighbors | - Accuracy<br>- F1-score<br>- Distance metric (e.g., Euclidean) | Euclidean distance:<br>d(p,q) = √(Σ(pᵢ - qᵢ)²) |
| Naive Bayes | Classification | A probabilistic model based on Bayes' theorem with independence assumptions | - Accuracy<br>- Precision<br>- Recall<br>- F1-score | P(A\|B) = (P(B\|A) * P(A)) / P(B) |
| K-Means Clustering | Unsupervised clustering | A model that partitions n observations into k clusters | - Inertia<br>- Silhouette score<br>- Calinski-Harabasz index | Inertia = Σ(min(distance²) to cluster center) |
| Principal Component Analysis (PCA) | Dimensionality reduction | A technique to reduce the dimensionality of data while preserving variance | - Explained variance ratio<br>- Cumulative explained variance | Cov(X) = (1/n) * X^T * X |
| Neural Networks | Various (classification, regression, etc.) | A model inspired by biological neural networks, capable of learning complex patterns | - Accuracy (classification)<br>- MSE (regression)<br>- Cross-entropy loss | Activation function (e.g., ReLU):<br>f(x) = max(0, x) |

| Neural Network Type | Use Case | Architecture | Key Components | Activation Functions | Loss Functions | Training Algorithm | Advantages | Challenges |
|---------------------|----------|--------------|-----------------|----------------------|-----------------|---------------------|------------|------------|
| Feedforward Neural Network (FNN) | Classification, Regression | Input layer, hidden layer(s), output layer | Neurons, Weights, Biases | ReLU, Sigmoid, Tanh | MSE, Cross-entropy | Backpropagation | Simple, Versatile | Limited for sequential data |
| Convolutional Neural Network (CNN) | Image Recognition, Computer Vision | Convolutional layers, Pooling layers, Fully connected layers | Filters, Feature maps | ReLU, Softmax | Cross-entropy | Gradient descent with backpropagation | Efficient for image data, Parameter sharing | Computationally intensive |
| Recurrent Neural Network (RNN) | Sequential data, Time series, NLP | Recurrent connections | Hidden state, Input gate, Output gate | Tanh, Sigmoid | Cross-entropy, MSE | Backpropagation Through Time (BPTT) | Handles variable-length sequences | Vanishing/exploding gradients |
| Long Short-Term Memory (LSTM) | Long-term dependencies in sequences | Memory cells, Gates (forget, input, output) | Cell state, Hidden state | Sigmoid, Tanh | Cross-entropy, MSE | Backpropagation Through Time (BPTT) | Addresses vanishing gradient problem | Complex architecture, Computationally expensive |
| Generative Adversarial Network (GAN) | Image generation, Data augmentation | Generator and Discriminator networks | Generator, Discriminator | ReLU, Tanh, Sigmoid | Binary cross-entropy | Alternating training of Generator and Discriminator | Can generate new, realistic data | Training instability, Mode collapse |
| Autoencoder | Dimensionality reduction, Feature learning | Encoder and Decoder networks | Encoder, Decoder, Bottleneck layer | ReLU, Sigmoid | MSE, Binary cross-entropy | Backpropagation | Unsupervised feature learning | May learn trivial solutions |
| Transformer | NLP, Sequence-to-sequence tasks | Multi-head attention, Feed-forward layers | Self-attention, Positional encoding | ReLU, Softmax | Cross-entropy | Adam optimizer | Parallelizable, Captures long-range dependencies | High memory requirements |
| Deep Belief Network (DBN) | Feature extraction, Dimensionality reduction | Stack of Restricted Boltzmann Machines (RBMs) | RBMs, Visible layer, Hidden layers | Sigmoid, Softmax | Contrastive divergence | Layer-wise pre-training, Fine-tuning | Unsupervised pre-training | Complex training process |
| Radial Basis Function Network (RBFN) | Function approximation, Time series prediction | Input layer, RBF layer, Output layer | RBF neurons, Centroids | Gaussian RBF | MSE | Two-stage training (unsupervised + supervised) | Fast training, Good at interpolation | Poor extrapolation, Curse of dimensionality |

| Neural Network Type | Architecture | Use Cases | Key Characteristics | Activation Functions | Loss Functions | Notable Variants |
|---------------------|--------------|-----------|---------------------|----------------------|----------------|-------------------|
| Feedforward Neural Network (FNN) | Input layer, hidden layer(s), output layer | Classification, regression | Simple, fully connected layers | ReLU, Sigmoid, Tanh | MSE, Cross-entropy | Multilayer Perceptron (MLP) |
| Convolutional Neural Network (CNN) | Convolutional layers, pooling layers, fully connected layers | Image recognition, computer vision | Local connectivity, parameter sharing | ReLU | Cross-entropy | ResNet, VGG, Inception |
| Recurrent Neural Network (RNN) | Recurrent connections, memory cells | Sequence data, time series, NLP | Can handle variable-length sequences | Tanh, Sigmoid | Cross-entropy | LSTM, GRU |
| Long Short-Term Memory (LSTM) | Special RNN architecture with gates | Long-term dependencies in sequences | Solves vanishing gradient problem | Sigmoid, Tanh | Cross-entropy | Bidirectional LSTM |
| Autoencoder | Encoder, bottleneck, decoder | Dimensionality reduction, feature learning | Unsupervised learning, data compression | ReLU, Sigmoid | MSE | Variational Autoencoder (VAE) |
| Generative Adversarial Network (GAN) | Generator and discriminator networks | Image generation, style transfer | Adversarial training | ReLU, Tanh | Binary cross-entropy | DCGAN, CycleGAN |
| Transformer | Self-attention mechanisms, feed-forward layers | NLP, sequence-to-sequence tasks | Parallelizable, captures long-range dependencies | ReLU, GELU | Cross-entropy | BERT, GPT |
| Graph Neural Network (GNN) | Node embeddings, edge operations | Graph-structured data, social networks | Operates on graphs, preserves structure | ReLU | Cross-entropy | Graph Convolutional Network (GCN) |

In [None]:
################## BELOW IS A CNN WITH a 5X5 kernel and a 2X2 Maxpooling ##########

![image.png](attachment:image.png)

Input -> [Conv -> BatchNorm -> ReLU] -> [Conv -> BatchNorm -> ReLU] -> MaxPool -> ...

| Component | Purpose | How it Works | Benefits |
|-----------|---------|--------------|----------|
| Kernel Window (Convolutional Filter) | Extract features from the input image | - Slides over the input image<br>- Performs element-wise multiplication and summation<br>- Creates feature maps | - Detects local patterns (e.g., edges, textures)<br>- Preserves spatial relationships<br>- Enables parameter sharing, reducing model complexity |
| Max Pooling | Reduce spatial dimensions of feature maps | - Slides a window over the feature map<br>- Takes the maximum value in each window<br>- Creates a downsampled output | - Reduces computational load<br>- Provides translation invariance<br>- Helps prevent overfitting |
| Activation Function (e.g., ReLU) | Introduce non-linearity into the model | - Applies a non-linear transformation to each element<br>- For ReLU: f(x) = max(0, x) | - Allows learning of complex patterns<br>- Helps solve the vanishing gradient problem<br>- Introduces sparsity in activations |

# The Vanishing Gradient Problem

## Explanation

The vanishing gradient problem occurs in deep neural networks when gradients become extremely small as they're propagated backwards through the network during training. This can lead to very slow learning or no learning at all, especially in the earlier layers of the network.

## Example: Deep Network with Sigmoid Activation

Let's consider a simple deep neural network with 10 layers, all using the sigmoid activation function. The sigmoid function is defined as:

σ(x) = 1 / (1 + e^(-x))

Its derivative is:
σ'(x) = σ(x) * (1 - σ(x))

Now, let's see what happens during backpropagation:

1. Assume each layer receives an input around 0, which is common after initialization.
2. The derivative of sigmoid at 0 is σ'(0) ≈ 0.25.
3. During backpropagation, we multiply these derivatives.

Gradient at the last layer: 0.25
Gradient at the 9th layer: 0.25 * 0.25 = 0.0625
Gradient at the 8th layer: 0.25 * 0.25 * 0.25 = 0.015625
...
Gradient at the 1st layer: 0.25^10 ≈ 0.0000095

## Visualization

Layer | Gradient
------|----------
10    | 0.25
9     | 0.0625
8     | 0.015625
7     | 0.00390625
6     | 0.0009765625
5     | 0.000244140625
4     | 0.00006103515625
3     | 0.0000152587890625
2     | 0.00000381469726562
1     | 0.00000095367431640

## Consequences

1. The gradient at the first layer is extremely small (≈ 0.00000095).
2. This tiny gradient means that the weights in the early layers will update very slowly.
3. In practice, due to limited floating-point precision, this gradient might effectively become zero.
4. As a result, the early layers of the network learn very slowly or not at all, while later layers do most of the learning.

This is why activation functions like ReLU (Rectified Linear Unit) are often preferred. ReLU's gradient is 1 for all positive inputs, which helps mitigate the vanishing gradient problem.


**Detailed CNN Architecture Flow with Filter and Kernel Sizes**

Input Image: 224x224x3

Conv1: 64 filters, 5x5 kernel, stride 1, padding 2
Output: 224x224x64
  → Spatial: (224 - 5 + 2*2) / 1 + 1 = 224
  → Channels: 64 (number of filters)

ReLU1: 224x224x64

MaxPool1: 2x2 kernel, stride 2
Output: 112x112x64
  → Spatial: (224 - 2) / 2 + 1 = 112
  → Channels unchanged

Conv2: 128 filters, 3x3 kernel, stride 1, padding 1
Output: 112x112x128
  → Spatial: (112 - 3 + 2*1) / 1 + 1 = 112
  → Channels: 128 (number of filters)

ReLU2: 112x112x128

MaxPool2: 2x2 kernel, stride 2
Output: 56x56x128
  → Spatial: (112 - 2) / 2 + 1 = 56
  → Channels unchanged

Conv3: 256 filters, 3x3 kernel, stride 1, padding 1
Output: 56x56x256
  → Spatial: (56 - 3 + 2*1) / 1 + 1 = 56
  → Channels: 256 (number of filters)

ReLU3: 56x56x256

MaxPool3: 2x2 kernel, stride 2
Output: 28x28x256
  → Spatial: (56 - 2) / 2 + 1 = 28
  → Channels unchanged

Flatten: 28 * 28 * 256 = 200,704 features

FC1: 200,704 → 1024 neurons

ReLU4: 1024 neurons

FC2: 1024 → 1000 neurons (e.g., for 1000-class classification)

Softmax: 1000 neurons

# CNN Classification Output Visualization

Let's consider a CNN trained to classify images into 5 categories: Dog, Cat, Bird, Fish, and Rabbit.

## Output Structure

```
[P(Dog), P(Cat), P(Bird), P(Fish), P(Rabbit)]
```

Where P(x) is the probability of the image belonging to class x.

## Example Output

```python
[0.05, 0.75, 0.15, 0.03, 0.02]
```

Interpretation:
- Dog:    5% probability
- Cat:   75% probability
- Bird:  15% probability
- Fish:   3% probability
- Rabbit: 2% probability

## Visual Representation

Dog   |▋ 5%
Cat   |██████████████████████████████████████ 75%
Bird  |███████ 15%
Fish  |▏ 3%
Rabbit|▏ 2%

In this example, the model is most confident that the image is of a cat.

Purpose of Softmax:

Converts raw scores (logits) into a probability distribution.
Ensures all output values are between 0 and 1.
Makes sure the sum of all outputs equals 1.


# Softmax Function

## Formula

For a vector z of K elements, the softmax function is defined as:

```
softmax(z_i) = e^(z_i) / Σ(e^(z_j))
```
where j = 1 to K

## Example

Input (logits):  [2.0, 1.0, 0.1]

Step 1: Calculate e^(z_i) for each element
e^2.0 ≈ 7.389
e^1.0 ≈ 2.718
e^0.1 ≈ 1.105

Step 2: Sum all e^(z_i)
7.389 + 2.718 + 1.105 = 11.212

Step 3: Divide each e^(z_i) by the sum
7.389 / 11.212 ≈ 0.659
2.718 / 11.212 ≈ 0.242
1.105 / 11.212 ≈ 0.099

Output (probabilities): [0.659, 0.242, 0.099]