### Terms

RNN: processes sequenctial data by maintaining internal hidden state that gets updated as it process each element in a sequence
- Hidden state: the internal memory that captures and maintains a summary of previous steps
- Hidden states have information decay and gradient tproblems due to the need to summarize past results
- Transformers have proved as a better alternative with self-attention allowing every position to the weighted against every other position, even in sequential data

CNN: process grid-like data exploiting spatial structue through local connectivity patterns and weight sharing

### Data Manipulation

One-hot encoding: converts catagorical data into a binary matrix

### Programming

MaxPooling -- downsampling operation reducing spacial dimensions of feature map by extracting the maximum value
- nn.MaxPooling(kernel_size=matrix size of extraction, stride=how far to move the window, padding, dilation=space between kernels)

Flatten -- converts multidimensional arrays into a single array: 7x7 → nn.Flatten() → 1x49

model = nn.Sequenctial() -- chains all sequences which is already imbedded with a feedforward system, as long as all the nn.Module functions are provided. The following rules are provided:

In [None]:
# ✗ Skip connections (ResNet-style)
class ResidualBlock(nn.Module):
    def forward(self, x):
        residual = x
        x = self.conv1(x)
        x = self.conv2(x)
        x = x + residual  # ← Can't do this in Sequential!
        return x

# ✗ Multiple inputs/outputs
class MultiPath(nn.Module):
    def forward(self, x1, x2):
        out1 = self.path1(x1)
        out2 = self.path2(x2)
        return out1, out2  # ← Sequential only handles single input→output

# ✗ Conditional logic
class Conditional(nn.Module):
    def forward(self, x, training):
        x = self.conv(x)
        if training:  # ← Can't have if-statements in Sequential
            x = self.dropout(x)
        return x

# ✗ Attention mechanisms (query, key, value from same input)
class Attention(nn.Module):
    def forward(self, x):
        q = self.query_proj(x)
        k = self.key_proj(x)    # ← Three branches from same input
        v = self.value_proj(x)  # Can't represent in Sequential
        return attention(q, k, v)

# ✓ Custom layers that fit the single-input, single-output pattern
class ComplexModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Can use Sequential for sub-components
        self.encoder = nn.Sequential(
            nn.Linear(100, 50),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(50, 100),
            nn.ReLU()
        )
        self.skip_proj = nn.Linear(100, 100)
    
    def forward(self, x):
        identity = x
        x = self.encoder(x)
        x = self.decoder(x)
        skip = self.skip_proj(identity)
        return x + skip  # ← Custom logic in forward()

BatchNorm -- normalizes all features to have μ=0 & σ²=1 -- nn.BatchNorm2d/1d(num_features)

XGBoost -- tree-based ensemble method 

### Training

Epoch: one complete pass through the entire training dataset

K-fold Cross-Validation: splits data into K-folds. During training, the model uses 1 fold for validation and the rest for training

Batch: a subset of training examples processed together in a single forward/backward pass before updateing model parameters

Activation Funtion: sigmoid, tanh

### Optimization Adressing


Gradient Clipping: prevents training instability from gradient explosion by limiting gradient magnitudes during backpropagation
- Clip by value / Clip by norm

Gradient Vanishing: occurs when gradients become very small as they propagate backward in DNN, effectively preventing layers from learning any information
- Solved with ReLU, batch normalization, residual connections, & other weight initialization schemes

Gradient Explosion: occurs when gradients become excessively large as they propagate backward, effectively destablizing training