#### Los Function

| Model Type        | Task                       | Loss Function          |
|-------------------|----------------------------|------------------------|
| ANN (FCN)         | Multi-class Classification  | CrossEntropyLoss       |
| CNN               | Binary Classification       | BCEWithLogitsLoss      |
| RNN               | Regression                  | MSELoss                |
| Any Model         | Robust Regression           | SmoothL1Loss           |
| Any Model         | Distribution Learning       | KLDivLoss              |
| Any Model         | Margin-based Classification | HingeEmbeddingLoss     |
| Any Model         | Similarity Learning         | CosineEmbeddingLoss    |
| Any Model         | Metric Learning             | TripletMarginLoss      |


In [None]:
# 1. Cross-Entropy Loss
# ------------------------------------
# Used for multi-class classification problems.
# This loss function is applied when the model outputs raw logits (unscaled scores).
# The loss calculates the difference between the predicted logits and the actual class index 
# using a softmax function to normalize the logits into a probability distribution.

# Example Usage:
criterion = nn.CrossEntropyLoss() 
# Model output: Raw logits of shape (batch_size, num_classes).
# Target: Integer class indices (0 to num_classes-1) for each input.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 2. Binary Cross-Entropy Loss with Logits (BCEWithLogitsLoss)
# ------------------------------------
# Used for binary classification tasks, where the model outputs raw logits.
# This loss combines a sigmoid activation function with binary cross-entropy loss 
# to directly calculate the loss on raw model output without needing separate sigmoid activation.

# Example Usage:
criterion = nn.BCEWithLogitsLoss()  # Suitable for binary classification
# Model output: Raw logits of shape (batch_size, 1).
# Target: Binary class (0 or 1).

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 3. Mean Squared Error Loss (MSELoss)
# ------------------------------------
# Typically used for regression tasks. This loss measures the squared difference 
# between the predicted and target continuous values.
# The goal is to minimize the squared error between predicted and actual values.

# Example Usage:
criterion = nn.MSELoss()  # Suitable for regression
# Model output: Continuous values (e.g., predicted house price, temperature).
# Target: Continuous values representing ground truth.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 4. Smooth L1 Loss
# ------------------------------------
# A robust version of the Mean Squared Error (MSELoss) that is less sensitive to outliers.
# It behaves like MSE for small differences and like Mean Absolute Error (MAE) for large differences.
# This loss is useful in scenarios where data has outliers or you want robustness in the model's learning.

# Example Usage:
criterion = nn.SmoothL1Loss()  # Replaces MSE for robustness
# Model output: Continuous values.
# Target: Continuous values (e.g., ground truth).

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------


# 5. Kullback-Leibler Divergence Loss (KLDivLoss)
# ------------------------------------
# Used to measure how one probability distribution diverges from a second, expected distribution.
# It’s often applied in generative models like Variational Autoencoders (VAE) or in tasks involving 
# distribution learning. This loss calculates the divergence between the predicted distribution 
# and the true probability distribution.

# Example Usage:
criterion = nn.KLDivLoss(reduction='batchmean')  # Calculates KL divergence with batch mean reduction
# Model output: Log-probabilities (log(p)) of predicted distribution.
# Target: True probability distribution (p).

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 6. Hinge Embedding Loss
# ------------------------------------
# Primarily used for binary classification tasks, often with margin-based models (e.g., Support Vector Machines).
# The loss function computes the margin between the predicted value and the true label.
# It penalizes predictions that are on the wrong side of the margin, encouraging correct predictions 
# that are confidently far from the decision boundary.

# Example Usage:
criterion = nn.HingeEmbeddingLoss()
# Model output: Predictions as -1 or 1 (binary classes).
# Target: True class as -1 or 1.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 7. Cosine Embedding Loss
# ------------------------------------
# Commonly used in similarity learning, where the goal is to minimize the angle 
# between two vectors. This loss encourages similar samples (vectors) to be closer in angle and 
# dissimilar ones to be farther apart. It’s often used for tasks like face verification.

# Example Usage:
criterion = nn.CosineEmbeddingLoss()
# Model output: Embedding vectors (e.g., face embeddings, text embeddings).
# Target: 1 for similar pairs, -1 for dissimilar pairs.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 8. Triplet Margin Loss
# ------------------------------------
# Used in metric learning to learn an embedding space where the distance between 
# similar samples is smaller than the distance between dissimilar samples by a specified margin.
# This loss is typically applied in tasks such as face recognition, where the model learns 
# to distinguish between positive and negative pairs of embeddings.

# Example Usage:
criterion = nn.TripletMarginLoss(margin=1.0)
# Model output: Embeddings of three samples: anchor, positive, and negative.
# Target: No explicit target needed; loss function works based on relative distances between embeddings.


#### Bias

In [None]:
"""
Bias is added in each layer. So if you want to add a bias in one layer then you have make it bias paramter value Ture and if you don't want then make it False.
"""
# Example in liner layer:
self.fc1 = nn.Linear(28*28, 128, bias=True)  # Bias enabled

# Example Convolution layer
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1, bias=True)

# Example RNN layer
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True, bias=True)

"""
Just make it false if you don't want to use it
"""

#### Activation Functions


| Activation Function | ANN | CNN | RNN |
|--------------------|----|----|----|
| **ReLU** (`nn.ReLU()`) | ✅ | ✅ | ❌ |
| **Leaky ReLU** (`nn.LeakyReLU()`) | ✅ | ✅ | ❌ |
| **Sigmoid** (`nn.Sigmoid()`) | ✅ | ✅ | ❌ |
| **Tanh** (`nn.Tanh()`) | ❌ | ❌ | ✅ |
| **Softmax** (`nn.Softmax()`) | ✅ | ❌ | ❌ |
| **ELU** (`nn.ELU()`) | ✅ | ✅ | ❌ |
| **SELU** (`nn.SELU()`) | ✅ | ❌ | ✅ |
| **GELU** (`nn.GELU()`) | ✅ | ✅ | ✅ |
| **Swish** (`nn.SiLU()`) | ✅ | ✅ | ❌ |


In [None]:
# 1. ReLU (Rectified Linear Unit)
# ------------------------------------
# The ReLU activation function is one of the most widely used activation functions.
# It outputs zero for all negative inputs and outputs the input itself for all positive inputs.
# It helps with avoiding the vanishing gradient problem by allowing positive gradients to flow 
# through the network during backpropagation.

self.relu = nn.ReLU() 
# Example: input = -2, output = 0; input = 3, output = 3.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 2. Sigmoid
# ------------------------------------
# The Sigmoid function outputs values in the range (0, 1), making it useful for binary classification tasks.
# It "squashes" the output into a probability-like range. 
# However, it can suffer from the vanishing gradient problem when values are too close to 0 or 1.

self.sigmoid = nn.Sigmoid()
# Example: input = -3, output = 0.047; input = 2, output = 0.881.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 3. Leaky ReLU
# ------------------------------------
# Leaky ReLU is a variant of ReLU. Instead of outputting zero for negative values, 
# it allows a small, non-zero gradient (controlled by the alpha parameter) to flow for negative inputs.
# This helps with the "dying ReLU" problem, where ReLU neurons may get stuck during training.

self.leaky_relu = nn.LeakyReLU(0.1)
# Example: input = -2, output = -0.2; input = 3, output = 3.
# The slope for negative inputs is 0.1 (default).

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 4. Tanh (Hyperbolic Tangent)
# ------------------------------------
# Tanh is similar to the sigmoid function but its range is (-1, 1) rather than (0, 1).
# It is zero-centered, which helps in some cases, especially with gradient flow during backpropagation.
# However, it can still suffer from the vanishing gradient problem.

self.tanh = nn.Tanh()
# Example: input = -2, output = -0.964; input = 2, output = 0.964.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 5. GELU (Gaussian Error Linear Unit)
# ------------------------------------
# GELU is a smooth approximation of the ReLU activation that allows for non-linearity and smoother training.
# It has the benefits of both ReLU and Tanh, with a non-zero output for negative values, but more stability during training.
# It is often used in transformer models and other deep architectures.

self.gelu = nn.GELU()
# Example: input = -2, output = -0.046; input = 2, output = 0.977.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 6. ELU (Exponential Linear Unit)
# ------------------------------------
# The ELU activation function outputs the input for positive values and applies an exponential function for negative values.
# It helps with avoiding the vanishing gradient problem and can accelerate learning in some cases.
# It has a small negative output for negative inputs, unlike ReLU which outputs 0.

self.elu = nn.ELU()
# Example: input = -2, output = -0.135; input = 3, output = 3.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 7. SELU (Scaled Exponential Linear Unit)
# ------------------------------------
# SELU is a self-normalizing activation function that automatically scales its output 
# to have zero mean and unit variance, making it useful for deep networks.
# It is an extension of ELU but with scaled parameters to enable self-normalization.

self.selu = nn.SELU()
# Example: input = -2, output = -1.757; input = 2, output = 2.742.

# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------
# ----------------------------------------------------------------------

# 8. Swish (SiLU: Sigmoid Linear Unit)
# ------------------------------------
# Swish, or SiLU, is a newer activation function that is a smooth combination of sigmoid and linear functions.
# It can help improve model performance, particularly in deep neural networks, by being more flexible than ReLU.
# It is known to offer better performance on a variety of tasks, as it does not suffer from dead neurons like ReLU.

self.silu = nn.SiLU()
# Example: input = -2, output = -0.119; input = 2, output = 1.762.


#### Regularization



| Technique | ANN | CNN | RNN | When to Use | Why to Use |
|-----------|----|----|----|-------------|------------|
| **Batch Normalization** (`nn.BatchNorm1d`, `nn.BatchNorm2d`) | ✅ | ✅ | ❌ | When training deep networks, especially CNNs and ANNs | Speeds up training and stabilizes learning by normalizing inputs per mini-batch |
| **Layer Normalization** (`nn.LayerNorm`) | ✅ | ❌ | ✅ | When using RNNs and transformers | Normalizes across features instead of batches, making it useful for varying batch sizes |
| **Instance Normalization** (`nn.InstanceNorm1d`, `nn.InstanceNorm2d`) | ❌ | ✅ | ❌ | When working with style transfer and image generation | Normalizes per sample, effective for tasks where batch statistics vary |
| **Group Normalization** (`nn.GroupNorm`) | ❌ | ✅ | ❌ | When BatchNorm is ineffective due to small batch sizes | Normalizes across grouped channels instead of full batches |
| **Dropout** (`nn.Dropout`) | ✅ | ❌ | ✅ | When overfitting occurs in fully connected layers and RNNs | Randomly disables neurons to prevent co-adaptation and improve generalization |
| **Dropout2d** (`nn.Dropout2d`) | ❌ | ✅ | ❌ | When overfitting occurs in CNNs | Drops entire feature maps instead of individual neurons, improving feature independence |


In [None]:
#####################################################
################### Normalization ###################
#####################################################

# 1. Batch Normalization (1D)
# ------------------------------------
# Batch Normalization is used to normalize the activations of the neurons in a mini-batch.
# It helps the network train faster and stabilizes the learning process by reducing internal covariate shift.
# Here, it normalizes over a batch of 1D data (e.g., sequence data or flat vectors).
self.bn1 = nn.BatchNorm1d(128)  # Batch Normalization for 1D input with 128 features.
# This normalizes each feature (column) across the mini-batch.

# 2. Batch Normalization (2D)
# ------------------------------------
# This performs batch normalization for 2D data, which is typically used for image data or other 2D input.
# The normalization is done per channel (i.e., across the height and width of the image), and the model learns scale and shift parameters for each channel.
self.bn2 = nn.BatchNorm2d(32)  # Batch Normalization for 2D input with 32 channels.
# The input is expected to be of shape (batch_size, num_channels, height, width).

# 3. Instance Normalization (2D)
# ------------------------------------
# Instance Normalization normalizes each individual sample (rather than a batch of samples) in the batch.
# This is especially useful in tasks like style transfer and image generation, where each image needs to be normalized independently.
self.in2 = nn.InstanceNorm2d(64)  # Instance Normalization for 2D input with 64 channels.
# It normalizes each image (instance) separately, useful for styles where every image is treated individually.

# 4. Layer Normalization
# ------------------------------------
# Layer Normalization normalizes the activations of each layer, rather than each mini-batch. 
# It’s applied across the features for each sample independently, and it's often used in models like RNNs.
self.ln = nn.LayerNorm(hidden_size)  # Layer Normalization for hidden_size features.
# The input is expected to be a tensor where each sample is normalized across the features (not the batch dimension).

# 5. Group Normalization
# ------------------------------------
# Group Normalization divides the channels into groups and normalizes each group separately.
# It’s an alternative to batch normalization, and it works well when the batch size is small or when batch statistics are unreliable.
self.gn = nn.GroupNorm(num_groups=8, num_channels=64)  # Group Normalization with 8 groups and 64 channels.
# This is particularly useful in scenarios with smaller batches, like segmentation tasks.

# 6. Instance Normalization (2D) - Duplicate
# ------------------------------------
# Another instance normalization layer for 2D data, similar to the earlier one.
# Instance normalization helps in domains like generative models, especially for style transfer tasks.
self.inorm = nn.InstanceNorm2d(32)  # Instance Normalization for 2D input with 32 channels.
# It normalizes each instance separately, similar to the previous InstanceNorm2d layer.

#####################################################
###################### Dropout ######################
#####################################################

# 1. Dropout
# ------------------------------------
# Dropout is a regularization technique that randomly sets a fraction of the input units to 0 during training.
# This prevents the model from overfitting by reducing reliance on specific neurons, forcing the network to learn more robust features.
self.dropout = nn.Dropout(0.3)  # Dropout with 30% probability of setting units to zero.
# This helps prevent overfitting by randomly deactivating 30% of the neurons during training.

# 2. 3D Dropout
# ------------------------------------
# 3D Dropout is an extension of dropout, but it's applied to 3D data like video or 3D image data.
# It works the same way as regular dropout, but instead of operating on 2D planes, it applies to the 3D volume of the input.
self.dropout3d = nn.Dropout3d(0.3)  # 3D Dropout with 30% probability.
# Useful for regularizing 3D CNNs in tasks like 3D object detection or video processing.


#### Weight Initialization



| Method | ANN | CNN | RNN | When to Use | Why to Use |
|--------|----|----|----|-------------|------------|
| **Xavier (Glorot) Initialization** (`xavier_uniform_`, `xavier_normal_`) | ✅ | ❌ | ✅ | When using **Tanh/Sigmoid** activations | Maintains variance across layers |
| **Kaiming (He) Initialization** (`kaiming_uniform_`, `kaiming_normal_`) | ❌ | ✅ | ❌ | When using **ReLU/LeakyReLU** activations | Helps deeper networks converge faster |
| **Orthogonal Initialization** (`orthogonal_`) | ❌ | ❌ | ✅ | For **RNNs, LSTMs, GRUs** | Preserves information across time steps |
| **Uniform Initialization** (`uniform_`) | ✅ | ✅ | ✅ | General purpose | Randomized initialization within a fixed range |
| **Normal Initialization** (`normal_`) | ✅ | ✅ | ✅ | General purpose | Gaussian distribution for weights |
| **Constant Initialization** (`constant_`) | ✅ | ✅ | ✅ | Used for bias terms | Sets weights/biases to fixed values |


In [None]:
init.xavier_uniform_(m.weight)





#### Embedding

In [None]:
# custom work emebdding train krte hobe

token_embedding = nn.Embedding(vocab_size, embed_dim)

#### Utilities and Tools

In [None]:
###########################################################
################# Device Management #######################
###########################################################

# Move tensor to the default GPU device (CUDA device).
tensor = tensor.to('cuda')
# Move the model to the default GPU device. This is necessary before training starts if a GPU is available, as it speeds up computations.
model = model.to('cuda')

# Another way to move the tensor to GPU. This defaults to the first GPU if multiple GPUs are available.
tensor = tensor.cuda()
# Move the model to the GPU. Equivalent to using .to('cuda') but specifically calls out CUDA.
model = model.cuda()

# Move the tensor back to CPU. This is useful for operations that need to be performed on the CPU or for compatibility reasons.
tensor = tensor.cpu()
# Move the model back to CPU. Useful when you need to save the model or perform CPU-only operations.
model = model.cpu()

##############################################################
############### Model Saving and Loading  ####################
##############################################################

# Save the model's state dictionary. This contains the model's weights, not the entire model structure.
torch.save(model.state_dict(), 'model.pth')

# Load a model state dictionary from the disk. This does not load the full model but only the parameters.
state_dict = torch.load('model.pth')

# Load the parameters into the existing model structure. This is necessary because torch.load does not load the model structure.
model.load_state_dict(torch.load('model.pth'))

#################################################################
#################### Gradient Calculation #######################
#################################################################

# Calculate gradients of the loss with respect to model parameters. This is used during the backward pass of training.
loss.backward()

# Context manager that disables gradient calculations. Gradients are not needed during inference, which makes operations faster and reduces memory usage.
with torch.no_grad():
    predictions = model(inputs)

# Access the gradient of a specific tensor. Useful for debugging or for custom operations involving gradients.
gradient = tensor.grad

# Compute the gradients of 'loss' with respect to the parameters of the model. This function returns the gradients and can be used when you need more control over how gradients are calculated or when you need to retrieve them explicitly.
grads = torch.autograd.grad(loss, model.parameters())



#### ANN

In [None]:
# Linear layer



#### CNN

In [None]:
# Convolutional Layers
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)

nn.Conv3d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)

# Pooling Layers
nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)

nn.AvgPool2d(kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True)

# Fully Connected (Dense) Layers
nn.Linear(in_features, out_features, bias=True)

# Padding Layers
nn.ZeroPad2d(padding)

# Flatten
self.flatten = nn.Flatten()

#### RNN

In [None]:
# RNN Layer

# GRU Layer 

# LSTM Layer

#### Container

In [None]:
# nn.Sequential

# nn.ModuleList

# nn.ModuleDict