#<font color='blue' size='5px'/> Model Building & Training with Pytorch Overview<font/>

## 1 Model Building In PyTorch

### Introduction

1. **Introduction to PyTorch Model Building**

  PyTorch is an open-source machine learning library that is widely used for building and training deep neural networks. PyTorch provides a flexible and intuitive approach to building models using dynamic computation graphs, making it easy to experiment with different architectures and training techniques.

2. **Defining the Model Architecture**

  The first step in building a PyTorch model is defining the architecture of the neural network. This involves specifying the **number of layers, the size of each layer, and the activation functions used in each layer**. PyTorch provides a wide range of pre-built layers and activation functions, as well as the ability to define custom layers and functions.

3. **Implementing the Forward Pass**

  Once the architecture of the model has been defined, the next step is to implement the **forward pass. The forward pass is the process of feeding input data through the neural network to produce an output**. In PyTorch, this is done by defining a **forward() method** for the model, which takes input data as input and returns the output of the model.

4. **Defining the Loss Function**

  In order to train a PyTorch model, it is necessary to define a loss function. The loss function measures how well the model is performing on a given task, and provides a way to **calculate the error between the predicted output and the true output**. PyTorch provides a wide range of **pre-built loss functions**, as well as the ability to define custom loss functions.

5. **Backpropagation and Optimization**

  Once the loss function has been defined, the next step is to **train the model using backpropagation and optimization**.
  
  Backpropagation is the process of **calculating the gradients of the loss function with respect to each parameter in the model**, and using these gradients to update the parameters using an optimization algorithm such as **stochastic gradient descent (SGD)**. PyTorch provides automatic differentiation, which makes it easy to calculate gradients using backpropagation.



### Steps of Model Building


**1. Import PyTorch Libraries:**

Begin by importing the necessary PyTorch libraries:

```python
import torch
import torch.nn as nn
```



**2. Define the Model Class:**

In PyTorch, you create a model by defining a Python class that **inherits from `torch.nn.Module`**. This class will represent your neural network. Within this class, you define the layers and operations that make up your model in the constructor (`__init__`) method and specify the forward pass in the `forward` method. Here's a basic structure of a model class:

```python
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        # Define layers and operations here

    def forward(self, x):
        # Specify the forward pass
        # Use layers and operations defined in __init__
        return x
```



**nn.Module:**
   - In PyTorch, you create a neural network model by defining a **Python class that inherits from `torch.nn.Module**. This class represents your neural network architecture.

   - **nn.Module** is a **class in PyTorch** that provides a convenient way to organize and encapsulate all the learnable parameters of a neural network model. When a custom neural network model is defined in PyTorch, it usually inherits from the `nn.Module` class. This allows the model to **inherit all the functionalities of nn.Module**, such as tracking all its parameters, moving the model to GPU, and many more.

   - **Within the constructor (`__init__`)** of the class, you define the layers and operations that make up your model.
    - For example, you might define linear layers, convolutional layers, activation functions, etc.

**Forward Method:**

  - **The `forward` method** is where you specify the **forward pass of your network**. You define how the data flows through the layers from input to output. This method computes the predictions or outputs of your model.

  - In PyTorch, the `forward` method is a required method for any custom neural network model that inherits from the `nn.Module` class.
  - This method defines the computation that the **model performs on input data to produce its output**.
  - Specifically, given an input tensor, the `forward` method specifies how that tensor should be transformed as it passes through the layers of the model.
  - When a PyTorch model is called with an input tensor, the `forward` method is automatically executed to produce the output tensor.

  - You can create complex models by composing various layers and modules in a modular and organized manner. This allows you to design architectures tailored to specific tasks.



**3. Define Layers and Operations:**

Inside the constructor (`__init__`) of your model class, define the layers and operations that you want to use in your model. PyTorch provides a wide range of built-in layers and modules to choose from. For example, you can define **linear (fully connected) layers, convolutional layers, activation functions, and more**.


- So, the main idea is to divide the layer into two stages unlike TensorFlow.
- Here the activation function & Computations that the model does in forward and Layer in __init__
- Here's an example of defining a simple feedforward neural network with two linear layers and a ReLU activation function:

```python
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(in_features=64, out_features=128)
        self.fc2 = nn.Linear(in_features=128, out_features=10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x
```



**In Python, `super` is a built-in function**

- It is used to call a method in a parent class. In the context of PyTorch, `super` is used to call the `__init__` method of the parent class (`nn.Module`) in the `__init__` method of the custom neural network model.

- This is necessary to properly** initialize the model's parameters** and other attributes that are defined in the parent class.

- Here's an example of how `super` is used in the code:

```python
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        # Define the rest of the model here
```

- Here, `super(MyModel, self).__init__()` **calls the `__init__` method of the parent class (`nn.Module`) with the arguments `MyModel` (the current class) and `self` (the current instance of the class)**. This initializes the model's parameters and other attributes that are defined in the parent class.

**4. Specify the Forward Pass:**

In the `forward` method of your model class, specify how the data flows through the layers. This method defines the **computation that produces the model's predictions**. You can use the layers and operations defined in the constructor.

In the example below, the forward pass applies a ReLU activation function after the first linear layer and returns the output of the second linear layer as the model's predictions.

```
  def forward(self, x):
      x = torch.relu(self.fc1(x))
      x = self.fc2(x)
      return x
```


**5. Instantiate the Model:**

To use your model, create an instance of it:

```python
model = MyModel()
```



**6. Model Summary and Parameters:**

The `summary` function from the `torchsummary` package is used to summarize the PyTorch model by showing the number of parameters and the output shape of each layer in the model.

  - You can check the **model's architecture and the number of trainable parameters using the `print`** function or dedicated libraries like `torchsummary`. For example:

```python
from torchsummary import summary

summary(model, (64,))  # Input shape, e.g., (batch_size, input_features)
```

  - The **second parameter** of the `summary` function is a **tuple that specifies the input shape of the model**. In this case, `(64,)` represents a single input sample with 64 features. The `batch_size` is not specified here, so it is assumed to be 1.

  - The **`summary` function uses this input shape to calculate the output shape of each layer** in the model, which is then displayed along with the number of parameters in each layer. This information can be useful for debugging and optimizing the model architecture.


**Full example**,

`batch_size` is the number of input samples in each batch, `channels` is the number of color channels in the input data (e.g. 3 for RGB images), `height` is the height of the input data in pixels, and `width` is the width of the input data in pixels.


  ```python
  from torchsummary import summary

  summary(model, (batch_size, channels, height, width))
  ```


  - For example, if you were working with 32x32 RGB images and a batch size of 16, you would use the following input shape:

  ```python
  summary(model, (16, 3, 32, 32))
  ```




**7. Model to GPU (if available):**

If you have a GPU available and want to accelerate training, move your model and data to the GPU using `model.to(device)` and `data.to(device)`.

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
```


**10. Model Serialization:**

PyTorch allows you to save and load model weights for reuse or deployment. You can save the entire model or just its state_dict, which contains learned parameters. Here's an example of saving and loading model weights:

```python
# Save model
torch.save(model.state_dict(), 'model_weights.pth')

# Load model
model.load_state_dict(torch.load('model_weights.pth'))
```


**11. Model Architecture Patterns:**

Depending on the task, you may follow specific architectural patterns:
   - **Sequential Models:** For linear stack-like networks, use `nn.Sequential` to create a sequence of layers.
   - **Residual Networks (ResNets):** Introduce shortcut connections to ease the training of very deep networks.
   - **Recurrent Neural Networks (RNNs):** Utilized for sequence data.
   - **Convolutional Neural Networks (CNNs):** Commonly used for image-related tasks.
   - **Transformer Models:** Suitable for sequence-to-sequence tasks with self-attention mechanisms.



**list of model architecture patterns in PyTorch:**

1. Feedforward Neural Networks (FFNN)
2. Convolutional Neural Networks (CNN)
3. Recurrent Neural Networks (RNN)
4. Long Short-Term Memory (LSTM) Networks
5. Gated Recurrent Units (GRU)
6. Autoencoders
7. Variational Autoencoders (VAE)
8. Generative Adversarial Networks (GAN)
9. Deep Belief Networks (DBN)
10. Restricted Boltzmann Machines (RBM)
11. Deep Convolutional GANs (DCGAN)
12. CycleGAN
13. Adversarial Autoencoders
14. Siamese Networks
15. Capsule Networks
16. Transformer Networks


**12. Transfer Learning:**

You can leverage pre-trained models to solve similar tasks more efficiently. Fine-tuning a pre-trained model on your specific data can save significant training time and resources.

```python
import torchvision.models as models

# Load a pre-trained model (e.g., ResNet-18)
pretrained_model = models.resnet18(pretrained=True)

# Replace the final classification layer with your custom layer
pretrained_model.fc = nn.Linear(pretrained_model.fc.in_features, num_classes)
```




**13. Debugging and Visualization:**

During model building and training, it's essential to use tools for debugging and visualization. PyTorch provides tools like `print` statements, TensorBoard, and visualization libraries like Matplotlib to help you understand your model's behavior and identify issues.


**14. Loss Functions:**

When building your model, you should also consider the choice of an appropriate loss function. The choice of loss function depends on the task you are solving. For example:
   - Cross-Entropy Loss (`nn.CrossEntropyLoss`) is commonly used for multi-class classification.
   - Mean Squared Error Loss (`nn.MSELoss`) is used for regression problems.
   - Custom loss functions can be defined for specialized tasks.

You can specify the loss function in the training loop when calculating the loss between model predictions and ground truth.

```python
criterion = nn.CrossEntropyLoss()
loss = criterion(predictions, ground_truth)
```



**15. Model Evaluation:**

During model building, it's essential to define evaluation metrics relevant to your task. For classification tasks, common evaluation metrics include accuracy, precision, recall, F1-score, and ROC AUC. For regression tasks, metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are commonly used.

You can use these metrics to evaluate the performance of your model on a validation or test dataset.



**16. Hyperparameter Tuning:**

Model building also involves hyperparameter tuning. Hyperparameters include learning rates, batch sizes, the number of hidden units or layers, dropout rates, weight decay, and more. You can experiment with different hyperparameter settings to find the best configuration for your model.




**17. Regularization Techniques:**

To prevent overfitting during model training, you can employ regularization techniques like dropout and weight decay. Dropout layers randomly deactivate neurons during training to improve model generalization. Weight decay adds a regularization term to the loss to discourage large weights.


**18. Model Saving and Loading:**

Once you have built and trained your model, you should save it to disk for future use or deployment. PyTorch allows you to save the model's state dictionary or the entire model architecture.

```python
# Save model
torch.save(model.state_dict(), 'model_weights.pth')

# Load model
model.load_state_dict(torch.load('model_weights.pth'))
```




## 2 Types of Layers that can be added

### Theory Guide

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_N9j0r2eLyrRos9mb7U89VO6Qq5PiJs1?usp=sharing)


## 3 Types of Activation Functions, Loss Functions, and Optimizers:

An activation function is a mathematical function that is applied to the output of a neural network layer or a single neuron. It helps to introduce non-linearity into the network, allowing it to learn complex patterns and relationships in the data.

The activation function determines the output of a neuron or a layer, based on the weighted sum of inputs. It helps to decide whether the neuron should be activated or not by applying a certain threshold. The activation function can be linear or non-linear.

Some commonly used activation functions include:
1. Sigmoid function: It maps the input to a value between 0 and 1, which is useful for binary classification problems.
2. Hyperbolic tangent (tanh) function: It maps the input to a value between -1 and 1, which is useful for classification problems.
3. Rectified Linear Unit (ReLU) function: It returns 0 for negative inputs and the input value for positive inputs, which is widely used in deep learning models.
4. Leaky ReLU function: It is similar to ReLU but allows a small gradient for negative inputs, preventing dead neurons.
5. Softmax function: It is used in multi-class classification problems to convert the output into probabilities.
.

**3. Types of Activation Functions, Loss Functions, and Optimizers:**
   - Activation Functions:
     - ReLU (`nn.ReLU`) is widely used due to its effectiveness in combating vanishing gradients.
     - Sigmoid (`nn.Sigmoid`) and Tanh (`nn.Tanh`) are often used for specific tasks like binary classification.
   - Loss Functions:
     - Cross-Entropy Loss (`nn.CrossEntropyLoss`) is suitable for multi-class classification problems.
     - Mean Squared Error Loss (`nn.MSELoss`) is commonly used for regression tasks.
   - Optimizers:
     - Adam (`optim.Adam`) is a popular choice due to its adaptive learning rate capabilities.
     - Stochastic Gradient Descent (`optim.SGD`) is a classic optimization algorithm.
     - RMSprop (`optim.RMSprop`) adapts the learning rate based on the recent gradients.




### Activation Function

**Activation Functions:**

Activation functions are mathematical functions applied to the output of a neuron in a neural network to introduce non-linearity into the model. The choice of activation function affects how the neuron responds to its input and, consequently, the network's ability to learn and represent complex patterns in the data. Here are some common types of activation functions used in neural networks:

Here's the corrected version of the text:

1. **Sigmoid Function (Logistic Activation):**
   - Formula:  $f(x) = \frac{1}{1 + e^{-x}}$
   - Range: $(0, 1)$
   - Characteristics: Sigmoid functions squash input values to the range $(0, 1)$. They are useful in binary classification problems and can be interpreted as providing probabilities.

2. **Hyperbolic Tangent Function (tanh):**
   - Formula: $f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
   - Range: $(-1, 1)$
   - Characteristics: tanh functions are similar to sigmoid functions but squash input values to the range $(-1, 1)$. They are zero-centered, making optimization easier in some cases.

3. **Rectified Linear Unit (ReLU):**
   - Formula: $f(x) = max(0, x)$
   - Range: $[0, \infty)$
   - Characteristics: ReLU functions are piecewise linear and set negative inputs to zero. They are computationally efficient and have been widely adopted in deep learning models.

4. **Leaky Rectified Linear Unit (Leaky ReLU):**
   - Formula: $f(x) = \begin{cases} x, & \text{if } x > 0 \\ \alpha x, & \text{if } x \leq 0 \end{cases}$
   - Range: $(-\infty, \infty)$
   - Characteristics: Leaky ReLU is similar to ReLU but allows a small gradient for negative inputs ($\alpha$ is typically a small positive constant). It helps mitigate the "dying ReLU" problem.

5. **Parametric Rectified Linear Unit (PReLU):**
   - Formula: $f(x) = \begin{cases} x, & \text{if } x > 0 \\ a_i x, & \text{if } x \leq 0 \end{cases}$
   - Range: $(-\infty, \infty)$
   - Characteristics: PReLU is an extension of Leaky ReLU, where the slope of the negative part can be learned during training.

6. **Exponential Linear Unit (ELU):**
   - Formula: $f(x) = \begin{cases} x, & \text{if } x > 0 \\ \alpha(e^{x} - 1), & \text{if } x \leq 0 \end{cases}$
   - Range: $(-\infty, \infty)$
   - Characteristics: ELU is similar to ReLU but smooth for negative inputs, with an exponential decay. It can help reduce the vanishing gradient problem.

7. **Scaled Exponential Linear Unit (SELU):**
   - Formula: $f(x) = \lambda \begin{cases} x, & \text{if } x > 0 \\ \alpha(e^{x} - 1), & \text{if } x \leq 0 \end{cases}$
   - Range: $(-\infty, \infty)$
   - Characteristics: SELU is an extension of ELU with specific values for $\alpha$ and $\lambda$ that aim to make the activations maintain a stable mean and variance during training.

8. **Softmax Function:**
   - Formula: $f(x)_i = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$ for each element $x_i$
   - Range: $(0, 1)$ with the sum of all elements equal to $1$
   - Characteristics: Softmax is used primarily in the output layer of multi-class classification models. It converts a vector of raw scores into a probability distribution over classes.



### Why to use Activation Function

**Why to use Activation Function:**

Adding nonlinearity through activation functions is a critical aspect of neural networks, and it serves several important purposes:

  1. **Model Complex Relationships:** Without activation functions, neural networks would be limited to representing linear transformations of the input data. In many real-world problems, especially those with complex, non-linear patterns, linear models are insufficient. Activation functions introduce nonlinearity, allowing neural networks to capture and model intricate relationships within the data.

  2. **Enable Representation Learning:** Neural networks are often referred to as universal function approximators, which means they have the capability to approximate any function. However, this capability is unlocked through the use of activation functions. Activation functions allow neural networks to learn and represent complex data patterns, making them highly adaptable to various tasks.

  3. **Enable Deep Architectures:** Deep neural networks, consisting of multiple layers, have the potential to learn hierarchical features from raw data. Activation functions prevent the network from collapsing into a linear model as you stack more layers. Without nonlinearity, the composition of multiple linear transformations would still result in a linear transformation, limiting the network's expressiveness.

  4. **Learn and Capture Local Patterns:** Activation functions introduce nonlinearity at the level of individual neurons. This nonlinearity enables neurons to capture and respond to local patterns and features in the input data. Neurons with activation functions can act as feature detectors, identifying specific characteristics in the data.

  5. **Enable Nonlinear Decision Boundaries:** In classification tasks, activation functions allow neural networks to create nonlinear decision boundaries. This is crucial for distinguishing between complex classes that cannot be separated by simple linear boundaries. Activation functions enable neural networks to learn decision regions that are more flexible and expressive.

  6. **Overcome the Vanishing Gradient Problem:** Certain activation functions, like the rectified linear unit (ReLU), help mitigate the vanishing gradient problem, which can hinder the training of deep networks. By allowing gradients to pass more freely during backpropagation, ReLU and similar functions enable the training of very deep architectures.

  7. **Introduce Sparsity and Non-sparsity:** Some activation functions, like the sigmoid and hyperbolic tangent (tanh), introduce sparsity in the network's activations. This sparsity can have regularization effects, helping prevent overfitting. On the other hand, ReLU and its variants introduce non-sparsity, which can improve the representation capacity of the network.

![image](https://www.simplilearn.com/ice9/free_resources_article_thumb/list-of-activation-functions-used-with-perceptron.jpg)

## 4 Training Loops (Forward and Backward)

**Training Loops (Forward and Backward):**
   - Training loops consist of two main phases:
     - Forward Pass: During this phase, input data is passed through the model to obtain predictions.
     - Backward Pass (Autograd): PyTorch's automatic differentiation engine (Autograd) automatically tracks operations during the forward pass. Gradients with respect to the loss are computed during the backward pass.
   - Training loops are organized with nested iterations. The inner loop typically iterates over batches of data, while the outer loop iterates over epochs, which are complete passes through the entire training dataset.



## 5  Gradient Descent Optimization Techniques

- Gradient Descent Optimization:
  - Stochastic Gradient Descent (SGD) updates weights using gradients computed on mini-batches of data.
  - Mini-batch Gradient Descent balances computation efficiency and convergence speed.

## 6 Learning Rates


- Learning Rates:
  - Learning rate is a hyperparameter that controls the step size during weight updates. Choosing the right learning rate is crucial for training stability and convergence.
  - Learning rate schedulers adjust the learning rate during training to improve convergence.




## 7 Weight Initialization Strategies

- Weight Initialization:
  - Proper weight initialization helps prevent vanishing/exploding gradients and accelerates convergence.
  - Xavier/Glorot initialization and He initialization are commonly used strategies.

## 8 Regularization Techniques

**Regularization Techniques:**
   - Regularization techniques are used to prevent overfitting:
     - Dropout randomly drops neurons during training, forcing the network to be more robust.
     - Weight Decay (L2 Regularization) adds a penalty term to the loss to discourage large weight values.
     - Batch Normalization normalizes activations within a layer to reduce internal covariate shift and accelerate training.