# Assignment 4

#### 1. What is the concept of cyclical momentum?

The term "cyclical momentum" refers to the change in momentum values that occurs as a neural network is being trained. Stochastic gradient descent (SGD) and other optimisation algorithms use the momentum technique to hasten convergence and increase learning process stability.

The weight that is traditionally given to the prior update while computing the current update of the model parameters is determined by a parameter called momentum. Even when the gradient shifts or fluctuates, it assists the optimisation algorithm to keep going in the prior path. The momentum term functions as a moving average of the gradients and enables the algorithm to move beyond flat areas or tiny local optima by gaining momentum in predictable directions.

In the concept of cyclical momentum, the momentum parameter is varied cyclically during training. Instead of using a fixed momentum value throughout the training process, the momentum is gradually increased and decreased within a specific range or cycle. This cyclic variation allows the optimization algorithm to explore different update directions and adapt the momentum based on the current state of the training process.

The idea behind cyclical momentum is to introduce additional exploration and adaptability to the optimization process. By varying the momentum, the algorithm can escape from poor local minima, explore different regions of the parameter space, and potentially find better solutions. This cyclic variation of momentum helps to balance exploration and exploitation, allowing the optimization process to explore wider regions during some cycles and exploit promising regions during others.

Cyclical momentum is often used in conjunction with other techniques, such as learning rate schedules or cyclical learning rates, to further enhance the optimization process and improve the performance of the neural network. By combining cyclical momentum with other cyclical techniques, the training process can become more robust, adaptable, and capable of discovering better solutions.

#### 2. What callback keeps track of hyperparameter values (along with other data) during training?

The callback that records hyperparameter values (together with other data) during training is typically referred to as a "callback logger" or "metrics logger" in the context of deep learning frameworks like PyTorch or TensorFlow. Throughout the training process, this callback is in charge of logging and recording a variety of data and metrics, including hyperparameter values, loss values, accuracy, validation metrics, and other pertinent statistics.

The callback logger typically operates by intercepting specific events or stages during training, such as the completion of each training batch, the end of each epoch, or the completion of the entire training process. It collects the desired data at these points and stores them for later analysis or visualization.

The specific implementation and functionality of the callback logger may vary depending on the deep learning framework or library being used. In PyTorch, for example, the `torch.utils.tensorboard.SummaryWriter` class can be used as a callback logger to log various metrics and hyperparameters during training. It can write the logged information to TensorBoard-compatible files, which can then be visualized using the TensorBoard tool.

Different callback loggers or metrics recording systems could be offered by different deep learning frameworks and packages. These recorders are useful for tracking the training process's development, evaluating the model's performance, and making defensible judgements based on the logged data, such as modifying hyperparameters, spotting overfitting, or contrasting various training sessions.

#### 3. In the color dim plot, what does one column of pixels represent?

One column of pixels often indicates the fluctuation of a particular dimension or feature across the data samples in the context of a colour dim plot, where the data is represented as an image.

Each pixel in a colour dim plot refers to a distinct data sample, and the colour of the pixel denotes the value of a certain feature or dimension of that sample. The plot is often laid out so that the rows represent individual data samples and the columns indicate various attributes or dimensions..

For example, consider a color dim plot where each row represents an image and each column represents a specific color channel (e.g., red, green, blue). In this case, one column of pixels in the plot would represent the variation of the pixel values in that specific color channel across the images. Each pixel's color in that column would indicate the intensity or magnitude of the corresponding color channel for a particular image.

It is possible to examine the distribution, trends, or correlations of the related feature or dimension across the data samples by looking at the colour variations within a column. This can shed light on the structure of the data, show where samples differ or overlap, or help pinpoint key characteristics for more investigation or modelling.


#### 4. In color dim, what does "poor teaching" look like? What is the reason for this?

"Poor teaching" in the context of a colour dim plot refers to the case where the visualisation fails to successfully communicate significant information or insights about the data. This indicates that the plot doesn't offer obvious and understandable patterns or correlations between the dimensions or features being visualised..

There can be several reasons for a color dim plot to exhibit poor teaching:

1. **Irrelevant or Uninformative Dimensions**: If the dimensions or features being visualized in the color dim plot are irrelevant or uninformative for understanding the data, the plot may not reveal any meaningful patterns or insights. In such cases, the plot may appear random or lack structure, making it difficult to draw meaningful conclusions.

2. **High Dimensionality**: Color dim plots can become less effective when visualizing high-dimensional data. As the number of dimensions or features increases, it becomes challenging to represent them all in a single plot. The plot may become crowded or too complex to interpret, making it difficult to extract useful information.

3. **Lack of Contrast or Discrimination**: If the color scheme or mapping used in the color dim plot does not provide sufficient contrast or discrimination between different values or ranges of the dimensions, it can make it challenging to distinguish patterns or identify relationships. The lack of clarity in the visual representation can lead to poor teaching.

4. **Data Variability**: In some cases, the data being visualized may exhibit low variability or lack distinctive patterns across the dimensions. This can result in a color dim plot that appears homogeneous or indistinguishable, providing limited insights into the data.

5. **Non-linear Relationships**: If the relationships between dimensions or features in the data are non-linear or complex, a simple color dim plot may not effectively capture these relationships. Linear or monotonic visualizations may fail to reveal the underlying structure of the data, leading to poor teaching.

#### 5. Does a batch normalization layer have any trainable parameters?

Yes, The parameters of a batch normalisation layer are trainable. During training for batch normalisation, the layer picks up and changes two different kinds of parameters:

1. **Scale Parameter (γ)**: This parameter is used to scale the normalized activations. It allows the batch normalization layer to control the magnitude of the normalized activations. The scale parameter is learned and updated during training, which means it is a trainable parameter.

2. **Shift Parameter (β)**: This parameter is used to shift the normalized activations. It allows the batch normalization layer to control the bias or offset of the normalized activations. Similar to the scale parameter, the shift parameter is also learned and updated during training, making it a trainable parameter.

These trainable parameters, the scale parameter (γ) and the shift parameter (β), provide flexibility to the batch normalization layer to adapt to different distributions of the input data and optimize the performance of the neural network. By adjusting the scale and shift of the normalized activations, batch normalization helps to improve the gradient flow, reduce internal covariate shift, and stabilize the training process.

It's important to note that batch normalization also utilizes two additional sets of parameters:

1. **Running Mean**: This parameter keeps track of the moving average of the mean value of the input batch during training. It is used during inference to normalize the input based on the estimated population mean.

2. **Running Variance**: This parameter keeps track of the moving average of the variance of the input batch during training. It is used during inference to normalize the input based on the estimated population variance.

#### 6. In batch normalization during preparation, what statistics are used to normalize? What about during the validation process?

The statistics used to normalise the input batch are calculated depending on the current mini-batch being processed during the batch normalisation training phase. Particularly, the mean and the variance of two statistics are computed.

The average of the input batch along each channel or feature dimension is used to calculate the mean. It displays the average value of the batch's activations.

The input batch variance along each channel or feature dimension is used to calculate variance. It represents the distribution or dispersion of the batch's activations.

Once the mean and variance are computed for the current mini-batch, the input batch is normalized using these statistics. The normalization process involves subtracting the mean from each activation and then dividing by the square root of the variance. This normalization step ensures that the activations have a mean of zero and a variance of one, helping to stabilize and normalize the input distribution.

During the validation process or inference, batch normalization behaves slightly differently. Instead of using the statistics calculated from the current mini-batch, the running mean and running variance are used for normalization. These running statistics are computed as moving averages during the training process.

The running mean represents the moving average of the mean values calculated across all the mini-batches seen during training. It provides an estimation of the population mean.

Similarly, the running variance represents the moving average of the variance values calculated across all the mini-batches seen during training. It provides an estimation of the population variance.

These running mean and variance values are utilised to normalise the input data during validation or inference. As a result, the statistics utilised for training are consistent, and the normalisation behaviour that was learned during training is preserved.

Batch normalisation enables the model to generalise well to unseen data and guarantees consistent normalisation behaviour throughout various stages of the training process by utilising the running mean and variance for normalisation during validation.

#### 7. Why do batch normalization layers help models generalize better?

Batch normalization layers help models generalize better for several reasons:

1. **Normalization of Activations**: Batch normalization normalizes the activations within each mini-batch by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. This normalization process helps to stabilize the distribution of activations, making them more consistent and reducing the internal covariate shift. By ensuring that the activations have similar statistical properties across mini-batches, batch normalization reduces the sensitivity of the model to the scale and distribution of the input data. This, in turn, helps the model generalize better to unseen examples.

2. **Regularization Effect**: Batch normalization introduces a regularization effect by adding noise to the hidden units through the normalization process. The noise injected into the activations during training acts as a form of regularization, which can prevent overfitting. By adding noise to the activations, batch normalization reduces the reliance of the model on specific instances or patterns within the mini-batches, forcing it to learn more robust and generalized representations.

3. **Stabilized Gradient Flow**: Batch normalization helps to alleviate the vanishing gradient problem and ensures a more stable gradient flow during training. By normalizing the activations and reducing the internal covariate shift, batch normalization keeps the gradients within a reasonable range, preventing them from becoming too small or too large. This facilitates the backpropagation of gradients through deep networks, enabling more effective and efficient training. The stabilized gradient flow allows the model to converge faster and generalize better to new examples.

4. **Reduction of Dependency on Initialization**: Batch normalization reduces the dependence of the model on the specific initialization of the network parameters. By normalizing the activations, batch normalization makes the model less sensitive to the choice of initialization values. This reduces the burden of finding the optimal initialization scheme and makes the training process more robust and reliable.

#### 8.Explain between MAX POOLING and AVERAGE POOLING is number eight.

1. **Max Pooling**: In max pooling, each pooling region is divided into non-overlapping patches, typically of size (e.g., 2x2 or 3x3). Within each patch, the maximum value is selected as the representative value. This means that only the highest activation within each patch is preserved, discarding the rest. Max pooling emphasizes the most prominent features or activations within each patch and retains the strongest signal, disregarding the lower activations. It is effective in capturing the most salient features and providing translation invariance, making it particularly useful for tasks where detecting presence or location of specific features is important.

2. **Average Pooling**: In average pooling, each pooling region is divided into non-overlapping patches, similar to max pooling. However, instead of selecting the maximum value, average pooling calculates the average (mean) value within each patch. This means that all the activations within each patch contribute equally to the pooled value. Average pooling provides a more smoothed representation of the input and helps to reduce noise or small variations. It can be useful in scenarios where the absolute presence or precise location of specific features is less important, and a more generalized representation is desired.

By lowering the spatial dimensions of the feature maps, both max pooling and average pooling reduce the number of parameters and computational complexity in following layers. By condensing the data in each pooling zone, they also introduce a certain amount of translation invariance.

Max pooling or average pooling should be chosen depending on the particulars of the problem at hand and the desired qualities of the final representation. Because it preserves the strongest activations and offers more robustness against minute translations, max pooling is frequently chosen when recognising certain characteristics or patterns. On the other hand, average pooling can be helpful for tasks like picture classification or gathering comprehensive spatial information when a more generalised representation or smoothing effect is sought.

#### 9. What is the purpose of the POOLING LAYER?

1. **Dimensionality Reduction**: The pooling layer reduces the spatial dimensions (width and height) of the feature maps. This downsampling helps to decrease the computational complexity of the network and reduce the number of parameters. By reducing the spatial resolution, the pooling layer enables subsequent layers to focus on more high-level and abstract features.

2. **Translation Invariance**: Pooling layers introduce a degree of translation invariance by summarizing the information within local regions. By aggregating the features within each pooling region, the pooling layer can capture the most important information and discard less relevant or redundant details. This translation invariance property allows the network to recognize patterns or features regardless of their exact location within the input.

3. **Feature Extraction**: The pooling layer acts as a feature extractor by preserving the most prominent features. By selecting the most representative value within each pooling region (e.g., maximum value in max pooling), the pooling layer retains the strongest activations and discards weaker ones. This helps to emphasize important features and reduce the influence of noisy or less significant activations.

4. **Spatial Hierarchies**: Pooling layers create spatial hierarchies by gradually reducing the spatial dimensions. As the pooling layers are stacked, the receptive fields of the neurons in higher layers become larger, capturing information from larger regions of the input. This hierarchical structure allows the network to learn increasingly complex and abstract features at higher layers.

#### 10. Why do we end up with Completely CONNECTED LAYERS?

1. **Global Information Aggregation**: Convolutional layers in a CNN extract local features through convolutional operations with small receptive fields. However, these local features may not capture the global context or relationships between different parts of the input. Fully connected layers, on the other hand, provide a way to aggregate information from the entire input space by connecting each neuron to every neuron in the previous layer. This allows the network to capture global patterns and dependencies, making it suitable for tasks that require holistic understanding or higher-level reasoning.

2. **Non-Local Relationships**: Fully connected layers enable the model to learn non-local relationships between different features. While convolutional layers are effective at learning local patterns and spatial hierarchies, they may not capture long-range dependencies or relationships that extend beyond the receptive field of the convolutional filters. Fully connected layers allow for more flexible connections between neurons, facilitating the learning of complex non-local relationships in the data.

3. **Feature Combination**: Fully connected layers provide a mechanism for combining features learned by earlier layers. The convolutional layers in a CNN extract low-level features in the earlier layers and progressively learn more abstract and higher-level features in subsequent layers. Fully connected layers receive these learned features as inputs and can combine them to form more complex representations. This allows the model to learn task-specific combinations of features that are relevant for the given problem.

4. **Decision Making**: Fully connected layers are often used as the final layers of a CNN to make predictions or decisions based on the learned features. These layers typically have a fixed number of neurons corresponding to the desired output classes or regression targets. Each neuron in the fully connected layers represents a specific class or target, and the activations of these neurons indicate the model's confidence or prediction for each class or target.

#### 11. What do you mean by PARAMETERS?

"Parameters" in the context of neural networks refer to the learnable variables that specify how the model behaves and how it is constructed. The neural network learns these parameters from the training data in order to execute a given task or generate predictions. As they control how the input data is translated and processed within the network, parameters are a crucial part of the model.

The two main types of parameters in a neural network are:

1. **Weights**: Weights are the coefficients that multiply the input values at each layer. They represent the strength of the connections between neurons or units in different layers of the network. Each connection between two neurons has an associated weight that determines the contribution of the input value to the activation of the receiving neuron. The weights control the influence of the input on the output and are adjusted during the training process to optimize the performance of the network.

2. **Biases**: Biases are additional parameters in a neural network that represent the intercept or offset term. A bias term is added to the weighted sum of inputs at each neuron in the network. It allows the network to learn an offset or bias in the predictions, providing flexibility in modeling different patterns and relationships. Biases help the network capture non-zero mean patterns in the data and adjust the decision boundaries of the model.

These parameters are initially initialised with random values, and via training, the network learns the ideal values. A loss function that measures the difference between expected outputs and actual targets is used to assess the network's performance during training. Following that, optimisation techniques like gradient descent are used to update the weights and biases in order to reduce the loss function and enhance the model's performance.

The architecture and configuration of the network, including the number of layers, the number of neurons in each layer, and any extra design decisions, affect the number of parameters in a neural network. Finding the ideal values for these parameters during the learning process will allow the network to accurately represent underlying patterns and make predictions on brand-new, unforeseen data.

#### 12. What formulas are used to measure these PARAMETERS?

1. **Number of Weights (W)**: The number of weights in a neural network can be calculated by summing up the total number of weights in each layer. For a fully connected layer, the number of weights is determined by the size of the input and output layers. If the input layer has dimension M and the output layer has dimension N, the number of weights in that layer would be M * N. For convolutional layers, the number of weights depends on the size of the convolutional filters and the number of input and output channels.

2. **Number of Biases (B)**: The number of biases in a neural network is equal to the number of neurons or units in each layer, excluding the input layer. Each neuron (except in the input layer) has its own bias term, so the number of biases in a layer is the same as the number of neurons in that layer.

3. **Total Number of Parameters**: The total number of parameters in a neural network is the sum of the number of weights and the number of biases. It represents the overall complexity of the model and the amount of memory required to store and process the parameters during inference.

4. **Total Memory Usage**: The memory usage of a neural network is related to the total number of parameters. Each parameter typically requires a certain number of bits or bytes to be stored. The memory usage can be computed by multiplying the total number of parameters by the number of bits/bytes required to store each parameter.