### Which weight initializer is used when?

* **Xavier/Glorot Initialization**: Used when we use Sigmoid or Tanh Activation Function.


* **He Initialization**: Used with ReLU.

We have 2 types of both `normal` and `uniform`. Perform trail and test and then select better one.

* Weight initialization is important for both the input and hidden layers of a neural network. Proper weight initialization helps in improving the convergence speed and stability of the training process. 


* Weight initializers are used in all layers of a neural network, not just the input layer. The best practice for weight initialization depends on the activation function used in the neural network.


* Their is no such rule these weight initalizers are used with only these activation functions. Do experiments, trail and test, it is `important` in real time.


* Proper weight initialization can help the neural network learn from the data in a more efficient way. 


* There are many different weight initialization techniques available. The best technique to use depends on the specific task and the architecture of the ANN.


* It is important to experiment with different weight initialization techniques to see which one works best for your particular task.


* For neural networks with ReLU activation functions, it is common to use a He initialization.


* He initialization ensures that the weights are initialized in a way that allows the neural network to learn more quickly.

**Note**



* We can use weight initializers in both pooling and convolutional layers in CNNs. In fact, it is important to use a weight initializer in all layers of a CNN, as it can help to improve the convergence of the network and prevent it from overfitting.


* We can use weight initializers in every hidden layer of an ANN. In fact, it is generally recommended to use a weight initializer in all layers of an ANN, as it can help to improve the convergence of the network and prevent it from overfitting.


* Whether to use the same weight initializer in every layer of a CNN or ANN, or to use different initializers at each layer, is a matter of debate. There is no single answer that is universally correct, as the best approach will depend on the specific dataset and task that the network is being trained on.


* If the network is very deep, it may be beneficial to use different initializers at different layers. This is because the deeper layers of the network will need to learn more complex features, and using a different initializer can help to prevent the network from becoming too unstable.


* If the network is not very deep, it may be sufficient to use the same initializer in every layer. This is because the shallower layers of the network will not need to learn as complex features, and using the same initializer can help to ensure that the network converges more quickly.

### Which Activation Function is to apply in which layer and when?

* **Regression**: ReLU + Linear


* **Binary Classification**: ReLU + Sigmoid


* **Multi-Class Classfication**: ReLU + Softmax


**`Note`**: There is no such rule that always always these comobinations. Perform trail and error to choose better combination than this.

### Q) Batch normalization is used ANN and when with code and best practice to apply batch normalization?

Yes, batch normalization can be used in Artificial Neural Networks (ANNs). In fact, batch normalization was originally introduced for feedforward ANNs and has since been widely adopted in various neural network architectures.

* When we apply batch normalization the values of input are converted in the range of (0-1) and then same activation function is applied on specific layer so output of all neurons are almost similar, so neural learn that output in more efficient way that's why we use BN.


* If we use BN after activation function so output will be converted in the range of 0-1 that will create the problem of vanishing gradient descent, this happens same with sigmoid in 1990's. 


**Steps**:-

1) Input data (features) are fed into the layer.


2) The layer applies a linear transformation (weights and biases) to the input data.


3) Batch Normalization is applied.


4) The output of the linear transformation is passed through an activation function, producing the activations of the layer.


**`Best Practice`**:-

**Apply Batch Normalization after Every Hidden Layer (or Almost All)**

**Avoid Batch Normalization in the Output Layer**

In [None]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, input_shape=(input_dim,)),
    tf.keras.layers.BatchNormalization(), 
    tf.keras.layers.Activation('relu'),    
    
    tf.keras.layers.Dense(64),
    tf.keras.layers.BatchNormalization(), 
    tf.keras.layers.Activation('relu'),
    
    tf.keras.layers.Dense(output_dim, activation='softmax')  
])


model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, validation_data=(x_val, y_val))

### Which optimizer is best in which case?

90% time we are using 'Adam' Optimizer in all problem statements. Perform trail and error to choose better optimizer.

### Q) How to specify number of neuron in each layer?

**Input Layer**: Number of neurons are equal to number of independent feataures in dataset.


**Hidden Layer**: In first hidden layer number of neurons are equal to neurons in input layer and second hidden layer neurons are half of first hidden layer and like this. But their is no such rule that always apply this technique perform trail and error also.


* The number of neurons in the hidden layers should be between the number of neurons in the input layer and the number of neurons in the output layer.


**Output Layer**: In case of regression and binary classification number of neurons in output layer is one and in case of multi-class classification and image data number of neurons in output layer is depennd on number of categories in dependent variabl.

### Calculate Loss by which loss function?

* **Regression**: MSE, MAE, Huber Loss
    
    
* **Binary Classification**: Binary Cross Entropy


* **Multi-Class Classfication**: Categorical Loss Entopy


* **Image Data**: Sparse Categorical Entopy



### Good Practice for Dropout percentage.


**Smaller layers**: Use a lower dropout percentage, such as 0.1 or 0.2.


**Larger layers**: Use a higher dropout percentage, such as 0.3 or 0.4.


**More complex tasks**: Use a higher dropout percentage.


**Less training data**: Use a higher dropout percentage.


* Dropout regularization can be used in both the input and hidden layers of a neural network. However, it is more commonly used in the hidden layers. This is because the input layer is typically not as prone to overfitting as the hidden layers.

## Batch_size parameter

* The batch_size parameter in deep learning represents the number of training examples utilized in one forward/backward pass.


* A larger batch size can improve the performance of the neural network, but it can also make the training process slower.


* A smaller batch size can make the training process faster, but it can also lead to less accurate results

**Small Batch Size**:


**Advantages**: Smaller batch sizes tend to converge faster as they update the model's parameters more frequently. They also require less memory, which can be beneficial when working with limited resources.


**Disadvantages**: Smaller batch sizes can introduce more noise to the parameter updates, potentially leading to less stable convergence and noisy gradients.


**Large Batch Size**:


**Advantages**: Larger batch sizes can provide smoother gradient updates, potentially leading to more stable convergence. They can also benefit from vectorized operations, improving computational efficiency.


**Disadvantages**: Larger batch sizes require more memory and may take longer per epoch due to fewer parameter updates. They might not generalize as well to the validation set due to their potential to get stuck in local minima.

### `Notes`:

* While working on image data we first scale matrix of data. We can scale manually and we have libraries also then we convert matrix of image data into 1D array by using .reshape function of numpy and then build neural network of it.


* After neural network is build we evaulate the performance of neural network on same metrics we use in ML.

# CNN

* **Stride is only apply on pooling layer not on convolutional layer becuase we in convolutional layer we want to extract more primitive features as possible but in pooling we only want high-level primitive features.**


* **By using strides devploy model gets faster in prediction but we loss information by using strides.** 


* **In CNN we apply batch normalization in convolutional layer only, not in pooling layer and fully connected layers.**


* **In CNN Loss is calculated by binarycrossentropy, categorical crossentropy and sparse categorical crossentropy.**


* **Pooling layers are used to reduce the dimensionality of the feature maps output by the convolutional layers, while activation functions are used to introduce non-linearity into the network. Non-linearity is important for deep learning models because it allows them to learn more complex patterns. However, pooling layers are already non-linear, so there is no need to add an activation function after them.**


* **When strides are used, more information is lost from the input data, which can lead to a slight increase in the error rate. However, the advantage of using strides is that it can speed up the training process and the processing speed of the application after deployment.**


* **Data augmentation is a technique that can be used to artificially increase the size of a training dataset by creating modified copies of images in the dataset. This can help to prevent overfitting, which is a problem that can occur when a model is trained on a small dataset.**


* **Data augmentation Increases training data, Reduces overfitting and improves performance of model.**


* **In real time projects image sizes are big. So, we use pooling almost in every project.**


* **We use strides in the pooling layer only because our goal is to extract high-level primitive features from the input. Strides result in loss of information, so we only use them in the pooling layer and not in the convolutional layer.**

* **We can use weight initializers in both pooling and convolutional layers in CNNs. In fact, it is important to use a weight initializer in all layers of a CNN, as it can help to improve the convergence of the network and prevent it from overfitting.**


* **We can use weight initializers in every hidden layer of an ANN. In fact, it is generally recommended to use a weight initializer in all layers of an ANN, as it can help to improve the convergence of the network and prevent it from overfitting.**


* **Whether to use the same weight initializer in every layer of a CNN or ANN, or to use different initializers at each layer, is a matter of debate. There is no single answer that is universally correct, as the best approach will depend on the specific dataset and task that the network is being trained on.**


* **If the network is very deep, it may be beneficial to use different initializers at different layers. This is because the deeper layers of the network will need to learn more complex features, and using a different initializer can help to prevent the network from becoming too unstable.**


* **If the network is not very deep, it may be sufficient to use the same initializer in every layer. This is because the shallower layers of the network will not need to learn as complex features, and using the same initializer can help to ensure that the network converges more quickly.**

### Q) USe of batch_size parameter?

Batch size refers to the number of training examples used in one iteration of the training process in a CNN. It is important because it determines how many samples are processed together in each step during training. Using a proper batch size allows efficient utilization of computational resources and can lead to more stable and faster convergence during training.

### Q) What is Batch Normalization in CNN and when to apply?

* The purpose of batch normalization in CNNs is to stabilize and speed up the training process by normalizing the input data within each mini-batch(subset of data) during training. This helps the model converge faster(minimize loss faster) and improves overall performance.


* Batch normalization (BN) is a technique used to improve the training of deep neural networks. It is typically used after convolutional layers and before activation functions.


* Batch normalization works by normalizing the activations of a layer across a batch of inputs. This helps to stabilize the training process and prevents the activations from becoming too large or too small. BN can also help to improve the generalization performance of the model.

In [None]:
model.add(Conv2D(32, kernel_size=(3, 3), padding='valid', input_shape=(256, 256, 3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding="valid"))

### Q) Why we batch normalization only after convolutional layer and not after pooling layer?

We use batch normalization after convolutional layers because it helps make the training process more stable and faster. Batch normalization normalizes the inputs to each layer, making the optimization process more efficient. Pooling layers are used for downsampling and do not require batch normalization for their specific task. Therefore, we apply batch normalization only after convolutional layers to improve the performance of the model during training.

### Q) Why do we apply the activation function after batch normalization instead of before?

* We apply the activation function after batch normalization to prevent the vanishing gradient problem and ensure stable and faster training. Batch normalization normalizes the inputs, making them suitable for the activation function, leading to more effective learning.


* Applying batch normalization before the activation function scales the output of mini-batches, making the model learn efficiently. It also speeds up the training process, and the activation function performs effectively on the normalized output of batch normalization.


* If we apply batch normalization after the activation function, it will normalize the output values to a range between 0 and 1. This can help to prevent the vanishing gradient problem, which can occur with activation functions such as the sigmoid function.

### Activations functions are only applied after convolutional layer not on pooling layer.

**Activation Function after Convolutional Layer**:
After applying convolution to the input data using a set of filters (kernels), the resulting feature maps go through an activation function element-wise. The activation function introduces non-linearity to the output of the convolutional layer, which allows the network to learn and model complex relationships in the data. Common activation functions used in CNNs include ReLU (Rectified Linear Unit), Leaky ReLU, ELU (Exponential Linear Unit), and others.

**No Activation Function after Pooling Layer**:
After applying pooling (e.g., MaxPooling or AveragePooling), there is no activation function applied directly to the pooled feature maps. Pooling layers are purely for down-sampling and spatial dimension reduction, and they do not introduce non-linearity.

The typical CNN architecture sequence involving convolutional layers, activation functions, and pooling layers is as follows:

**Input Data -> Convolutional Layer -> Activation Function -> Pooling Layer ->..**

### `IMPORTANT`

**Pooling layers are used to reduce the dimensionality of the feature maps output by the convolutional layers, while activation functions are used to introduce non-linearity into the network. Non-linearity is important for deep learning models because it allows them to learn more complex patterns. However, pooling layers are already non-linear, so there is no need to add an activation function after them.**

### Q) When to apply padding?

* Padding is applied to the input data before performing convolutional operations in a Convolutional Neural Network (CNN). The primary purpose of padding is to maintain the spatial dimensions of the input and output feature maps during convolution.


* Do not apply padding on pooling layer because we want only high-level primitive features in pooling layer.


* In industry, zero padding is usually used, but there are other padding methods as well, such as making the outer layer the same as the nearby grid.

### Dropout percentages

**You can use dropout in both convolutional layers and pooling layers in CNNs.**

**Good Practice in convolutional layers and pooling layers**

In **pooling layers**, dropout is not typically used. This is because pooling layers are used to reduce the size of the feature maps, and dropout would introduce additional complexity to the model. However, some researchers have experimented with using dropout in pooling layers, and some have found that it can improve the performance of the model.


**Here are some guidelines for using dropout in CNNs:**


**Convolutional layers**: Dropout can be used in convolutional layers to prevent overfitting. A good starting point is to use a dropout percentage of 0.2 or 0.3.


**Pooling layers**: Dropout is not typically used in pooling layers.


**Fully connected layers**: Dropout can be used in fully connected layers to prevent overfitting. A good starting point is to use a dropout percentage of 0.5 or 0.6.

----------------------------------------------------

**Good Practice for neural networks**


**Smaller layers**: Use a lower dropout percentage, such as 0.1 or 0.2.


**Larger layers**: Use a higher dropout percentage, such as 0.3 or 0.4.


**More complex tasks**: Use a higher dropout percentage.


**Less training data**: Use a higher dropout percentage.

### Which optimizer, loss function and metric to use in CNN:- 

* Use a binary crossentropy loss function, which is a good choice for binary classification tasks and categorical crossentrophy for multi-class classification tasks.


* Use the Adam optimizer, which is a good choice for most deep learning tasks.


* Use the accuracy metric, which is a good choice for binary classification tasks.

### Data Augmentation


* Data augmentation is a technique used to artificially increase the size of a dataset by applying various transformations to the existing images, such as rotations, flips, translations, zooming, etc. While performing data augmentation, the total number of images may remain the same, but the dataset's diversity and variability increase due to the transformed images.


* However, it's important to note that data augmentation does not increase the unique images in the dataset. The augmented images are generated on-the-fly during training and do not create additional unique images. They are applied dynamically during each training epoch, allowing the model to see different variations of the original images and improve its generalization.


* So, the number of unique images in your dataset remains 5000, but the model effectively sees a more diverse and varied dataset due to data augmentation, which can help improve the model's performance and robustness.

### `IMPORTANT`


* **Data augmentation is a technique that can be used to artificially increase the size of a training dataset by creating modified copies of images in the dataset. This can help to prevent overfitting, which is a problem that can occur when a model is trained on a small dataset.**


* **Data augmentation Increases training data, Reduces overfitting and improves performance of model.**


### Perfect Code

* **We can add multiple convolutional layer in 1 convolutional layers before pooling layer.**


* **We can use batch normalization in dense layers also.**


* **We can use data augmentation while dataset to prevent model from data augmentation.**


* **If you want to add more data to pretrained model and do not rebuild their fully connected architecture, build model on same architecture of fully-connected layers.**

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, BatchNormalization, Activation

model = Sequential()

model.add(Conv2D(256, kernel_size=(3, 3), padding='same', input_shape=(256, 256, 3))) # for RGB Image
model.add(Conv2D(256, kernel_size=(3, 3), input_shape=(256, 256, 3)))
model.add(Conv2D(256, kernel_size=(3, 3), padding='same', input_shape=(256, 256, 3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Conv2D(128, kernel_size=(3, 3), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Conv2D(64, kernel_size=(3, 3), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Conv2D(32, kernel_size=(3, 3), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Flatten())


model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.3))


model.add(Dense(128))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.25))


model.add(Dense(64))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))


model.add(Dense(32))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.1))


model.add(Dense(1, activation='sigmoid'))


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


model.summary()

**Add object of early stopping parameters here then continue with below code.**

In [None]:
# Fit the model to the training data and validate on the validation data
history = model.fit(train_ds, epochs=20, validation_data=validation_ds)



### Q) How to choose appropriate number of filters at every convolutional layer.


A common practice is to start with a small number of filters in the first layer and gradually increase the number of filters in subsequent layers. For example, in a basic CNN architecture, the number of filters might be set as follows:

**`First Convolutional Layer`**: Fewer filters, e.g., 32 or 64.


**`Intermediate Convolutional Layers`**: Increasing number of filters, e.g., 128, 256, etc.


**`Final Convolutional Layer`**: The number of filters can be set to the number of classes in the classification task (for classification problems).

**Note**
* The choice of the number of filters can also depend on the available computational resources and the size of the dataset. 


* Larger numbers of filters increase the loss of information and small size filter increases computational cost and the risk of overfitting if the dataset is small.


* That's we use (3,3) filter becuase this size is not too big or small and in industries (3,3) filter size usually use. But we can also ue (1,1), (5,5), (7,7), (11,11) filter size

### Q) How to specify correct input shape of images while creating dataset?


The choice of the image size (image_size) when creating a dataset of images depends on various factors, including the characteristics of the dataset, the available computational resources, and the requirements of the specific deep learning model or task you are working on.


As a starting point, image sizes such as (256, 256) or (224, 224) are commonly used in various computer vision tasks. These sizes are often chosen because they strike a balance between maintaining relevant details in the images and computational efficiency.


### Q) What will happen if we incease the number of convolutional layers?


Adding more convolutional layers can make the model more complex and difficult to train, but it can also improve the model's performance by extracting more features and reducing overfitting.


**`Increased complexity`**: Adding more convolutional layers will increase the complexity of the CNN model. This can make the model more difficult to train, but it can also improve the model's performance.


**`Improved feature extraction`**: More convolutional layers can extract more features from the input data. This can improve the model's ability to recognize patterns in the data and make more accurate predictions.


**`Reduced overfitting`**: Adding more convolutional layers can help to reduce overfitting. This is because the model will be able to learn more features from the data, which will make it less likely to memorize the training data and generalize better to new data.

### `'mini-batch'` parameter

The size of the mini-batch is a hyperparameter that can be tuned to improve the performance of the model. A smaller mini-batch size can help to prevent overfitting, but it can also make the training process slower. A larger mini-batch size can speed up the training process, but it can also increase the risk of overfitting.

The optimal mini-batch size depends on the specific problem and the dataset. However, a good starting point is a mini-batch size of 32 to 128.