### Which weight initializer is used when?

* **Xavier/Glorot Initialization**: Used when we use Sigmoid or Tanh Activation Function.


* **He Initialization**: Used with ReLU.

We have 2 types of both normal and uniform. Perform trail and test and then select better one.

* Proper weight initialization can help the neural network learn from the data in a more efficient way. 


* There are many different weight initialization techniques available. The best technique to use depends on the specific task and the architecture of the ANN.


* It is important to experiment with different weight initialization techniques to see which one works best for your particular task.

### Which Activation Function is to apply in which layer and when?

* **Regression**: ReLU + Linear


* **Binary Classification**: ReLU + Sigmoid


* **Multi-Class Classfication**: ReLU + Softmax


**`Note`**: There is no such rule that always always these comobinations. Perform trail and error to choose better combination than this.

### Q) Batch normalization is used ANN and when with code and best practice to apply batch normalization?

Yes, batch normalization can be used in Artificial Neural Networks (ANNs). In fact, batch normalization was originally introduced for feedforward ANNs and has since been widely adopted in various neural network architectures.

**Steps**:-

1) Input data (features) are fed into the layer.


2) The layer applies a linear transformation (weights and biases) to the input data.


3) Batch Normalization is applied.


4) The output of the linear transformation is passed through an activation function, producing the activations of the layer.


**`Best Practice`**:-

**Apply Batch Normalization after Every Hidden Layer (or Almost All)**

**Avoid Batch Normalization in the Output Layer**

In [None]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, input_shape=(input_dim,)),
    tf.keras.layers.BatchNormalization(), 
    tf.keras.layers.Activation('relu'),    
    
    tf.keras.layers.Dense(64),
    tf.keras.layers.BatchNormalization(), 
    tf.keras.layers.Activation('relu'),
    
    tf.keras.layers.Dense(output_dim, activation='softmax')  
])


model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, validation_data=(x_val, y_val))

### Which optimizer is best in which case?

90% time we are using 'Adam' Optimizer in all problem statements. Perform trail and error to choose better optimizer.

### Q) How to specify number of neuron in each layer?

**Input Layer**: Number of neurons are equal to number of independent feataures in dataset.


**Hidden Layer**: In first hidden layer number of neurons are equal to neurons in input layer and second hidden layer neurons are half of first hidden layer and like this. But their is no such rule that always apply this technique perform trail and error also.


**Output Layer**: In case of regression and binary classification number of neurons in output layer is one and in case of multi-class classification and image data number of neurons in output layer is depennd on number of categories in dependent variabl.

### Calculate Loss by which loss function?

* **Regression**: MSE, MAE, Huber Loss
    
    
* **Binary Classification**: Binary Cross Entropy


* **Multi-Class Classfication**: Categorical Loss Entopy


* **Image Data**: Sparse Categorical Entopy



### Good Practice for Dropout percentage.


**Smaller layers**: Use a lower dropout percentage, such as 0.1 or 0.2.


**Larger layers**: Use a higher dropout percentage, such as 0.3 or 0.4.


**More complex tasks**: Use a higher dropout percentage.


**Less training data**: Use a higher dropout percentage.

### `Notes`:

* While working on image data we first scale matrix of data. We can scale manually and we have libraries also then we convert matrix of image data into 1D array by using .reshape function of numpy and then build neural network of it.


* After neural network is build we evaulate the performance of neural network on same metrics we use in ML.

# CNN

* **Stride is only apply on pooling layer not on convolutional layer becuase we in convolutional layer we want to extract more primitive features as possible but in pooling we only want high-level primitive features.**


* **In CNN we apply batch normalization in convolutional layer only, not in pooling layer and fully connected layers.**


* **In CNN Loss is calculated by binarycrossentropy, categorical crossentropy and sparse categorical crossentropy.**

### Q) What is Batch Normalization in CNN and when to apply?

* The purpose of batch normalization in CNNs is to stabilize and speed up the training process by normalizing the input data within each mini-batch(subset of data) during training. This helps the model converge faster(minimize loss faster) and improves overall performance.


* Batch normalization (BN) is a technique used to improve the training of deep neural networks. It is typically used after convolutional layers and before activation functions.


* Batch normalization works by normalizing the activations of a layer across a batch of inputs. This helps to stabilize the training process and prevents the activations from becoming too large or too small. BN can also help to improve the generalization performance of the model.

In [None]:
model.add(Conv2D(32, kernel_size=(3, 3), padding='valid', input_shape=(256, 256, 3)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding="valid"))

### Activations functions are only applied after convolutional layer not on pooling layer.

**Activation Function after Convolutional Layer**:
After applying convolution to the input data using a set of filters (kernels), the resulting feature maps go through an activation function element-wise. The activation function introduces non-linearity to the output of the convolutional layer, which allows the network to learn and model complex relationships in the data. Common activation functions used in CNNs include ReLU (Rectified Linear Unit), Leaky ReLU, ELU (Exponential Linear Unit), and others.

**No Activation Function after Pooling Layer**:
After applying pooling (e.g., MaxPooling or AveragePooling), there is no activation function applied directly to the pooled feature maps. Pooling layers are purely for down-sampling and spatial dimension reduction, and they do not introduce non-linearity.

The typical CNN architecture sequence involving convolutional layers, activation functions, and pooling layers is as follows:

**Input Data -> Convolutional Layer -> Activation Function -> Pooling Layer ->..**

### Q) When to apply padding?

Padding is applied to the input data before performing convolutional operations in a Convolutional Neural Network (CNN). The primary purpose of padding is to maintain the spatial dimensions of the input and output feature maps during convolution.


Do not apply padding on pooling layer because we want only high-level primitive features in pooling layer.

### Dropout percentages

**You can use dropout in both convolutional layers and pooling layers in CNNs.**

**Good Practice in convolutional layers and pooling layers**

In **pooling layers**, dropout is not typically used. This is because pooling layers are used to reduce the size of the feature maps, and dropout would introduce additional complexity to the model. However, some researchers have experimented with using dropout in pooling layers, and some have found that it can improve the performance of the model.


**Here are some guidelines for using dropout in CNNs:**


**Convolutional layers**: Dropout can be used in convolutional layers to prevent overfitting. A good starting point is to use a dropout percentage of 0.2 or 0.3.


**Pooling layers**: Dropout is not typically used in pooling layers.


**Fully connected layers**: Dropout can be used in fully connected layers to prevent overfitting. A good starting point is to use a dropout percentage of 0.5 or 0.6.

----------------------------------------------------

**Good Practice for neural networks**


**Smaller layers**: Use a lower dropout percentage, such as 0.1 or 0.2.


**Larger layers**: Use a higher dropout percentage, such as 0.3 or 0.4.


**More complex tasks**: Use a higher dropout percentage.


**Less training data**: Use a higher dropout percentage.

### Which optimizer, loss function and metric to use in CNN:- 

* Use a binary crossentropy loss function, which is a good choice for binary classification tasks and categorical crossentrophy for multi-class classification tasks.


* Use the Adam optimizer, which is a good choice for most deep learning tasks.


* Use the accuracy metric, which is a good choice for binary classification tasks.

### Perfect Code

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, BatchNormalization, Activation

model = Sequential()

model.add(Conv2D(256, kernel_size=(3, 3), padding='same', input_shape=(256, 256, 3))) # for RGB Image
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Conv2D(128, kernel_size=(3, 3), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Conv2D(64, kernel_size=(3, 3), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Conv2D(32, kernel_size=(3, 3), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))


model.add(Flatten())


model.add(Dense(256, activation='relu'))
model.add(Dropout(0.3))


model.add(Dense(128, activation='relu'))
model.add(Dropout(0.25))


model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))


model.add(Dense(32, activation='relu'))
model.add(Dropout(0.1))


model.add(Dense(1, activation='sigmoid'))


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


model.summary()

**Add object of early stopping parameters here then continue with below code.**

In [None]:
# Fit the model to the training data and validate on the validation data
history = model.fit(train_ds, epochs=20, validation_data=validation_ds)



### Q) How to choose appropriate number of filters at every convolutional layer.


A common practice is to start with a small number of filters in the first layer and gradually increase the number of filters in subsequent layers. For example, in a basic CNN architecture, the number of filters might be set as follows:

**`First Convolutional Layer`**: Fewer filters, e.g., 32 or 64.


**`Intermediate Convolutional Layers`**: Increasing number of filters, e.g., 128, 256, etc.


**`Final Convolutional Layer`**: The number of filters can be set to the number of classes in the classification task (for classification problems).


The choice of the number of filters can also depend on the available computational resources and the size of the dataset. Larger numbers of filters increase the model's capacity to learn intricate patterns but also increase the computational cost and the risk of overfitting if the dataset is small.


### Q) How to specify correct input shape of images while creating dataset?


The choice of the image size (image_size) when creating a dataset of images depends on various factors, including the characteristics of the dataset, the available computational resources, and the requirements of the specific deep learning model or task you are working on.


As a starting point, image sizes such as (256, 256) or (224, 224) are commonly used in various computer vision tasks. These sizes are often chosen because they strike a balance between maintaining relevant details in the images and computational efficiency.


### Q) What will happen if we incease the number of convolutional layers?


Adding more convolutional layers can make the model more complex and difficult to train, but it can also improve the model's performance by extracting more features and reducing overfitting.


**`Increased complexity`**: Adding more convolutional layers will increase the complexity of the CNN model. This can make the model more difficult to train, but it can also improve the model's performance.


**`Improved feature extraction`**: More convolutional layers can extract more features from the input data. This can improve the model's ability to recognize patterns in the data and make more accurate predictions.


**`Reduced overfitting`**: Adding more convolutional layers can help to reduce overfitting. This is because the model will be able to learn more features from the data, which will make it less likely to memorize the training data and generalize better to new data.

### `'mini-batch'` parameter

The size of the mini-batch is a hyperparameter that can be tuned to improve the performance of the model. A smaller mini-batch size can help to prevent overfitting, but it can also make the training process slower. A larger mini-batch size can speed up the training process, but it can also increase the risk of overfitting.

The optimal mini-batch size depends on the specific problem and the dataset. However, a good starting point is a mini-batch size of 32 to 128.