Q1.  **Explain the Activation Functions in your own language**

<!-- -->

1.  **sigmoid**

2.  **tanh**

3.  **ReLU**

4.  **ELU**

5.  **LeakyReLU**

6.  **Swish**

> **a. Sigmoid:**
>
> The sigmoid activation function is a mathematical function that
> squashes the input value between 0 and 1. It has an S-shaped curve and
> is commonly used in binary classification problems. The sigmoid
> function maps any real number to a value between 0 and 1, making it
> useful for producing probabilities. When the input is large, the
> sigmoid function approaches 1, and when the input is small, it
> approaches 0.
>
> **b. Tanh:**
>
> The hyperbolic tangent (tanh) activation function is similar to the
> sigmoid function but maps the input values between -1 and 1. It has an
> S-shaped curve like the sigmoid function but is symmetric around the
> origin. The tanh function is useful in models where negative values
> are significant or when we need a stronger gradient than the sigmoid
> function provides.
>
> **c. ReLU:**
>
> The rectified linear unit (ReLU) activation function is a piecewise
> linear function that outputs the input directly if it is positive, and
> 0 otherwise. It is the most widely used activation function in deep
> learning models. ReLU provides a simple way to introduce non-linearity
> to the model and helps in learning complex patterns. It is
> computationally efficient and avoids the vanishing gradient problem.
>
> **d. ELU:**
>
> The exponential linear unit (ELU) activation function is similar to
> the ReLU function for positive inputs but smoothens the output for
> negative inputs. It is designed to alleviate the dying ReLU problem,
> where some neurons in a network become inactive and stop learning. ELU
> allows negative values, which helps the model to learn robust
> representations and can lead to better generalization.
>
> **e. LeakyReLU:**
>
> The LeakyReLU activation function is a variation of the ReLU function
> that addresses the "dying ReLU" problem. It introduces a small slope
> for negative input values, allowing the gradient to flow even when the
> neuron is not active. This prevents the neuron from completely dying
> out during training and helps in better learning and convergence.
>
> **f. Swish:**
>
> The Swish activation function is a smooth and non-monotonic function
> that combines elements of the sigmoid and ReLU functions. It takes the
> input value and applies a sigmoid-like transformation, resulting in a
> smooth curve. Swish has been found to perform well in deep neural
> networks, as it retains the positive characteristics of both the
> sigmoid and ReLU functions. It can help improve the gradient flow and
> capture more complex patterns in the data.

Q2.  **What happens when you increase or decrease the optimizer learning
    rate?**

When you increase the learning rate of an optimizer, it affects the rate
at which the model parameters are updated during the training process.
Here's what happens when you increase or decrease the learning rate:

**Increase in learning rate:**

**1. Faster convergence:** A higher learning rate can lead to faster
convergence since the model parameters are updated more aggressively. It
means that the model learns from the data more quickly and reaches a
good solution in fewer iterations.

**2. Risk of overshooting:** However, a very high learning rate can
cause the optimizer to overshoot the optimal solution. In this case, the
parameter updates may be too large, leading to unstable training and the
model failing to converge.

**3. Skipping local minima:** With a higher learning rate, the optimizer
is more likely to jump out of local minima, which can be beneficial if
the current local minimum is not the global minimum. This can help the
model escape from suboptimal solutions and explore a wider parameter
space.

**Decrease in learning rate:**

**1. Smoother convergence:** A lower learning rate leads to smoother
convergence as the updates to the model parameters are smaller. It
allows the model to fine-tune its performance gradually and can help in
reaching a more precise solution.

**2. Increased training time:** Since smaller updates are made to the
parameters, it takes more iterations for the model to converge.
Consequently, reducing the learning rate increases the training time, as
the model needs more epochs to reach a satisfactory performance level.

**3. Improved stability:** A lower learning rate can make the training
process more stable, reducing the risk of overshooting or oscillating
around the optimal solution. It provides a more controlled update
process and can prevent the model from getting stuck in a suboptimal
region.

Finding the appropriate learning rate is crucial in training neural
networks. It often requires experimentation and tuning to strike the
right balance between convergence speed, stability, and performance.
Techniques like learning rate schedules, adaptive learning rates (e.g.,
Adam optimizer), or using learning rate annealing can be employed to
optimize the learning rate during training.

Q3.  **What happens when you increase the number of internal hidden
    neurons?**

Increasing the number of internal hidden neurons in a neural network can
have several effects on the model's performance and behavior. **Here's
what generally happens when you increase the number of hidden neurons:**

**1. Increased model capacity:** Adding more hidden neurons increases
the model's capacity to learn complex patterns and representations from
the data. The neural network becomes capable of capturing more intricate
relationships and can potentially improve its ability to generalize to
unseen examples.

**2. Potential overfitting:** Increasing the number of hidden neurons
without appropriate regularization techniques can lead to overfitting.
Overfitting occurs when the model becomes too complex and starts to
memorize the training data instead of learning generalizable patterns.
This can result in poor performance on new, unseen data.

**3. Longer training time:** With more hidden neurons, the model becomes
larger and more computationally demanding to train. Training a neural
network with a higher number of hidden neurons may require more
computational resources and time to converge.

**4. Improved learning capacity:** Increasing the number of hidden
neurons can enhance the model's learning capacity, enabling it to better
fit the training data. The network becomes more flexible in representing
complex relationships and can potentially achieve higher training
accuracy.

**5. Higher risk of overparameterization:** Adding more hidden neurons
increases the number of parameters in the model, which can lead to
overparameterization. Overparameterization can make the optimization
problem more challenging, as the model has more degrees of freedom and
may be prone to getting stuck in suboptimal solutions.

**6. Potential vanishing or exploding gradients:** Increasing the number
of hidden neurons can exacerbate the issues of vanishing or exploding
gradients, especially in deep neural networks. If the network is not
properly initialized or the activation functions and weight
initialization are not carefully chosen, gradients may become too small
or too large, impeding effective training.

It's important to strike a balance when determining the number of hidden
neurons in a neural network. It often requires experimentation and model
validation to find the optimal architecture that provides good
generalization while avoiding overfitting or other issues associated
with excessive complexity. Regularization techniques such as dropout,
weight decay, or early stopping can be employed to mitigate the risk of
overfitting when increasing the number of hidden neurons.

Q4.  **What happens when you increase the size of batch computation?**

Increasing the size of the batch computation, also known as batch size,
in the context of training a neural network can have several
implications. **Here's what generally happens when you increase the size
of the batch computation:**

**1. Faster convergence:** Increasing the batch size can lead to faster
convergence during training. With a larger batch size, more training
samples are processed in parallel before updating the model parameters.
This results in fewer parameter updates per epoch, which can reduce the
overall training time and potentially achieve convergence more quickly.

**2. More memory usage:** Larger batch sizes require more memory to
store the batch data and intermediate results during forward and
backward propagation. If the available memory is limited, increasing the
batch size beyond a certain point may lead to out-of-memory errors or
significantly slow down the training process.

**3. Smoother gradient estimates:** A larger batch size provides a more
accurate estimation of the true gradient because it incorporates
information from a larger number of samples. This can lead to more
stable and consistent updates to the model parameters, potentially
resulting in better generalization performance.

**4. Possible degradation of generalization:** While increasing the
batch size can improve convergence speed, it can also introduce a risk
of overfitting. Larger batch sizes tend to provide less noisy gradient
estimates, which can potentially cause the model to converge to sharp,
over-optimized solutions that do not generalize well to unseen data.
Regularization techniques such as dropout or weight decay may be needed
to mitigate this risk.

**5. Impact on learning dynamics:** The batch size can affect the
learning dynamics of the model. Smaller batch sizes introduce more
stochasticity into the parameter updates, which can help the model
escape from poor local minima and explore a wider range of solutions.
Larger batch sizes, on the other hand, exhibit more deterministic
behavior due to the reduced noise in the gradient estimates.

**6. Computational efficiency:** Increasing the batch size can improve
computational efficiency, especially on hardware architectures optimized
for parallel processing, such as GPUs. Utilizing larger batch sizes
allows for better utilization of hardware resources, potentially
speeding up the training process.

Choosing the appropriate batch size is a trade-off between convergence
speed, memory constraints, and generalization performance. It often
requires experimentation and consideration of the specific
characteristics of the dataset and the computational resources
available. Different batch sizes may work better for different problems
and architectures, and it is common to perform hyperparameter tuning to
find the optimal batch size for a given scenario.

Q5.  **Why we adopt regularization to avoid overfitting?**

Regularization techniques are adopted in machine learning, including
deep learning, to combat overfitting. Overfitting occurs when a model
becomes too complex and starts to memorize the training data, leading to
poor generalization performance on unseen data. Regularization helps to
prevent or reduce overfitting by adding additional constraints to the
model during training. Here are the key reasons why regularization is
used to avoid overfitting:

**1. Complexity control:** Regularization techniques introduce
constraints that control the complexity of the model. By limiting the
complexity, the model is less likely to memorize noise or irrelevant
patterns in the training data, focusing on the essential features that
generalize well to unseen examples.

**2. Parameter shrinkage:** Regularization methods encourage the model
parameters to take smaller values. This helps to prevent large parameter
magnitudes that can lead to overfitting. By shrinking the parameter
values, the model becomes more robust and less sensitive to noise in the
training data.

**3. Occam's razor principle:** Regularization aligns with the principle
of Occam's razor, which states that simpler explanations are generally
more likely to be correct. Regularization encourages the model to favor
simpler explanations or hypotheses by penalizing complex models that fit
the training data too closely. This principle helps to avoid overfitting
by promoting models that balance complexity and generalization.

**4. Noise reduction:** Regularization techniques can reduce the impact
of noisy or irrelevant features in the training data. By applying
regularization, the model is encouraged to focus on the most informative
features, filtering out the noise and reducing the chances of
overfitting to specific instances or idiosyncrasies in the training
data.

**5. Improved generalization:** Regularization helps in improving the
generalization performance of the model by reducing overfitting. By
controlling the model's complexity and encouraging simpler solutions,
regularization allows the model to capture the underlying patterns and
relationships that hold true across different examples, leading to
better performance on unseen data.

Common regularization techniques in deep learning include L1 and L2
regularization (weight decay), dropout, early stopping, and batch
normalization. These techniques can be used individually or in
combination to effectively regularize the model and strike a balance
between fitting the training data and generalizing to new data.
Regularization is an essential tool in the machine learning
practitioner's toolbox to avoid overfitting and improve the overall
performance and robustness of the models.

Q6.  **What are loss and cost functions in deep learning?**

In deep learning, loss and cost functions are mathematical functions
that quantify the discrepancy between the predicted outputs of a neural
network and the true labels or targets associated with the training
data. The terms "loss function" and "cost function" are often used
interchangeably, although there can be slight differences in their usage
depending on the context. Here's a brief explanation of each:

**Loss Function:**

A loss function, also known as an objective function or an error
function, measures the inconsistency between the predicted outputs of a
model and the true labels for a single training example. It calculates a
scalar value that represents the discrepancy or error for that specific
example. The loss function typically takes the predicted outputs (often
referred to as logits or probabilities) and the corresponding true
labels as input.

The choice of a specific loss function depends on the nature of the
problem being addressed. For example, in binary classification tasks,
common loss functions include binary cross-entropy or sigmoid
cross-entropy. For multiclass classification problems, categorical
cross-entropy or softmax cross-entropy is often used. In regression
tasks, mean squared error (MSE) or mean absolute error (MAE) are
frequently employed as loss functions.

**Cost Function:**

The cost function, also known as the objective or the average loss
function, is a measure of the overall discrepancy or error between the
predicted outputs and the true labels for the entire training dataset.
It is obtained by taking the average (or sum) of the individual loss
values calculated for each training example. The cost function is used
to evaluate the performance of a model and guide the learning process
during training.

The cost function provides a single scalar value that quantifies the
overall performance of the model on the training data. The goal of the
learning process is to minimize this cost or error by adjusting the
model's parameters (weights and biases) through optimization algorithms
like gradient descent or its variants. The model iteratively updates its
parameters to find the values that minimize the cost function, leading
to improved predictions and better generalization on unseen data.

It's important to note that the choice of loss and cost functions
depends on the specific task, the nature of the output (e.g., binary
classification, multiclass classification, regression), and the desired
properties of the model. Different loss functions have different
characteristics and can influence the learning process and the behavior
of the trained model.

Q7.  **What do you mean by underfitting in neural networks?**

Underfitting in neural networks refers to a situation where the model
fails to capture the underlying patterns and relationships present in
the training data. It occurs when the model is too simple or lacks the
capacity to represent the complexity of the data, resulting in poor
performance on both the training data and unseen data.

**When a neural network underfits the data, it exhibits high bias or a
high training error. Here are some characteristics and indicators of
underfitting:**

**1. Insufficient complexity:** The neural network may have insufficient
capacity or architectural limitations to model the complexity of the
data. It fails to capture important patterns, dependencies, or nuances
present in the training data.

**2. High training error:** The model's performance on the training data
is poor, indicated by high training error or low accuracy. The model
struggles to fit the training examples and does not capture the
underlying relationships effectively.

**3. Poor generalization:** Underfitting often leads to poor
generalization, meaning that the model's performance on unseen data
(validation or test data) is also subpar. It fails to capture the
essential features and patterns that hold true across the entire
dataset.

**4. Oversimplified decision boundaries:** The underfitted model may
result in oversimplified decision boundaries that do not properly
separate different classes or capture the intricacies of the data
distribution. This can lead to misclassifications and poor predictive
performance.

**5. Low variance, high bias:** Underfitting is associated with a high
bias and low variance. The model is biased toward a certain set of
assumptions or oversimplified representations, which prevents it from
adequately adapting to the data and learning more complex patterns.

Addressing underfitting typically requires increasing the model's
complexity or capacity. This can be done by adding more layers,
increasing the number of neurons, or adjusting other architectural
aspects. Additionally, techniques like changing the activation
functions, modifying the optimization algorithm, or introducing
regularization methods (such as dropout or weight decay) can help to
overcome underfitting by allowing the model to capture more complex
relationships and reduce bias.

Finding the right balance between model complexity and data complexity
is crucial. It is important to monitor the model's performance, diagnose
underfitting, and make appropriate adjustments to improve the model's
ability to learn and generalize effectively.

Q8.  **Why we use Dropout in Neural Networks?**

Dropout is a regularization technique commonly used in neural networks
to prevent overfitting and improve generalization performance. It works
by randomly dropping out (i.e., temporarily removing) a proportion of
neurons during the training phase. **Here are the main reasons why
dropout is used in neural networks:**

**1. Reducing overfitting:** Dropout helps to reduce overfitting by
introducing noise and randomness during training. By dropping out
neurons, the network becomes less reliant on specific neurons and
prevents the co-adaptation of neurons, forcing the network to learn more
robust and generalized representations.

**2. Ensemble learning effect:** Dropout can be seen as training
multiple different neural networks simultaneously, as each dropout
configuration creates a unique subnetwork. During training, different
subsets of neurons are dropped out, leading to different paths and
interactions within the network. This creates an ensemble of networks
that work together to make predictions, effectively reducing the risk of
overfitting and improving generalization.

**3. Regularizing complex models:** Dropout is particularly useful when
training deep neural networks with many layers and a large number of
parameters. Deep models tend to have a higher risk of overfitting due to
their increased capacity to memorize the training data. Dropout helps
regularize the model and prevents it from becoming overly complex and
overfitting the data.

**4. Handling co-adaptation:** Neural networks have a tendency to
develop co-adaptations between neurons during training, where specific
neurons rely heavily on the presence of other neurons for effective
functioning. This can lead to a fragile network that is overly sensitive
to slight changes or variations in the input. Dropout breaks up these
co-adaptations and encourages neurons to be more self-reliant, improving
the overall robustness of the network.

**5. Efficient optimization:** Dropout can also lead to more efficient
optimization by reducing the effects of vanishing gradients. By randomly
dropping out neurons, the flow of gradients through the network becomes
more diffuse and less likely to get stuck in poor local minima, allowing
for better exploration of the parameter space.

It's important to note that dropout is typically used during the
training phase and is typically turned off during inference or
evaluation. During inference, the full network with all neurons is used
to make predictions.

By incorporating dropout in the training process, neural networks can
become more resilient to overfitting, generalize better to unseen data,
and enhance the overall performance and robustness of the models.