1. **Explain the Activation Functions in your own language:**
   - **Sigmoid**: The sigmoid function squashes input values to a range between 0 and 1. It's useful for binary classification problems because it outputs probabilities. However, it can suffer from vanishing gradients, making training deep networks difficult.
     $$\sigma(x) = \frac{1}{1 + e^{-x}}$$

   - **Tanh**: The tanh function squashes input values to a range between -1 and 1. It's zero-centered, which can help with convergence during training. Like the sigmoid, it can also suffer from vanishing gradients.
     $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

   - **ReLU (Rectified Linear Unit)**: ReLU outputs the input directly if it is positive; otherwise, it outputs zero. It's simple and effective, helping to mitigate the vanishing gradient problem. However, it can suffer from the "dying ReLU" problem, where neurons can get stuck outputting zero.
     $$\text{ReLU}(x) = \max(0, x)$$

   - **ELU (Exponential Linear Unit)**: ELU outputs the input if it is positive, and an exponential function if it is negative. This helps to keep the mean activations closer to zero and can improve learning dynamics.
     $$\text{ELU}(x) = \begin{cases} 
      x & \text{if } x > 0 \\
      \alpha (e^x - 1) & \text{if } x \leq 0 
   \end{cases}$$

   - **Leaky ReLU**: Leaky ReLU is a variant of ReLU that allows a small, non-zero gradient when the input is negative. This helps to mitigate the dying ReLU problem.
     $$\text{Leaky ReLU}(x) = \begin{cases} 
      x & \text{if } x > 0 \\
      \alpha x & \text{if } x \leq 0 
   \end{cases}$$

   - **Swish**: Swish is a newer activation function that is a smooth, non-monotonic function. It tends to perform better than ReLU in deeper networks.
     $$\text{Swish}(x) = x \cdot \sigma(x) = x \cdot \frac{1}{1 + e^{-x}}$$

2. **What happens when you increase or decrease the optimizer learning rate?**
   - **Increase**: Increasing the learning rate can speed up training, but if it's too high, it can cause the model to overshoot the optimal solution, leading to divergence or instability.
   - **Decrease**: Decreasing the learning rate can lead to more stable training and convergence, but if it's too low, training can become very slow and may get stuck in local minima.

3. **What happens when you increase the number of internal hidden neurons?**
   - Increasing the number of hidden neurons can allow the network to learn more complex patterns and improve performance. However, it also increases the risk of overfitting, especially if the training data is limited. It also increases computational cost and memory usage.

4. **What happens when you increase the size of batch computation?**
   - Increasing the batch size can lead to more stable gradient estimates and faster training due to better utilization of hardware. However, if the batch size is too large, it can require more memory and may lead to poorer generalization.

5. **Why do we adopt regularization to avoid overfitting?**
   - Regularization techniques, such as L1/L2 regularization, dropout, and early stopping, help to prevent overfitting by adding constraints or noise to the model. This encourages the model to learn more general patterns rather than memorizing the training data.

6. **What are loss and cost functions in deep learning?**
   - **Loss Function**: A loss function measures the difference between the predicted output and the actual target. It is used to guide the optimization process.
   - **Cost Function**: The cost function is typically the average of the loss function over the entire training dataset. It provides a single scalar value that the optimization algorithm aims to minimize.

7. **What do you mean by underfitting in neural networks?**
   - Underfitting occurs when a neural network is too simple to capture the underlying patterns in the data. This results in poor performance on both the training and validation datasets. It can be caused by having too few neurons, layers, or insufficient training.

8. **Why do we use Dropout in Neural Networks?**
   - Dropout is used to prevent overfitting by randomly setting a fraction of the neurons to zero during training. This forces the network to learn redundant representations and improves generalization by reducing reliance on any single neuron.
