In [1]:
# 1. Activation Functions
# a) Sigmoid
# The sigmoid function maps any input to a value between 0 and 1. It has an "S" shaped curve and is useful for binary classification tasks. However, it can lead to issues like vanishing gradients, where very high or low input values cause the gradients to become very small, slowing down the learning process.

# b) Tanh
# The hyperbolic tangent (tanh) function is similar to the sigmoid but outputs values between -1 and 1. It effectively centers the data, which can lead to better convergence during training compared to the sigmoid. Like sigmoid, it can also suffer from vanishing gradients for extreme input values.

# c) ReLU (Rectified Linear Unit)
# ReLU is a simple activation function defined as 𝑓(𝑥)=max⁡(0,𝑥)
# f(x)=max(0,x). It outputs zero for negative inputs and passes positive inputs unchanged. This function helps mitigate the vanishing gradient problem and leads to faster training, but it can result in dead neurons where some neurons become inactive and never recover.

# d) ELU (Exponential Linear Unit)
# ELU is designed to address some of the shortcomings of ReLU. It outputs a small negative value for negative inputs (instead of zero) and increases smoothly for positive inputs. This helps prevent dead neurons and can improve learning speed, particularly in deeper networks.

# e) Leaky ReLU
# Leaky ReLU is a variant of ReLU that allows a small, non-zero gradient (a "leak") for negative inputs, typically defined as 𝑓(𝑥)=𝑥f(x)=x for 𝑥>0x>0 and 𝑓(𝑥)=𝛼𝑥f(x)=αx for 𝑥≤0x≤0 (where 𝛼
# α is a small constant). This helps keep the neurons active during training, reducing the risk of dead neurons.

# f) Swish
# Swish is a newer activation function defined as 𝑓(𝑥)=𝑥⋅sigmoid(𝑥)
# f(x)=x⋅sigmoid(x). It can output both negative and positive values and tends to work well in practice, often outperforming ReLU and its variants in deep networks.

In [2]:
# Learning Rate in Optimizers
# When you increase the learning rate, the model learns faster but may overshoot the optimal point, leading to divergence or instability. On the other hand, decreasing the learning rate slows down the training process, allowing the model to make smaller updates. While this can lead to more precise convergence, it may also result in longer training times and the risk of getting stuck in local minima.



In [3]:
# Number of Hidden Neurons
# Increasing the number of hidden neurons can give the network more capacity to learn complex patterns in the data. However, it can also lead to overfitting, where the model learns noise instead of the underlying patterns. The model may perform well on the training data but poorly on unseen data due to its increased complexity.



In [4]:
#  Batch Size
# Increasing the batch size can lead to more stable and accurate gradient estimates, which can improve the training speed and convergence. However, it also requires more memory and can reduce the generalization capability of the model, potentially leading to overfitting. Smaller batch sizes introduce more noise in the gradients, which can help the model generalize better.



In [5]:
# Regularization to Avoid Overfitting
# Regularization techniques (like L1, L2 regularization, and dropout) are adopted to prevent overfitting by adding a penalty for more complex models. This encourages the model to focus on the most significant features, reduces reliance on noise in the training data, and promotes generalization to unseen data.

In [6]:
#  Loss and Cost Functions
# In deep learning, a loss function measures how well the model's predictions match the true labels for a single training example, while a cost function is the average of the loss over the entire dataset or a batch of examples. The goal of training is to minimize the cost function, which typically reflects the model's performance during training.

In [7]:
#  Underfitting
# Underfitting occurs when a model is too simple to capture the underlying patterns in the data. This can happen if the model has too few parameters, inadequate training, or an inappropriate model architecture. An underfitted model performs poorly on both training and test data, failing to achieve good accuracy.



In [9]:
# Dropout in Neural Networks
# Dropout is a regularization technique used to prevent overfitting by randomly dropping a fraction of neurons during training. This forces the network to learn robust features that are not reliant on any single neuron, promoting better generalization. During inference, all neurons are used, ensuring the model has learned a comprehensive representation of the data.