# Neural Network Architecture

A **neural network architecture** refers to the *structure or layout of a neural network*. It defines how the individual nodes or layers are organized and connected to each other. The architecture *determines how the network processes and transforms input data to produce output predictions or classifications*.

**Direction** definition is *PENDING*

## Input Layer

The **input layer** is responsible in *preprocessing and transforming the input data to its appropriate numeric representations prior* to being passed to every node in the first hidden layer. It ***does not have its own weights, biases, or activation functions***; as they only come to play inside the hidden layers. The input layer ensures that the neural network can *better handle different types of data*, *improve convergence during training*, and *enhance the overall performance of the model*.

*Below are the specific and detailed steps performed by the input layer for the data preprocessing:*
* **Data Scaling/Normalization**: This involves scaling the input data to a specific range or normalizing it to have zero mean and unit variance. Scaling the data ensures that all features have a similar scale, which can help improve the performance and convergence of the neural network during training.
* **Data Encoding**: If the input data contains categorical variables, they may need to be encoded into numerical values. This can be done using techniques such as one-hot encoding, label encoding, or ordinal encoding, depending on the nature of the categorical variables.
* **Missing Data Handling**: If the input data contains missing values, preprocessing in the input layer may involve handling these missing values. This can be done by imputing the missing values with a suitable strategy, such as mean imputation, median imputation, or using more advanced techniques like regression imputation or multiple imputation.
* **Feature Selection/Extraction**: In some cases, the input data may contain a large number of features, and not all of them may be relevant for the task at hand. Preprocessing in the input layer may involve selecting a subset of the most informative features or performing feature extraction techniques, such as principal component analysis (PCA) or linear discriminant analysis (LDA), to reduce the dimensionality of the input data.
* **Data Reshaping**: Depending on the specific requirements of the neural network architecture, the input data may need to be reshaped or restructured. For example, if the network expects a certain input shape, such as a specific number of rows and columns, the input data may need to be reshaped accordingly.gly.

## Hidden Layer

The **hidden layers** serves as the brain of the neural network, responsible in ***processing the data to capture and determine the complex relationships and relationship existent in them*** leading to *improved generalization and prediction capabilities in the output layer*. Each node in the individual hidden layers processes and identifies relationships between the target variable (y) and the feature variables (x), and passes their output to all nodes found in the succeeding layer, persisted with their passed weights and biases.

*Below are the specific and detailed tasks performed by the hidden layer:*
* **Weight and Bias Adjustment**: The hidden layers contribute to determining the optimal values of weights and biases. The weights control the strength of the connections between neurons, while the biases introduce an additional parameter that helps in shifting the activation function. By adjusting these parameters, the hidden layers allow the neural network to learn and adapt to the patterns in the data.
* **Non-Linear Transformations**: The primary function of the hidden layers is to perform non-linear transformations on the inputs. Each neuron in the hidden layer applies an activation function to the weighted sum of its inputs. This introduces non-linearity into the model, enabling the neural network to capture complex relationships and patterns in the data. The hidden layers allow the network to learn and represent non-linear mappings between the input and output variables.
* **Feature Extraction and Representation**: The hidden layers also play a crucial role in feature extraction and representation learning. As the data passes through the hidden layers, they learn to extract relevant features from the input data. Each hidden layer can learn to detect and represent different levels of abstraction and complexity. By combining these learned features, the network can make more accurate predictions in the output layer.
* **Hierarchical Learning**: Neural networks with multiple hidden layers can learn hierarchical representations of the data. The early hidden layers learn low-level features, such as edges or corners, while the deeper hidden layers learn more abstract and high-level features. This hierarchical learning allows the network to capture intricate patterns and relationships in the data, leading to improved performance.
* **Generalization and Prediction**: The hidden layers, along with the learned weights and biases, contribute to the network's ability to generalize and make predictions. By learning from the training data, the hidden layers adjust the weights and biases to minimize the difference between the predicted output and the actual output. This enables the network to make accurate predictions on unseen data.

Usually more hidden layers means better accuracy, but is not always true in some occassions. It is also important to consider the fact that more hidden layers meant higher processing power and time needed for training and inference.

## Weights and Biases in Neural Networks

***Weights** and **biases** are fundamental components of a neural network and are its trainable parameters determined during the training process to improve the model's prediction capabilities. They are in a form of arrays storing numeric values.*

### Weights

**Weights** are parameters associated with the connections between neurons in a neural network. Each connection between two neurons has a weight assigned to it. These weights determine the strength or importance of the connection. They ***determine the contribution of each input to the output of a node***. *Larger weights* amplify the input's effect, while *smaller weights* diminish it. By adjusting the weights, the network learns to assign appropriate importance to different features and patterns in the data.

Initially, weights are assigned randomly using **initialization techniques** such as *random initialization*, *Xavier initialization*, and *He initialization*.

During training, the network updates the weights using **optimization algorithms** like *gradient descent*. The goal is to minimize the difference between the *predicted output* and the *actual output* through such algorithms.

### Biases

**Biases** are additional parameters in neural networks that allow for shifting the activation function. Each neuron, with the exception of the input layer nodes, has a bias associated with it. Biases help the network learn and represent non-linear relationships between inputs and outputs. They allow the network to ***introduce a certain level of flexibility and adaptability***. They help in capturing non-linear patterns and making predictions that are not solely dependent on the input values. Biases shift the activation function, enabling the network to learn complex decision boundaries.

Similar to weights, biases are initialized with random values. The initialization process depends on the specific activation function used in the network.

During training, biases are updated along with the weights. The network learns the optimal bias values that help in achieving the desired output.

## Activation Functions in Deep Learning

**Activation functions** play a crucial role in deep learning by introducing ***non-linearity*** into the neural network.

*Activation functions* are applied to the weighted sum of inputs in a neuron to introduce non-linearity. Without activation functions, a neural network would be limited to learning linear relationships between inputs and outputs. Activation functions allow the network to learn and represent complex patterns and non-linear relationships in the data.

### Non - Linearity through Activation Functions

**Non-linearity** refers to the property of a function or relationship where the output does not change linearly with the input. ***Non-linear transformations***, introduced through activation functions in deep learning, enable neural networks to capture and model complex patterns and non-linear relationships in the data. This is usually achieved by transforming the input values into a desired output range, typically between 0 and 1 or -1 and 1

In the context of deep learning, non-linearity is crucial because many real-world problems and data patterns are inherently non-linear. Linear relationships can only capture simple patterns and relationships that can be represented by straight lines or planes. However, complex patterns and relationships often require non-linear transformations to be accurately captured and modeled.

By introducing non-linearity through activation functions, neural networks can learn and represent these non-linear relationships. Activation functions allow the network to capture intricate patterns, make complex decisions, and approximate highly non-linear functions. This enables deep learning models to handle a wide range of tasks, such as image recognition, natural language processing, and time series prediction, where non-linear relationships are prevalent.

### Common Activation Functions

*The choice of activation function depends on the problem at hand and the characteristics of the data. Different activation functions have different properties and can impact the network's performance. It is important to experiment with different activation functions and select the one that yields the best results for a specific task.*

**Sigmoid Function**: The sigmoid function is a popular activation function that ***maps the input to a value between 0 and 1***. It has a smooth S-shaped curve. The sigmoid function is useful in binary classification problems or when the output needs to be interpreted as a probability.

**ReLU (Rectified Linear Unit)**: The ReLU activation function is widely used in deep learning. It ***returns the input if it is positive, and 0 otherwise***. ReLU helps in overcoming the vanishing gradient problem and accelerates the convergence of the network during training.

**Leaky ReLU**: Leaky ReLU is a variation of the ReLU function that ***introduces a small slope for negative inputs***. This helps in addressing the "dying ReLU" problem, where neurons can become inactive and stop learning.

**Tanh (Hyperbolic Tangent)**: The tanh function ***maps the input to a value between -1 and 1***. It is similar to the sigmoid function but centered around 0. Tanh is useful when the input data is normalized or when negative values are meaningful.

**Softmax**: The softmax function is commonly used in the output layer of a neural network for multi-class classification problems. It ***converts the output values into probabilities, where the sum of all probabilities is equal to 1***. Softmax is useful when dealing with mutually exclusive classes.

*Note that each hidden layer may have a different activation function from other hidden layers of the same neural network to improve the performance of the neural network.*

## Output Layer

The **output layer** in a deep neural network is the final layer of neurons that ***produces the network's predictions or outputs***. It is responsible for *mapping the learned features and representations* from the *preceding layers to the desired output format*.

The **number of neurons** in the output layer ***depends on the specific task and the desired output format***. For example, in *binary classification*, there is typically *one node in the output layer* to represent the probability or prediction for one class. In *multi-class classification*, the *number of neurons in the output layer corresponds to the number of classes*.

The choice of activation function in the output layer depends on the nature of the problem and the desired output format. Common activation functions used in the output layer with relation to the nature of the problem are listed below:
* **Sigmoid Function**: Used for ***binary classification problems***, the sigmoid function ***maps the output to a value between 0 and 1***, representing the probability of belonging to a particular class.
* **Softmax Function**: Used for ***multi-class classification problems***, the softmax function ***normalizes the outputs across all classes, producing a probability distribution where the sum of probabilities is equal to 1***. It helps in determining the most likely class for a given input.
* **Linear Function**: Used for ***regression problems***, the linear activation function allows the network to ***directly output continuous values without any transformation***.

*It is also important to note that in the output layer we also apply the loss functions*. The **loss function** measures the ***discrepancy between the predicted output and the true output***. Common loss functions include **mean squared error (MSE)** for *regression problems*, **binary cross-entropy** for *binary classification*, and **categorical cross-entropy** for *multi-class classification*. We will discuss this later to expand on it.

The outputs derived after being processed in the output later are then run through an **interpretation process**. For example, in *binary classification*, the output can be ***interpreted as the probability of belonging to a particular class***. In *multi-class classification*, the output can be ***interpreted as the probabilities of belonging to each class***. In *regression*, the ***output represents the predicted continuous value***.

## Loss Function in Output Layer

A **loss function** in the output layer of a neural network is to ***measure the discrepancy or error between the predicted output and the true output (also known as the ground truth)***. The loss function quantifies how well the network is performing on the given task and provides a measure of the network's ability to make accurate predictions. Below are the major aspects of a loss function:

The loss function is used as an **error measurement** calculates the difference between the predicted output and the true output. It provides a numerical value that represents the error or discrepancy between the predicted and actual values.

The loss function is used as an **optimization guide** for the network's parameters (weights and biases) during the training process. The goal is to minimize the loss function by adjusting the parameters, which leads to better predictions and improved performance.

The loss function is **task-specific**. The choice of the loss function depends on the specific task and the desired output format. Different tasks, such as regression, binary classification, or multi-class classification, require different loss functions.
* **Mean Squared Error (MSE)**: Used for regression tasks, MSE calculates the average squared difference between the predicted and true values. It penalizes larger errors more heavily.
* **Binary Cross-Entropy**: Used for binary classification tasks, binary cross-entropy measures the dissimilarity between the predicted probabilities and the true binary labels. It is commonly used with sigmoid activation in the output layer.
* **Categorical Cross-Entropy**: Used for multi-class classification tasks, categorical cross-entropy measures the dissimilarity between the predicted class probabilities and the true class labels. It is commonly used with softmax activation in the output layer.

The loss function is used for **training and evaluation**. During training, the loss function is used to compute the gradient of the loss with respect to the network's parameters. This gradient is then used to update the parameters through backpropagation and gradient descent optimization. The loss function guides the learning process by indicating the direction in which the parameters should be adjusted to minimize the error.

Lastly, the loss function is often used as an **evaluation metric** to assess the performance of the trained model. However, it may not always directly reflect the model's overall performance. Additional evaluation metrics, such as accuracy, precision, recall, or F1 score, may be used to provide a more comprehensive assessment of the model's performance.