# Neural Networks

## 1. Neurons:
A neuron is a fundamental building block of a neural network, and it computes a weighted sum of its input plus a bias term, then passes it through an activation function.

**Equation:**
A single neuron's output can be represented as:
<h4><center>$ y = f\left( \sum_{i=1}^{n} w_i \cdot x_i + b \right) $</center></h4>

where:
- $y$: output of the neuron
- $w_i$: weight of the i-th input
- $x_i$: i-th input
- $b$: bias term
- $f$: activation function, such as ReLU or Sigmoid
- $n$: number of inputs

In [3]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

#### Sigmoid Function
The mathematical expression for the sigmoid function is given by:

<h4><center>$ \sigma(x) = \frac{1}{1 + e^{-x}} $</center></h4>

##### Properties and Applications:
1. **Range**: The sigmoid function takes any real-valued number and squashes it into the range between 0 and 1. The output can be interpreted as a probability, making it useful for binary classification tasks.

2. **Smooth Gradient**: The function is smooth, and its derivative is easy to compute. The gradient is given by:

   <h4><center>$ \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x)) $</center></h4>

   The smooth gradient helps in gradient-based optimization methods like Gradient Descent.

3. **Activation in Hidden Layers**: Earlier in neural network history, the sigmoid function was commonly used in hidden layers to introduce non-linear properties to the model. The non-linearity allows the network to learn complex mappings from inputs to outputs.

4. **Vanishing Gradient Problem**: One downside of the sigmoid function is that its gradient becomes very small for very large or very small input values. This leads to the so-called "vanishing gradient" problem during training, where the gradients become so small that the weights hardly update, slowing down or even halting the learning process.

5. **Computational Efficiency**: The sigmoid function is computationally more expensive compared to other activation functions like ReLU. The exponential function involved in its computation can be costly, especially in deep networks.

In [None]:
def single_neuron(x, w, b):
    weighted_sum = np.dot(x, w) + b
    return sigmoid(weighted_sum)

# Inputs, Weights, and Bias
x = np.array([0.5, 0.3, 0.1])
w = np.array([0.4, 0.2, 0.6])
b = 0.1

# Output of the Neuron
output = single_neuron(x, w, b)
print(f"Output of the neuron: {output}")

In the **Single Neuron function**  `x` is the input vector, `w` is the weights vector, and `b` is the bias. The function calculates the weighted sum of the inputs and passes it through the sigmoid activation function. The `np.dot` function computes the dot product of the input and weights, effectively implementing the equation $ y = f\left( \sum_{i=1}^{n} w_i \cdot x_i + b \right) $, where $ f $ is the sigmoid function.


## 2. Weights:
Weights are parameters within the neural network that are fine-tuned during training. They define the strength of connections between neurons in different layers.

**Importance:**
Weights allow the model to generalize from the training data to unseen data. By adjusting the weights, the network minimizes the error between the predicted and actual output.

In [2]:
def update_weights(x, y_true, w, b, learning_rate=0.1):
    # Predicted Output
    y_pred = single_neuron(x, w, b)
    
    # Error Derivative
    d_error = 2 * (y_pred - y_true)
    
    # Derivative with respect to Weights and Bias
    d_weights = d_error * x * y_pred * (1 - y_pred)
    d_bias = d_error * y_pred * (1 - y_pred)
    
    # Update Weights and Bias
    w -= learning_rate * d_weights
    b -= learning_rate * d_bias
    
    return w, b

# Target Output
y_true = 0.4

# Updated Weights and Bias
new_w, new_b = update_weights(x, y_true, w, b)
print(f"Updated weights: {new_w}")
print(f"Updated bias: {new_b}")

Updated weights: [0.39513082 0.19707849 0.59902616]
Updated bias: 0.09026164910025532


The **Update Weights function** updates the weights and bias based on the error between the predicted and true output.
- `y_pred` is the output from the neuron.
- `y_true` is the actual target output.
- `d_error` computes the derivative of the error with respect to the output, using a simple mean squared error. 
- `d_weights` and `d_bias` calculate the gradients for the weights and bias, respectively, considering the sigmoid activation function's derivative $ y'(x) = y(x)(1 - y(x)) $.
- The weights and bias are updated by subtracting a fraction (controlled by `learning_rate`) of the computed gradients. This helps in reducing the error.

## 3. Convolutional Layer:
A convolutional layer applies a convolution operation to its input, helping the network to automatically and adaptively learn spatial hierarchies of features.

1. **Filters/Kernels**: Convolutional layers use a set of learnable filters or kernels. Each kernel is small spatially but extends through the full depth of the input volume. These filters are applied to the input to produce feature maps, emphasizing certain features like edges or textures.

2. **Sliding Window**: The kernels slide across the input data (such as an image) to produce a feature map. This sliding window operation is the convolution.

3. **Stride**: This defines how much the filter moves at each step of the sliding window. A stride of 1 moves the filter one pixel at a time, while a stride of 2 moves it two pixels, etc.

4. **Padding**: Padding adds extra pixels around the input image. It controls the spatial dimensions of the output volumes (for example, to keep them the same as the input).

5. **Pooling**: While not part of the convolution operation itself, pooling layers often follow convolutional layers to reduce the spatial dimensions of the data.

**Equation:**
The convolution operation can be represented as:
<h4><center>$ (f * g)(t) = \sum_{\tau=-\infty}^{\infty} f(\tau) \cdot g(t - \tau) $</center></h4>

where:
- $f$: input feature map
- $g$: kernel or filter
- $t$: location in the output feature map

In [11]:
def convolution2D(image, kernel, stride=1, padding=0):
    # Add padding to the image
    image_padded = np.pad(image, padding, mode='constant')
    
    # Dimensions of the output feature map
    output_shape = ((image.shape[0] - kernel.shape[0] + 2 * padding) // stride + 1,
                    (image.shape[1] - kernel.shape[1] + 2 * padding) // stride + 1)
    
    # Initialize the output feature map
    feature_map = np.zeros(output_shape)
    
    # Apply the kernel to the image
    for i in range(0, output_shape[0]):
        for j in range(0, output_shape[1]):
            x_start = i * stride
            x_end = x_start + kernel.shape[0]
            y_start = j * stride
            y_end = y_start + kernel.shape[1]
            
            # Element-wise multiplication of the kernel and the image patch
            feature_map[i, j] = np.sum(image_padded[x_start:x_end, y_start:y_end] * kernel)
    
    return feature_map

In [12]:
image = np.array([[1, 2, 3, 4],
                  [5, 6, 7, 8],
                  [9, 10, 11, 12],
                  [13, 14, 15, 16]])

# Define a kernel, for example, a 2x2 kernel for edge detection
kernel = np.array([[-1, -1],
                   [1, 1]])

# Call the convolution2D function
convolution2D_output = convolution2D(image, kernel)

# Print the output
print(convolution2D_output)


[[8. 8. 8.]
 [8. 8. 8.]
 [8. 8. 8.]]


- **Padding**: The function begins by adding padding to the image using `np.pad` if necessary.

- **Output Shape Calculation**: The dimensions of the output feature map are calculated based on the input size, kernel size, stride, and padding.

- **Convolution Operation**: The nested loops slide the kernel across the padded image with the given stride. For each position, a patch of the image is multiplied element-wise with the kernel, and the result is summed up to form a single pixel in the output feature map.

- **Result**: The function returns the feature map representing the convoluted layer's output.

The example kernel provided is a very simple one that might detect horizontal edges. Below we will explore how these different concpets can be combined into a more functional model.

***Significance:*** Convolutional layers form the core of CNNs, enabling them to learn spatial features from images in a hierarchical and translation-invariant manner. The basic concept of applying a filter across the image captures essential features, enabling deep learning models to perform complex tasks in computer vision, such as object detection, image segmentation, and face recognition.

## 4. Pooling Layer:
Pooling layers reduce the dimensionality of the data, retaining only essential information. 
1. **Function**: Pooling reduces the spatial size of the representation, reducing the number of parameters and computation in the network. This helps to control overfitting.

2. **Types**: The two most common types of pooling are Max Pooling and Average Pooling.
   - **Max Pooling**: Selects the maximum value from each group of values in a local region of the input.
   - **Average Pooling**: Calculates the average value for each group of values in a local region of the input.

3. **Window Size**: The size of the window determines how many pixels are grouped together for each pooling operation.

4. **Stride**: Like in convolution, stride controls how the window moves across the input.

5. **No Learnable Parameters**: Unlike convolutional layers, pooling operations have no learnable parameters, meaning they do not change during the training process.

**Equation for Max Pooling:**
If you have a $2 \times 2$ pooling window, the max pooling operation selects the maximum value from the $2 \times 2$ grid:
<h4><center>$ \text{Max} \left( a, b, c, d \right)$</center></h4>

where $a, b, c,$ and $d$ are the values in the $2 \times 2$ window.

In [8]:
def max_pooling2D(image, pool_size=2, stride=2):
    # Dimensions of the output
    output_shape = ((image.shape[0] - pool_size) // stride + 1,
                    (image.shape[1] - pool_size) // stride + 1)

    # Initialize the output
    pooled_output = np.zeros(output_shape)

    # Apply max pooling
    for i in range(0, output_shape[0]):
        for j in range(0, output_shape[1]):
            x_start = i * stride
            x_end = x_start + pool_size
            y_start = j * stride
            y_end = y_start + pool_size
            
            # Take the max value from the window
            pooled_output[i, j] = np.max(image[x_start:x_end, y_start:y_end])

    return pooled_output

In [9]:
image = np.array([[1, 2, 3, 4],
                  [5, 6, 7, 8],
                  [9, 10, 11, 12],
                  [13, 14, 15, 16]])

pooled_output = max_pooling2D(image)
print(pooled_output)

[[ 6.  8.]
 [14. 16.]]


- **Output Shape Calculation**: Based on the input size, pool size, and stride, the dimensions of the output are calculated.

- **Max Pooling Operation**: The nested loops slide the pooling window across the image with the given stride. For each position, the maximum value within the window is taken and assigned to the corresponding position in the output.

- **Result**: The function returns the pooled output, which is a reduced-size version of the input, retaining only the most prominent features.

***Significance:*** Pooling layers are essential in CNNs for down-sampling the spatial dimensions, reducing the computational complexity, and helping to make the detection of features invariant to scale and orientation changes. Max pooling, in particular, highlights the most salient aspects of the features, thus preserving essential information while discarding redundant details. 

## 5. Activation Function:
Activation functions introduce non-linearity to the model, allowing it to learn complex patterns. Common activation functions include ReLU and Sigmoid.

**Equations:**
- **ReLU:** $ f(x) = \max(0, x) $
- **Sigmoid:** $ f(x) = \frac{1}{1 + e^{-x}} $

### 6. Backpropagation:
Backpropagation is the algorithm used to update the weights in the network. It computes the gradient of the loss function with respect to each weight by applying the chain rule.

**Equation:**
The weight update can be represented as:
<h4><center>$ w_i = w_i - \alpha \frac{\partial L}{\partial w_i} $</center></h4>

where:
- $w_i$: weight of the i-th input
- $\alpha$: learning rate
- $L$: loss function

### Relevance to Data Science:
These concepts lead to powerful models that can capture complex patterns and relationships in data. The ability to build hierarchical representations through convolutional and pooling layers makes CNNs particularly effective for image analysis. They have revolutionized areas like image recognition, video analysis, and even non-visual tasks like natural language processing. By understanding the underlying mathematics and theories, data scientists can better utilize these tools, design more efficient models, and innovate in various applications.

In data science, understanding the theoretical components enables the practitioner to make informed decisions about model architecture, optimization, and evaluation. It bridges the gap between mathematical abstractions and real-world applications, enhancing both the interpretability and effectiveness of the models.