Let's start off by making a simple Neuron\\
![image.png](attachment:image.png)

**Building a Neuron: The Role of the Sigmoid Function, An Activation Function**

Creating a neuron involves several steps and relies heavily on mathematics. Among these, the sigmoid function plays a crucial role.

The sigmoid function, also known as the logistic function, introduces non-linearity into the neuron. This means it helps the model learn complex relationships between variables, unlike linear models where changes in one variable directly affect another (e.g., doubling the length of a rectangle doubles its width).

**Understanding Non-Linearity:**

In real-world scenarios, relationships between variables are often non-linear. For example, consider a car's speed (variable A) and its fuel consumption (variable B). Doubling the speed (A) wouldn't simply double the fuel consumption (B). This is because air resistance increases exponentially as speed increases (another variable C affecting B). This creates a roughly cubic relationship between speed and fuel consumption.

Here's an illustration:

* **Linear Assumption (incorrect):** One might assume that doubling a car's speed would simply double its fuel consumption.
* **Reality (non-linear):** Doubling speed increases air resistance significantly, leading to a much higher (cubic) increase in fuel consumption.

Data points:

* At 30 mph, a car might use 24 mpg.
* At 60 mph, instead of 12 mpg (linear assumption), it might use 30 mpg (non-linear effect).
* At 90 mph, consumption might drop to 18 mpg (air resistance becomes more dominant).

The sigmoid function allows neurons to capture these non-linear relationships, making them powerful tools for tackling complex problems.

**Formula:**
$$
\frac{1}{1+e^{-x}}
$$

**Weights and Biases:**

Often times we initialize the weights and biases as random. This is because we want to avoid having all neurons in a neural network be the same weight, as then it is likely that all neurons will behave the exact same and that is not helpful for learning different aspects of the data.

**So what is a weight and a bias?**
A weight is a variable within a neuron that serves to scale or de-scale a data point's importance.
A bias is a constant variable added to data points whose whole purpose is to "shift the activation function in a sense." This helps in handling inputs that are not centered around zero.

**Evaluating a Predicted Data Point:**
The evaluation of a data point is quite crucial, as it tells you how far you are from the prediction. There are two specific ways that we are going to cover in which you can evaluate a predicted data point.

**Binary Cross Entropy Loss:**
This is commonly used for binary classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1.
**Formula:**
$$
-\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(y_i^{\hat}) + (1-y_i) \log(1-y_i^{\hat})\right]
$$
where \(y_i^{\hat}\) is the predicted output and \(y_i\) is the actual output.

**Mean Squared Error:**
This one is used in regression tasks to measure the Euclidean distances (squared) between the predicted output and the actual output.
**Formula:**
$$
\frac{1}{n}\sum_{i=1}^{n}(y_i^{\hat} - y_i)^2
$$

**Calculating Gradient:**
Oh boy, this is really hard to explain but I'm gonna try my best. Basically, the gradient is a vector of partial derivatives of the loss function with respect to each parameter (weight and bias) in the network. It indicates the direction and magnitude of the steepest increase in the loss function. We use its negative to update parameters, moving towards minimizing the loss.

**Steps:**
1. Calculate the Error using either the loss function formula for binary cross-entropy loss or simply by calculating the difference between the predicted point and the actual point.
   To compute the derivative of the loss with respect to the predicted output:
   $$
   y^{\hat} - y
   $$

2. Calculate the Partial Derivatives:
   1. Partial derivatives are kinda simple since you are calculating the derivative of a variable with respect to another variable, but you often treat variable b as a constant.
      To compute the derivative of the predicted output with respect to the weighted sum:
      $$
      y^{\hat}(1 - y^{\hat})
      $$

      To compute the derivative of the weighted sum with respect to each weight:
      $$
      x_i
      $$

3. Use the Chain Rule:
   $$
   x_i \cdot y^{\hat}(1 - y^{\hat}) \cdot (y^{\hat} - y)
   $$


In [None]:
'''PseudoCode

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    sig = sigmoid(x)
    return sig * (1 - sig)

def initialize_parameters(input_size):
    weights = np.random.rand(input_size)
    bias = np.random.rand(1)
    return weights, bias

def forward_propagation(inputs, weights, bias):
    weighted_sum = np.dot(inputs, weights) + bias
    output = sigmoid(weighted_sum)
    return output, weighted_sum

def binary_cross_entropy_loss(predicted_output, actual_output):
    epsilon = 1e-15
    predicted_output = np.clip(predicted_output, epsilon, 1 - epsilon)
    loss = - (actual_output * np.log(predicted_output) + (1 - actual_output) * np.log(1 - predicted_output))
    return np.mean(loss)

def calculate_gradients(inputs, predicted_output, actual_output, weighted_sum):
    error = predicted_output - actual_output  # dL/dy
    d_loss_d_predicted = error
    d_predicted_d_weighted_sum = sigmoid_derivative(weighted_sum)  # dy/dz
    d_weighted_sum_d_weights = inputs  # dz/dw
    d_weighted_sum_d_bias = 1  # dz/db

    gradients = d_loss_d_predicted * d_predicted_d_weighted_sum * d_weighted_sum_d_weights
    bias_gradient = d_loss_d_predicted * d_predicted_d_weighted_sum * d_weighted_sum_d_bias

    return gradients, bias_gradient
'''

