Regression
 - Use when you want to predict target scalar values based on scalar value inputs; as opposed to predicting the class that an object belongs to
 - Regression examples: predicting tomorrow's temperature, or price of a car
 - To implement in network, we will have to change our output layer and loss calculation (error calc)

Linear Activation:
- this will be used for the output layer
- it is just y = x, or outputs = inputs
- derivative wrt to input is 1, so just passing back gradients of loss layer
- this code really does nothing, we just do it for completeness and to have an ouput layer for better clarity in our code. It should add minimal amount to training time, if any.

In [None]:
class Activation_Linear:
    # Forward pass
    def forward(self, inputs):
        # Just remember values
        self.inputs = inputs
        self.output = inputs
    # Backward pass
    def backward(self, dvalues):
        # derivative is 1, 1 * dvalues = dvalues - the chain rule
        self.dinputs = dvalues.copy()

Mean Squared Error Loss:
- the average of the squared difference between the outputs and their respective ground truth values for a particular sample
- Li = 1/J * sum((yij - y_hatij)^2); where Li = loss for sample i, J is the number of outputs, yij is ground truth, y_hatij is prediction
- the further we get from the ground truth, the more harshly the network is penalized because the loss grows quadratically
- Derivative (use chain rule): dLi/dy_hatij = d/dy_hatij ( 1/J * sum((yij - y_hatij)^2) ) = 1/J * d/dy_hatij sum((yij - y_hatij)^2) = 1/J * 2 (yij - y_hatij) * d/dy_hatij (yij - y_hatij) = 1/J * 2 (yij - y_hatij) * (0 - 1) = -2/J * (yij - y_hatij)

Mean Squared Error Loss code
- forward pass: remeber, axis = -1 takes average of row (aka outputs for a sample)
- backward pass: still doing gradient normalization via averaging

In [None]:
# Mean Squared Error loss
class Loss_MeanSquaredError(Loss): # L2 loss
    # Forward pass
    def forward(self, y_pred, y_true):
        
        # Calculate loss
        sample_losses = np.mean((y_true - y_pred)**2, axis=-1)
        # Return losses
        return sample_losses
    
        # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        
        # Number of outputs in every sample
        # We'll use the first sample to count them
        outputs = len(dvalues[0])
        
        # Gradient on values
        self.dinputs = -2 * (y_true - dvalues) / outputs
        
        # Normalize gradient
        self.dinputs = self.dinputs / samples

Mean Absolute Error Loss
- average of the absolute value of the difference between ground truth and predicted values
- Li = 1/J * abs(yij - y_hatij); same definitions as above
- penalizes error linearly - error increases 1:1 the further you are from loss. Produces sparser results and is more robust to outliers. Sparsity means weights/biases that are 0, which can be beneficial for interpretability and feature selection, but can also make the model more unstable and passes up on incorporating that information. Can also be more computationally efficient and robust to outliers because of the 0s. Another general downside is that it is very sensitive to scaling of the training/testing data
- MAE (L1) used less frequently than MSE (L2)
- Derivative dLi/dy_hatij = d/dy_hatij ( 1/J * abs(yij - y_hatij) )=  1/J * d/dy_hatij * abs(yij - y_hatij); absolute value derivative 1 when term >= 0 or -1 when < 0; dLij = 1/J * {1; yij - y_hatij >= 0 or -1; yij - y_hatij < 0}; Remember that abs derivate is undefined at 0, but we just say it is 1 for ease

MAE Codes
- Forward Pass: same -1 axis concept in the forward pass mean
- Backward pass: use np.sign() to get the derivative of the abs

In [None]:
# Mean Absolute Error loss
class Loss_MeanAbsoluteError(Loss): # L1 loss
    # Forward pass
    def forward(self, y_pred, y_true):
        # Calculate loss
        sample_losses = np.mean(np.abs(y_true - y_pred), axis=-1)
        # Return losses
        return sample_losses
    
    # Backward pass
    def backward(self, dvalues, y_true):
        # Number of samples
        samples = len(dvalues)
        # Number of outputs in every sample
        # We'll use the first sample to count them
        outputs = len(dvalues[0])
        # Calculate gradient
        self.dinputs = np.sign(y_true - dvalues) / outputs
        # Normalize gradient
        self.dinputs = self.dinputs / samples

Comparison between L1 (MAE) and L2 (MSE) (credit ChatGPT):
1. **Robustness to Outliers**:
   - L1 Loss: More robust to outliers because it doesn't square the errors. Outliers have a linear impact on the loss.
   - L2 Loss: Sensitive to outliers due to the squaring of errors. Outliers have a quadratic impact on the loss, making it less robust.

2. **Sparsity**:
   - L1 Loss: Encourages sparsity in the solution, meaning many coefficients can be exactly zero, effectively performing feature selection.
   - L2 Loss: Does not inherently encourage sparsity, leading to solutions where most coefficients are non-zero.

3. **Computational Efficiency**:
   - L1 Loss: Sparse solutions are computationally efficient because they involve fewer non-zero parameters.
   - L2 Loss: May involve more computational overhead due to non-sparsity.

4. **Solution Stability**:
   - L1 Loss: Can lead to less stable solutions, especially when features are highly correlated, as it may arbitrarily select one feature over another.
   - L2 Loss: Generally produces more stable solutions, particularly when features are correlated.

5. **Impact of Scaling**:
   - L1 Loss: Sensitive to feature scaling since it treats all errors equally regardless of their magnitude.
   - L2 Loss: Less sensitive to feature scaling due to the squaring of errors, but can still be affected by extreme feature values.

6. **Optimization Challenges**:
   - L1 Loss: Introduces non-differentiability at zero, which can make optimization challenging, especially in large-scale problems.
   - L2 Loss: Smooth and differentiable, making optimization relatively easier compared to L1 loss.

7. **Performance on Small Datasets**:
   - L1 Loss: May not perform well on small datasets where sparsity could be too aggressive, leading to underfitting.
   - L2 Loss: Generally performs well on small datasets, providing more stable estimates.

Accuracy Calculation:
- need a new way to calculate accuracy because cannot just check equivalence anymore. This will result in the model appearing to be very inaccurate, when in reality it is ok for something to just be close enough. Ex: ground truth = 100, predicted value = 99.999
- to do this, we need to calculate a cushion value, where we will consider something correct if it falls within the cushion. 
- one way: take the standard deviation of the ground truth values in the dataset, and divide by some scalar. The larger the scalar, the more strict/smaller our cushion is.
- then, variations from ground truth that are less than the cushion allowance are considered accurate/correct.
- Code example below: y=groud truth of data set; 250 is our scalar for the cushion size. Then we take the absolute value of the differnece between ground predicted values and ground truth and do a boolean check that it is within our cushion. If it is the value is 1, if not, it is 0. Then we average the 1s and 0s for our accuracy metric.

In [None]:
accuracy_precision = np.std(y) / 250 #250 is the scalar

predictions = activation2.output #network outputs
accuracy = np.mean(np.absolute(predictions - y) < accuracy_precision)
