Goal of the research paper - what specific problem is being analyzed and in what way

Abstract

## A
### A.1
To use the gradient machinery we need the cost/loss functions with respective gradients:
- Mean Squared Error
- Log loss, with and without $L_1$ and $L_2$
- The multiclass cross entropy cost/loss function

Needs to be explained in the methods section 


#### Methods

#### Code

In [None]:
import numpy as np
#Code section 
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def cross_entropy(predict, target):
    return np.sum(-target * np.log(predict))
# see https://medium.com/data-science/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

def log_loss(y_true, y_pred, regularization=None, weights=None, lambda_reg=0.01):
    # Clipping it using epsilon to avoid log(0)
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    m = len(y_true)
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    if regularization == "L1" and weights is not None:
        loss += (lambda_reg / (2 * m)) * np.sum(np.abs(weights))
    elif regularization == "L2" and weights is not None:
        loss += (lambda_reg / (2 * m)) * np.sum(np.square(weights))
    return loss

#### Results - comparison

### A.2 - Activation Functions
Set up the expression and their first derivative for the following activation functions:
- Sigmoid
- RELU
- Leaky RELU

#### Methods

#### Code

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def ReLu(z):
    return np.where(z > 0, z, 0)

def der_ReLu(X):
    return np.where(X > 0, 1, 0)

def LeakyReLu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def der_LeakyReLu(X, alpha=0.01):
    return np.where(X > 0, 1, alpha)


#### Results - comparison

## B


Use only the mean-squared error as cost function (no regularization terms) and 
write an FFNN code for a regression problem with a flexible number of hidden
layers and nodes using only the Sigmoid function as activation function for
the hidden layers. Initialize the weights using a normal
distribution. How would you initialize the biases? And which
activation function would you select for the final output layer?
And how would you set up your design/feature matrix? Hint: does it have to represent a polynomial approximation as you did in project 1? 

Train your network and compare the results with those from your OLS
regression code from project 1 using the one-dimensional Runge
function.  When comparing your neural network code with the OLS
results from project 1, use the same data sets which gave you the best
MSE score. Moreover, use the polynomial order from project 1 that gave you the
best result.  Compare these results with your neural network with one
and two hidden layers using $50$ and $100$ hidden nodes, respectively.

Comment your results and give a critical discussion of the results
obtained with the OLS code from project 1 and your own neural network
code.  Make an analysis of the learning rates employed to find the
optimal MSE score. Test both stochastic gradient descent
with RMSprop and ADAM and plain gradient descent with different
learning rates.

You should, as you did in project 1, scale your data.

### Methods


### Code
Should i use the same sparse data-set as i did in the earlier report or perhaps change it up? Maybe have an example on how it works on a large dataset?

### Results

We do not extract a polynomial structure for the input variables in the neural network as we let the neural network find out what the structure should be for best predicting the out put variables. 

## C
Test the code in B against Scikit-learn, take training time and results, maybe also keras and pytorch.

Test that the derivatives are correct using autograd

### Methods Code Comparison

## D
Different activation functions for the hidden layers, Sigmoid, Relu and Leaky Relu

### Results 
Bias variance trade-off analysis?

## E
Testing different hyperparameter $\lambda, \eta, N(Layers), N(Height)$


### Code

### Results

# Classification Section

## F
Change the cost function in b,d,e and perform classification analysis on MNIST problem. Evaluate the results using the Accuracy score, with a critical analysis on the parameters, activation functions and architecture of the network. Compare the results with similar results from Scikit-learn. 

### Code

### Results

## G
Summarization of all the algorithms and the results to give a critical evaluation of their pros and cons. Which algorithm worked best for the regression case and which is best for the classification case  

### Results

# CheckList

## Summary of methods to implement and analyze

**Required Implementation:**
1. Reuse the regression code and results from project 1, these will act as a benchmark for seeing how suited a neural network is for this regression task.

2. Implement a neural network with

  * A flexible number of layers

  * A flexible number of nodes in each layer

  * A changeable activation function in each layer (Sigmoid, ReLU, LeakyReLU, as well as Linear and Softmax)

  * A changeable cost function, which will be set to MSE for regression and cross-entropy for multiple-classification

  * An optional L1 or L2 norm of the weights and biases in the cost function (only used for computing gradients, not interpretable metrics)

3. Implement the back-propagation algorithm to compute the gradient of your neural network

4. Reuse the implementation of Plain and Stochastic Gradient Descent from Project 1 (and adapt the code to work with the your neural network)

  * With no optimization algorithm

  * With RMS Prop

  * With ADAM

5. Implement scaling and train-test splitting of your data, preferably using sklearn

6. Implement and compute metrics like the MSE and Accuracy

### Required Analysis:

1. Briefly show and argue for the advantages and disadvantages of the methods from Project 1.

2. Explore and show the impact of changing the number of layers, nodes per layer, choice of activation function, and inclusion of L1 and L2 norms. Present only the most interesting results from this exploration. 2D Heatmaps will be good for this: Start with finding a well performing set of hyper-parameters, then change two at a time in a range that shows good and bad performance.

3. Show and argue for the advantages and disadvantages of using a neural network for regression on your data

4. Show and argue for the advantages and disadvantages of using a neural network for classification on your data

5. Show and argue for the advantages and disadvantages of the different gradient methods and learning rates when training the neural network

### Optional (Note that you should include at least two of these in the report):

1. Implement Logistic Regression as simple classification model case (equivalent to a Neural Network with just the output layer)

2. Compute the gradient of the neural network with autograd, to show that it gives the same result as your hand-written backpropagation.

3. Compare your results with results from using a machine-learning library like pytorch (https://docs.pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html)

4. Use a more complex classification dataset instead, like the fashion MNIST (see <https://www.kaggle.com/datasets/zalando-research/fashionmnist>)

5. Use a more complex regression dataset instead, like the two-dimensional Runge function $f(x,y)=\left[(10x - 5)^2 + (10y - 5)^2 + 1 \right]^{-1}$, or even more complicated two-dimensional functions (see the supplementary material of <https://www.nature.com/articles/s41467-025-61362-4> for an extensive list of two-dimensional functions). 

6. Compute and interpret a confusion matrix of your best classification model (see <https://www.researchgate.net/figure/Confusion-matrix-of-MNIST-and-F-MNIST-embeddings_fig5_349758607>)

## Background literature

1. The text of Michael Nielsen is highly recommended, see Nielsen's book at <http://neuralnetworksanddeeplearning.com/>. It is an excellent read.

2. Goodfellow, Bengio and Courville, Deep Learning at <https://www.deeplearningbook.org/>. Here we recommend chapters 6, 7 and 8

3. Raschka et al. at <https://sebastianraschka.com/blog/2022/ml-pytorch-book.html>. Here we recommend chapters 11, 12 and 13.