# Basic of Neural Networks

Neural networks rose to fame in the late 1980s, thanks in part to advancements like the backpropagation algorithm, which allowed for more effective training of multi-layer networks. However, due to challenges such as inefficient training methods and limited computational power, their practical applications were restricted, leading to a decline in widespread interest.

 Despite this, neural networks did not vanish entirely as a field of study and experienced a resurgence after 2010 under the name *deep learning*. This revival was driven by new architectures, larger datasets, and increased computational capabilities, enabling breakthroughs in areas like image and video classification, speech recognition, and text modeling. Many attribute these successes to the availability of vast training datasets made possible by the digitalization of data in science and industry.

## Neural Network Layers
Layers are fundamental components that enable infor,ation processing. Each of them is defined:

### **1. Input Layer**
The input layeer is the first layer of a neural network and is responsible for receiving the initial data to be processed.
- **Function**: Provides the input values of the model, like features or variables of a problem.
- **Neuron Numbers**: The number of neurons in this layer corresponds to the number of features (or attributes) of the dataset.

### **2. Hidden Layers**
The hidden layers are the intermediate between the input and the output layer. This layers is where the complex calculations occurs to learn the patterns in the data.
- **Function**: Transform inputs in more useful representations for the output layer to produce more precise predictions
- **Quantity of layers and neurons**: Could be one or many hidden layers, and the numbers of neurons in each layer depends of the complexity of the problem. A common *architecture* can include two hidden layers with 12 neurons each one.
- **Learning**: Hidden layers  uses algorithms like **backpropagation** to adjust the weights and minimize the error within the training.

### **3. Output Layers** 
The **output layer** is the last layer of neural networjs and is responsible of generate the final predictions.
- **Function**: Generates the final result based in the transformations realized by the hidden layers.
- **Number of neurons**: The number of neurons in this layer depends of the problem:
    - For **binary classification**, there are usally one neuron that produces the probability (i.e., sigmoid function)
    - For **multi-class classification**, the number of neurons corresponds to the number of classes (softmax funcion)
    - For **regression**, there are usually only one neuron that predicts the numeric value



In [48]:
import random
import time
# only one 3rd party library
import numpy as np

In [49]:
class Network(object):
    def __init__(self, sizes):
        """
        Sizes contains the number of neurons in the respective
        layers of networks. For example, if the list was [2,3,1]
        would be a three layer network with 2, 3, 1 neurons respectively
        Biases and weights are initializated randomly, using Gaussian 
        distribution with mean 0 and variance 1. 

        NOTE the first layer is assumed to be an input layer and by convention
        we won't set any biases for those neurons, since biases are only ever 
        used in computing the outputs from later layers.
        """
        self.num_layers = (len(sizes))
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes [1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]
                        # tuples with number ofneurons in actual layer,
                        # neurons in next layer

# Single Layer Neural Networks
A neural network takes an input of $p$ variables $ X = (X_1,\ X_2,\dots\ X_p)$ and builds a nonlinear function $f(X)$ to predict the response $Y$. 

Before I've seen methods like trees, boosting and geralized addtive methods. What distinguishes neural networks from these methods is the particular *structure* of the model.

## Fitting a Neural Network
For a Single Layer Neural Network, the parameters are $ \beta = (\beta_0, \beta_1,\dots, \beta_k)$ as well as each of $w_k = w_{k0}, w_{k1}, \dots, w_{k0})$, $k=1,\dots, K$. Given observations $(x_i, y_i),\; i=1,\dots,n$ we, could fit the model by solving a nonlinear least squares problem

$$
minimize: \frac{1}{2}\sum_{i=1}^{n}(y_i-f(x_i)^2), \tag{0}
$$
where
$$
f(x_i) = \beta_0 + \sum_{k=1}^{K}\beta_{k}g \left (w_{k0}+ \sum_{j=1}^{p}w_{kj}x_{ij}\right) \tag{1}
$$
The objective looks simple, but because of the nested arrangement of the parameters and the symmetry of the hidden units, it is not straightforwar to minimize. The problem is nonconvex in the parameters, and hence there are *multiple solutions*.

To overcome some of these issues and to protect from overfitting, two general strtegies are employed when fitting a neurla networks:
- *Slow Learning*: The model is fit in a somewhat slow iterative fashion, using *gradient descent*. The fitting process is then stopped when overfitting is detected.
- *Regularization*: Penalties are imposed on the parameters, usually lasso or ridge.

#### Suppose
We represent all the parameters in one long vector $\theta$. Then we can rewrite the objective in (0) as:
$$
R(\theta) = \frac{1}{2}\sum_{i=1}^{n}(y_i-f_{\theta}(x_i))^2, \tag{2}
$$ 
where we make explicit the dependence of $f$ on the parameters. The idea of gradient descent is very simple.

1. Start wit a guess $\theta^0$ for all the parameters in $\theta$, and set t=0

2. Iterate until the objective fails to decrase: \
    (a) Find a vector $\delta$ that reflects a small change in $\theta$, such that $\theta^{t+1}=\theta^{t}+\delta$
    (b) Set $t\leftarrow t+1$

One can visualize (Figure 10.17) standing in a mountainous terrain, and
the goal is to get to the bottom through a series of steps. As long as each
step goes downhill, we must eventually get to the bottom. In this case we
were lucky, because with our starting guess θ0 we end up at the global
minimum. In general we can hope to end up at a (good) local minimum.

#### Backpropagation
How do we find the directions to move $\theta$ so as to decrase the objective $R(\theta)$ in 2? The gradient of $R(\theta)$, evaluates at some current value $\theta=\theta^m$, is the vector of partial derivatives at that point:
$$
\nabla R(\theta^m)= \frac{\partial R(\theta)}{\partial\theta}\bigg|_{\theta=\theta^m} \tag{3}
$$

The subscript $\theta=\theta^m$ means that after computing the vector of derivatives we evaluate it at current guess $\theta^m$. This gives the direction in $\theta$-space in which $R(θ)$ increases most rapidly. The idea of gradient descent is to move $θ$ a little in the opposite direction (since we wish to go downhill):
$$
\theta^{m+1}\leftarrow \theta^m-\rho\nabla R(\theta^m) \tag{4}
$$

At small learning rate $\phi$, this step will decrease the objective $R(\theta)$; i.e.  $R(\theta^{m+1}) \leq R(\theta^m),$ if the gradient vector is zero, then we may have arrived at a minimum of the objective.

Usually in many networks, the calculation is simple because of the *chain rule* of differentation.
$$
\frac{dz}{dx}= \frac{dz}{dy} \frac{dy}{dx} 
$$

Since $R(\theta) = \sum_{i=1}^{n} R_i(\theta) = \frac{1}{2}\sum_{i=1}^{n}(y_i-f_{\theta}(x_i)^{2})$ is a sum, its gradient is also a sum over the *n* observations, so we will just examine one of these terms,
$$
R_i({\theta}) = \frac{1}{2}\left(y_i - \beta_0-\sum_{k=1}^{K}\beta_k g(w_{k0} + \sum_{j=1}^{p}w_{kj}x_{ij})\right)^2 \tag{5}

To simplify the expressions to follow, we write $z_{ik} = w_{k0} + \sum_{j=1}^{p}w_{kj}x_{ij}$ \
First we take the derivative with respect to $\beta_k$:
$$
\frac{\partial R_i(\theta)}{\partial \beta_k} = \frac{\partial R_i(\theta)}{\partial f_{\theta}(x_i)} \cdot \frac{\partial f_{\theta}(x_i) }{\partial \beta_k}  \\[5mm]
\boxed{= -(y_i - f_{\theta}(x_i)) \cdot g(z_{ik})} \tag{6}
$$
And now we take the derivative with respect to $w_{kj}$:
$$
\frac{\partial R_i(\theta)}{\partial w_{kj}} = \frac{\partial R_i(\theta)}{\partial f_{\theta}(x_i)} \cdot \frac{\partial f_{\theta}(x_i)}{\partial g(z_{ik})} \cdot \frac{\partial g(z_{ik})}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{w_{kj}}  \\[5mm] 
\boxed{= -(y_i- f_{\theta}(x_i)) \cdot \beta_k \cdot g'(z_{ik}) \cdot x_{ij}} \tag{7}
$$

Notice that both these expressions contain the residual $y_i - f_{theta}(x_i)$. In (6) we see that a fraction of that residual gets attributed to each of the hidden units according to the value of $g_{ik}$. Then in (7) we see a similar attribution to input $j$ via hidden unit $k$.
So the act of diferentiation assigns a fraction of the residual to each of the parameters via the chain rule —  a process known as *backpropagation* in the neural network literature. Although these calculations are straightforward, it takes careful bookkeeping to keep track of all the pieces

In [None]:
class Network(object):    
    def backprop(self, x, y):
        """
        Return a tuple ``(nabla_b, nabla_w)`` representing the gradient 
        for the cost function C_x ``nabla_b`` and ``nabla_w`` are layer by layer
        lists of numpy arrays, similar to ``self.biases`` and ``self.weights``

        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] #list to store all the activations, layer by layer
        zs = [] #list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation) + b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())

        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)


### **Regularization and Stochastic Gradient Descent**
Gradient descent takes takes many steps to reach a local minimum. There are a number of approaches for accelerating the process. Also, when *n* is large, instead of summing (6)-(7) over all *n* observations, we can sample a small fraction widely known as **mini-batch** of them each time we compute a gradient step. 

This process is known as **stochastic gradient descent** (SGD) and is `the state of the art for learning deep neural networks. Fortunately, there is very good software for setting up deep learning models, and for fitting them to data.



Regularization is essential here to avoid overfitting. Taking for example *ridge regularization* for the  MNSIT dataset which takes 9 inputs (numbers) $(x_i)$. 
$$ 
R(\theta; \lambda) = -\sum_{i=1}^{n} \sum_{m=0}^{9}y_{im}log(f_m(x_i)) + \lambda \sum_{j}\theta_j^2

The parameter $\lambda$ is often preset at a small value or else it is found using validation-set approach. We can also use different values of $\lambda$ for the groups of weights from different layers; in this case *$W_1$* and *$W_2$* were penalized. We need two things:

a. term that penalizes large weights and is controlled by the hyperparameter $\lambda$.
$$
C = C_0+\frac{\lambda}{2n}\sum_{w}w^2
$$
b. a weight update rule
$$
w \leftarrow w-\eta \left(\frac{\partial C_0}{\partial w} + \frac{\lambda}{n}w \right)
$$


Which could be implemented as:

```Python
weights = [(1 - eta * (lmbda / n)) * w - (eta / len(mini_batch)) * nw
                for w, nw in zip(weights, nabla_w)]
```

In [None]:
class Network(object):    
    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        """
        Train the neural network using mini-batch stochastic gradient descent.
        ``training_data`` is a list of tuples ``(x, y)`` representing the training
        inputs and the desired outputs. The non-optional parameters are self 
        explanatory. If ``test_data`` is provided then the network will be evaluated
        against the test data after each epoch, and partial progress printed out.
        """
        if test_data:
            n_test = len(test_data)  
        n = len(training_data)  # stores the total number of training examples
        for j in range(epochs): # training loop
            time1 = time.time()
            random.shuffle(training_data)
            mini_batches = [    # mini batch creation
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta, lmbda, n)
            time2 = time.time()
            if test_data:
                print("Epoch {0}: {1} / {2}, took {3:.2f} seconds".format(
                    j, self.evaluate(test_data), n_test, time2-time1))
            else:
                print("Epoch {0} complete in {1:.2f} seconds".format(j, time2-time1))

    def update_mini_batch(self, mini_batch, eta, lmbda, n):
        """
        Update the network's weights and biases by applying grad des using
        backpropagation to a single mini batch,
        Mini-batch is just a list of tuples ``(x, y)`` and ``eta`` the 
        learning rate
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]  # 1. initialize
        nabla_w = [np.zeros(w.shape) for w in self.weights] #    gradients accumulators
        for x,y in mini_batch:  # 2. process each example (iterate over mini-batch)
            delta_nabla_b, delta_nabla_w = self.backprop(x, y) # 3. update weights and biases
            nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] # accumulate gradients
            nabla_w = [nb + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        # update weights with L2 regularization
        self.weights = [(1 - eta * (lmbda / n)) *w -(eta/len(mini_batch)) # 4. update weights and biases
                        for w, nw in zip(self.weights, nabla_w)]
        # obviously bias doesnt need regularization
        self.biases = [b-(eta/len(mini_batch))*nb
                    for b, nb in zip(self.biases, nabla_b)]


In [102]:
class Network(object):
    def __init__(self, sizes):
        """
        Sizes contains the number of neurons in the respective
        layers of networks. For example, if the list was [2,3,1]
        would be a three layer network with 2, 3, 1 neurons respectively
        Biases and weights are initializated randomly, using Gaussian 
        distribution with mean 0 and variance 1. 

        NOTE the first layer is assumed to be an input layer and by convention
        we won't set any biases for those neurons, since biases are only ever 
        used in computing the outputs from later layers.
        """
        self.num_layers = (len(sizes))
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes [1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]
                        # tuples with number ofneurons in actual layer,
                        # neurons in next layer

    def backprop(self, x, y):
        """
        Return a tuple ``(nabla_b, nabla_w)`` representing the gradient 
        for the cost function C_x ``nabla_b`` and ``nabla_w`` are layer by layer
        lists of numpy arrays, similar to ``self.biases`` and ``self.weights``

        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] #list to store all the activations, layer by layer
        zs = [] #list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation) + b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())

        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)
    
    def feedforward(self, a):
        """
        Return the output of the network
        """
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a) + b)
        return a # activaction of the neurons in the current layer of the NN

    def SGD(self, training_data, epochs, mini_batch_size, eta, lmbda, test_data=None):
        """
        Train the neural network using mini-batch stochastic gradient descent.
        ``training_data`` is a list of tuples ``(x, y)`` representing the training
        inputs and the desired outputs. The non-optional parameters are self 
        explanatory. If ``test_data`` is provided then the network will be evaluated
        against the test data after each epoch, and partial progress printed out.
        """
        if test_data:
            n_test = len(test_data)  
        n = len(training_data)  # stores the total number of training examples
        for j in range(epochs): # training loop
            time1 = time.time()
            random.shuffle(training_data)
            mini_batches = [    # mini batch creation
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta, lmbda, n)
            time2 = time.time()
            if test_data:
                print("Epoch {0}: {1} / {2}, took {3:.2f} seconds".format(
                    j, self.evaluate(test_data), n_test, time2-time1))
            else:
                print("Epoch {0} complete in {1:.2f} seconds".format(j, time2-time1))

    def update_mini_batch(self, mini_batch, eta, lmbda, n):
        """
        Update the network's weights and biases by applying grad des using
        backpropagation to a single mini batch,
        Mini-batch is just a list of tuples ``(x, y)`` and ``eta`` the 
        learning rate
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]  # 1. initialize
        nabla_w = [np.zeros(w.shape) for w in self.weights] #    gradients accumulators
        for x,y in mini_batch:  # 2. process each example (iterate over mini-batch)
            delta_nabla_b, delta_nabla_w = self.backprop(x, y) # 3. update weights and biases
            nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] # accumulate gradients
            nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        # update weights with L2 regularization
        self.weights = [(1 - eta * (lmbda / n)) *w -(eta/len(mini_batch)) # 4. update weights and biases
                        for w, nw in zip(self.weights, nabla_w)]
        # obviously bias doesnt need regularization
        self.biases = [b-(eta/len(mini_batch))*nb
                    for b, nb in zip(self.biases, nabla_b)]

    def evaluate(self, test_data):
        """
        Return the number of test inputs for which the neural netwokr outputs
        the correct result. Note that the neural network's output is assumed 
        to be the index of whichever neuron in the final layer has the highest 
        activation
        """
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x==y) for (x, y) in test_results)
    
    def cost_derivative(self, output_activations, y): #loss function
        r"""
        Return the vector of partial derivatives \partial C_x/ 
        \partial a for the output activations.
        """
        return (output_activations - y) # original cost function (cross entropy or MSE)

In [103]:
#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

# One-hot encoding for labels
def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

## Implemmenting it

In [104]:
from sklearn.datasets import fetch_openml

# fetch the MNIST dataset
mnist = fetch_openml('mnist_784', as_frame=False)

# extract the data and labels
X, y = mnist.data.astype(np.float32), mnist.target.astype(np.int32)

# normalize the data to the range [0, 1]
X /= 255.0

X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

In [105]:

# Load MNIST data
def load_data():
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', as_frame=False)
    X, y = mnist.data.astype(np.float32), mnist.target.astype(np.int32)
    X /= 255.0  # Normalize pixel values to [0, 1]
    X_train, X_test = X[:60000], X[60000:]
    y_train, y_test = y[:60000], y[60000:]
    training_inputs = [np.reshape(x, (784, 1)) for x in X_train]
    training_results = [vectorized_result(y) for y in y_train]
    training_data = list(zip(training_inputs, training_results))
    test_inputs = [np.reshape(x, (784, 1)) for x in X_test]
    test_data = list(zip(test_inputs, y_test))
    return training_data, test_data

In [114]:
print(len(training_data[1][:10]))

2


In [109]:
# Main script
if __name__ == "__main__":
    training_data, test_data = load_data()
    net = Network([784, 30, 10])
    net.SGD(training_data, epochs=10, mini_batch_size=10, eta=.01, test_data=test_data, lmbda=1e-3)



  return 1.0/(1.0+np.exp(-z))


Epoch 0: 1135 / 10000, took 12.42 seconds
Epoch 1: 1135 / 10000, took 11.82 seconds


KeyboardInterrupt: 

### Dropout Learning
Is a form of regularization, similar in some respects to ridge regularization. Inspired by random forests, the idea is to randomly remove a fraction $\phi$, this is done separately each time a training observation is processed. 

The surviving units stand in for those missing, and their weights are scaled up by a factor of $1/(1 − \phi)$ to compensate. This prevents nodes from becoming over-specialized, and can be seen as a form of regularization. In practice dropout is achieved by randomly set￾ting the activations for the “dropped out” units to zero, while keeping the
architecture intact.

### Interpolation and Double Descent

These concepts are important phenomena in the field of machine learning, particularly when dealing with overparamaterized models.
#### **Interpolation**
Occurs when a model is complex enough to perfectly fit the training data, meaning that it can pass through every single training point without any error.

This typically happens when the number of parameters in the model exceeds the number of data points, allowing the model to find solutions that exactly match the training set. In traditional statistical learning theory, reaching this interpolation theshold was ofted associated with overfitting,, where the model performs well on training data but poorly on unseen test data.

#### **Double Descent**
Double descent refers to a surprising phenomenon observed in the relationship between model complexity (e.g., number of parameters) and generalization error. Traditionally, as you increase the complexity of a model, the error decreases up to a point, after which it increases due to overfitting. However, in the case of double descent, beyond the interpolation threshold—where the model has enough capacity to perfectly fit the training data—the error begins to decrease again, leading to a second descent in the error curve

This behavior can be explained by the fact that very large models, despite their capacity to overfit, also have the ability to generalize well if trained appropriately. For instance, overparameterized deep networks might **interpolate noisy data** yet still exhibit good generalization performance. The mechanism behind this involves not only fitting the data but doing so in a way that implicitly prefers simpler solutions among all possible interpolating ones, a property sometimes referred to as "*smooth interpolation*".

In summary, while increasing model complexity usually leads to an initial rise in test error past a certain point due to overfitting, for sufficiently complex models like those used in modern deep learning, there’s often a second phase where further increasing the model size improves generalization, resulting in what we call the double descent curve.