# Writing Neural Networks

Writing neural networks using tensorflow's low level api is an essential skill for implementing most research papers.  The ability to translate from mathematical equations into code, is incredibly powerful.  Not only because you can take the work of others to a functional place, but because you can come up with your own ideas, mathematically, prove some properties and then do an implementation.  The process of reading and understanding techniques through mathematics and then implementing them in code is the most productive way to make progress in your development as a scientist and researcher.  It's also very useful for being an effective engineer.  Through mathematics we can quickly understand and verify that some property ought to hold.  Once you understand the definitions and consequences of an idea in mathematics, verification of it's validity often takes a day or so, much faster than the verification of scientific quantities, because mathematics works in the world of absolute truth.  This means there aren't really grey areas, so you can make definitive statements.  While in general, science does not follow this binary notion of absolutes, hence the extensive use of statistics, scientific tools are primarily written in the language of mathematics.  A given idea or theory (of the real world) may seem enticing, but without scientific evidence to back it up, the theory means little. 

That said, once a theory can be validated through statistical analysis, it can be folded into a theoretical analysis.  It is from time to time even possible to draw conclusions beyond the phenomenon of inquiry which guides future discovery.  This has been seen several times from the world of mathematical physics.  One striking example of this is the discovery of blackholes which were theorized a century before they were physically discovered.

Note: If you don't know how to write tensorflow code, please see [this](https://github.com/EricSchles/datascience_book/blob/master/python_programming/tensorflow_basics/Tensorflow%20Basics.ipynb) notebook.  

## The Neural Network framework

If you aren't already familar with neural networks, I suggest reading [this chapter](https://github.com/EricSchles/datascience_book/blob/master/5/An%20Introduction%20to%20Neural%20Networks%20-%2007.ipynb) first.  

In general, neural nets are essentially a collection of linear regression models tied together through a meta optimization algorithm called backpropagation.  The linear regression models are sometimes tied together with an 'activation' function, which is just a secondary transform applied after the linear regression optimization takes place.  We've already seen that we can use stochastic gradient descent to optimize a linear regression model [here](https://github.com/EricSchles/datascience_book/blob/master/2/An%20Introduction%20to%20Regression%20-%2003.ipynb) and that there is a linear algebra equivalent [here](https://github.com/EricSchles/datascience_book/blob/master/5/An%20Introduction%20to%20Neural%20Networks%20-%2007.ipynb).  Neural networks essentially tie these two ideas together and do _both_ optimization strategies.  The linear algebra optimization strategy happens locally with the so called 'forward pass' and then the gradient is used explicitly to update the weights on the 'backward pass'.  But keep in mind there are two optimizations working together in tandem.  

So really a neural network is just an ensemble of sort of linear models or models with easy derivates and each 'layer' of the network is just a given model, optimizing a bit of the ensemble in an explicit way.  The power of neural networks come from their flexability.  Unlike random forests or gradient boosted trees which either optimize in parallel or in sequence, neural networks can do both.  Some of the layers can optimize for certain inputs and others can optimize for others.  Or we can feed copies of the same data to a very wide neural network, which essentially acts like a random forest.  We'll see a number of architectures and ideas in this chapter for how to write down different neural network architectures, EVEN if they aren't necessarily useful, it's still good practice to see how to work with these different tools.

## A first primitive example

Our first example is going to be a purely linear model that simply does essentially linear regression.

In [44]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

class LinearRegression:
    def __init__(self, shape, learning_rate=0.5):
        array = np.random.normal(0, 1, size=shape)
        array = tf.cast(array, tf.float32)
        self.weights = tf.Variable(array)
        self.bias = tf.Variable(1.0)
        self.learning_rate = learning_rate
        
    def predict(self, x):
        return tf.tensordot(x, self.weights, axes=1) + self.bias
    
    def loss(self, y_pred, y_true):
        return tf.reduce_mean(
           tf.square(y_pred - y_true)
        )
    
    def update_weights(self, X_train, y_true):
        with tf.GradientTape() as tape:
            y_pred = self.predict(X_train)
            loss = self.loss(y_pred, y_true)
        gradients = tape.gradient(
            loss, [self.weights, self.bias]
        )
        self.weights.assign_sub(gradients[0] * self.learning_rate)
        self.bias.assign_sub(gradients[1] * self.learning_rate)

X, y = make_regression()
X_train, X_test, y_train, y_test = train_test_split(X, y)
epochs = 10
lin_reg = LinearRegression(X_train.shape[1])
X_train = tf.cast(
    tf.constant(X_train), 
    tf.float32
)
X_test = tf.cast(
    tf.constant(X_test),
    tf.float32
)
y_train = tf.cast(
    tf.constant(y_train),
    tf.float32
)
y_test = tf.cast(
    tf.constant(y_test),
    tf.float32
)
for i in range(epochs):
    lin_reg.update_weights(X_train, y_train)
    
y_pred = lin_reg.predict(X_test)
loss = lin_reg.loss(y_pred, y_test)
print("MSE:", loss)

MSE: tf.Tensor(27385605000000.0, shape=(), dtype=float32)


There are a couple of things to note here which are mandatory in order for tensorflow to train a mode successfully:

1. You must cast everything to the same type - as an exercise try copy/pasting the above code into a new cell and remove all the type casting (the code with tf.cast).  It fails because all the X, y data is treated as doubles, not floats.

2. You must call predict and your loss function inside of the gradient tape context but must apply your gradient updates outside the gradient tape context.  This choice has always felt somewhat arbitrary to me.  However, because of how distributed training works and the fact that only some variables are *trainable* while others are frozen, we need some way to manage state.  It's an annoying trade off, but it needed to be made somewhere.  Fortunately, the code is not particularly ugly, just very pedantic

3. In general tensors products are not commutative, so the order of your parameters in your tensor product matter.  This can be seen in the predict function defined above.  If we tried changing around the order then our code would not work.

4. Notice the use of tf.Variable for our weights and bias term.  If we made those tf.constant instead, then our gradient tape wouldn't let us update our variables.

These basic concerns may seem like an impedement, but this code is far cleaner and scalable than the vanilla numpy implementation found [here in the section: A Naive Implementation of a Neural Network](https://github.com/EricSchles/datascience_book/blob/master/5/An%20Introduction%20to%20Neural%20Networks%20-%2007.ipynb).  Also, we don't need to worry about figuring out the derivative for each of our activation functions.  While some people love calculus (like myself, yes even in higher dimensions), many programmers don't.  Which is why automatic differentiation implemented in packages like tensorflow has openned up a world of mathematics to programmers.  

Now that we've seen how the tensorflow framework can be used to train a neural network, let's do a multilayer preceptron.  This will be our first multilayer neural network or ensemble of layers.  We'll look at a deep neural network first, deep because it trains the layers in sequence.  And then we'll look at a wide neural network, because it will train the layers in parallel.

In [4]:
import tensorflow as tf
import numpy as np
import pandas as pd
import random
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

class Dense(tf.Module):
    def __init__(self, output_size, name=None, final=False):
        super().__init__(name=name)
        if final:
            weights = np.random.normal(0, 1, size=output_size)
            weights = tf.cast(weights, tf.float32)
            self.weights = tf.Variable(weights)
        else:
            weights = np.random.normal(0, 1, size=output_size * output_size)
            weights = weights.reshape(output_size, output_size)
            weights = tf.cast(weights, tf.float32)
            self.weights = tf.Variable(weights)
        #bias = np.array([1.0 for _ in range(output_size)])
        self.bias = tf.Variable(1.0)
        
    def __call__(self, x):
        return tf.tensordot(x, self.weights, axes=1) + self.bias

        
class NeuralNet(tf.Module):
    def __init__(self, X_in, X_out, optimizer):
        super(NeuralNet, self).__init__()
        self.layer_one = Dense(X_out)
        self.layer_two = Dense(X_out)
        self.layer_three = Dense(X_out, final=True)
        self.optimizer = optimizer
        
    def _collect_trainable_variables(self):
        return [
            self.layer_one.weights,
            self.layer_two.weights,
            self.layer_three.weights,
            self.layer_one.bias,
            self.layer_two.bias,
            self.layer_three.bias
        ]
    
    def __call__(self, x):
        return self.predict(x)
    
    def predict(self, x):
        # multiple layers go here
        res = self.layer_one(x)
        res = self.layer_two(res)
        return self.layer_three(res)

    def loss(self, y_pred, y_true):
        return tf.reduce_mean(
           tf.square(y_pred - y_true)
        )
    
    def step(self, x, y):
        x = tf.cast(x, tf.float32)
        y = tf.cast(y, tf.float32)
        with tf.GradientTape() as tape:
            pred = self.predict(x)
            loss = self.loss(pred, y)
        trainable_variables = self._collect_trainable_variables()
        print(trainable_variables)
        gradients = tape.gradient(loss, trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, trainable_variables))

if __name__ == '__main__':
    X, y = make_regression(n_samples=1000, n_features=100)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    learning_rate = 0.9
    optimizer = tf.optimizers.Adam(learning_rate)
    nn = NeuralNet(1000, X_train.shape[1], optimizer)
    num_steps = 110
    losses = []
    for step in range(num_steps):
        nn.step(X_train, y_train)
        pred = nn(X_test)
        loss = nn.loss(pred, y_test)
        losses.append(loss)
    plt.plot(losses)

[<tf.Variable 'Variable:0' shape=(100, 100) dtype=float32, numpy=
array([[-0.7212728 ,  1.1348922 ,  0.61802864, ...,  0.22765778,
        -1.1133646 , -1.440366  ],
       [ 1.4324218 , -0.60617477, -0.01505937, ..., -0.30103448,
         0.7750409 ,  0.25534207],
       [-1.3063195 , -0.8149815 ,  0.72941184, ..., -1.0462462 ,
        -0.8580679 , -0.54124355],
       ...,
       [-1.530326  ,  0.01813192,  1.3527733 , ..., -1.6122704 ,
         0.2747804 , -0.83125126],
       [-0.5919334 , -0.40140167,  0.61836475, ..., -2.0201757 ,
         1.2274144 ,  0.26729208],
       [-1.280319  ,  1.3908894 , -0.7895513 , ..., -0.6349584 ,
        -0.02296463,  0.11261811]], dtype=float32)>, <tf.Variable 'Variable:0' shape=(100, 100) dtype=float32, numpy=
array([[ 0.24141675, -0.05211819,  1.6133287 , ..., -1.0225772 ,
         0.29146394, -0.07150515],
       [-0.7354816 ,  0.9732139 , -1.4986281 , ...,  0.17546143,
         0.35929096, -1.2626256 ],
       [-1.1389118 ,  0.25035045,  0.87

2023-07-11 13:28:29.149922: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x110dd7ba0


NotFoundError: could not find registered platform with id: 0x110dd7ba0 [Op:__inference__update_step_xla_647]