First, we import all necessary modules from Numpy and PyTorch (Code copied from neural_net.py).

The last import is just so I can slow it down a little later.

In [1]:
import numpy as np
import torch.nn as nn
import torch
import time

I create a set of input data. I believe this must be as a Torch tensor, but first I will create it as a list. It is completely possible to create it directly as a tensor or a NumPy array also, but I am just taking the long route.

In [2]:
Input_Data = [n for n in range(10)]
print(Input_Data, type(Input_Data))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] <class 'list'>


To turn this into a Torch tensor, we first need to turn it into a NumPy array (Edit: this is not true. torch.tensor can accept lists as well. [3] could be commented out and it wouldn't affect anything).

In [3]:
Input_Data = np.array(Input_Data)
print(Input_Data, type(Input_Data))

[0 1 2 3 4 5 6 7 8 9] <class 'numpy.ndarray'>


We can then turn this array into a tensor. Contrary to the guidelines in the 60 minute blitz tutorial say, you can turn a numpy array into a PyTorch tensor simply using the torch.tensor operation. This, crucially, allows you specify the data type as float. For reasons best known to the developers of PyTorch, the datatype of a float is not assumed, and then this creates errors later on.

The requires_grad=True is not necessary at least for feedfoward use of the network.

In [4]:
Input_Data = torch.tensor(Input_Data, dtype=torch.float32, requires_grad=True)

We then need to make sure the tensor is not given in terms of columns. Each row is taken later as a seperate instance for training, and all higher dimensions are data associated with that same instance. If we supply our simple tensor as a row vector, it will look like one single instance with 10 input parameters. This, we transpose it. Transposing is not made easy because when the array or tensor is just a vector, python modules like to strip it of its awareness of higher dimensions for reasons best known to the developers of NumPy and PyTorch. The -1 means default to whatever dimension (#rows) necessary.

In [5]:
Input_Data = Input_Data.view(-1, 1)
print(Input_Data, type(Input_Data))

tensor([[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.],
        [8.],
        [9.]], grad_fn=<ViewBackward>) <class 'torch.Tensor'>


We can now create the target data, which in this case is the result of performing $y = 2x + 20$ where y is the Target Data and x is the Input_Data. The same method will be used but in one step this time.

In [6]:
def Poly(x):
    y = x*2 + 20
    return y

Target_Data = [Poly(n) for n in range(10)]
Target_Data = np.array(Target_Data)
Target_Data = torch.tensor(Target_Data, dtype=torch.float32, requires_grad=True)
Target_Data = Target_Data.view(-1,1)
print(Target_Data, type(Target_Data))

tensor([[20.],
        [22.],
        [24.],
        [26.],
        [28.],
        [30.],
        [32.],
        [34.],
        [36.],
        [38.]], grad_fn=<ViewBackward>) <class 'torch.Tensor'>


Now that we have Data and the corresponding target, which change subject slightly and set up a simple neural network. The design for this will be one that has a single input, 2 hidden layers, each with 3 nodes, and no activation functions. Edit:(the loop fails after at maximum teh first 10 epochs unless some kind of activation function is used, presumably because the results blow up. All results become Nan). There will be one output. Deciding on the shape and functional description of each node should be enough to make a forward pass on the network. To do this we will need to use PyTorch.

I will construct the netowrk the same way as in "neural_net" which seems to differ significantly from what I read in the tutorial. In the tutorial, fundamentally, a class was created with certain properties that were then linked together by a method of that class. In DeepMoD, what seems to be occurring is that a list is created of neural network layers, and then some function is used to transform that into a network of some class that i cannot guess.

Create list of layers, we start with the first hidden layer. It has 1 as the first arguement to specify that each node should expect 1 input (each single element of out input data). It then has a 3 to specify the number of nodes in this layer.

In [7]:
Network = [nn.Linear(1, 3)]

We then append the piecewise activation function.

In [8]:
Network.append(nn.Sigmoid())

We then append the second hidden layer, which we chose to be identical to the first. This time the 1st arguement is 3 as there were 3 nodes in teh previous layer. In between these two elements is where we would have applied the "activation function" if we were including one. append() is a method of lists.

In [9]:
Network.append(nn.Linear(3, 3))
Network.append(nn.Sigmoid())

We then append the output. It is single valued, so the second arguement becomes 1. There is no activation function here as we just want to see the result.

In [10]:
Network.append(nn.Linear(3, 1))

We then apply ths funny function that changes it into a network:

In [11]:
Torch_Network = nn.Sequential(*Network)
print(type(Torch_Network))

<class 'torch.nn.modules.container.Sequential'>


We now have a built network. We can feed a tensor to this network and get the feedforward output. Currently, the weights and biases have all been set randomly between -1 and 1, so the result will itself be fairly random. However, for a given input, it will be consistant, as no backprop occurs, and the network is unchanged.

To demonstrate the feedforward output, we can create a tensor of single value that will therefore output a single value. We need to call the number 1 as "1." so that it takes it as a torch.float class object, which otherwise is, for some reason, not assumed. We will output what the network currently tells us for 1 and 20, twice each, to show consistency.

In [12]:
print(Torch_Network(torch.tensor([1.])))
print(Torch_Network(torch.tensor([1.])))
print(Torch_Network(torch.tensor([20.])))
print(Torch_Network(torch.tensor([20.])))

tensor([0.3264], grad_fn=<AddBackward0>)
tensor([0.3264], grad_fn=<AddBackward0>)
tensor([0.3230], grad_fn=<AddBackward0>)
tensor([0.3230], grad_fn=<AddBackward0>)


We can also demonstrate feeding the entire Input_Data Tensor into the network to show how it provides a tensor in return of equal number of rows. the input tensor is always [rowxcolumn] (Number of instances x Number of features or variables for each instance) and the output tensor is always (Number of instances x Number of output results for each instance). As we have the same number of input features as output results per instance, both the input and output tensors have the same shape.

In [13]:
Output_Data = Torch_Network(Input_Data)
print(Output_Data, type(Output_Data))

tensor([[0.3243],
        [0.3264],
        [0.3271],
        [0.3266],
        [0.3258],
        [0.3249],
        [0.3242],
        [0.3237],
        [0.3234],
        [0.3233]], grad_fn=<AddmmBackward>) <class 'torch.Tensor'>


We can now exam the Network a little bit to understand how the parameters are stored. In theory, I have worked out that the number of parameters (catch all term for weights and biases) is equal to

$Number of Parameters = \sum\limits_{Layer=1} n_{Layer} * (n_{Layer-1} + 1)$

Where $n_{Layer}$ is the number of nodes in a given layer and $n_0$ is the number of input features. The $+ 1$ comes from the bias parameters in each layer, and the product of the nodes in 2 layers gives the weights, as each node in the previous layer connects to each node in the current layer.

Considering we have 1 input feature and nodes that go 3, 3, 1, where the result of the final one is our output, we expect 22 parameters in our network.

We examine parameters using the .parameters() method of tensors, but printing it directly is useless as it is a generator.

In [14]:
print(Torch_Network.parameters(), type(Torch_Network.parameters()))

<generator object Module.parameters at 0x7f09be3e0f10> <class 'generator'>


But we can pull out the components of this using list. We get a list of length that is equal to twice the number of layers in our system (not including input). In this case, that is 6. Each element of this list is a tensor itself. The reason for the number of tensors is that, for each a layer, one tensor containing the weights, and one containing the biases is produced.

In [15]:
Parameter_Tensor = list(Torch_Network.parameters())
print(len(Parameter_Tensor))

6


For each pair of tensors produced for each layer, the first is for the weights applied to the inputs from the previous layer. This is generally in the form of a matrix where each row corresponds to a different node in the current layer, and each column corresponds to input from a node in the previous layer. In the first hidden layer, there are 3 nodes and one input per node:

In [16]:
print('Parameter #', 1, 'has size', Parameter_Tensor[0].size())
print('And looks like \n', Parameter_Tensor[0])

Parameter # 1 has size torch.Size([3, 1])
And looks like 
 Parameter containing:
tensor([[-0.6825],
        [ 0.7378],
        [ 0.5528]], requires_grad=True)


The 2nd tensor is for the biases on the nodes in the first layer. As the shape of this tensor is independant of anything but the number of nodes, it is returned as a tensor in the form of a 1 dimensional vector, unspecified as to whether it is a column or row vector. There is an element for each node in the layer.

In [17]:
print('Parameter #', 2, 'has size', Parameter_Tensor[1].size())
print('And looks like \n', Parameter_Tensor[1])

Parameter # 2 has size torch.Size([3])
And looks like 
 Parameter containing:
tensor([-0.2242,  0.1473, -0.7378], requires_grad=True)


By now examining the 3rd and 4th elements of Parameter_Tensor, we examine the weights and biases of the second layer. The shape of the biases tensor is identical to the first hidden layer, but the weights tensor now is in th form of a 3x3 matrix, as each node in the second hidden layer (each row) receives input from each node in the first hidden layer (each column).

In [18]:
print('Parameter #', 3, 'has size', Parameter_Tensor[2].size())
print('And looks like \n', Parameter_Tensor[2])
print('Parameter #', 4, 'has size', Parameter_Tensor[3].size())
print('And looks like \n', Parameter_Tensor[3])

Parameter # 3 has size torch.Size([3, 3])
And looks like 
 Parameter containing:
tensor([[-0.2523,  0.2746, -0.0982],
        [-0.1485, -0.4616, -0.2932],
        [ 0.1923,  0.3346, -0.2605]], requires_grad=True)
Parameter # 4 has size torch.Size([3])
And looks like 
 Parameter containing:
tensor([ 0.2470, -0.3116, -0.1488], requires_grad=True)


Finally, the output layer, only has one node, and so only 1 bias. It receives input from all three nodes in the previous layer and so it's weights tensor is a row vector, but this time, unlike the bias vector, this row shape is explicit.

In [19]:
print('Parameter #', 5, 'has size', Parameter_Tensor[4].size())
print('And looks like \n', Parameter_Tensor[4])
print('Parameter #', 6, 'has size', Parameter_Tensor[5].size())
print('And looks like \n', Parameter_Tensor[5])

Parameter # 5 has size torch.Size([1, 3])
And looks like 
 Parameter containing:
tensor([[0.1714, 0.0276, 0.2445]], requires_grad=True)
Parameter # 6 has size torch.Size([1])
And looks like 
 Parameter containing:
tensor([0.0946], requires_grad=True)


The next thing to do is to start to train the neural network. For that, 2 additional things need to be decided upon; the loss function, and the optimisation function. For the former, we will simply use MSE loss:

In [20]:
Loss_Function = nn.MSELoss()

And for the optimisation function, we will simply use stochastic gradient descent with a learning rate of 0.01.

In [21]:
optimizer = torch.optim.SGD(Torch_Network.parameters(), lr=0.01)

Borrowing the simple components of the loop from the Neural Networks part of the PyTorch Tutorial, we combine everything into a loop, so that we train the neural network. The function "optimizer.zero_grad()" apparently needs to be run.
After that, we calculate the networks prediction on the data. We are feeding it a Tensor of size 10, so it will sequentially (I think) use the network to evaluate each element in turn, and output a tensor of equal size to give the results.
The loss is then calculated, I believe this is a tensor as well
Then we just sort of *do* the backprop to work out the gradients in loss with respect to each weight and bias
Then we trigger the SGD to adjust each bias for each input data element

In [22]:
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Torch_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data)
        #time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''

print('Ready to end this')

Ready to end this


Seeing as we never used the final iteration of our network to calculate results, we do this last bit one last time:

In [23]:
Output_Data = Torch_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data)

tensor(0.0458, grad_fn=<MeanBackward0>)
tensor([[20.1411],
        [21.8972],
        [24.0307],
        [26.1969],
        [28.1849],
        [30.0871],
        [32.1267],
        [34.3484],
        [36.4528],
        [38.0558]], grad_fn=<AddmmBackward>)


And that's it! Currently, unless I put in something as simple as $y = 2x + 20$, the NN isn't good enough. Need to play around with

\begin{itemize}
shape
activation functions
learning rate
\end{itemize}

I don't really want to muck around with loss calculation or optimisation functions just yet

....
Some time Later
....

If we want to improve the network, we should increase the size of each layer and the number of layers. Let's create a new network, this time with 4 hidden layers, and 30 nodes per layer.

In [24]:
Better_Network = nn.Sequential(*[nn.Linear(1, 30), nn.Sigmoid(), nn.Linear(30, 30), nn.Sigmoid(), nn.Linear(30, 30), nn.Sigmoid(),
                                nn.Linear(30, 30), nn.Sigmoid(), nn.Linear(30, 1)])

The number of parameters in this network is much larger. It is $30*2 + 30*31 + 30*31 + 30*31 + 1*31 = $...

In [25]:
print(60 + 3*30*31 + 31)

2881


We won't change anything but the shape, so we can now reconfigure the optimiser and then run the training loop as before:

In [26]:
optimizer_2 = torch.optim.SGD(Better_Network.parameters(), lr=0.01)
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer_2.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer_2.step()    # Does the update
    '''
    if n % 100 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data)
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data)

Final Result:
tensor(0.0292, grad_fn=<MeanBackward0>)
tensor([[19.9913],
        [21.8427],
        [23.9340],
        [25.9418],
        [27.8624],
        [29.7803],
        [31.7710],
        [33.8421],
        [35.8780],
        [37.6839]], grad_fn=<AddmmBackward>)


After a bit of thought, it is possible to note that any network without activation function will essentially just be simplifiable to a linear equation in each of the input parameters, ie  $y = \alpha + ax_1 + bx_2 + cx_3$ ... Considering that our input equation is linear (essentially meaning can be expressed as a polynomial of order 1 or 0), we shouldn't need an activation function at all. Let's test this.

In [27]:
Simple_Network = nn.Sequential(*[nn.Linear(1, 30), nn.Linear(30, 30), nn.Linear(30, 30), nn.Linear(30, 30), nn.Linear(30, 1)])

optimizer_3 = torch.optim.SGD(Simple_Network.parameters(), lr=0.0001)
Max_Iterations = 1000

for n in range(Max_Iterations):
    optimizer_3.zero_grad()   # zero the gradient buffers
    Output_Data = Simple_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer_3.step()    # Does the update
    '''
    if n % 100 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data)
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Simple_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data)

Final Result:
tensor(1.0477e-10, grad_fn=<MeanBackward0>)
tensor([[20.0000],
        [22.0000],
        [24.0000],
        [26.0000],
        [28.0000],
        [30.0000],
        [32.0000],
        [34.0000],
        [36.0000],
        [38.0000]], grad_fn=<AddmmBackward>)


This result is almost perfect, and certainly perfect as far as we could ever reasonably require. Notice however, the learning rate has been set to 0.0001. Setting it at the higher rate of 0.01 lead to Nans by Epoch 10 at the latest, and 0.001 lead to results that were highly variable with a loss that rapidly spiked and reduced - clearly the minimisation was unstable and widely oscillated around an optimal minimisation direction. It may eventually have convergeed, I did not check.

With the very low learning rate however, the convergence is rapid and precise.

# Trying a more complicated result

Let's use our Better_Network to try and deduce a higher order polynomial: This time, we shall use a larger training data set and aim to fit the Equation $y = x^2 + x + 50$. First of all, we create the data:

In [1]:
import numpy as np
import torch.nn as nn
import torch
import time

In [2]:
Input_Data = np.array([n for n in range(100)])
Input_Data = Input_Data.reshape(-1, 1)
Target_Data = Input_Data**2 + Input_Data + 50
print(Target_Data.reshape(1, -1)) # reshaped just so it fits on the page nicely
Input_Data = torch.tensor(Input_Data, dtype=torch.float32, requires_grad=True)
Target_Data = torch.tensor(Target_Data, dtype=torch.float32, requires_grad=True)

[[  50   52   56   62   70   80   92  106  122  140  160  182  206  232
   260  290  322  356  392  430  470  512  556  602  650  700  752  806
   862  920  980 1042 1106 1172 1240 1310 1382 1456 1532 1610 1690 1772
  1856 1942 2030 2120 2212 2306 2402 2500 2600 2702 2806 2912 3020 3130
  3242 3356 3472 3590 3710 3832 3956 4082 4210 4340 4472 4606 4742 4880
  5020 5162 5306 5452 5600 5750 5902 6056 6212 6370 6530 6692 6856 7022
  7190 7360 7532 7706 7882 8060 8240 8422 8606 8792 8980 9170 9362 9556
  9752 9950]]


Reinitialise the Network:

In [3]:
Better_Network = nn.Sequential(*[nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 1)])
print(Better_Network(torch.tensor([1.])))
print(Better_Network(torch.tensor([1.])))
print(Better_Network(torch.tensor([20.])))
print(Better_Network(torch.tensor([20.])))

tensor([0.1028], grad_fn=<AddBackward0>)
tensor([0.1028], grad_fn=<AddBackward0>)
tensor([0.0418], grad_fn=<AddBackward0>)
tensor([0.0418], grad_fn=<AddBackward0>)


Redefine the optimizer and run the training loop

In [4]:
Loss_Function = nn.MSELoss() # this has been put here just in case one wants to skip beyond the earlier stuff, and so miss out this line.
optimizer_4 = torch.optim.SGD(Better_Network.parameters(), lr=0.001)
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer_4.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer_4.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data.view(1, -1)) # reshaped for readability

Final Result:
tensor(8887778., grad_fn=<MeanBackward0>)
tensor([[3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978, 3382.9978,
         3382.9978, 3382.9978, 3382.9978, 33