First, we import all necessary modules from Numpy and PyTorch (Code copied from neural_net.py).

The last import is just so I can slow it down a little later.

In [1]:
import numpy as np
import torch.nn as nn
import torch
import time

I create a set of input data. I believe this must be as a Torch tensor, but first I will create it as a list. It is completely possible to create it directly as a tensor or a NumPy array also, but I am just taking the long route.

In [2]:
Input_Data = [n for n in range(10)]
print(Input_Data, type(Input_Data))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] <class 'list'>


To turn this into a Torch tensor, we first need to turn it into a NumPy array (Edit: this is not true. torch.tensor can accept lists as well. [3] could be commented out and it wouldn't affect anything).

In [3]:
Input_Data = np.array(Input_Data)
print(Input_Data, type(Input_Data))

[0 1 2 3 4 5 6 7 8 9] <class 'numpy.ndarray'>


We can then turn this array into a tensor. Contrary to the guidelines in the 60 minute blitz tutorial say, you can turn a numpy array into a PyTorch tensor simply using the torch.tensor operation. This, crucially, allows you specify the data type as float. For reasons best known to the developers of PyTorch, the datatype of a float is not assumed, and then this creates errors later on.

The requires_grad=True is not necessary at least for feedfoward use of the network.

In [4]:
Input_Data = torch.tensor(Input_Data, dtype=torch.float32, requires_grad=True)

We then need to make sure the tensor is not given in terms of columns. Each row is taken later as a seperate instance for training, and all higher dimensions are data associated with that same instance. If we supply our simple tensor as a row vector, it will look like one single instance with 10 input parameters. This, we transpose it. Transposing is not made easy because when the array or tensor is just a vector, python modules like to strip it of its awareness of higher dimensions for reasons best known to the developers of NumPy and PyTorch. The -1 means default to whatever dimension (#rows) necessary.

In [5]:
Input_Data = Input_Data.view(-1, 1)
print(Input_Data, type(Input_Data))

tensor([[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.],
        [8.],
        [9.]], grad_fn=<ViewBackward>) <class 'torch.Tensor'>


We can now create the target data, which in this case is the result of performing $y = 2x + 20$ where y is the Target Data and x is the Input_Data. The same method will be used but in one step this time.

In [6]:
def Poly(x):
    y = x*2 + 20
    return y

Target_Data = [Poly(n) for n in range(10)]
Target_Data = np.array(Target_Data)
Target_Data = torch.tensor(Target_Data, dtype=torch.float32, requires_grad=True)
Target_Data = Target_Data.view(-1,1)
print(Target_Data, type(Target_Data))

tensor([[20.],
        [22.],
        [24.],
        [26.],
        [28.],
        [30.],
        [32.],
        [34.],
        [36.],
        [38.]], grad_fn=<ViewBackward>) <class 'torch.Tensor'>


Now that we have Data and the corresponding target, which change subject slightly and set up a simple neural network. The design for this will be one that has a single input, 2 hidden layers, each with 3 nodes, and no activation functions. Edit:(the loop fails after at maximum teh first 10 epochs unless some kind of activation function is used, presumably because the results blow up. All results become Nan). There will be one output. Deciding on the shape and functional description of each node should be enough to make a forward pass on the network. To do this we will need to use PyTorch.

I will construct the netowrk the same way as in "neural_net" which seems to differ significantly from what I read in the tutorial. In the tutorial, fundamentally, a class was created with certain properties that were then linked together by a method of that class. In DeepMoD, what seems to be occurring is that a list is created of neural network layers, and then some function is used to transform that into a network of some class that i cannot guess.

Create list of layers, we start with the first hidden layer. It has 1 as the first arguement to specify that each node should expect 1 input (each single element of out input data). It then has a 3 to specify the number of nodes in this layer.

In [7]:
Network = [nn.Linear(1, 3)]

We then append the piecewise activation function.

In [8]:
Network.append(nn.Sigmoid())

We then append the second hidden layer, which we chose to be identical to the first. This time the 1st arguement is 3 as there were 3 nodes in teh previous layer. In between these two elements is where we would have applied the "activation function" if we were including one. append() is a method of lists.

In [9]:
Network.append(nn.Linear(3, 3))
Network.append(nn.Sigmoid())

We then append the output. It is single valued, so the second arguement becomes 1. There is no activation function here as we just want to see the result.

In [10]:
Network.append(nn.Linear(3, 1))

We then apply ths funny function that changes it into a network:

In [11]:
Torch_Network = nn.Sequential(*Network)
print(type(Torch_Network))

<class 'torch.nn.modules.container.Sequential'>


We now have a built network. We can feed a tensor to this network and get the feedforward output. Currently, the weights and biases have all been set randomly between -1 and 1, so the result will itself be fairly random. However, for a given input, it will be consistant, as no backprop occurs, and the network is unchanged.

To demonstrate the feedforward output, we can create a tensor of single value that will therefore output a single value. We need to call the number 1 as "1." so that it takes it as a torch.float class object, which otherwise is, for some reason, not assumed. We will output what the network currently tells us for 1 and 20, twice each, to show consistency.

In [12]:
print(Torch_Network(torch.tensor([1.])))
print(Torch_Network(torch.tensor([1.])))
print(Torch_Network(torch.tensor([20.])))
print(Torch_Network(torch.tensor([20.])))

tensor([0.3264], grad_fn=<AddBackward0>)
tensor([0.3264], grad_fn=<AddBackward0>)
tensor([0.3230], grad_fn=<AddBackward0>)
tensor([0.3230], grad_fn=<AddBackward0>)


We can also demonstrate feeding the entire Input_Data Tensor into the network to show how it provides a tensor in return of equal number of rows. the input tensor is always [rowxcolumn] (Number of instances x Number of features or variables for each instance) and the output tensor is always (Number of instances x Number of output results for each instance). As we have the same number of input features as output results per instance, both the input and output tensors have the same shape.

In [13]:
Output_Data = Torch_Network(Input_Data)
print(Output_Data, type(Output_Data))

tensor([[0.3243],
        [0.3264],
        [0.3271],
        [0.3266],
        [0.3258],
        [0.3249],
        [0.3242],
        [0.3237],
        [0.3234],
        [0.3233]], grad_fn=<AddmmBackward>) <class 'torch.Tensor'>


We can now exam the Network a little bit to understand how the parameters are stored. In theory, I have worked out that the number of parameters (catch all term for weights and biases) is equal to

$Number of Parameters = \sum\limits_{Layer=1} n_{Layer} * (n_{Layer-1} + 1)$

Where $n_{Layer}$ is the number of nodes in a given layer and $n_0$ is the number of input features. The $+ 1$ comes from the bias parameters in each layer, and the product of the nodes in 2 layers gives the weights, as each node in the previous layer connects to each node in the current layer.

Considering we have 1 input feature and nodes that go 3, 3, 1, where the result of the final one is our output, we expect 22 parameters in our network.

We examine parameters using the .parameters() method of tensors, but printing it directly is useless as it is a generator.

In [14]:
print(Torch_Network.parameters(), type(Torch_Network.parameters()))

<generator object Module.parameters at 0x7f09be3e0f10> <class 'generator'>


But we can pull out the components of this using list. We get a list of length that is equal to twice the number of layers in our system (not including input). In this case, that is 6. Each element of this list is a tensor itself. The reason for the number of tensors is that, for each a layer, one tensor containing the weights, and one containing the biases is produced.

In [15]:
Parameter_Tensor = list(Torch_Network.parameters())
print(len(Parameter_Tensor))

6


For each pair of tensors produced for each layer, the first is for the weights applied to the inputs from the previous layer. This is generally in the form of a matrix where each row corresponds to a different node in the current layer, and each column corresponds to input from a node in the previous layer. In the first hidden layer, there are 3 nodes and one input per node:

In [16]:
print('Parameter #', 1, 'has size', Parameter_Tensor[0].size())
print('And looks like \n', Parameter_Tensor[0])

Parameter # 1 has size torch.Size([3, 1])
And looks like 
 Parameter containing:
tensor([[-0.6825],
        [ 0.7378],
        [ 0.5528]], requires_grad=True)


The 2nd tensor is for the biases on the nodes in the first layer. As the shape of this tensor is independant of anything but the number of nodes, it is returned as a tensor in the form of a 1 dimensional vector, unspecified as to whether it is a column or row vector. There is an element for each node in the layer.

In [17]:
print('Parameter #', 2, 'has size', Parameter_Tensor[1].size())
print('And looks like \n', Parameter_Tensor[1])

Parameter # 2 has size torch.Size([3])
And looks like 
 Parameter containing:
tensor([-0.2242,  0.1473, -0.7378], requires_grad=True)


By now examining the 3rd and 4th elements of Parameter_Tensor, we examine the weights and biases of the second layer. The shape of the biases tensor is identical to the first hidden layer, but the weights tensor now is in th form of a 3x3 matrix, as each node in the second hidden layer (each row) receives input from each node in the first hidden layer (each column).

In [18]:
print('Parameter #', 3, 'has size', Parameter_Tensor[2].size())
print('And looks like \n', Parameter_Tensor[2])
print('Parameter #', 4, 'has size', Parameter_Tensor[3].size())
print('And looks like \n', Parameter_Tensor[3])

Parameter # 3 has size torch.Size([3, 3])
And looks like 
 Parameter containing:
tensor([[-0.2523,  0.2746, -0.0982],
        [-0.1485, -0.4616, -0.2932],
        [ 0.1923,  0.3346, -0.2605]], requires_grad=True)
Parameter # 4 has size torch.Size([3])
And looks like 
 Parameter containing:
tensor([ 0.2470, -0.3116, -0.1488], requires_grad=True)


Finally, the output layer, only has one node, and so only 1 bias. It receives input from all three nodes in the previous layer and so it's weights tensor is a row vector, but this time, unlike the bias vector, this row shape is explicit.

In [19]:
print('Parameter #', 5, 'has size', Parameter_Tensor[4].size())
print('And looks like \n', Parameter_Tensor[4])
print('Parameter #', 6, 'has size', Parameter_Tensor[5].size())
print('And looks like \n', Parameter_Tensor[5])

Parameter # 5 has size torch.Size([1, 3])
And looks like 
 Parameter containing:
tensor([[0.1714, 0.0276, 0.2445]], requires_grad=True)
Parameter # 6 has size torch.Size([1])
And looks like 
 Parameter containing:
tensor([0.0946], requires_grad=True)


The next thing to do is to start to train the neural network. For that, 2 additional things need to be decided upon; the loss function, and the optimisation function. For the former, we will simply use MSE loss:

In [20]:
Loss_Function = nn.MSELoss()

And for the optimisation function, we will simply use stochastic gradient descent with a learning rate of 0.01.

In [21]:
optimizer = torch.optim.SGD(Torch_Network.parameters(), lr=0.01)

Borrowing the simple components of the loop from the Neural Networks part of the PyTorch Tutorial, we combine everything into a loop, so that we train the neural network. The function "optimizer.zero_grad()" apparently needs to be run.
After that, we calculate the networks prediction on the data. We are feeding it a Tensor of size 10, so it will sequentially (I think) use the network to evaluate each element in turn, and output a tensor of equal size to give the results.
The loss is then calculated, I believe this is a tensor as well
Then we just sort of *do* the backprop to work out the gradients in loss with respect to each weight and bias
Then we trigger the SGD to adjust each bias for each input data element

In [22]:
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Torch_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data)
        #time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''

print('Ready to end this')

Ready to end this


Seeing as we never used the final iteration of our network to calculate results, we do this last bit one last time:

In [23]:
Output_Data = Torch_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data)

tensor(0.0458, grad_fn=<MeanBackward0>)
tensor([[20.1411],
        [21.8972],
        [24.0307],
        [26.1969],
        [28.1849],
        [30.0871],
        [32.1267],
        [34.3484],
        [36.4528],
        [38.0558]], grad_fn=<AddmmBackward>)


And that's it! Currently, unless I put in something as simple as $y = 2x + 20$, the NN isn't good enough. Need to play around with

- shape

- activation functions

- learning rate

I don't really want to muck around with loss calculation or optimisation functions just yet

....
Some time Later
....

If we want to improve the network, we should increase the size of each layer and the number of layers. Let's create a new network, this time with 4 hidden layers, and 30 nodes per layer.

In [24]:
Better_Network = nn.Sequential(*[nn.Linear(1, 30), nn.Sigmoid(), nn.Linear(30, 30), nn.Sigmoid(), nn.Linear(30, 30), nn.Sigmoid(),
                                nn.Linear(30, 30), nn.Sigmoid(), nn.Linear(30, 1)])

The number of parameters in this network is much larger. It is $30*2 + 30*31 + 30*31 + 30*31 + 1*31 = $...

In [25]:
print(60 + 3*30*31 + 31)

2881


We won't change anything but the shape, so we can now reconfigure the optimiser and then run the training loop as before:

In [26]:
optimizer_2 = torch.optim.SGD(Better_Network.parameters(), lr=0.01)
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer_2.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer_2.step()    # Does the update
    '''
    if n % 100 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data)
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data)

Final Result:
tensor(0.0292, grad_fn=<MeanBackward0>)
tensor([[19.9913],
        [21.8427],
        [23.9340],
        [25.9418],
        [27.8624],
        [29.7803],
        [31.7710],
        [33.8421],
        [35.8780],
        [37.6839]], grad_fn=<AddmmBackward>)


After a bit of thought, it is possible to note that any network without activation function will essentially just be simplifiable to a linear equation in each of the input parameters, ie  $y = \alpha + ax_1 + bx_2 + cx_3$ ... Considering that our input equation is linear (essentially meaning can be expressed as a polynomial of order 1 or 0), we shouldn't need an activation function at all. Let's test this.

In [27]:
Simple_Network = nn.Sequential(*[nn.Linear(1, 30), nn.Linear(30, 30), nn.Linear(30, 30), nn.Linear(30, 30), nn.Linear(30, 1)])

optimizer_3 = torch.optim.SGD(Simple_Network.parameters(), lr=0.0001)
Max_Iterations = 1000

for n in range(Max_Iterations):
    optimizer_3.zero_grad()   # zero the gradient buffers
    Output_Data = Simple_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer_3.step()    # Does the update
    '''
    if n % 100 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data)
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Simple_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data)

Final Result:
tensor(1.0477e-10, grad_fn=<MeanBackward0>)
tensor([[20.0000],
        [22.0000],
        [24.0000],
        [26.0000],
        [28.0000],
        [30.0000],
        [32.0000],
        [34.0000],
        [36.0000],
        [38.0000]], grad_fn=<AddmmBackward>)


This result is almost perfect, and certainly perfect as far as we could ever reasonably require. Notice however, the learning rate has been set to 0.0001. Setting it at the higher rate of 0.01 lead to Nans by Epoch 10 at the latest, and 0.001 lead to results that were highly variable with a loss that rapidly spiked and reduced - clearly the minimisation was unstable and widely oscillated around an optimal minimisation direction. It may eventually have convergeed, I did not check.

With the very low learning rate however, the convergence is rapid and precise.

# Trying a more complicated result

Let's use our Better_Network to try and deduce a higher order polynomial: This time, we shall use a larger training data set and aim to fit the Equation $y = x^2 + x + 50$. First of all, we create the data:

This time, I have also chosen to use `np.arange` to create a data array in a style more similar to matlab. ie, i could specify `np.arange(start_value, stop_value, step_size)` where only `stop_value` is compulsory.

In [2]:
import numpy as np
import torch.nn as nn
import torch
import time

In [3]:
Input_Data = np.arange(100)
Input_Data = Input_Data.reshape(-1, 1)
Target_Data = Input_Data**2 + Input_Data + 50
print(Target_Data.reshape(1, -1)) # reshaped just so it fits on the page nicely
Input_Data = torch.tensor(Input_Data, dtype=torch.float32, requires_grad=True)
Target_Data = torch.tensor(Target_Data, dtype=torch.float32, requires_grad=True)

[[  50   52   56   62   70   80   92  106  122  140  160  182  206  232
   260  290  322  356  392  430  470  512  556  602  650  700  752  806
   862  920  980 1042 1106 1172 1240 1310 1382 1456 1532 1610 1690 1772
  1856 1942 2030 2120 2212 2306 2402 2500 2600 2702 2806 2912 3020 3130
  3242 3356 3472 3590 3710 3832 3956 4082 4210 4340 4472 4606 4742 4880
  5020 5162 5306 5452 5600 5750 5902 6056 6212 6370 6530 6692 6856 7022
  7190 7360 7532 7706 7882 8060 8240 8422 8606 8792 8980 9170 9362 9556
  9752 9950]]


Reinitialise the Network:

Here I have noted that the `*[]` construction inside `nn.sequential` is only necessary when creating the network over multiple lines.

In [4]:
Better_Network = nn.Sequential(nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 1))
print(Better_Network(torch.tensor([1.])))
print(Better_Network(torch.tensor([1.])))
print(Better_Network(torch.tensor([20.])))
print(Better_Network(torch.tensor([20.])))

tensor([0.3304], grad_fn=<AddBackward0>)
tensor([0.3304], grad_fn=<AddBackward0>)
tensor([0.3288], grad_fn=<AddBackward0>)
tensor([0.3288], grad_fn=<AddBackward0>)


Redefine the optimizer and run the training loop

In [5]:
Loss_Function = nn.MSELoss() # this has been put here just in case one wants to skip beyond the earlier stuff, and so miss out this line.
optimizer_4 = torch.optim.SGD(Better_Network.parameters(), lr=0.001)
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer_4.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer_4.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data.view(1, -1)) # reshaped for readability

Final Result:
tensor(8887777., grad_fn=<MeanBackward0>)
tensor([[3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998, 3382.9998,
         3382.9998, 3382.9998, 3382.9998, 33

This doesn't work it turns out. I believe what is happening is due to 2 things.

1. The optimisation is getting stuck on a saddle point or local minimum as the optimisation has no momentum


2. The extensive use of Tanh is too much.

Instead, let's use the Adam optimiser as in DeepMoD which seems to be a mix of gradient decent with momentum and batch normalisation. We will also change the output of the 4th hidden layer to be activated by the ReLU function which essentially does nothing if the output is positive, and sets it to zero if the output is negative.

In [6]:
Better_Network_2 = nn.Sequential(nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.ReLU(), nn.Linear(30, 1))

In [7]:
optimizer = torch.optim.Adam(Better_Network_2.parameters()) # Adam has a default lr which I have chosen to use
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network_2(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network_2(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data)
print(Loss)
print(Output_Data.view(1, -1)) # reshaped for readability

Final Result:
tensor(79.3329, grad_fn=<MeanBackward0>)
tensor([[  52.2700,   50.9930,   55.0556,   63.3784,   69.7567,   79.3826,
           91.5507,  105.4291,  121.9146,  140.4485,  160.9191,  182.8826,
          206.5374,  232.0595,  259.5842,  289.2013,  320.9562,  354.8628,
          390.9084,  429.0680,  469.3065,  511.5880,  555.8788,  602.1511,
          650.3813,  700.5551,  752.6631,  806.7026,  862.6768,  920.5922,
          980.4595, 1042.2909, 1106.1000, 1171.9000, 1239.7051, 1309.5266,
         1381.3743, 1455.2565, 1531.1786, 1609.1436, 1689.1501, 1771.1963,
         1855.2766, 1941.3848, 2029.5120, 2119.6489, 2211.7859, 2305.9131,
         2402.0220, 2500.1050, 2600.1567, 2702.1743, 2806.1572, 2912.1094,
         3020.0347, 3129.9417, 3241.8403, 3355.7400, 3471.6501, 3589.5786,
         3709.5288, 3831.5022, 3955.4939, 4081.4971, 4209.5029, 4339.5020,
         4471.4888, 4605.4619, 4741.4292, 4879.4058, 5019.4097, 5161.4541,
         5305.5415, 5451.6450, 5599.7026, 574

These results are finally much better. However, it is possible to make them better by scaling the data so that we are dealing with numbers between -1 and 1. I will scale the data by dividing by the maximum value in the array, prior to importing into a tensor

## Normalisation

In [1]:
import numpy as np
import torch.nn as nn
import torch
import time
Loss_Function = nn.MSELoss()

In [2]:
Input_Data = np.arange(100)
Input_Data = Input_Data.reshape(-1, 1)
Target_Data = Input_Data**2 + Input_Data + 50

# These are the lines where scaling takes place
Scaling_Factor = np.amax(Target_Data)
Input_Data_Scaled = Input_Data/Scaling_Factor
Target_Data_Scaled = Target_Data/Scaling_Factor

print(Target_Data_Scaled.reshape(1, -1)) # reshaped just so it fits on the page nicely

Input_Data = torch.tensor(Input_Data, dtype=torch.float32, requires_grad=True)
Target_Data = torch.tensor(Target_Data, dtype=torch.float32) #Turn this into a tensor also, just to allow computation of loss
Input_Data_Scaled = torch.tensor(Input_Data_Scaled, dtype=torch.float32, requires_grad=True)
Target_Data_Scaled = torch.tensor(Target_Data_Scaled, dtype=torch.float32, requires_grad=True)

[[0.00502513 0.00522613 0.00562814 0.00623116 0.00703518 0.0080402
  0.00924623 0.01065327 0.01226131 0.01407035 0.0160804  0.01829146
  0.02070352 0.02331658 0.02613065 0.02914573 0.03236181 0.03577889
  0.03939698 0.04321608 0.04723618 0.05145729 0.0558794  0.06050251
  0.06532663 0.07035176 0.07557789 0.08100503 0.08663317 0.09246231
  0.09849246 0.10472362 0.11115578 0.11778894 0.12462312 0.13165829
  0.13889447 0.14633166 0.15396985 0.16180905 0.16984925 0.17809045
  0.18653266 0.19517588 0.2040201  0.21306533 0.22231156 0.23175879
  0.24140704 0.25125628 0.26130653 0.27155779 0.28201005 0.29266332
  0.30351759 0.31457286 0.32582915 0.33728643 0.34894472 0.36080402
  0.37286432 0.38512563 0.39758794 0.41025126 0.42311558 0.4361809
  0.44944724 0.46291457 0.47658291 0.49045226 0.50452261 0.51879397
  0.53326633 0.5479397  0.56281407 0.57788945 0.59316583 0.60864322
  0.62432161 0.64020101 0.65628141 0.67256281 0.68904523 0.70572864
  0.72261307 0.73969849 0.75698492 0.77447236 0.79

Reinitialise the network:

In [3]:
Better_Network_2 = nn.Sequential(nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.ReLU(), nn.Linear(30, 1))

In the first instance, we will try to train the network after only scaling the target data.

In [4]:
optimizer = torch.optim.Adam(Better_Network_2.parameters())
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network_2(Input_Data) # Notice here that it is NOT Input_Data_Scaled
    Loss = Loss_Function(Output_Data, Target_Data_Scaled)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network_2(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data_Scaled)
print(Loss)
print(Output_Data.view(1, -1)) # reshaped for readability

Final Result:
tensor(1.5672e-06, grad_fn=<MeanBackward0>)
tensor([[0.0049, 0.0049, 0.0054, 0.0058, 0.0065, 0.0076, 0.0089, 0.0103, 0.0120,
         0.0139, 0.0159, 0.0180, 0.0202, 0.0225, 0.0255, 0.0290, 0.0325, 0.0362,
         0.0399, 0.0438, 0.0477, 0.0517, 0.0558, 0.0599, 0.0642, 0.0685, 0.0739,
         0.0801, 0.0863, 0.0926, 0.0990, 0.1054, 0.1120, 0.1186, 0.1253, 0.1321,
         0.1389, 0.1458, 0.1527, 0.1606, 0.1693, 0.1781, 0.1869, 0.1957, 0.2046,
         0.2136, 0.2226, 0.2316, 0.2406, 0.2496, 0.2594, 0.2705, 0.2817, 0.2928,
         0.3039, 0.3150, 0.3261, 0.3372, 0.3482, 0.3592, 0.3701, 0.3821, 0.3959,
         0.4095, 0.4230, 0.4364, 0.4497, 0.4629, 0.4759, 0.4888, 0.5016, 0.5166,
         0.5322, 0.5476, 0.5628, 0.5777, 0.5925, 0.6070, 0.6212, 0.6382, 0.6557,
         0.6729, 0.6898, 0.7063, 0.7226, 0.7385, 0.7541, 0.7717, 0.7911, 0.8100,
         0.8286, 0.8467, 0.8644, 0.8838, 0.9035, 0.9228, 0.9416, 0.9599, 0.9778,
         0.9937]], grad_fn=<ViewBackward>)


However, we need to scale the results back up as well now:

In [5]:
Output_Data = Output_Data*Scaling_Factor

print("Loss is:", Loss_Function(Output_Data, Target_Data))
print(Output_Data.view(1, -1)) # reshaped for readability

Loss is: tensor(155.1595, grad_fn=<MseLossBackward>)
tensor([[  48.8473,   48.9315,   53.2814,   57.4380,   64.6736,   75.2767,
           88.1173,  102.8596,  119.5233,  137.9920,  157.9742,  179.0984,
          201.0042,  223.3885,  253.5789,  288.3331,  323.6436,  360.0324,
          397.3939,  435.6455,  474.7223,  514.5760,  555.1672,  596.4682,
          638.4562,  681.1106,  735.6388,  796.5303,  858.3719,  921.1119,
          984.7042, 1049.1069, 1114.2838, 1180.1959, 1246.8099, 1314.0876,
         1381.9955, 1450.4958, 1519.5520, 1598.3336, 1684.8268, 1771.9187,
         1859.5593, 1947.6923, 2036.2635, 2125.2153, 2214.4890, 2304.0227,
         2393.7571, 2483.6277, 2580.7625, 2691.5605, 2802.4282, 2913.2837,
         3024.0452, 3134.6299, 3244.9543, 3354.9373, 3464.4968, 3573.5520,
         3682.0229, 3802.2026, 3938.8142, 4074.4907, 4209.1372, 4342.6641,
         4474.9814, 4606.0083, 4735.6626, 4863.8687, 4990.5552, 5140.2793,
         5295.5347, 5448.7031, 5599.7148, 5748.

This doesn't immediately seem to work much better, so now we will try with Input Data that was also scaled. First, we reinitialise the network, then we train and see our result:

In [6]:
Better_Network_2 = nn.Sequential(nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.ReLU(), nn.Linear(30, 1))

In [7]:
optimizer = torch.optim.Adam(Better_Network_2.parameters())
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network_2(Input_Data_Scaled) # Notice here that it is NOT Input_Data_Scaled
    Loss = Loss_Function(Output_Data, Target_Data_Scaled)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network_2(Input_Data_Scaled)
Output_Data = Output_Data*Scaling_Factor
print("Loss is:", Loss_Function(Output_Data, Target_Data))
print(Output_Data.view(1, -1)) # reshaped for readability

Final Result:
Loss is: tensor(12228.5029, grad_fn=<MseLossBackward>)
tensor([[-7.7099e+01, -5.1873e+01, -2.6623e+01, -1.3481e+00,  2.3947e+01,
          4.9264e+01,  7.4604e+01,  9.9966e+01,  1.2535e+02,  1.5076e+02,
          1.7619e+02,  2.0164e+02,  2.2711e+02,  2.5260e+02,  2.7812e+02,
          3.0366e+02,  3.3170e+02,  3.6725e+02,  4.0282e+02,  4.3842e+02,
          4.7403e+02,  5.0968e+02,  5.4534e+02,  5.8102e+02,  6.1673e+02,
          6.5246e+02,  6.8820e+02,  7.2398e+02,  7.5977e+02,  7.9558e+02,
          8.3141e+02,  8.6726e+02,  9.5209e+02,  1.0387e+03,  1.1253e+03,
          1.2119e+03,  1.2985e+03,  1.3852e+03,  1.4719e+03,  1.5586e+03,
          1.6453e+03,  1.7320e+03,  1.8188e+03,  1.9056e+03,  1.9923e+03,
          2.0792e+03,  2.1660e+03,  2.2528e+03,  2.3397e+03,  2.4266e+03,
          2.5134e+03,  2.6003e+03,  2.6927e+03,  2.8120e+03,  2.9313e+03,
          3.0506e+03,  3.1700e+03,  3.2894e+03,  3.4088e+03,  3.5282e+03,
          3.6476e+03,  3.7671e+03,  3.8865e

Okay, this was horrible in comparison. The scale up in the end is an external process to the NN, just to allow us to see how good a job it did more intuitively. We know what the relationship between Target_Data and Target_Data_Scaled is, and if Output_Data approximates Target_Data_Scaled well, then making the same transformation should yeild a small loss between it and Target_Data.

So the question is which scaling allows the best approximation of Target_Data_Scaled.

- Unscaled Input -> Unscaled Target was alright

- Unscaled Input -> Scaled Target was about the same. Running both multiple times shows that either one can perform better on any given attempt.

- Scaled Input -> Scaled target produced a terrible approximation.

Remember, the issue is not how well the network approximates $y = x^2 + x + 50$, but simply how well it approximates the given target data.

The issue is potentially that the scaling factor used is the same for both Data sets. Let us try independantly normalising the two:

### Double Normalisation

In [1]:
import numpy as np
import torch.nn as nn
import torch
import time
Loss_Function = nn.MSELoss()

In [2]:
Input_Data = np.arange(100)
Input_Data = Input_Data.reshape(-1, 1)
Target_Data = Input_Data**2 + Input_Data + 50

# These are the lines where scaling takes place
Scaling_Factor_Input = np.amax(Input_Data)
Scaling_Factor_Target = np.amax(Target_Data)
Input_Data_Scaled = Input_Data/Scaling_Factor_Input
Target_Data_Scaled = Target_Data/Scaling_Factor_Target

print(Target_Data_Scaled.reshape(1, -1)) # reshaped just so it fits on the page nicely

Target_Data = torch.tensor(Target_Data, dtype=torch.float32) #Turn this into a tensor also, just to allow computation of loss
Input_Data_Scaled = torch.tensor(Input_Data_Scaled, dtype=torch.float32, requires_grad=True)
Target_Data_Scaled = torch.tensor(Target_Data_Scaled, dtype=torch.float32, requires_grad=True)

[[0.00502513 0.00522613 0.00562814 0.00623116 0.00703518 0.0080402
  0.00924623 0.01065327 0.01226131 0.01407035 0.0160804  0.01829146
  0.02070352 0.02331658 0.02613065 0.02914573 0.03236181 0.03577889
  0.03939698 0.04321608 0.04723618 0.05145729 0.0558794  0.06050251
  0.06532663 0.07035176 0.07557789 0.08100503 0.08663317 0.09246231
  0.09849246 0.10472362 0.11115578 0.11778894 0.12462312 0.13165829
  0.13889447 0.14633166 0.15396985 0.16180905 0.16984925 0.17809045
  0.18653266 0.19517588 0.2040201  0.21306533 0.22231156 0.23175879
  0.24140704 0.25125628 0.26130653 0.27155779 0.28201005 0.29266332
  0.30351759 0.31457286 0.32582915 0.33728643 0.34894472 0.36080402
  0.37286432 0.38512563 0.39758794 0.41025126 0.42311558 0.4361809
  0.44944724 0.46291457 0.47658291 0.49045226 0.50452261 0.51879397
  0.53326633 0.5479397  0.56281407 0.57788945 0.59316583 0.60864322
  0.62432161 0.64020101 0.65628141 0.67256281 0.68904523 0.70572864
  0.72261307 0.73969849 0.75698492 0.77447236 0.79

In [3]:
Better_Network_2 = nn.Sequential(nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.ReLU(), nn.Linear(30, 1))

In [4]:
optimizer = torch.optim.Adam(Better_Network_2.parameters())
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network_2(Input_Data_Scaled) # Notice here that it is NOT Input_Data_Scaled
    Loss = Loss_Function(Output_Data, Target_Data_Scaled)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network_2(Input_Data_Scaled)
Output_Data = Output_Data*Scaling_Factor_Target
print("Loss is:", Loss_Function(Output_Data, Target_Data))
print(Output_Data.view(1, -1)) # reshaped for readability

Final Result:
Loss is: tensor(159.8869, grad_fn=<MseLossBackward>)
tensor([[  48.5908,   53.7471,   58.8083,   63.7739,   71.6121,   80.1517,
           88.6266,   97.0361,  117.2570,  137.6169,  157.9832,  178.3547,
          198.7278,  225.6299,  258.8672,  292.1520,  325.4800,  358.8480,
          392.2490,  425.6808,  459.1374,  502.9512,  553.3946,  603.9490,
          654.6070,  705.3601,  756.1992,  807.1157,  858.1019,  909.1488,
          960.2452, 1014.7471, 1088.3585, 1162.1637, 1236.1511, 1310.3107,
         1384.6320, 1459.1012, 1533.7090, 1608.4413, 1683.2867, 1760.9408,
         1850.1373, 1939.4685, 2028.9198, 2118.4749, 2208.1172, 2297.8303,
         2387.5969, 2477.3989, 2580.6611, 2690.9771, 2801.3008, 2911.6064,
         3021.8674, 3132.0598, 3242.1543, 3352.1267, 3461.9487, 3571.5935,
         3681.0317, 3806.6938, 3940.4795, 4074.0957, 4207.5049, 4340.6704,
         4473.5566, 4606.1284, 4738.3462, 4870.1743, 5001.5728, 5132.5059,
         5280.3901, 5436.6855, 55

Okay, this hasn't really helped at all. The approximation is about as good, once again, at no scaling whatsoever. Perhaps the values are too small? Let's try a logaritmic scaling.

### Log scaling

Remember if $\log_{10}(a) = b$ then $10^b = a$. This will be the reversing transformation at the end.

Also, log scaling like this will introduce negative values for elements less than 1. Nothing in my dataset is less than 1 prior to scaling, but of course this is just luck. this ay be an issue simply because I haven't experimented with negative values yet and am not sure what teh affect on choice of activation functions will be (ie sigmoid has no negative output, unlike tanh)

Let's just try both applying a logarithm to just the target data and to both the input and the target.

First, just the target:

In [1]:
import numpy as np
import torch.nn as nn
import torch
import time
Loss_Function = nn.MSELoss()

In [2]:
Input_Data = np.arange(100)
Input_Data = Input_Data.reshape(-1, 1)
Target_Data = Input_Data**2 + Input_Data + 50

# These are the lines where scaling takes place
Input_Data_Scaled = Input_Data # No change here, but name change to reuse code
Target_Data_Scaled = np.log10(Target_Data)

print(Target_Data_Scaled.reshape(1, -1)) # reshaped just so it fits on the page nicely

Target_Data = torch.tensor(Target_Data, dtype=torch.float32) #Turn this into a tensor also, just to allow computation of loss
Input_Data_Scaled = torch.tensor(Input_Data_Scaled, dtype=torch.float32, requires_grad=True)
Target_Data_Scaled = torch.tensor(Target_Data_Scaled, dtype=torch.float32, requires_grad=True)

[[1.69897    1.71600334 1.74818803 1.79239169 1.84509804 1.90308999
  1.96378783 2.02530587 2.08635983 2.14612804 2.20411998 2.26007139
  2.31386722 2.36548798 2.41497335 2.462398   2.50785587 2.55145
  2.59328607 2.63346846 2.67209786 2.70926996 2.74507479 2.77959649
  2.81291336 2.84509804 2.87621784 2.90633504 2.93550727 2.96378783
  2.99122608 3.01786772 3.04375513 3.06892761 3.09342169 3.1172713
  3.14050804 3.16316137 3.18525877 3.20682588 3.2278867  3.24846372
  3.26857797 3.28824923 3.30749604 3.32633586 3.34478512 3.3628593
  3.380573   3.39794001 3.41497335 3.43168534 3.44808767 3.46419137
  3.48000694 3.49554434 3.51081301 3.52582195 3.54057972 3.55509445
  3.56937391 3.5834255  3.59725628 3.610873   3.6242821  3.63748973
  3.65050179 3.66332393 3.67596155 3.68841982 3.70070372 3.712818
  3.72476725 3.73655585 3.74818803 3.75966784 3.77099921 3.78218587
  3.79323145 3.80413943 3.81491318 3.82555593 3.83607081 3.84646083
  3.85672889 3.86687781 3.87691031 3.886829   3.8966364

In [3]:
Better_Network_2 = nn.Sequential(nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.ReLU(), nn.Linear(30, 1))

In [4]:
optimizer = torch.optim.Adam(Better_Network_2.parameters())
Max_Iterations = 10000

print("Now starting training ...")

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network_2(Input_Data_Scaled) # Notice here that it is NOT Input_Data_Scaled
    Loss = Loss_Function(Output_Data, Target_Data_Scaled)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network_2(Input_Data_Scaled)
Output_Data = 10**Output_Data # Notice the reversal of the logarithm is here, and not a simple scale factor liek before
print("Loss is:", Loss_Function(Output_Data, Target_Data))
print(Output_Data.view(1, -1)) # reshaped for readability

Now starting training ...
Final Result:
Loss is: tensor(323.8916, grad_fn=<MseLossBackward>)
tensor([[  50.0038,   52.0103,   55.9414,   62.1794,   69.8635,   79.5464,
           92.2009,  106.4471,  122.4087,  140.2052,  159.9356,  181.6750,
          205.4736,  231.3596,  259.3419,  289.4162,  321.5676,  355.7767,
          392.0223,  430.2835,  470.5422,  512.7838,  556.9971,  603.1741,
          651.3096,  701.4017,  753.4480,  807.4496,  863.4081,  921.3236,
          981.1986, 1043.0355, 1106.8380, 1172.6067, 1240.3473, 1310.0623,
         1381.7561, 1455.4355, 1531.1040, 1608.7714, 1688.4447, 1770.1328,
         1853.8477, 1939.5975, 2027.3962, 2117.2554, 2209.1924, 2303.2163,
         2399.3445, 2496.6638, 2595.8093, 2697.0361, 2800.3540, 2905.7788,
         3013.3242, 3122.9995, 3234.8159, 3348.7874, 3464.9155, 3583.2092,
         3703.6768, 3826.3203, 3951.1372, 4078.1294, 4207.2949, 4338.6235,
         4472.1123, 4607.7446, 4745.5151, 4885.3955, 5027.3706, 5171.4189,
       

Next, both log scaled:

However, note that the first element of Input_Data is 0, so the logarithm of it will be $-\infty$. This isn't brilliant, but I am just going to not have the 0 for this test.

In [5]:
Input_Data = np.arange(1, 100) # Set a starting value of 1, rather than the default of 0. This results in 1 fewer eleent in the array than in previous examples.
Input_Data = Input_Data.reshape(-1, 1)
Target_Data = Input_Data**2 + Input_Data + 50

# These are the lines where scaling takes place
Input_Data_Scaled = np.log10(Input_Data)
Target_Data_Scaled = np.log10(Target_Data)

Target_Data = torch.tensor(Target_Data, dtype=torch.float32) #Turn this into a tensor also, just to allow computation of loss
Input_Data_Scaled = torch.tensor(Input_Data_Scaled, dtype=torch.float32, requires_grad=True)
Target_Data_Scaled = torch.tensor(Target_Data_Scaled, dtype=torch.float32, requires_grad=True)


Better_Network_2 = nn.Sequential(nn.Linear(1, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.Tanh(),
                                nn.Linear(30, 30), nn.ReLU(), nn.Linear(30, 1))


optimizer = torch.optim.Adam(Better_Network_2.parameters())
Max_Iterations = 10000

print("Now starting training ...")

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = Better_Network_2(Input_Data_Scaled) # Notice here that it is NOT Input_Data_Scaled
    Loss = Loss_Function(Output_Data, Target_Data_Scaled)
    Loss.backward()
    optimizer.step()    # Does the update
    '''
    if n % 10 == 0:
        print('Reached Epoch', n)
        print('Loss is', Loss)
        print('Current Prediction is', Output_Data.view(1, -1)) # reshaped for readability
        time.sleep(5) # This is just for my convenience so i can see what is happening before it continues
    '''
print('Final Result:')
Output_Data = Better_Network_2(Input_Data_Scaled)
Output_Data = 10**Output_Data # Notice the reversal of the logarithm is here, and not a simple scale factor liek before
print("Loss is:", Loss_Function(Output_Data, Target_Data))
print(Output_Data.view(1, -1)) # reshaped for readability

[[1.71600334 1.74818803 1.79239169 1.84509804 1.90308999 1.96378783
  2.02530587 2.08635983 2.14612804 2.20411998 2.26007139 2.31386722
  2.36548798 2.41497335 2.462398   2.50785587 2.55145    2.59328607
  2.63346846 2.67209786 2.70926996 2.74507479 2.77959649 2.81291336
  2.84509804 2.87621784 2.90633504 2.93550727 2.96378783 2.99122608
  3.01786772 3.04375513 3.06892761 3.09342169 3.1172713  3.14050804
  3.16316137 3.18525877 3.20682588 3.2278867  3.24846372 3.26857797
  3.28824923 3.30749604 3.32633586 3.34478512 3.3628593  3.380573
  3.39794001 3.41497335 3.43168534 3.44808767 3.46419137 3.48000694
  3.49554434 3.51081301 3.52582195 3.54057972 3.55509445 3.56937391
  3.5834255  3.59725628 3.610873   3.6242821  3.63748973 3.65050179
  3.66332393 3.67596155 3.68841982 3.70070372 3.712818   3.72476725
  3.73655585 3.74818803 3.75966784 3.77099921 3.78218587 3.79323145
  3.80413943 3.81491318 3.82555593 3.83607081 3.84646083 3.85672889
  3.86687781 3.87691031 3.886829   3.89663643 3.90

While it is nice to see that applying logs isn't breaking anything (as long as all the data is $\geq 0$), there was also no real advantage. Let's try changing the network again:

## Testing Variations

Now we will use the `nn.ModuleList` object to test a number of networks of various types to see which manages to produce the lowest loss on the same problem.

We can vary

- The amount of layers

- The size of the layers

- The use of ReLU, Tanh and sigmoid (R, T, S)

However i will not change the optimiser now and I will use the optimiser's default learning rate.

In [1]:
import numpy as np
import torch.nn as nn
import torch
import time
Loss_Function = nn.MSELoss()

In order to test all the variations, I have created a general purpose permutation function below. The idea is that you tell the function the amount of layers (Instances) and the possible values that each layer can be described by. In this case, I will feed it a list of `[1, 2, 3]` with 1 corresponding to a Sigmoid function, 2 corresponding to a Tanh, and 3 corresponding to a ReLU. However, this translation is not done in this block of code.

The logic of the function is that I track 2 lists. I do the leg work on a list that iterates through indices, ie [0, 0, 1, 2] then [0, 0, 1, 3] and then each iteration use this list of indices to create a new list where the value at each point in the index list chooses the value in my new list. I then add this permutation to my grand list of lists.

In [5]:
def Permutate(Instances, Option_List):
    '''
    Instances: Integer
            Number of elements in list, each of which is expected to be able to take any value given in Option_List
    Option_List: List
            List of options that each element in the list can be
    '''
    
    Element_Index = -1 # negative as we cycle back from the final index
    Option_Index = 0
    Option_Total = len(Option_List)
    Permutation_Tracker = [0 for _ in range(Instances)] # List of indices used to do the legwork for easier cranking through permutations
    Current_Permutation = [Option_List[0] for _ in range(Instances)] # assigns an option to each element based on the index in the current Permutation_Tracker

    List_of_Permutations = [list(Current_Permutation)] # The default list of zeroth options can be immediately added
  
    while True:
        if Permutation_Tracker[Element_Index] == Option_Total-1: #flags that the highest option has been reached for this point in the list, and so slides up
            Element_Index -= 1
            if -1*Element_Index > Instances: #final flag to say that to perform any more permutations, would need to fall off the begining of the list
                break

        else: # in case where sliding was not triggered, this means that the index at this point in the tracker can be notched up
            Permutation_Tracker[Element_Index] += 1
            for n in range(Element_Index+1, 0): # and all indices to the right can be zeroed
                Permutation_Tracker[n] = 0

            Element_Index = -1 # and we begin from the rightmost point again

            for i in range(Instances): # uses the indices described by the tracker to create a list from the provided options
                Current_Permutation[i] = Option_List[Permutation_Tracker[i]]

            List_of_Permutations.append(list(Current_Permutation))

    return List_of_Permutations

I now use this permutation function to create a list of distinct neural networks, which are appended into a list using `nn.ModuleList`. The idea is that I

1. Examine all each number of hidden layers (up to 4)

2. For each number of hidden layers, examine the affect on the size of the layers (all size 10, to all size 30)

3. For each size and number of hidden layer combination, examine each combination of activation functions at each hidden layer.

This last one proved very hard to manage as the amount of permutations is dependant on the number of layers, hence the number of nested loops needed to be a variable itself!

The solution was to use my Permutate function above, and create a reference, fully shaped NN, through which I could modify the activation functions present through indexing provided in Permutate().

> Note: This probably means, that for each shape of NN, the random initialisation of the weights is the same across all permutations of activation functions. 

In [3]:
TrialNNs = nn.ModuleList([])
All_Sizes = []
All_Perms = []
for layers in [1,2,3,4]:
    for size in [10,20,30]:
        # Initial shaping of Network
        NewNetworkList = [nn.Linear(1, size), nn.Sigmoid()]
        for layer in range(1, layers):
            NewNetworkList.append(nn.Linear(size, size))
            NewNetworkList.append(nn.Sigmoid)
            
        NewNetworkList.append(nn.Linear(size, 1))  
        
        
        Available_Permutations = Permutate(layers, [1,2,3])
        
        for Curr_Perm in Available_Permutations: #Curr_Perm is itself a list, containing the indices for defining the activation functions in this permutation
            All_Sizes += [size]
            All_Perms.append(Curr_Perm)
            for Act_layer, Act_Type in enumerate(Curr_Perm):
                index = (Act_layer*2)+1 # The activation functions are present at point indexed to be 1, 3, 5 etc (seperated by the nn.Linear)
                
                if Act_Type == 1:
                    NewNetworkList[index] = nn.Sigmoid()
                elif Act_Type == 2:
                    NewNetworkList[index] = nn.Tanh()
                else:
                    NewNetworkList[index] = nn.ReLU()
                
            NewNetwork = nn.Sequential(*NewNetworkList)
            TrialNNs.append(NewNetwork)

Now that the list of NNs is ready, we modify the standard loop from before to train each NN in the list in turn, and print the loss for each.

In [4]:
Input_Data = np.arange(100)
Input_Data = Input_Data.reshape(-1, 1)
Target_Data = Input_Data**2 + Input_Data + 50

Target_Data = torch.tensor(Target_Data, dtype=torch.float32)
Input_Data = torch.tensor(Input_Data, dtype=torch.float32, requires_grad=True)

Loss_Tracker = []

for NNIndex, TrialModule in enumerate(TrialNNs):
    optimizer = torch.optim.Adam(TrialModule.parameters())
    Max_Iterations = 10000

    for n in range(Max_Iterations):
        optimizer.zero_grad()   # zero the gradient buffers
        Output_Data = TrialModule(Input_Data)
        Loss = Loss_Function(Output_Data, Target_Data)
        Loss.backward()
        optimizer.step()    # Does the update

    print('For the module of size:', All_Sizes[NNIndex], ', and type:', All_Perms[NNIndex])
    Output_Data = TrialModule(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data).item() #I don't want PyTorch to track loss back through previous modules, so this is convenient place to set Loss as a number
    print("Loss is:", Loss)
    if len(Loss_Tracker) > 0 and Loss < min(Loss_Tracker):
        print('New Record! \n \n')
    Loss_Tracker += [Loss]
    
print('Done!')

For the module of size: 10 , and type: [1]
Loss is: 19587660.0
For the module of size: 10 , and type: [2]
Loss is: 18879778.0
New Record! 
 

For the module of size: 10 , and type: [3]
Loss is: 146501.5
New Record! 
 

For the module of size: 20 , and type: [1]
Loss is: 18911486.0
For the module of size: 20 , and type: [2]
Loss is: 17619076.0
For the module of size: 20 , and type: [3]
Loss is: 158141.765625
For the module of size: 30 , and type: [1]
Loss is: 18261726.0
For the module of size: 30 , and type: [2]
Loss is: 16433583.0
For the module of size: 30 , and type: [3]
Loss is: 430116.40625
For the module of size: 10 , and type: [1, 1]
Loss is: 19586834.0
For the module of size: 10 , and type: [1, 2]
Loss is: 18878858.0
For the module of size: 10 , and type: [1, 3]
Loss is: 811.5999755859375
New Record! 
 

For the module of size: 10 , and type: [2, 1]
Loss is: 17921158.0
For the module of size: 10 , and type: [2, 2]
Loss is: 17263256.0
For the module of size: 10 , and type: [2, 3]

In [10]:
BestIndex = np.argmin(np.array(Loss_Tracker))
print('The best network was the network module of description:', TrialNNs[BestIndex])
print('It had a loss of', Loss_Tracker[BestIndex])
print('And was able to produce an Output of:', TrialNNs[BestIndex](Input_Data).view(1, -1))

The best network was the network module of description: Sequential(
  (0): Linear(in_features=1, out_features=30, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=30, out_features=30, bias=True)
  (3): Tanh()
  (4): Linear(in_features=30, out_features=30, bias=True)
  (5): ReLU()
  (6): Linear(in_features=30, out_features=1, bias=True)
)
It had a loss of 7.395278453826904
And was able to produce an Output of: tensor([[176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778,
         176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778,
         176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778,
         176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778,
         176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778,
         176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778,
         176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778, 176.4778,
         176.4778, 176.4778, 

Okay, so mixed findings here. Let's start with what is positive. Learnings include:

- Most networks are bad. Very few produced final loss values of under 100. This tells us not to worry if teh network isn't working as it needs to be designed just right!

- It is a very good idea to end with a ReLU

- Size isn't the most important thing. The lowest loss was from a network with only 3 layers, although the 4-layer network most similar to it (an extra Sigmoid layer at the start) also produced almost as good loss. This implies that there is a ceiling at which point increasing the number of hidden layers is no use.

- Stepping up the activations from Sigmoid -> Tanh -> ReLU seems to work best.

Now the bad. As can be seen when i tried to print it at the end. Something perplexing has occurred. The best network has a terrible output!! And in a way that makes 0 sense. How was a loss <10 calculated from that abysmal display???

To check what is going on, let's recreate the network and test it fresh.

In [15]:
But_Why_NN = nn.Sequential(nn.Linear(1, 30), nn.Sigmoid(), nn.Linear(30, 30), nn.Tanh(), nn.Linear(30, 30), nn.ReLU(), nn.Linear(30, 1))

optimizer = torch.optim.Adam(But_Why_NN.parameters())
Max_Iterations = 10000

for n in range(Max_Iterations):
    optimizer.zero_grad()   # zero the gradient buffers
    Output_Data = But_Why_NN(Input_Data)
    Loss = Loss_Function(Output_Data, Target_Data)
    Loss.backward()
    optimizer.step()    # Does the update

Output_Data = TrialModule(Input_Data)
Loss = Loss_Function(Output_Data, Target_Data).item()
print(Output_Data.view(1, -1))
print('Loss is:', Loss)

tensor([[370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593, 370.8593,
         370.8593, 370.8593, 370.8593, 370.8593, 370

Well then, I give up, I don't know what happened, but at least on re-creation, the loss isn't lying to me.....