warning: this project is heavily experimental. it's more of a research project.
PAGNN stands for Persistent Artificial Graph-based Neural Networks.
st+1 = σ(stW + b)
, where:
W
is a square adjacency matrix of shapeNxN
.st
is a row vector of shape1xN
.N
is the total number of neurons in the network.σ(...)
is an activation function.
-
PAGNNs are a stateful artificial neural network architecture that is as general as possible. It is able to operate on temporal & non-temporal datasets with no modifications to the API or architecture.
-
They also have a general topology, represented by square adjacency matrices:
- The architecture itself makes no assumptions about the topology other than the TOTAL number of neurons. In FFNNs, tons of assumptions are heuristically made about the topology beyond the neuron count including: number of layers, number of neurons per layer, density per layer, and the directionality of dataflow.
- A fully dense PAGNN is a topological “complete graph”. This means that every neuron is connected to every other neuron in the network bi-directionally, including itself:
- However, PAGNNs do not need to be fully dense! There has been tons of research in sparsity regarding FFNNs, and it has been shown that networks that are sufficiently deep (ex. ResNet50) that are also pruned to have less nonzero weights than a network of similar architecture that is less deep (ex. ResNet18), still outperform their smaller counterparts significantly. In some cases they only suffer from a sub-1% accuracy loss from their base dense architecture, whereas demoting the deeper network to be more shallow, you can suffer significant accuracy loss (Ex. demoting from ResNet50 to ResNet18 on ImageNet loses about 7-9% top-1 accuracy, whereas pruning a ResNet50 to have the same parameter count as ResNet18 can lose sub-1%). However, I hypothesize that sparsity in PAGNNs can produce results beyond redundancy reduction... But truly more general learners. In the figure below, the 2 topologies on the left are sparse PAGNNs, and the right is a fully dense PAGNN all with 7 total neurons (nodes):
- It is also worth mentioning that a FFNN with n neurons, k layers, and dk densities per layer is a special case of a PAGNN network with n neurons. This applies for all possible values for n, k, and dk. So intuitively a PAGNN topology optimizer could find something similar to a FFNN as one of it’s intermediary solutions. In other words, FFNNs are pruned PAGNNs.
- So if we prove that a FFNN can generalize better than a fully dense PAGNN in some cases, then we already have our answer: pruned PAGNNs have the potential to outperform dense PAGNNs. According to early experimentation, this seems to be the case. This is very exciting because I hypothesize that the topology is as important as the parameterization of the AI -- if not, more.
- I also hypothesize that it may not even be necessary for the engineer to determine the number of neurons a PAGNN should have. This may also be learned, and is much more trivial with this sort of network in comparison to a FFNN (see NEAT).
-
Due to the very arbitrary representation of PAGNNs, it may be hard to understand how we actually provide our input data & return our predictions. It helps to compare PAGNNs to FFNNs:
-
FFNNs: We have some feature column vector
X
that contains our input data. Let’s say we have 2 features we want to operate upon, that means that our first layer’s weight matrix has a dimensionality of2xN
(remember for FFNNs we always transposeW
before applying it to our input data, so after transposition it becomesNx2
) and our input column vectorX
has a dimensionality of2x1
. There is nowhere to store these features other thanX
, so we simply take the matrix multiplication betweenWT
&X
(shapes=Nx2
&2x1
). The only “state” a FFNN has is the intermediary calculations between layers. This WTX matmul gives us an outputZ0
which is the intermediary output for layer 0, this is now the input to layer 1. This is thrown away after a singleX
sample, so FFNNs are not stateful. -
Loading
X
into a PAGNN: We still have some feature column vectorX
containing our input data. Again, we’ll say we have 2 features we want to operate upon, so againX
is of the shape2x1
. However, now we introduce the PAGNN’s state variable calledst
. It is a row vector of shape1xN
, whereN
is the number of neurons total in the network and is initialized to contain all 0s.N
must be greater than or equal to the number of input features. Before we can input any data, we must first choose neurons to allocate as input neurons which in this case must be 2 since we have 2 input features inX
. For simplicity, we will also say that this network has 5 total neurons. Now before we can do anything, we have to load the input features fromX
intost
. We do so by simply setting the first 2 elements (technically this can be an arbitrary choice but for simplicity we choose the first 2) inst
to the 2 elements inX
. This is called “loading input”. Now, ourst
variable has 2 feature values as well as 3 trailing 0s. All parameters in the network are stored in the adjacency matrix "W
", it is a square matrix with the shapeNxN
. ConceptuallyW
represents the weightings between neurons (nodes), whereW[i, j]
is the weight (or edge value) for the connection between neuroni
& neuronj
. To get the next statest+1
, we take the matrix multiply betweenst
&W
(shapes=1x5
&5x5
). Conceptually this is the transfer of energy between neurons. For each element in the state vector we basically pass it through the edges in the network, and take the sum of all incoming weighted energies. The output of this operation is1xN
, or in this case1x5
, which is the exact shape we started with forst
. This output row vector becomesst+1
. That completes 1 full PAGNN step. Extracting predictions from a PAGNN: If we want our output vectorY
to be a feature vector of size 1, we simply take thest
vector (again, of size1x5
) and pull out the last element. Again, this choice can be arbitrary but for simplicity we pull out the last element!
-
-
PAGNN input/output caveats: Technically the number of neurons in a PAGNN can be equal to
max(num_input_features, num_output_features)
.- Although, this means some neurons will double as inputs & outputs. It is unknown if this is useful and is a topic of active research. Also, depending on the topology of the adjacency matrix
W
& the chosen number of steps, the output neuron’s states may remain 0s even after a number of steps. If theW
's topology is disconnected, where some inputs are isolated from some outputs both directly and indirectly, the empty output neurons problem will necessarily happen. This is why training PAGNNs can be somewhat cumbersome if done naively. - Fully dense PAGNNs do not suffer from the “empty neurons” problem mentioned above, however they may contain noise/interference, again this is under active research so this is speculative.
- Although, this means some neurons will double as inputs & outputs. It is unknown if this is useful and is a topic of active research. Also, depending on the topology of the adjacency matrix
- After extracting our predicted values in the previous section, we can calculate the loss. PyTorch makes this very easy as it builds a dynamic compute graph that allows us to backpropagate the error. Then, we simply call the optimizer's step function!
- More on this coming soon.
Note: In the current state of the repository, PAGNNs are fully dense. I hypothesize that significantly large, but also sparse PAGNNs can perform significantly better.
- Clone this repository.
cd path/to/cloned/PAGNNs/
python -m venv env
source env/bin/activate
pip install -e .