# LARGER GPU-ACCELERATED BRAIN SIMULATIONS WITH PROCEDURAL CONNECTIVITY

#### JAMES C KNIGHT AND THOMAS NOWOTNY

Centre for Computational Neuroscience and Robotics, School of Engineering and Informatics, University of Sussex, Brighton, United Kingdom

ABSTRACT. Large-scale simulations of spiking neural network models are an important tool for improving our understanding of the dynamics and ultimately the function of brains. However, even small mammals such as mice have on the order of  $1 \times 10^{12}$  synaptic connections which, in simulations, are each typically characterized by at least one floating-point value. This amounts to several terabytes of data an unrealistic memory requirement for a single desktop machine. Large models are therefore typically simulated on distributed supercomputers which is costly, places limits on the number and duration of simulations that can be run and typically requires formal procedures to gain access to the necessary resources. In this work, we describe extensions to GeNN – our Graphical Processing Unit (GPU) accelerated spiking neural network simulator – that enable it to 'procedurally' generate connectivity and synaptic weights 'on the go' as spikes are triggered, instead of storing and retrieving them from memory. We find that GPUs are well-suited to this approach because of their raw computational power which, due to memory bandwidth limitations, is often under-utilised when simulating spiking neural networks. We demonstrate the value of our approach with a recent model of the Macaque visual cortex consisting of  $4.13\times10^6$  neurons and  $24.2\times10^9$  synapses. Using our new method, it can be simulated on a single GPU – a significant step forward in making large-scale brain modelling accessible to many more researchers

#### 1. Introduction

The brain of a mouse has around  $70 \times 10^6$  neurons, but this number is dwarfed by the  $1 \times 10^{12}$  synapses which connect them  $^{16}$ . In computer simulations of spiking neural networks, propagating spikes involves adding the synaptic input from each spiking presynaptic neuron to the postsynaptic neurons. The information describing which neurons are synaptically connected and with what weight is typically generated before a simulation is run and stored in large arrays. For large-scale brain models this creates high memory requirements, so that they can typically only be simulated on large distributed computer systems using software such as NEST  $^{15}$  or NEURON  $^8$ . By careful design, these simulators can keep the memory requirements for each node almost constant, even when a simulation is distributed across tens of thousands of nodes  $^{19}$ . However, high performance computer (HPC) systems are bulky, expensive and consume a lot of power and are hence typically shared resources, only accessible to a limited number of researchers and for time-limited investigations.

Neuromorphic systems <sup>13,14,24,30,33</sup> take inspiration from the brain and have been developed specifically for simulating large spiking neural networks more efficiently. One particular relevant feature of the brain is that its memory elements – the synapses – are co-located with the computing elements – the neurons. In neuromorphic systems, this often translates to dedicating a large proportion of each chip to memory. This allows some classes of spiking neural networks to be simulated very efficiently, but reducing the degree of connectivity to fit within the constraints of current neuromorphic systems inevitably changes the dynamics of brain simulations <sup>38</sup>. However, while such on-chip memory is fast, it can only be fabricated at relatively low density so that many of these systems economize – either by reducing the maximum number of synapses per neuron to as few as 256 or by reducing the precision of the synaptic weights to 6 bit <sup>33</sup>, 4 bit <sup>13</sup> or even 1 bit <sup>24</sup>. Unlike

 $E\text{-}mail\ address{:}\ \texttt{J.C.Knight@sussex.ac.uk}$ 

most other neuromorphic systems, the SpiNNaker  $^{14}$  neuromorphic supercomputer is fully programmable and combines large on-chip and external memories, distributed across the system, which enables real-time simulation of large-scale models  $^{31}$ . This is promising for the future but, due to its prototype nature, the availability of SpiNNaker hardware is limited and even moderately-sized simulations still require a physically large system (9 boards for a model with around  $100 \times 10^3$  neurons and  $300 \times 10^6$  synapses  $^{31}$ ).

Modern GPUs have relatively little on-chip memory and, instead, dedicate the majority of their silicon area to arithmetic logic units. GPUs use dedicated hardware to rapidly switch between tasks so that the latency of accessing external memory can be 'hidden' behind computation, as long as there is sufficient computation to be performed. For example, the memory latency of a typical modern GPU can be completely hidden if each CUDA core performs approximately 10 arithmetic operations per byte of data accessed from memory. Unfortunately, propagating a spike in a spiking neural network simulation typically requires reading a synaptic weight ( $\geq 2 \, \mathrm{B}$ ) and postsynaptic index ( $\geq 2 \, \mathrm{B}$ ) from memory and then writing back the synaptic input ( $\geq 4$  B), while performing many fewer than the corresponding 80 instructions. This makes spike propagation highly memory bound. Nonetheless, we have shown in previous work 20 that, as GPUs have significantly higher total memory bandwidth than even the fastest CPU, moderately sized models of around  $100 \times 10^3$  neurons and  $1 \times 10^9$  synapses can be simulated on a single GPU with competitive speed and energy consumption. However, individual GPUs do not have enough memory to simulate larger brain models. One can install a small number of GPUs within a single node and connect them using high-speed NVLink<sup>27</sup> interconnect, through which they can communicate with low latency and high bandwidth <sup>21</sup>. However, if a larger number of GPUs is required, multiple nodes need to be used which are connected via the same MPI based protocols and network fabric as CPU based systems and hence suffer from similar communication overheads as CPU based clusters.

In this work, we present two innovations which enable large-scale brain simulations on a single GPU workstation. The first innovation, which we describe in section 2.1, we call 'procedural connectivity'. Procedural connectivity uses the large amount of computational power available on a GPU to generate synaptic connectivity 'on the fly' – reducing the memory and memory bandwidth requirements. While a similar approach was used by Eugene Izhikevich for simulating an extremely large thalamo-cortical model with  $1\times 10^{11}$  neurons and  $1\times 10^{15}$  synapses on a modest PC cluster in  $2005^{\,17}$  – an incredible achievement – it has not been subsequently applied to modern hardware.

Procedural connectivity vastly reduces the memory requirements of large-scale brain simulations, but previous GPU code generation strategies do also not scale well to models containing large numbers of neural populations and synaptic projections. To address this second problem, we have developed a 'kernel merging' code generator, described in section 2.2.

# 2. Results

In the following subsections, we first present the two recent innovations in our GeNN simulator  $^{42}$  which enable simulations of very large models on a GPU. We then demonstrate the power of the new features by simulating a recent model of the Macaque visual cortex  $^{35}$  with  $4.13\times10^6$  neurons and  $24.2\times10^9$  synapses.

2.1. **Procedural connectivity.** In a brain simulation, neurons and synapses can be described by a variety of mathematical models but these are eventually all translated into time or event-driven update algorithms<sup>6</sup>. Our GeNN simulator<sup>42</sup> uses code generation to convert neuron and synapse update algorithms – described using 'snippets' of C-like code – into CUDA code for efficient GPU simulation. Before a simulation can be run, its parameters, in particular the state variables and the synaptic connectivity, need to be initialised. Traditionally, this is done by running initialisation algorithms on the main CPU prior to the simulation. The results are stored in CPU memory, uploaded to GPU memory and then used during the simulation. We have recently extended GeNN to use code generation from code snippets to also generate efficient, parallel code for model initialisation <sup>20</sup>. Offloading initialisation to the GPU in this way made it around 20×



FIGURE 1. Scaling of 1s balanced random network simulation using different algorithms on a range of modern GPUs. All data points represent the mean of 5 simulations and standard deviations are smaller than line width so not shown. A Jetson TX2. B GeForce MX130. C GeForce GTX 1650. D Titan RTX.

faster on a desktop PC<sup>20</sup>, demonstrating that initialisation algorithms are well-suited for GPU acceleration. Here, we are going one step further. We realised that, if each synaptic connection can be re-initialised in less than the 80 operations required to hide the latency incurred when fetching its 8B of state from memory, it could be faster and vastly more memory efficient to regenerate synaptic connections on demand rather than storing them in memory. This is the concept of procedural connectivity. It is applicable whenever synapses are static – plastic synapses which change their weights during a simulation will have to be simulated in the traditional way.

We implemented procedural connectivity in GeNN by repurposing our previously developed parallel initialisation methods. Instead of running them once for all synapses at the beginning of the simulation, we rerun the methods during the simulation to regenerate the outgoing synapses of each neuron that fires a spike and immediately use the identified connections and weights to run the post-synaptic code which calculates the effect of the spike onto other neurons. This is possible because the outgoing synaptic connections from each neuron are typically largely independent from those of other neurons as we shall see from typical examples below.

In the absence of knowledge of the exact microscopic connectivity in the brain, there are a number of typical connectivity schemes that are used in brain models. We will now discuss two typical examples and how they can be implemented efficiently on a GPU. One very common connectivity scheme is the 'fixed probability connector' for which each neuron in the presynaptic population is connected to each neuron in the postsynaptic population with fixed probability  $P_{\text{conn}}$ . The postsynaptic targets of any presynaptic neuron can hence be sampled from a Bernoulli process with success probability  $P_{\text{conn}}$ . One simple way of sampling from the Bernoulli process is to repeatedly draw samples from the uniform distribution Unif[0,1] and generate a synapse if the sample is less than  $P_{\text{conn}}$ . However, for sparse connectivity ( $P_{\text{conn}} \ll 1$ ), it is much more efficient to sample from the geometric distribution  $Geom[P_{conn}]$  which governs the number of Bernoulli trials until the next success (i.e. a synapse). The geometric distribution can be sampled in constant time by inverting the cumulative density function of the equivalent continuous distribution (the exponential distribution) to obtain  $\frac{\log(\text{Unif}[0,1])}{\log(1-P_{\text{conn}})}$  p499. Note that, if we were to directly draw from the uniform distribution, the sampling for each potential synapse would be independent from any other potential synapse and all these operations could be performed in parallel. However, for the more efficient 'geometric sampling' employed here, the sampling for the post-synaptic targets of a presynaptic neuron must be done serially, but is still independent from the sampling for any other presynaptic neuron.

Another common scheme for defining connectivity is the 'fixed total number connector' in which a fixed total number  $N_{\text{syn}}$  of synapses is placed between randomly chosen

partners from the pre- and postsynaptic populations. In order to initialise this connectivity in parallel, the number of synapses that originate from each of the  $N_{\rm pre}$  presynaptic neurons must first be calculated by sampling from the multinomial distribution  ${\rm Mult}[N_{\rm syn},\{P_n,P_n,\ldots,P_n\}]$ , where  $P_n=\frac{1}{N_{\rm pre}}$ , on the host CPU up front because these numbers need to add to  $N_{\rm syn}$  and are hence not independent. However, once the numbers of outgoing synapses are determined, the postsynaptic targets for a presynaptic neuron can be generated very efficiently in parallel by sampling from the discrete uniform distribution  ${\rm Unif}[0,N_{\rm post}]$  where  $N_{\rm post}$  is the size of the postsynaptic population. Note, that this can only be done because the targets of each presynaptic neuron are independent from those of any other presynaptic neuron. Where synaptic weights and delays are not constant across synapses, but are described by some statistical distribution, they can also be sampled independently from each other and hence in parallel.

In order to use these parallel initialisation schemes for procedural connectivity, we require reproducible pseudorandom numbers that can be generated independently for each presynaptic neuron. In principle this could be done with 'convential' pseudorandom number generators (PRNGs), but each presynaptic neuron would need to maintain its own PRNG state which would lead to a significant memory overhead. Instead, we use the 'counter-based' Philox $4\times32-10$  PRNG $^{32}$ . Counter-based PRNGs are designed for parallel applications and essentially consist of a pseudo-random bijective function which takes a counter as an input (for Philox $4\times32-10$  a 128 bit number) and outputs a random number. In contrast to convential PRNGs, this means that generating the  $n^{\rm th}$  random number in a stream has exactly the same cost as generating the 'next' random number, allowing us to trivially divide up the random number stream between multiple parallel processes (in this case presynaptic neurons).

For an initial demonstration of the performance and scalability of procedural connectivity, we simulated a balanced random network of current-based Leaky Integrate-and-Fire (LIF) neurons (see section 4.2 for model description) at scales ranging from  $1\times10^3$  to  $1\times10^6$  neurons ( $100\times10^3$  to  $100\times10^9$  synapses respectively) on a representative selection of NVIDIA GPU hardware: Jetson TX2, a low-power embedded system with 8 GB (shared memory); Geforce MX130, a laptop GPU with 2 GB; Geforce GTX 1650, a low-end desktop GPU with 4 GB; and Titan RTX, a high-end workstation GPU with 24 GB. Fig. 1 shows the duration of these simulations using our new procedural approach or using the standard approach of storing synaptic connections in memory employing two different data structures. Both data structures are described in more detail in our previous work  $^{20}$  but briefly, in the 'sparse' data structure, a presynaptic neuron's postsynaptic targets are represented as an array of indices whereas, in the 'bitfield' data structure, they are represented as a  $N_{\rm post}$  array of bits where a '1' at position i indicates that there is a connection to postsynaptic neuron i and a '0' that there is not.

None of our devices have enough memory to store the  $100 \times 10^9$  synapses required for the largest scale using either data structure but, at the  $100 \times 10^3$  neuron scale, the bitfield data structure allows the model to fit into the memory of several devices it otherwise would not. However, not only is the new procedural approach the *only* way of simulating models at the largest scales but, as Fig. 1 illustrates, even at smaller scales the performance of the procedural approach is competitive with and sometimes better than the standard approach. Although, when using the procedural approach, synaptic weights and postsynaptic indices are no longer read from memory, our current implementation still accumulates postsynaptic input using a 'fire-and-forget' atomic add operation. Atomic operations are entirely handled within the L2 cache of modern GPUs and thus have relatively low latency. However, profiling using NVIDIA Nsight compute 26 suggests that the latency of these operations, combined with those incurred when sampling from the PRNG and calculating logarithms are currently the limiting factor in this benchmark rather than the compute resources of the GPU, suggesting that further optimisation could significantly improve performance. All of the synapses in this model have the same synaptic weight meaning that they can be hard-coded into the procedural connectivity kernels. However, if weights vary across synapses, the 'bitfield' cannot be used and the memory constraints for the 'sparse' representation become even more severe.

2.2. Kernel merging. NVIDIA GPUs are typically programmed in CUDA using a Single Instruction Multiple Thread (SIMT) paradigm where programmers write 'kernel' functions containing serial C-like code which is then executed in parallel across many virtual threads. We call our second innovation 'kernel merging' and it relates to the way these kernels are implemented. While the procedural connectivity presented in the previous section allows simulating models which would otherwise not fit into the memory of a GPU, there are additional problems when using code generation for models with a large number of neuron and synapse populations. GeNN and other SNN simulators which use code generation to generate all of their simulation code<sup>4</sup> (as opposed to, for example NESTML<sup>28</sup>, which uses code generation only to generate neuron simulation code) generate separate pieces of code for each population of neurons and synapses. This allows optimizations such as hard-coding constant parameters and, although generating code for models with many populations will result in large code size, C++ CPU code can easily be divided between multiple modules and compiled in parallel, minimizing the effects on build time. However, GPUs can only run a small number of kernels - which are equivalent to modules in this context – simultaneously (128 on the latest NVIDIA GPUs<sup>25</sup> p278). Therefore, GeNN employs a "Kernel Fusion" approach 41 where multiple neuron populations are simulated within each kernel. CUDA threads are grouped into 'thread blocks' and to maximise performance, the code running on threads within a thread block should not 'diverge'. To minimise this divergence, we add 'padding' threads around each population of neurons so population and block boundaries are aligned. Therefore, in the following pseudocode we simulating 3 populations of 100 neurons each in a single kernel and - assuming a thread block size of 32 – we pad each population to 128 threads:

This approach works well for a small number of populations but, as Fig. 2A illustrates, when we partition a model consisting of a single population of 1000000 LIF neurons see section 4.3 for model description) into an increasingly large number of ever smaller populations, compilation time increases super-linearly and quickly becomes impractical. Furthermore, the simulation also runs slower with a large number of populations (Fig. 2B). Normally, we would expect this model to be memory bound as each thread in the model reads 32 B of data (8 B of neuron state and 24 B of RNG state) and, as discussed above, hiding the latency of these memory accesses would require approximately 320 arithmetic operations - many more than required to sample an input current from the normal distribution and update a LIF neuron. Fig. 2C - obtained using data from the NVIDIA Nsight compute profiler <sup>26</sup> – shows that this is true for small numbers of populations. In this case, the memory system is around 90% utilised. However, when the model is partitioned into larger numbers of smaller populations, the memory is used less efficiently and the kernel becomes latency bound, i.e. neither memory nor compute are used efficiently. Investigating further, we found that this drop in performance was accompanied by an increasing number of "No instruction" stalls (Fig. 2D) which are events that prevent the GPU from doing any work during a clock cycle. These particular events are likely to be caused by



FIGURE 2. Performance of a 1 s simulation of  $1 \times 10^6$  LIF neurons driven by a Gaussian input current, partitioned into varying numbers  $(N_{pop})$  of populations and running on a workstation equipped with a Titan RTX GPU. All data points represent the mean of 5 simulations and standard deviations are smaller than line width so not shown. A Compilation time  $(T_{comp})$  using GCC 7.5.0. B Neuron kernel time  $(T_{neuron})$  for a 1 s simulation. C Memory throughput  $(K_{mem})$  reported by NVIDIA Nsight compute profiler "Speed of light" metric. D Number of "No instruction" stalls reported by NVIDIA Nsight compute profiler  $(N_{stall})$ .

"Excessively jumping across large blocks of assembly code" <sup>26</sup> p47, which makes sense as we are generating kernels with hundreds of thousands of lines of code.

Several neural modelling tools including Brian2<sup>37</sup> provide modellers with tools to work with 'slices' of neuron populations, allowing models to be defined with fewer populations. However, if a model is defined by connecting these slices together, the resulting connectivity is the result of *multiple* simple connection rules of the type discussed in the previous section, making it much more difficult to apply our procedural connectivity approach. Furthermore, such an approach places the responsibility for structuring a model in such a way that it can be simulated efficiently onto the modellers, who often prefer to concentrate on the science and organise populations according to anatomy or physiology.

To address the issue of too many populations, we developed a new code generator for GeNN which applies a Single Program Multiple Data (SPMD) approach and 'merges' the model description, grouping together populations which can be simulated using the same generated code. From this merged description, structures are generated to store the pointers to state variables and parameter values which are still allowed to differ between merged populations:

```
struct NeuronUpdateGroup {
  unsigned int numNeurons;
  float* V;
};
```

An array of these structures is then declared for each merged population and each element is initialised with pointers to state variables and parameter values:

```
\label{eq:neuronUpdateGroup [3];} $\operatorname{neuronUpdateGroup}[0] = \{100, \, VA\};$ $\operatorname{neuronUpdateGroup}[1] = \{100, \, VB\};$ $\operatorname{neuronUpdateGroup}[2] = \{100, \, VC\};$ $
```

where VA is a pointer to the array containing the state variable 'V' of populations 'A' and so on. In order for a thread to determine which neuron in which population it should simulate, we generate an additional data structure – an array containing a cumulative sum of threads used for each population:

unsigned int startThread  $[3] = \{0, 128, 256\};$ 

Each thread performs a simple binary search within this array to find the index of the neuron and population it should simulate. As Fig. 2 shows, this approach solves the observed issues with compilation time and simulation performance.

2.3. The multi-area model. Due to lack of computing power and sufficiently detailed connectivity data, previous models of the cortex have either focused on modelling individual local microcircuits at the level of individual cells <sup>18,29</sup> or modelling multiple connected areas at a higher level of abstraction 7. However, recent data 2 has shown that cortical activity has distinct features at both the global and local levels which can only be captured by modelling interconnected microcircuits at the level of individual cells. The recent  $multi-area\ model^{34,35}$  is an example of such multi-scale modeling. It uses scaled versions of a previous, 4 layer microcircuit model<sup>29</sup> to implement 1 mm<sup>2</sup> 'patches' for 32 areas of the macaque visual cortex. The patches are connected together according to inter-area axon tracing data from the CoCoMac<sup>1</sup> database, further refined using additional anatomical data<sup>22</sup> and heuristics<sup>11</sup> to obtain estimates for the number of synapses between areas. The synapses are distributed between populations in the source and target area using layer-specific tracing data <sup>23</sup> and cell-type-specific dendritic densities <sup>3</sup>. Individual populations are connected by the fixed number connectors described above. For a full description of the multi-area model please see Schmidt et al.  $^{34}$ ,  $^{35}$ . In 2018, this model was simulated using NEST 15 on an IBM Blue Gene/Q supercomputer. Because each Blue Gene/Q compute node only had a small amount of memory, these simulation were distributed across an entire rack (1024 compute nodes) of the system and simulating 1s of biological time took approximately  $12 \min^{35}$ . However, on the newer 'JURECA' supercomputer (where each node has 128 GB of memory) this model can be simulated on as few as 11 compute nodes. Furthermore, by distributing the model across 160 compute nodes, it can be initialized in approximately 8 min and simulating 1 s of biological time takes only 31 s<sup>39</sup>.

The multi-area model consists of  $4.13 \times 10^6$  neurons in 254 populations and  $24.2 \times 10^9$  synapses in 64516 projections. Without kernel merging, it would therefore be unlikely that the model would compile or simulate at a workable speed using GeNN. Additionally, unlike the model we benchmarked previously, each synapse in this model has an independent weight and synaptic delay sampled from a normal distribution so the bitfield data structure cannot be used. Even if we assume that 16 bit floating-point would provide sufficient weight precision, that delays could be expressed as 8 bit integers and that neuron populations are all small enough to be indexed using 16 bit indices, our sparse data structure would still require 5 B per synapse, such that the complete synaptic data would need over 100 GB of GPU memory. While a cluster of GPUs connected using NVLink could be built with this much memory, it is more than any single GPU has available. However, using procedural connectivity, we are able to simulate this model on a single workstation with a Titan RTX GPU.

In order to validate our GeNN simulations, we ran a  $10.5\,\mathrm{s}$  simulation of the multi-area model in a 'ground state' where inter-area connections have the same strength as intra-area connections and a  $100.5\,\mathrm{s}$  simulation in a 'resting state' where inter-area connections are  $1.9\times$  stronger. Initialization of our model took 6 min (3 min of which was spent generating and compiling code) and simulation of each biological second took 7.9 min in the ground state and  $8.5\,\mathrm{min}$  in the resting state (averaged over 3 simulation runs). While this is significantly faster than the older Blue Gene/Q supercomputer, it is around  $15\times$  slower than the newer JURECA system. However, because individual synapses simulated using procedural connectivity do not require initialization, in shorter simulations this large performance difference can be somewhat overcome by the faster initialisation time of the GeNN simulations. For example a 1s simulation would take just under 11 min using GeNN (if the model is not recompiled) and around  $8.5\,\mathrm{min}$  using  $160\,\mathrm{JURECA}$  nodes – only  $2.5\,\mathrm{min}$  faster.

Fig. 3A-C shows some example spike rasters from three of the modelled areas, illustrating the asynchronous irregular nature of the model's ground state whereas Fig. 3D-F illustrate the characteristic irregular activity and population bursts of the same areas in



FIGURE 3. Results of full-scale multi-area model simulation. **A-C** Raster plots from GeNN ground state simulation showing spiking activity of 3% of the neurons in area V1 (A), V2 (B), and Frontal Eye Field (C). **G-I** Raster plots from GeNN resting state simulation showing spiking activity of 3% of the neurons in area V1 (A), V2 (B), and Frontal Eye Field (C). Blue: excitatory neurons, red: inhibitory neurons. **D-F** Spiking statistics for each population across all 32 areas simulated in ground state using GeNN and NEST shown as split violin plots. **J-L** Spiking statistics for each population across all 32 areas simulated in resting state using GeNN and NEST shown as split violin plots. Solid lines: medians, Dashed lines: Interquartile range. **D,J** Population-averaged firing rates. **E,K** Average pairwise correlation coefficients of spiking activity. **F,L** Irregularity measured by revised local variation LvR <sup>36</sup> averaged across neurons.



FIGURE 4. Comparison of full-scale multi-area model spike statistics between multiple GeNN simulations with different seeds; and NEST simulations. Comparisons calculated using the Kullback-Leibler (KL) divergence. A Neuron firing rates  $\bf B$  Pairwise correlation coefficients of spiking activity.  $\bf C$  Irregularity measured by revised local variation LvR  $^{36}$ .

the resting state. Next, we calculated the per-layer distributions of rates, spike-train irregularity and cross-correlation coefficients across all areas (disregarding the first 500 ms of simulation) and compared them to the same measures obtained from spike trains generated by the supercomputer simulations. We calculated irregularity using the revised local variation LvR<sup>36</sup>, averaged over a subsample of 2000 neurons and cross-correlation from spike histograms with 1 ms bins, calculated from a subset of 2000 non-silent neurons. The distributions of these values, obtained from the NEST and GeNN simulations, are shown as violin plots in Fig. 3G-L. Upon visual inspection, the distributions are very similar but, to assess how meaningful the differences between these distributions are, we compared the distributions obtained from 3 GeNN simulations with different random number generator seeds against those obtained from NEST. We performed this comparison by computing histograms of the three measures – using bin sizes determined with the Freedman-Diaconis rule  $^{12}$  – and comparing pairs of distributions using the Kullback-Leibler divergence  $D_{\rm KL}$ . The results of these comparisons are shown in Fig. 4 and suggest that the influence of the random seeds is comparable to the influence of random seeds and simulator combined indicating that the choice of simulator has little effect on the model dynamics.

## 3. Discussion

In this work we have presented a novel approach for large-scale brain simulation on GPU devices which entirely removes the need to store connectivity data in memory. We have shown that this approach allows us to simulate a cortical model with  $4.13 \times 10^6$  neurons and  $24.2 \times 10^9$  synapses  $^{34,35}$  on a single modern GPU. While this represents a significant step forward in terms of making large-scale brain modelling tools accessible to a large community of brain researchers, this model still has around  $20 \times$  fewer neurons and  $40 \times$  fewer synapses than the brain of even a small mammal such as a mouse  $^{16}$ . Our implementation of the multi-area model requires a little over 12 GB of GPU memory, with

| Parameter                    | Procedural connectivity benchmark | Merging<br>benchmark | Multi-area<br>model |
|------------------------------|-----------------------------------|----------------------|---------------------|
| $\tau_{ m m} \ [{ m ms}]$    | 20                                | 20                   | 2                   |
| $V_{\rm rest}   [{ m mV}]$   | -60.0                             | -70.0                | -65                 |
| $V_{ m th} \ [{ m mV}]$      | -50.0                             | -51.0                | -50                 |
| $R_{\rm m} \ [{ m M}\Omega]$ | 20                                | 20                   | 40                  |
| $	au_{ m ref} \ [ m ms]$     | 5                                 | 2                    | 2                   |

Table 1. Neuron parameters.

the majority (8.5 GB) being used for the circular dendritic delay buffers (see Knight and Nowotny  $^{20}$ ). These are a per-neuron (rather than per-synapse) data structure but, because the inter-area connections in the model have delays of up to 500 simulation timesteps, the delay buffers become quite large. However, for models without large delays such as the balanced random network simulated in section 2.1, only around 20 B is required per neuron meaning that, theoretically, over  $1 \times 10^9$  neurons could be simulated on a 24 GB GPU.

One important aspect of large-scale brain simulations not addressed in this work is synaptic plasticity and its role in learning. As discussed by Knight and Nowotny <sup>20</sup>, GeNN supports a wide variety of synaptic plasticity rules. In order to modify synaptic weights, they need to be stored in memory rather than generated procedurally. However, connectivity could still be generated procedurally, potentially halving the memory requirements of models with synaptic plasticity. This would be sufficient for synaptic plasticity rules that only require access to presynaptic spikes and postsynaptic neuron states <sup>5,9</sup> but, for many Spike-Timing-Dependent Plasticity (STDP) rules, access to postsynaptic spikes is also required. GeNN supports such rules by automatically generating a lookup table structure (see Knight and Nowotny <sup>20</sup>). While this process could be adapted to generate a lookup table from procedural connectivity, this would further erode memory savings. However, typically not all synapses in a simulation are plastic and those that are not could be simulated fully procedurally.

In this work, we have discussed the idea of procedural connectivity in the context of GPU hardware but, we believe that there is also potential for developing new types of neuromorphic hardware built from the ground up for procedural connectivity. The latencies of sampling from the random number generator and performing atomic add operations were identified in section 2.1 as the factors currently limiting performance of the procedural connectivity algorithm. By dedicating more silicon area to fast on-chip memory – which would alleviate the need for atomic operations – and by implementing the random number generator in hardware, these limitations could be overcome, leading to truly game-changing compute time improvements.

### 4. Methods

4.1. **Neuron model.** In all experiments presented in this work, neurons are modelled as Leaky Integrate-and-Fire (LIF) units with the parameters listed in Table 1. The membrane voltage  $V_i$  of neuron i is modelled as

(1) 
$$\tau_{\rm m} \frac{dV_i}{dt} = (V_{\rm rest} - V_i) + R_{\rm m} (I_{\rm syn}_i + I_{\rm ext}_i),$$

where  $\tau_{\rm m}$  and  $R_{\rm m}$  represent the time constant and resistance of the neuron's cell membrane,  $V_{\rm rest}$  defines the resting potential,  $I_{{\rm syn}_i}$  represents the synaptic input current and  $I_{{\rm ext}_i}$  represents an external input current. When the membrane voltage crosses a threshold  $V_{\rm th}$  a spike is emitted, the membrane voltage is reset to  $V_{\rm rest}$  and updating of V is suspended for a refractory period  $\tau_{\rm ref}$ . In the models where there are synaptic connections, they are current-based, i.e. pre-synaptic spikes lead to exponentially-decaying input currents  $I_{{\rm syn}_i}$ 

(2) 
$$\tau_{\text{syn}} \frac{dI_{\text{syn}_i}}{dt} = -I_{\text{syn}_i} + \sum_{i=0}^n w_{ij} \sum_{t_j} \delta(t - t_j),$$

where  $\tau_{\text{syn}}$  represents the decay time constant and  $t_j$  are the arrival times of incoming spikes from n presynaptic neurons. The ordinary differential equations (1) and (2) are solved with an exponential Euler algorithm.

- 4.2. Balanced random network model. This model was first presented by Vogels and Abbott <sup>40</sup> but has subsequently been widely used as a scalable benchmark <sup>6</sup>. The network consists of N LIF neurons as described in section 4.1 with parameters shown in table 1. The neurons are partitioned into one population of  $\frac{4N}{5}$  excitatory and a second of  $\frac{N}{5}$ inhibitory neurons. The two populations of neurons are connected to each other and with themselves with fixed probability  $P_{\rm conn}=10\,\%$  (the highest density at which Vogels and Abbott 40 suggests their results hold). All excitatory synapses have  $\tau_{\rm syn}=5\,{\rm ms}$ and  $w_{ij} = \frac{3.2}{N} \text{nA}$ ; and all inhibitory synapses have  $\tau_{\text{syn}} = 10 \,\text{ms}$  and  $w_{ij} = \frac{40.8}{N} \text{nA}$ . Additionally, every cell is depolarized by approximately 10 mV by applying a constant external current  $I_{\text{ext}_i} = 0.55 \,\text{nA}$ . Simulations were run with a time step  $\Delta t = 1.0 \,\text{ms}$ .
- 4.3. Merging model. This model simply consists of 1000000 LIF neurons, as described in section 4.1 with parameters shown in table 1, each driven by a Gaussian input current of  $I_{\text{ext}_i} \sim \mathcal{N}(1.0, 0.25)$ . Simulations were run with a time step  $\Delta t = 1.0 \,\text{ms}$ .

#### 4.4. Multi-area model.

#### 5. Acknowledgments

We would like to thank Jari Pronold, Sacha van Albada, Agnes Korcsak-Gorzo and Maximilian Schmidt for their help with the multi-area model data; and Dan Goodman and Mantas Mikaitis for their feedback on the manuscript. This work was funded by the EPSRC (Brains on Board project, grant number EP/P006094/1).

# 6. Author contributions

J.K. and T.N. wrote the paper. T.N. is the original developer of GeNN. J.K. is currently the primary GeNN developer and was responsible for extending the code generation approach to the procedural simulation of synaptic connectivity. J.K. performed the experiments and the analysis of the results that are presented in this work.

# 7. Data availability

The raw data used to produce Fig. 1 and Fig. 2; and the pre-processed data used to produce Fig. 3 are available at https://github.com/BrainsOnBoard/procedural\_paper. The raw spiking data from the multi-area model simulations is available upon request.

# 8. Code availability

All experiments were carried out using the GeNN 4.3.0, available at https://github. com/genn-team/genn/releases/tag/4.3.0. A GeNN port of the multi-area model is available at https://github.com/neworderofjamie/multi-area-model. The models used to produce Fig. 1 and Fig. 2 are all available at https://github.com/BrainsOnBoard/ procedural\_paper.

- [1] Bakker, R., Wachtler, T., and Diesmann, M. (2012). CoCoMac 2.0 and the future of tract-tracing databases. Frontiers in Neuroinformatics, 6:1-6.
- [2] Belitski, A., Gretton, A., Magri, C., Murayama, Y., Montemurro, M. A., Logothetis, N. K., and Panzeri, S. (2008). Low-frequency local field potentials and spikes in primary visual cortex convey independent visual information. Journal of Neuroscience, 28(22):5696-5709.
- [3] Binzegger, T., Douglas, R. J., and Martin, K. A. (2004). A quantitative map of the circuit of cat primary visual cortex. Journal of Neuroscience, 24(39):8441–8453.

- [4] Blundell, I., Brette, R., Cleland, T. A., Close, T. G., Coca, D., Davison, A. P., Diaz-Pier, S., Fernandez Musoles, C., Gleeson, P., Goodman, D. F. M., Hines, M., Hopkins, M. W., Kumbhar, P., Lester, D. R., Marin, B., Morrison, A., Müller, E., Nowotny, T., Peyser, A., Plotnikov, D., Richmond, P., Rowley, A., Rumpe, B., Stimberg, M., Stokes, A. B., Tomkins, A., Trensch, G., Woodman, M., and Eppler, J. M. (2018). Code Generation in Computational Neuroscience: A Review of Tools and Techniques. Frontiers in Neuroinformatics, 12.
- [5] Brader, J. M., Senn, W., and Fusi, S. (2007). Learning real-world stimuli in a neural network with spike-driven synaptic dynamics. *Neural computation*, 19(11):2881–912.
- [6] Brette, R., Rudolph, M., Carnevale, T., Hines, M., Beeman, D., Bower, J. M., Diesmann, M., Morrison, A., Goodman, P. H., Harris, F. C., Zirpe, M., Natschläger, T., Pecevski, D., Ermentrout, B., Djurfeldt, M., Lansner, A., Rochel, O., Vieville, T., Muller, E., Davison, A. P., El Boustani, S., and Destexhe, A. (2007). Simulation of networks of spiking neurons: a review of tools and strategies. *Journal of computational neuroscience*, 23(3):349–98.
- [7] Cabral, J., Kringelbach, M. L., and Deco, G. (2014). Exploring the network dynamics underlying brain activity during rest. *Progress in Neurobiology*, 114:102–131.
- [8] Carnevale, N. T. and Hines, M. L. (2006). *The NEURON book*. Cambridge University Press.
- [9] Clopath, C., Büsing, L., Vasilaki, E., and Gerstner, W. (2010). Connectivity reflects coding: a model of voltage-based STDP with homeostasis. *Nature neuroscience*, 13:344– 352.
- [10] Devroye, L. (2013). Non-uniform random variate generation. Springer-Verlag New York, New York.
- [11] Ercsey-Ravasz, M., Markov, N. T., Lamy, C., Van Essen, D. C., Knoblauch, K., Toroczkai, Z., and Kennedy, H. (2013). A Predictive Network Model of Cerebral Cortical Connectivity Based on a Distance Rule. *Neuron*, 80(1):184–197.
- [12] Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator: L<sub>2</sub> theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 57(4):453–476.
- [13] Frenkel, C., Lefebvre, M., Legat, J.-D., and Bol, D. (2019). A 0.086-mm<sup>2</sup> 12.7-pJ/SOP 64k-Synapse 256-Neuron Online-Learning Digital Spiking Neuromorphic Processor in 28nm CMOS. *IEEE Transactions on Biomedical Circuits and Systems*, 13(1):145–158.
- [14] Furber, S. B., Galluppi, F., Temple, S., and Plana, L. A. (2014). The SpiNNaker Project. *Proceedings of the IEEE*, 102(5):652–665.
- [15] Gewaltig, M.-O. and Diesmann, M. (2007). NEST (NEural Simulation Tool). Scholarpedia, 2(4):1430.
- [16] Herculano-Houzel, S., Mota, B., and Lent, R. (2006). Cellular scaling rules for rodent brains. *Proceedings of the National Academy of Sciences*, 103(32):12138–12143.
- [17] Izhikevich, E. M. (2005). Large-Scale Simulation of the Human Brain.
- [18] Izhikevich, E. M. and Edelman, G. M. (2008). Large-scale model of mammalian thalamocortical systems. *Proceedings of the National Academy of Sciences of the United States of America*, 105(9):3593–8.
- [19] Jordan, J., Ippen, T., Helias, M., Kitayama, I., Sato, M., Igarashi, J., Diesmann, M., and Kunkel, S. (2018). Extremely Scalable Spiking Neuronal Network Simulation Code: From Laptops to Exascale Computers. *Frontiers in Neuroinformatics*, 12:2.
- [20] Knight, J. C. and Nowotny, T. (2018). GPUs Outperform Current HPC and Neuromorphic Solutions in Terms of Speed and Energy When Simulating a Highly-Connected Cortical Model. Frontiers in Neuroscience, 12:1–19.
- [21] Li, A., Song, S. L., Chen, J., Li, J., Liu, X., Tallent, N. R., and Barker, K. J. (2020). Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. *IEEE Transactions on Parallel and Distributed Systems*, 31(1):94–110.
- [22] Markov, N. T., Ercsey-Ravasz, M. M., Ribeiro Gomes, A. R., Lamy, C., Magrou, L., Vezoli, J., Misery, P., Falchier, A., Quilodran, R., Gariel, M. A., Sallet, J., Gamanut, R., Huissoud, C., Clavagnier, S., Giroud, P., Sappey-Marinier, D., Barone, P., Dehay, C., Toroczkai, Z., Knoblauch, K., Van Essen, D. C., and Kennedy, H. (2014a). A weighted and directed interareal connectivity matrix for macaque cerebral cortex. Cerebral Cortex,

- 24(1):17-36.
- [23] Markov, N. T., Vezoli, J., Chameau, P., Falchier, A., Quilodran, R., Huissoud, C., Lamy, C., Misery, P., Giroud, P., Ullman, S., Barone, P., Dehay, C., Knoblauch, K., and Kennedy, H. (2014b). Anatomy of hierarchy: Feedforward and feedback pathways in macaque visual cortex. Journal of Comparative Neurology, 522(1):225-259.
- [24] Merolla, P. A., Arthur, J. V., Alvarez-Icaza, R., Cassidy, A. S., Sawada, J., Akopyan, F., Jackson, B. L., Imam, N., Guo, C., Nakamura, Y., Brezzo, B., Vo, I., Esser, S. K., Appuswamy, R., Taba, B., Amir, A., Flickner, M. D., Risk, W. P., Manohar, R., and Modha, D. S. (2014). A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668-673.
- [25] NVIDIA Corporation (2019). CUDA C++ Programming Guide.
- [26] NVIDIA Corporation (2020a). Nsight Compute.
- [27] NVIDIA Corporation (2020b). NVLink Fabric Multi-GPU Processing.
- [28] Plotnikov, D., Blundell, I., Ippen, T., Eppler, J. M., Rumpe, B., and Morrison, A. (2016). NESTML: a modeling language for spiking neurons. In Lecture Notes in Informatics (LNI), volume P-254, pages 93–108. Modellierung 2016, Karlsruhe (Germany), 17 Mar 2016 - 19 Mar 2016, Gesellschaft für Informatik e.V. (GI).
- [29] Potjans, T. C. and Diesmann, M. (2014). The Cell-Type Specific Cortical Microcircuit: Relating Structure and Activity in a Full-Scale Spiking Network Model. Cerebral Cortex, 24(3):785-806.
- [30] Qiao, N., Mostafa, H., Corradi, F., Osswald, M., Stefanini, F., Sumislawska, D., and Indiveri, G. (2015). A reconfigurable on-line learning spiking neuromorphic processor comprising 256 neurons and 128K synapses. Frontiers in Neuroscience, 9:1-17.
- [31] Rhodes, O., Peres, L., Rowley, A. G. D., Gait, A., Plana, L. A., Brenninkmeijer, C., and Furber, S. B. (2020). Real-time cortical simulation on neuromorphic hardware. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 378(2164):20190160.
- [32] Salmon, J. K., Moraes, M. A., Dror, R. O., and Shaw, D. E. (2011). Parallel random numbers: As Easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, volume 81, page 1, New York, New York, USA. ACM Press.
- [33] Schemmel, J., Kriener, L., Müller, P., and Meier, K. (2017). An accelerated analog neuromorphic hardware system emulating NMDA- and calcium-based non-linear dendrites. Proceedings of the International Joint Conference on Neural Networks, pages 2217 - 2226.
- [34] Schmidt, M., Bakker, R., Hilgetag, C. C., Diesmann, M., and van Albada, S. J. (2018a). Multi-scale account of the network structure of macaque visual cortex. Brain Structure and Function, 223(3):1409-1435.
- [35] Schmidt, M., Bakker, R., Shen, K., Bezgin, G., Diesmann, M., and van Albada, S. J. (2018b). A multi-scale layer-resolved spiking network model of resting-state dynamics in  ${\it macaque\ visual\ cortical\ areas.\ PLoS\ Computational\ Biology,\ 14(10):1-38.}$
- [36] Shinomoto, S., Kim, H., Shimokawa, T., Matsuno, N., Funahashi, S., Shima, K., Fujita, I., Tamura, H., Doi, T., Kawano, K., Inaba, N., Fukushima, K., Kurkin, S., Kurata, K., Taira, M., Tsutsui, K. I., Komatsu, H., Ogawa, T., Koida, K., Tanji, J., and Toyama, K. (2009). Relating neuronal firing patterns to functional differentiation of cerebral cortex. PLoS Computational Biology, 5(7).
- [37] Stimberg, M., Brette, R., and Goodman, D. F. (2019). Brian 2, an intuitive and efficient neural simulator. eLife, 8:1-41.
- [38] van Albada, S. J., Helias, M., and Diesmann, M. (2015). Scalability of Asynchronous Networks Is Limited by One-to-One Mapping between Effective Connectivity and Correlations. PLoS Computational Biology, 11(9):1–37.
- [39] van Albada, S. J., Pronold, J., van Meegen, A., and Diesmann, M. (in press). Usage and Scaling of an Open-Source Spiking Multi-Area Model of Monkey Cortex. In Amunts, K., Grandinetti, L., Lippert, T., and Petkov, N., editors, Brain-Inspired Computing. Springer.

- [40] Vogels, T. P. and Abbott, L. F. (2005). Signal Propagation and Logic Gating in Networks of Integrate-and-Fire Neurons. *The Journal of Neuroscience*, 25(46):10786–10795
- [41] Wang, G., Lin, Y. S., and Yi, W. (2010). Kernel fusion: An effective method for better power efficiency on multithreaded GPU. Proceedings 2010 IEEE/ACM International Conference on Green Computing and Communications, GreenCom 2010, 2010 IEEE/ACM International Conference on Cyber, Physical and Social Computing, CP-SCom 2010, pages 344–350.
- [42] Yavuz, E., Turner, J., and Nowotny, T. (2016). GeNN: a code generation framework for accelerated brain simulations. *Scientific Reports*, 6:18854.