# Large-scale brain simulations on the desktop using procedural connectivity

James C Knight<sup>a,1</sup> and Thomas Nowotny<sup>a</sup>

<sup>a</sup>Centre for Computational Neuroscience and Robotics, School of Engineering and Informatics, University of Sussex, Brighton, United Kingdom

This manuscript was compiled on March 19, 2020

11

12

15

17

21

23

24

25

27

29

31

3

Large-scale simulations of spiking neural networks have become important tools in helping us improve our understanding of the dynamics of brain circuits, and ultimately brain function. However, even small mammals such as mice have on the order of  $1 \times 10^{12}$  synaptic connections (1) which are typically charaterized by at least one floating point value per synapse, usually describing their synaptic conductance. If single precision floating point variables were used to store individual conductance values for all synapses of a mouse brain, several terabytes of memory would be required. Because such memory requirements are beyond what is plausible for a single desktop machine, simulations of large-scale spiking neural networks are currently typically simulated on large distributed high performance computing systems. This is costly and limits this level of modelling to a select few research groups who have access to appropriate supercomputing resources. However, large parts of current brain models are typically described by simple algorithms which determine the connectivity and conductances of synaptic connections. In this work, we describe extensions to GeNN (2) - our GPU-based spiking neural network simulator - that enable it to 'procedurally' generate connectivity and synaptic weights 'on the go' as spikes are triggered.; This is in contrast to generating large connectivity matrices up front and retrieving them from memory when needed, which is how practiaclly all simulations are run today. We find that highend GPUs are well-suited to this approach as they provide a large amount of raw computational power which is often under-utilised when simulating spiking neural networks due to the limited memory bandwidth available to each parallel computing element. To demonstrate the value of this approach, we have implemented a recent model of the Macaque visual cortex consisting of  $4.13 \times 10^6$  neurons and  $24.2 \times 10^9$  synapses. With our new method, this model can be simulated on a single GPU. We demonstrate that the results match results obtained on a supercomputer and that the simulation runs faster on a single high-end GPU than a previous simulation which was executed on over 1000 supercomputer nodes.

spiking neural networks | GPU | high-performance computing | brain simulation

The brain of a mouse has around  $70 \times 10^6$  neurons, but this number is dwarfed by the  $1 \times 10^{12}$  synapses which connect them. In computer simulations of spiking neural networks, propagating spikes through synapses involves reading a 'row' of synapses connecting a spiking presynaptic neuron to its postsynaptic partners and adding the 'weight' of each synapse in the row to a 'bin' containing the postsynatic neuron's input for the next simulation timestep. Typically, the information describing which neurons are connected by a synapse and with what conductance, is generated before a simulation is run and stored in large matrices in random access memory (RAM). This creates high memory requirements for large-scale brain models, so that they can typically only be simulated on large distributed computer systems using software such as NEST (3)

or NEURON (4). By careful design, these simulators can keep the memory requirements for each node constant, even when a simulation is distributed across thousands of nodes (5). However, high performance computer systems are bulky, expensive and consume large amounts of power, meaning that they are typically shared resources that are only accessible to a limited number of researchers and for strongly time-limited investigations.

16

17

18

19

20

21

22

23

24

25

26

27

30

31

32

33

34

37

39

40

41

42

43

44

45

46

47

Neuromorphic systems (6–11) take inspiration from the brain and have been developed specifically for simulating large spiking neural networks. One particular relevant feature of the brain is that its memory elements - the synapses are co-located with the computing elements – the neurons – throughout the entire system. In neuromorphic systems, this often translates to dedicating a large proportion of each chip to memory. However, while such on-chip memory is fast, it can only be fabricated at relatively low density meaning that many of these systems economize – either by reducing the maximum number of synapses per neuron to as few as 256 or by reducing the precision of the synaptic weights to 6 (11), 4 (6) or even 1 bit (7, 9). Such strategies allow some classes of spiking neural networks to be simulated very efficiently, but reducing the degree of connectivity in large-scale brain simulations to fit within the constraints of current neuromorphic systems inevitably changes their dynamics (12). Unlike the majority of other neuromorphic systems, the SpiNNaker (8) neuromorphic super-computer is entirely programmable and combines a large amount of on-chip memory with external memories, distributed across the system for the storage of synaptic connectivity. SpiNNaker's external memory bandwidth, on-chip memory capacity and the computational power of each core are all tailored to large-scale brain simulation meaning that the output bins of the synapse processing algorithm can fit in on-chip memory and there is enough external

## **Significance Statement**

Authors must submit a 120-word maximum statement about the significance of their research paper written at a level understandable to an undergraduate educated scientist outside their field of speciality. The primary goal of the Significance Statement is to explain the relevance of the work in broad context to a broad readership. The Significance Statement appears in the paper itself and is required for all research papers.

J.K. and T.N. wrote the paper. T.N. is the original developer of GeNN. J.K. is currently the primary GeNN developer and was responsible for extending the code generation approach to the procedural simulation of synaptic connectivity. J.K. performed the experiments and the analysis of the results that are presented in this work.

The authors declare no conflict of interest.

<sup>&</sup>lt;sup>1</sup>To whom correspondence should be addressed. E-mail: J.C.Knightsussex.ac.uk

memory bandwidth to fetch synaptic rows fast enough for real-time simulation of large-scale models (13). This is a promising approach for future research but, because of its prototype nature, the availability of SpiNNaker hardware is limited and a physically large system is still required for even moderately-sized simulations (9 boards for a simulation with around  $10 \times 10^3$  neurons and  $300 \times 10^6$  synapses (13)).

Modern GPUs have relatively small amounts of on-chip memory and, instead, dedicate the majority of their silicon area to arithmetic logic units (ALUs). GPUs use dedictated hardware to rapidly switch between tasks so that the latency of accessing external memory can be 'hidden' behind computation, as long as there is sufficient computation to be performed. For example, the memory latency of a typical modern GPU can be completely hidden if each CUDA core performs approximately 10 arithmetic operations per byte of data accessed from memory. Unfortunately, processing a synapse in a spiking neural network simulation is likely to require accessing approximately 8 B of memory and performing many fewer than the required 80 instructions. This makes synaptic updates highly memory bound. Nonetheless, we have shown in previous work (14) that, as GPUs have significantly higher total memory bandwidth than even the most expensive CPU, moderately sized models of around  $10 \times 10^3$  neurons and  $1 \times 10^9$ synapses can be simulated on a single GPU with competitive speed and energy requirements. However, individual GPUs do not have enough memory to simulate truly large-scale brain models and, although small numbers of GPUs can be connected together using the high-speed NVLink (TODO: cite) interconnect, beyond such small GPU clusters, scaling will be dictated by the same communication overheads as for other MPI-based distributed systems.

In this work we present a novel approach which converts large-scale brain simulation from a problem which is memory-bound on a GPU to one where the large amount of computational power available on a GPU can be used to reduce both memory and memory bandwidth requirements and enable truly large-scale brain simulations on a single GPU workstation.

#### Results

In the following subsections, we will first present two recent innovations in our GeNN simulator (2) which allow to use it for simulating large-scale models on a single GPU. We will then demonstrate the power of these new features by simulating a recent model of the Macaque visual cortex (15) consisting of  $4.13\times10^6$  neurons and  $24.2\times10^9$  synapses on a single GPU. We find that, we not only obtain the same results as in the previous simulation on a high-performance supercomputer, but our simulation also runs faster.

**Procedural connectivity.** Our GeNN simulator (2) uses code generation to convert neuron and synapse models – described using 'snippets' of C-like code – into CUDA code for GPU simulation. We previously extended GeNN to allow the same approach to be applied to generating efficient, parallel model initialisation code from code snippets describing the algorithms to use for initialising individual state variables and synaptic connectivity (14). Parallelising initialisation in this manner sped up model initialisation by around  $20\times$  on a desktop PC, but also indicates just how well-suited these initialisation algorithms are to GPU implementation. In fact, it seems

somewhat illogical to run these algorithms once only to fill the limited memory of the GPU with data and subsequently read it back throughout the simulation at the expense of equally limited memory bandwidth. Instead, what if we used some of the

To demonstrate the performance and scalability of this new approach, we ran several simulations of a network, initially designed as a medium for experimentation into signal propagation through cortical networks (16), but subsequently widely used as a scalable benchmark (17). The network consists of 10 000 integrate-and-fire neurons, split between an excitatory population of 8000 cells and an inhibitory population of 2000 cells.

**Kernel merging.** While the procedural connectivity approach presented in the previous section allows us to simulate models which would otherwise not fit within the memory of a single GPU, there are additional problems when using code generation to generate simulation code for models with large numbers of neuron and synapse populations.

GeNN and – to the best of our knowledge (18) – all other SNN simulators which use code generation to generate all of their simulation code (rather than, for example NESTML (19), which uses code generation to generate neuron simulation code) generate seperate pieces of code to simulate each population of neurons and synapses. This approach allows optimizations such as the hard-coding of constant parameters to be easily performed and, although generating code for models with many populations will result in large code size, C++ CPU code can be easily divided between multiple modules and compiled in parallel, minimizing the effect on build time. However, GPUs can only run a small number of kernels – which are equivalent to modules in this context – simultaneously (128 on the latest NVIDIA GPUs (TODO: cite)). Therefore, in GeNN, multiple neuron populations are simulated within each kernel resulting in code such as the following example which illustrates how 3 populations of 1000 neurons could be simulated in a single kernel:

```
void updateNeurons()
{
   if(thread < 1000) {
      // Update neuron population A
   }
   else if(thread >= 1000 && thread < 2000) {
      // Update neuron population B
   }
   else if(thread >= 2000 && thread < 3000) {
      // Update neuron population C
   }
}</pre>
```

This works well for models with small numbers of populations but, as Fig. 2A illustrates, compilation time increases super-linearly with the number of populations (i.e. the size of the neuron kernel) – quickly becoming impractical. Additionally, and even more critically, Fig. 2B shows that simulations of the same model, artificially divided into more populations, run much slower. Each thread in this model reads 32 B of data and, as we discussed previously, hiding the latency of these memory accesses would require approximately 320 arithmetic operations. Sampling from the uniform distribution and updating a LIF neuron requires many fewer operations than this



Fig. 1. Performance scaling on a range of modern GPUs. A The best performing approach at each scale. B Raw performance of each approach.



Fig. 2. Performance of a simulation of  $1\,000\,000\,\mathrm{LIF}$  neurons driven by a gaussian input current, partitioned into varying numbers of populations. A Compilation time using GCC 7.5.0. **B** Simulation time for an  $1\,\mathrm{s}$  simulation. **C** Memory throughput reported by NVIDIA Nsight compute profiler 'Speed of light' metric. **D** Number of 'No instruction' stalls reported by NVIDIA Nsight compute profiler.

so we would expect this kernel to be memory bound. Fig. 2C – obtained using data from the NVIDIA Nsight compute profiler (TODO: cite) – shows that this to be true with the memory system being around 90 % utilised for small numbers of populations. However, if the model is partitioned into large numbers of populations, the kernel stops being able to efficiently use the memory. Investigating further using the profiler showed that this drop in performance was accompanied by a large number of "No instruction" stalls (events preventing the GPU from doing any work during a given clock cycle) as illustrated in Fig. 2D.

168

169

170

171

173

174

175

176

177

201

202

203

204

205

206

208

209

210

211

212

213

214

215

216

217

218

219

220

221

226

```
struct NeuronUpdateGroup
179
180
      unsigned int numNeurons;
181
      float * V;
182
183
184
   NeuronUpdateGroup neuronUpdateGroup
185
186
      \{1000, d_VA\},
187
      {1000, d_VB},
188
      {1000, d_VC}
189
    };
190
191
    void updateNeurons()
192
    {
193
      if(thread < 3000) {
194
         // Determine which population thread
195
           should be processing and update using
           variables in neuron Update Group
197
198
    }
199
```

The multi-area model. Due to lack of computing power and sufficiently detailed connectivity data, previous models of the cortex have either focussed on modelling individual local microcircuits at the level of individual cells (21, 22) or modelling multiple connected areas at a higher level of abstraction where entire ensembles of neurons are described by a small number of differential equations (TODO: find citation). However, data from several species (TODO: find citation) has shown that cortical activity has distinct features at both the global and local levels which can only be captured by modelling interconnected microcircuits at the level of individual cells. The multi-area model (15, 23) does just this – using scaled versions of a previous 4 layer microcircuit model (22) to implement 1 mm<sup>2</sup> 'patches' for each of 32 areas of the macaque cortex involved in visual processing. The 32 areas are connected together with connectivity based on inter-area axon tracing data from the CoCoMac (24) database, further refined using additional anatomical data (25) and heuristics (26) to obtain estimates for the number of synapses connecting pairs of areas. Synapses between areas are then distributed between the populations which make up each area

By using a supercomputer to simulate a model based on the latest connectivity data and The multi-scale model of the macaque visual cortex (15) developed by

### **Discussion**

Further scaling - memory only required for neuron parameters

- Learning
- Hardware for procedural connectivity?

#### **Materials and Methods**

Please describe your materials and methods here. This can be more than one paragraph, and may contain subsections and equations as required. Authors should include a statement in the methods section describing how readers will be able to access the data in the paper. 227

228

229

230

231

232

233

234

236

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

260

262

264

265

267

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

- · LIF neuron
- Exponential static synapses
- Connectivity
- Parameter values for scaling and merging experiments

Neuron models. Example text for subsection.

**ACKNOWLEDGMENTS.** Please include your acknowledgments here, set in a single paragraph. Please do not include any acknowledgments in the Supporting Information, or anywhere else in the manuscript.

- Herculano-Houzel S, Mota B, Lent R (2006) Cellular scaling rules for rodent brains. Proceedings of the National Academy of Sciences 103(32):12138–12143.
- Yavuz E, Turner J, Nowotny T (2016) GeNN: a code generation framework for accelerated brain simulations. Scientific reports 6(November 2015):18854.
- 3. Gewaltig MO, Diesmann M (2007) NEST (NEural Simulation Tool). Scholarpedia 2(4):1430
- 4. Carnevale NT, Hines ML (2006) The NEURON book. (Cambridge University Press)
- Jordan J, et al. (2018) Extremely Scalable Spiking Neuronal Network Simulation Code: From Laptops to Exascale Computers. Frontiers in Neuroinformatics 12(February):2.
- Frenkel C, Lefebvre M, Legat JD, Bol D (2018) A 0.086-mm<sup>2</sup> 2 12.7-pJ/SOP 64k-Synapse 256-Neuron Online-Learning Digital Spiking Neuromorphic Processor in 28nm CMOS. *IEEE Transactions on Biomedical Circuits and Systems* PP(XX):1–1.
- Frenkel C, Legat Jd, Bol D (2019) A 65-nm 738k-Synapse/mm 2 Quad-Core Binary-Weight Digital Neuromorphic Processor with Stochastic Spike-Driven Online Learning in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). (IEEE), pp. 1–5.
- Furber SB, Galluppi F, Temple S, Plana LA (2014) The SpiNNaker Project. Proceedings of the IEEE 102(5):652–665.
- Merolla PA, et al. (2014) A million spiking-neuron integrated circuit with a scalable communication network and interface. (S)cience 345(6197):668–673.
- Qiao N, et al. (2015) A reconfigurable on-line learning spiking neuromorphic processor com prising 256 neurons and 128K synapses. Frontiers in Neuroscience 9(APR):1–17.
- Schemmel J, Kriener L, Muller P, Meier K (2017) An accelerated analog neuromorphic hardware system emulating NMDA- and calcium-based non-linear dendrites. Proceedings of the International Joint Conference on Neural Networks 2017-May:2217–2226.
- van Albada SJ, Helias M, Diesmann M (2015) Scalability of Asynchronous Networks Is Limited by One-to-One Mapping between Effective Connectivity and Correlations. PLoS Computational Biology 11(9):1–37.
- 13. Rhodes O, et al. (2019) Real-Time Cortical Simulation on Neuromorphic Hardware
- Knight JC, Nowotny T (2018) GPUs Outperform Current HPC and Neuromorphic Solutions in Terms of Speed and Energy When Simulating a Highly-Connected Cortical Model. Frontiers in Neuroscience 12(December):1–19.
- Schmidt M, et al. (2018) A multi-scale layer-resolved spiking network model of resting-state dynamics in macaque visual cortical areas. PLoS Computational Biology 14(10):1–38.
- Vogels TP, Abbott LF (2005) Signal Propagation and Logic Gating in Networks of Integrateand-Fire Neurons. The Journal of Neuroscience 25(46):10786–10795.
- Brette R, et al. (2007) Simulation of networks of spiking neurons: a review of tools and strate gies. Journal of computational neuroscience 23(3):349–98.
- Blundell I, et al. (2018) Code Generation in Computational Neuroscience: A Review of Tools and Techniques. Frontiers in Neuroinformatics 12(November).
- 19. Plotnikov D, et al. (2016) NESTML: a modeling language for spiking neurons. pp. 93-108.
- Shinomoto S, et al. (2009) Relating neuronal firing patterns to functional differentiation of cerebral cortex. PLoS Computational Biology 5(7).
- Izhikevich EM, Edelman GM (2008) Large-scale model of mammalian thalamocortical systems. Proceedings of the National Academy of Sciences of the United States of America 105(9):3593–8.
- Potjans TC, Diesmann M (2014) The Cell-Type Specific Cortical Microcircuit: Relating Structure and Activity in a Full-Scale Spiking Network Model. Cerebral Cortex 24(3):785–806.
- Schmidt M, Bakker R, Hilgetag CC, Diesmann M, van Albada SJ (2018) Multi-scale account
  of the network structure of macaque visual cortex. Brain Structure and Function 223(3):1409–
  1435.
- Bakker R, Wachtler T, Diesmann M (2012) CoCoMac 2.0 and the future of tract-tracing databases. Frontiers in Neuroinformatics 6(DEC):1–6.
- (2014) A weighted and directed interareal connectivity matrix for macaque cerebral cortex Cerebral Cortex 24(1):17–36.
- Ercsey-Ravasz M, et al. (2013) A Predictive Network Model of Cerebral Cortical Connectivity Based on a Distance Rule. Neuron 80(1):184–197.



Fig. 3. Results of full-scale multi-area model simulation. A-C Raster plots of spiking activity of 3 % of the neurons in area V1 A, V2 B, and FEF C. Blue: excitatory neurons, red: inhibitory neurons. D-F Spiking statistics for each population across all 32 areas simulated using GeNN and NEST shown as split violin plots. Solid lines: medians, Dashed lines: Interquartile range (IQR). D Population-averaged firing rates. E Average pairwise correlation coefficients of spiking activity. F Irregularity measured by revised local variation LvR (20) averaged across neurons.