This repository features a C++ library for an external SRAM for the Arduino Uno. More specifically the HM3-2064-5 external RAM was used together with two 74HC595 shifting registers. This external RAM accepts
As RAM access times are crucial for performance, the project is focused on reducing the overhead of the external RAM and shows the idea of using the faster internal RAM as a buffer - analog to how caching works on modern CPUs.
The pinout of the external SRAM GM3-2064-5 can be seen in the graphic below. A0 : A12 are the pins for the
As the Arduino has only
The following Circuit Diagram shows all connections between these components and the Arduino. The following color code is used:
-
Purple wires are used for the
$13$ address bits of the external ram where A0, A1 and A2 is connected directly to the Arduino with A2 using the same pin as for serial communication with the shifting register. This setup allows for better performance when not just writing single bytes (e.g. floats with$4$ bytes each), which will be explained in more detail later. - Green is used for the serial output of the arduino and the daisy chain serial wire between both shifting registers.
- Ochre is the clock signal for the serial input of the shifting registers.
-
Orange wires are used for the
$8$ I/O bits of the external ram connected directly to the Arduino to enable a parallel read and write after the address is set. - Yellow are the read enable and write enable pins to initialize read or write operations at the specific address.
- Black GND
-
Red
$5$ V
For Linux users there is a Makefile for compiling, flashing and connecting to serial port using screen
or cu
. This makefile should also work for Mac users. Windows users could use an IDE like Microchip Studio. For the Makefile one needs the following packages, which are all available via package managers like apt.
make
avr-libc
(libraries)gcc-avr
(compiling)avrdude
(flashing)screen
orcu
(serial connection)
The Makefile automatically finds a connected Arduino. Calling just make
without any further recipies will show you all possible main programs to flash on the Arduino together with two recipies to connect to the serial port via screen
or cu
. As an example, to compile and flash the basic test program and to connect to the serial port using screen, one can use the following command if the Arduino is connected.
make test screen
This is the core part of the repository. This library carries out the read and write access of the external RAM. The pinout and other useful definitions are found in config.hpp
.
The extram_setup()
function needs to be called once and it configures the required I/O pins of the Arduino to communicate with the shifting registers and the external SRAM.
The send_addr_to_sr()
funcion takes the ADDR_MSB
is set to ADDR_SR_LSB
is set to
void send_addr_to_sr(uint16_t addr) {
// send each bit starting from most significant
for (uint16_t i = (1 << ADDR_MSB); i >= (1 << (ADDR_SR_LSB - 1)); i >>= 1) {
if (addr & i) // send 1 (green wire)
PORT_SER |= MASK_SER;
else // send 0 (green wire)
PORT_SER &= ~MASK_SER;
// read into shifting register giving clock pulse (ochre wire)
PORT_SRCLK |= MASK_SRCLK;
PORT_SRCLK &= ~MASK_SRCLK;
}
}
The funcion extram_read()
and the function extram write()
carry out the read and write operations on the external SRAM. They are implemented using templates, which makes it easy to store and access different data types on the external SRAM.
The external SRAM is organized in bytes. If the considered data type contains multiple bytes, we will store it contiguous on the external SRAM. This is done using a reinterpret cast of the data pointer to uint8_t
. At this point it is important, that the three least significant bits A0 : A2 of the address are connected directly to the Arduino. Because of this, it suffices to call the costly send_addr_to_sr()
function only once to jump to the right position on the external SRAM, while we can access the following contiguous bytes by changing A0 : A2 directly. This increases performance significantly for larger data types, as we will later see in the benchmarks. The only requirement is, that the address has to be a multiple of the size of the data type.
The actual read and write command is given to the external SRAM by changing the state of the output enable OE and write enable WE accordingly.
template <typename T>
T extram_read(uint16_t addr, uint16_t ind = 0) {
// extram address
uint16_t addr_extram = addr + ind * sizeof(T);
// variable to return
T data;
// pointer to read the single bytes
uint8_t *ptr = reinterpret_cast<uint8_t *>(&data);
// send starting address to shifting register
send_addr_to_sr(addr_extram);
// set OE active (yellow wire)
PORT_OE &= ~MASK_OE;
// set IO pins to input with pullup (orange wires)
DDR_IO0 &= ~MASK_IO0;
PORT_IO0 |= MASK_IO0;
DDR_IO1 &= ~MASK_IO1;
PORT_IO1 |= MASK_IO1;
// read the single bytes
for (uint8_t i = 0; i < sizeof(T); i++) {
// least significant bits of address (purple wires A0:A2)
PORT_ADDRLSB &= ~MASK_ADDRLSB;
PORT_ADDRLSB |= MASK_ADDRLSB & (addr_extram + i);
// wait for output ready
_delay_us(0.12);
// read from external RAM (orange wires)
ptr[i] = PIN_IO0 & MASK_IO0;
ptr[i] |= PIN_IO1 & MASK_IO1;
}
// set OE inactive (yellow wire)
PORT_OE |= MASK_OE;
// return
return data;
}
template <typename T>
void extram_write(T &data, uint16_t addr, uint16_t ind = 0) {
// extram address
uint16_t addr_extram = addr + ind * sizeof(T);
// pointer to write single bytes
uint8_t *ptr = reinterpret_cast<uint8_t *>(&data);
// send starting address to shifting register
send_addr_to_sr(addr_extram);
// set IO pins to output (orange wires)
DDR_IO0 |= MASK_IO0;
DDR_IO1 |= MASK_IO1;
// write the single bytes
for (uint8_t i = 0; i < sizeof(T); i++) {
// least significant bits of address (purple wires A0:A2)
PORT_ADDRLSB &= ~MASK_ADDRLSB;
PORT_ADDRLSB |= MASK_ADDRLSB & (addr_extram + i);
// set IO pins (orange wires)
PORT_IO0 &= ~MASK_IO0;
PORT_IO0 |= ptr[i] & MASK_IO0;
PORT_IO1 &= ~MASK_IO1;
PORT_IO1 |= ptr[i] & MASK_IO1;
// write to external RAM by giving low pulse on WE (yellow wire)
PORT_WE &= ~MASK_WE;
PORT_WE |= MASK_WE;
}
}
This is a simple test which checks the functionality of the external RAM. It should be run after connecting the hardware to make sure that everything is working. For a few different data types, a vector is written to addresses spread randomly over the whole external SRAM and the user is notified if are any errors when reading the data again.
This is a test which measures the time of reading and writing a vector of length
data type | size[B] | write time [ms] | read time [ms] |
---|---|---|---|
uint8_t | |||
uint16_t | |||
uint32_t float |
|||
uint64_t |
It seems like reading is slightly faster than writing for bigger data types. This makes sense, as one iteration of the for loop over the single bytes of extram_write()
is a little bit more costly compared to extram_read()
.
Additionally we can nicely see the use of the three least significant bits of the address A_0 : A2 being connected directly to the Arduino. Altough uint64_t
is uint8_t
, it takes less than
This raises the question of the bandwidth of the read and write operations for each datatype. For this measurement, the time to read and write the whole external RAM with different datatypes is measured to calculate the corresponding bandwith. The following table shows the results.
data type | size [B] | write time [ms] | write bandwidth [kB/s] | read time [ms] | read bandwidth [kB/s] |
---|---|---|---|---|---|
uint8_t | |||||
uint16_t | |||||
uint32_t float |
|||||
uint64_t |
This underlines the assumption from above, that reading is slightly faster than writing - especially for larger data types. The bandwidth results nicely show, how the direct control of the three least significant bits of the address improves bandwidth for bigger data types by avoiding the costly calls of send_addr_to_sr()
for each byte.
This is a library for benchmarking the Bubblesort algorithm. Bubblesort is implemented on internal RAM and on external RAM. Additionally there is a chunked bubblesort. The list to be sorted is divided into
The following table shows the time to sort a vector of length
method | time uint8_t [ms] | time uint16_t [ms] | time uint32_t [ms] |
---|---|---|---|
internal list without chunks | |||
external list without chunks | |||
internal list internal chunks | not enough space | ||
external list external chunks | |||
external list internal chunks |
The chunked Bubblesort algorithm is out of competition compared to standart Bubblesort. This is because the computational effort of Bubblesort grows quadratically with the length
For normal Bubblesort the performance on the external RAM is way worse compared to the internal RAM with a factor of
The performance difference for chunked Bubblesort between internal RAM and external RAM with external chunks is smaller with around factor uint32_t
values, as the corresponding vector takes
Let us consider the external RAM as the main RAM and the internal RAM as a cache analog to modern CPUs, which have way faster but also smaller caches additionally to the RAM. Between the external Bubblesort with external chunks and internal chunks, the required time is around
This is another library for benchmarking. The 2d-Poisson equation with dirichlet boundary conditions on the unit square
is solved using a Jacobi-solver with a mesh of
$$ \phi^\text{new}{i,j} = \frac{1}{4} \left(\phi^\text{old}{i+1,j} + \phi^\text{old}{i-1,j} + \phi^\text{old}{i,j+1} + \phi^\text{old}{i,j-1} - f{i,j}\right) \text{ for } i,j \in {1, \dots, N+1}\text{.} $$
Typically
The function solve()
is implemented on the internal RAM. The function solve_extram()
is implemented on external RAM with the described buffer also on external RAM. solve_extram_buffered()
has the buffer allocated on internal RAM, reducing the number of external RAM accesses. The function solve_extram_doublebuffered()
has a even bigger buffer on the internal RAM to reduce the number of external RAM accesses further.
The following table shows the times for solving the 2d-Poisson equation with lib_poisson.cpp
with the different methods described above for
method | description | time [ms] |
---|---|---|
internal | internal |
|
external | external |
|
external buffered | external |
|
external double buffered | external |
This shows that using only the external RAM is slower, but in this case it takes just
Let us again consider the external RAM as the main RAM and the internal RAM as a cache analog to modern CPUs. Then the two implementations using the internal cache as a buffer show, how caching can improve performance. The most sophisticated caching approach reduces the time by around
This paragraph is meant as an outlook. On modern CPUs, the compiler tries its best to make use of the cache by itself, but sometimes the user has to change the program slightly to make caching possible.
In our specific case, the following problem could arise. We are using a vector of length
But in our specific example, this problem can easily be fixed. At the moment, the grid udpate is calculated row-wise. Cutting the grid in stripes and using the row-wise approach afterwards makes it possible to use a buffer depending on the stripe width instead of the whole width
This is a helper libary for serial printing using the USART
. One could also use the serial
library from the Arduino IDE. After calling the usart_setup()
function once, all implemented serial print funcions for the specific data types can be used.
This is a helper library to measure elapsed time in ms using the Timer/Counter 0
. One could also use millis()
command from Arduino IDE. The following code shows the basic usage.
// setup millisecond timer
timer_setup(); // only one call required
uint32_t t;
// measure and print time
timer_reset();
// CODE TO BE MEASURED HERE
t = timer_getms();
serprintuint32(t);
serprint(" ms\n\r");
This test tries to find the overhead caused by the time measurement. If one plugs in _delay_ms()
function from util/delay.h
, then we measure the following times with the lib_time
library for counting milliseconds and decimilliseconds respectively.
OCR1A | precision [ms] | time [ms] for _delay_ms(100000)
|
overhead |
---|---|---|---|
|
|||
|
One could set OCR1A
a little bit lower to compensate the overhead, but this might depend on compiler options, which is why we will just leave it at
Datasheet HM3-2064-5
Datasheet 74HC595
Bubblesort
Discrete Poisson Equation
Jacobi-solver