# Import

Example of journalling along with the codebase.

In [2]:
import numpy as np
import pandas as pd

In [20]:
performance_flops = {}
power_eff = {}
transistor_count = {}
processor_count = {}

# Microsoft Brainwave

While exploring a tabular dataset, how to find unique points ? 

Stratix10 280 has 5760 DSPs (11.5K 18x19 muls) + 930K ALMs. On dot products it can do (11.5K + 11.5K) x 0.5GHz =~12 TOPS

In [9]:

performance_flops["brainwave"] = 39.5 #Teraops/s
power_eff["brainwave"] = 720 #GFflops/W Underwhelming

In [3]:
stratix10 = {}
stratix10["dsp_count"] = 5760
11.5* 1024* 18*19

4027392.0

![image.png](attachment:image.png)

ms-fp8/fp9: Microsoft's "proprietary parameterizable narrow precision format": FPGA-efficient 8- and 9-bit floating point (?) formats. ms-fp9 at 65 TOPS and ms-fp8 at 90 TOPS at 0.5 GHz. Yesterday's demo was 39.5 TOPS at "130000/cycle" = ~300 MHz

You can do >1 short mul in one 18x19 DSP mul, but that won't get you to 65 or 90 TOPS (32 or 45 Tmul/s?). I think mostly LUT mul/adds

So it's fun to think what ms-fp8 might be. Separate sign or 2's comp? 5-6 bit mantissa? 2-3 bit exponent? Balance mul and add area.

Worst part of fp-add is denorm one operand to match exponents, then add, then norm the sum. Two barrel shifts = brutal in FPGAs.

For narrow exp dot products, no need. Sum terms a[i]*w[i] with a wide fixed point adder tree. FPGAs are sheer terrors at fixed pt add.

Anyway, after 25 years of FPGA reduced prec research, now have 'killer app': machine learning w/ 9-8-7-6-5-4-3-2b? weights/activations

With FPGAs there is / can be an everything bit numeric format. Or variable precision even in different operators of same type.

when Altera did 32b SP FPU-DSPs we asked designer ML how A might stretch to 64b DP (e.g. II=2 cycles). Now thinking <= 9b FPUs!

Parallelizing:

An RDMA-like protocol breaks messages into packets that can be spread across a pool of networked FPGAs within 10 microseconds. A single instruction can access 130,000 of Microsoft’s floating point operations in a cycle. The company declined to describe its fp format except to say it was not based on the IEEE standard.



![image.png](attachment:image.png)

# Baidu XPU

Baidu: tiny cores are 1250 LUTs/4 DSP/5 BRAM, run at 600 MHz. >500 MHz in Xilinx is nontrivial. I've seen deep pipelining / MTA, limited result fwding/reg file bypass, and/or DSP datapath.

Recurrence: operand reg FF clk-out => ALU => result mux => result forwarding mux (regfile bypass) => operand reg FF setup is >2 ns.

DSP datapath soft prcoessors (NTU iDEA https://dr.ntu.edu.sg/handle/10220/16290?show=full … @sfahmy) lack full result fwding => reuse hazards in pipeline

Which can lead to lower IPC in compiled code. Note: no C compiler for Baidu tiny core -- use assembly.

That can be a reasonable approach and tradeoff vs. RTL but not as high productivity as "same C++ code runs on host or accelerator".

Anyway nice Baidu work validates a "software first, software mostly" path to making FPGA accelerators accessible to SW engineers.

![image.png](attachment:image.png)

# Nvidia Tesla

The Tesla P100 can deliver 21 TFLOPS of 16-bit floating point that is ideal for DNN applications. It employs CoWoS (Chip-on Wafer-on-Substrate) with HBM2 (high-bandwidth memory version 2) technology. AMD used HBM on its Radeon R9 GPU (see “Best of 2015: High Bandwidth Memory Helps GPU Deliver on Performance ” on electronicdesign.com). The Tesla P100 has four NVLinks, allowing multiple chips to be combined into a single compute node.



# Google TPU

Google got 20 percent better results having a neural net decide how to spread a graph compute problem over a pool of GPUs

![image.png](attachment:image.png)

Parallelizing:

In addition, no one knows how to best create machines to process dynamic models across multiple boards. Developers still want batch sizes measured in millions, requiring new leaps in parallel training techniques.


Perhaps the trickiest bit is knowing what algorithms will be important in the next three or four years, he said.

“The field is moving extremely quickly. In a weekend we can design new optimization update rules. You have to make your best guess…[but] I’m pretty convinced we will end up in a pretty good state,” he said.



In [21]:
performance_flops["TPU"] = 92 #Teraops/s

![image.png](attachment:image.png)

Google discovered inference work needs low latency more than high throughput. It also appreciated the choice to use a single-threaded design. “I find 18-threaded machines hard to think about,” said Young.

Asked what he would change, he said, “if I had it to do all over I would buy a real floating point unit and not worry about little problems that are still giving us bugs.”



# Amazon Soft FPGA

Xilinx VU9P chips on F1 Service. 

F1 service provides either one or eight FPGAs paired with eight or 64 CPUs, respectively. Each FPGA provides users with 64 GBytes memory, 2.5M logic elements and 6,800 DSPs. Amazon places multiple FPGAs on a 400 Gbits/s ring bus to enable streaming bandwidth.

Amazon’s FPGA Shell encapsulates the chip, its DDR4 memory controllers, PCIe links and the 400 Gbit/s interconnect. The company offers an SDK, including tools and APIs to program it.



![image.png](attachment:image.png)

# Baidu 

Voice search service on about 3,000 servers with FPGAs. Altogether, it uses more than 10,000 FPGA cards today.



The device currently implemented in a Xilinx VU9P packs 256 tiny cores in 32-unit clusters running at a blazing 600 MHz. Each core contains 1,252 LUTs and four DSPs. It runs simple MIPS-like instructions with a bit of scratchpad memory

![image.png](attachment:image.png)

The chip acts as an accelerator with no operating system or cache. It rides a PCIe card and currently is programmed in assembler.

The XPU has access to four 72-bit banks of 2400 MHz DDR memory. The chip also contains custom logic to run Baidu’s SDA-II deep learning software at 6.144 TeraOps/second.



In [22]:
performance_flops["baidu"] = 6.144 # Teraops/s

# Wave Computing

![image.png](attachment:image.png)

# Nvidia 

Nvidia’s Volta (below), announced in May, is nearly twice Vega’s size, a massive 815mm2. It more squarely targets the emerging market for neural network training that today is hungry for more performance.

Volta has 5,120 processors, 21 billion transistors, and can deliver up to 120 TFlops

In [23]:
processor_count["nvidia_volta"] = 5120
transistor_count["nvidia_volta"] = 21 #billion
performance_flops["nvidia_volta"] = 120 #Tflops

# AMD Vega

![image.png](attachment:image.png)

AMD said its Vega 10 (above), announced in January, is the first consumer chip to use HBM2 memory stacks and marks its return to the market for general-purpose GPU computing.



Vega has 4,096 processors, 12.5 billion transistors, and can deliver up to 26 TFlops

In [17]:
processor_count["amd_vega"] = 4096
transistor_count["amd_vega"] = 12.5 #billion
performance["amd_vega"] = 26 #TFlops

![image.png](attachment:image.png)

# Academic - Celerity

16nm TSMC PDK to tapeout of a 25mm2 device in nine months

Celerity packs into its 360 million transistors 511 RISC-V cores and a custom neural network accelerator. The cores are split into a 496-unit array running at 1.05 GHz and five Linux-capable Rocket cores along with the accelerator running at 625 MHz.


The students claim an astounding 700-1,220x speedup by using specialty and many-core tiers in collaboration, compared to using either in isolation. They expect first silicon in September.

The distributed team attributed its condensed design time and relatively miniscule $1.3 million budget to a mixture of factors. They relied heavily on open source IP and automated tools, and their design is highly modular. The teams hailed from Cornell, the University of Michigan and the Universities of California in Los Angeles and San Diego, supported by funds from DARPA.

