# Architecture and Performances comparison of Network on Chip router for Hierarchical Mesh Topology

Bouraoui Chemli

Electronics and Microelectronics Laboratory
Faculty of Sciences of Monastir
University of Monastir
Tunisia

E-mail: bouraoui.chemli@fsm.rnu.tn

Abdelkrim Zitouni College of Education in Jubail University of Dammam KSA

E-mail: azitouni@uod.edu.sa

Abstract— Recently, Network on chip (NoC) has emerged as a good solution for future complex System on Chip (SoC). As opposed to bus technology, NoC allows the communication of hundreds or thousands of cores (processors, memories...) on a single chip. This work aims at providing comparison and performance analysis of three regular NoC topologies. We present the different pipeline stages of the proposed router which is the backbone of the NoC. The proposal supports the hierarchical mesh topology and uses a minimal routing algorithm to avoid deadlocks and a priority based arbiter to satisfy the quality (QoS) of service expected by the NoC. Results are presented and compared with other works in terms of maximal clock frequency, area, power consumption and peak performance.

Keywords— Architecture; NoC; Router; topology

## I. INTRODUCTION

Recently, NoC was the subject of many researcher topics. It implements routing calculation and packet switching technique to decrease hardware cost and power consumption and increase scalability and system performance [1]. NoC can handle the communication of hundreds of cores and allows several transactions concurrently [2]. Hence, NoC is presented as a better candidate for eventual on chip communication. It gives more flexibility, high scalability and low latency compared to conventional bus technology [3]. Mainly, NoC architectures consist of three parts which are the router to forward packets across the network, the network interface to allow the access to the network and the link to interconnect the NoC parts [4]. The connection of the different parts affects the data transmission Thus, NoC topology does not only play an important role in determining the network latency throughput area but also keeps scalability and reusability of the NoC design. In this dissertation, we present scalable and flexible router architecture for hierarchical mesh topology NoC. We therefore compare and analyze the performance of the proposal with other common topologies which are mesh, mesh 3D and diagonal mesh.

This paper is organized as follows; Section 2 deals with related works. Section 3 tackles the pipeline stages of the proposed router in detail. Section 4 provides the performance results to conclude the paper in section 5.

## II. RELATED WORK

Several research topics are conducted to NoC architecture. This section introduces and discusses some of the proposed NoC on scientific literature.

In [5], authors presented NoC architecture which uses the mesh topology, implements the wormhole switching and adopts the stall-and-go flow control. In order to decrease latency and boost throughput, they implement Short-Pass-Link customization. However they have to focus on reducing both the hardware cost and the power consumption. In [6], authors suggested scalable packet based router architecture which dynamically transfers and manages many transactions concurrently. Furthermore, they use their router in mesh and torus NoC. Their design suffers, however, from low throughput and high power consumption. In [7], authors presented a new NoC architecture. They used mesh topology, XY routing algorithm and credit-based flow control. In order to support and differentiate between packets OoS, they implement a dynamic arbiter. Despite the fact that their architecture reduced the average latency of network, they did not study the deadlock situations which are critical for NoC design. In [8], writers offered a flexible router design for mesh NoC. The synthesis of their design in ASIC showed a promising result when it comes to power consumption and area. Nonetheless, using the handshaking flow control can increase calculation time. It also seems better to use a credit-based flow control. In [9], Mesh/Torus NoC design was introduced. They use input queuing, XY routing algorithm and virtual channels. In spite of their design being deadlock free and implementing a simple routing mechanism, it suffers from area overhead and lack of scalability. In [10], we proposed router architecture for use in mesh NoC topology. We applied the deadlock-free negative-

## ICEMIS2017, Monastir, Tunisia

first which is a turn-model based routing algorithm. The router uses dynamic arbiter as well to provide QoS and fairly serve packets. Still, the router pipeline stages suffer from dependency which increases latency and hardware cost.

Consequently, we suggest the router architecture for hierarchical mesh NoC topology. The proposal has been prototyped on virtex5 and virtex6 FPGA. We provide performance comparison with three common topologies, diagonal-mesh, 2D mesh and 3D mesh.

## III. NOC ARCHITECTURE

# A. Router pipeline stages

The router is the backbone of the NoC design and defines its communication architecture. As presented in Figure 1, the proposed router architecture is primarily composed of three pipeline stages; the routing calculation, switch allocation and crossbar traversal. First, packets are received at the input ports from neighboring routers or from the local connected core. Then, the routing calculation and the arbitration process are performed at the same time in order to reduce the latency. Next and by the end of those two processes, it sends information about the destination port to the crossbar in a way that it finally establishes a connection allowing packets to reach its target.



Fig. 1. Router pipeline stages.

# B. NoC Topology

As illustrated in Figure 2, for the proposed NoC we use a regular topology which is a 4 by 4 size hierarchical-mesh, the well-known wormhole switching as a switching technique and the handshaking as a flow control. We adopt a unique address to each router defined in XY coordinates. Routers are connected to each other via ports, regarding its position in the network. Each router can have a maximum of six bidirectional ports four of which are connected to the adjacent routers at every direction (north, east, south and west). One port is connected to the (i+2, j) router and a last one connected to the local core. For both the power consumption and the area of the chip to be reduced, we have to deactivate any ports with no connection to other routers.



Fig. 2. 4x4 NoC hierarchical mesh topology.

# C. Switching technique

The proposed architecture applies the well-known wormhole switching technique. As displayed in Figure 3, the packet is composed by the header flit and the body flit. Each flit is 32 bits size. The first four bits of the header flit are dedicated to the destination address. The fifth bit is dedicated to the QoS types. The next four to the number of flits per packet to devote the following three are dedicated to packet priority and the rest of the flit is an extension. The body flit is 32 bits payload size. We can change the flit size regarding the application requirements.



Fig. 3. packet format.

## D. Routing

We adopted a minimal, simple-to-implement, routing algorithm for our design. Unlike any other routing algorithm, it avoids deadlock and livelock without using the virtual channels that cause a hardware complexity overhead. Each input port has its own routing block where the routing calculation is performed. It compares the current router address with the destination address of the flit to define the destination port:

- If the destination address equals R (i, j) address then the output port will be the local port.
- If the destination address equals R (i, j+1) address then the output port will be north.

- If the destination address equals R (i+1, j) address then the output port will be east.
- If the destination address equals R (i, j-1) address then the output port will be south.
- If the destination address equals R (i-1, j) address then the output port will be west.
- If the destination address equals R (i+2, j) address then output port will be south/ west.

## E. Switch allocation

Simultaneously as the routing calculation is performed, the switch allocator receives signals from the input port about the priority of the packets and the port requested. The switch allocation is composed by a priority scheduler block and round-robin arbiter. This scheme served the biggest packet priority to the selected output port in fair way.

## F. Crossbar

As shown in figure 4, the crossbar is composed of maximum of six multiplexers. It waits for the notification concerning the chosen output port from the switch allocator. Based on this notification, it forwards flits to the adequate output port. The number of flits per packet notifies the crossbar about the end of transmission and the channel can then be used for other flits.



Fig. 4. Crossbar circuit.

## IV. EXPERIMENTAL RESULTS

This section gives the synthesis results, the performance analysis and the evaluation of the different router implementations. The proposed routers have been designed with VHDL language at the register transfer level (RTL). They were simulated and synthesized using the ISE 13.1 tool of Xilinx. They were then implemented in two different FPGA, Virtex5 and Virtex6 as shown in Table I. Table II delivers the simulation parameters of the suggested router.

Table III. summarizes the results of the different router implementations which are given in terms of estimated peak performance, area, power consumption and maximal clock frequency. 4-Port routers are the ones positioned at the corner of the network. They are similar to the R (0, 0) router of figure 2 and connected to three neighboring routers. 5-Port routers are the ones positioned at the edge of the network. They are similar to the router R (0, 1) of figure 2 as well as connected to four neighboring routers. 6-Port routers are the ones positioned at the center of the network. They are similar to the R (1, 1) router of figure 2 and connected to five neighboring routers. We noticed in table II. that the area occupation and the power consumption of the router rises when the number of ports per router increase. The 6-Port router requires more hardware resources and consumes more power than the 5-Port router and the 4-Port router. This increase in terms of area is certainly caused by the added logic in use and growth of the router complexity. The maximum clock frequency is 258 MHz. It decreases whenever the number of ports per router increases due to the enlargement of the arbitration scheduling. The operating frequency of the router allows us to calculate the estimated peak performance (PP). The PP depends on the maximal operating frequency (F<sub>max</sub>) of the router, the clock cycle time (T) for the transmission of one flit and the flit size:

$$PP_{perport} = (F_{max} / T) * flit size$$

TABLE I. PROPOSED ROUTER RESULTS WITH SAME FPGA OF OTHER WORKS

| FPGA                                          | Virtex5              | Virtex6              |
|-----------------------------------------------|----------------------|----------------------|
| Topology                                      | Hierarchical<br>Mesh | Hierarchical<br>Mesh |
| Number of ports                               | 6                    | 6                    |
| Routing algorithm                             | Minimal routing      | Minimal routing      |
| Frequency (MHz)                               | 239                  | 227                  |
| Area (Slice)                                  | 1690                 | 1308                 |
| Power estiamtion (mW)                         | 20                   | 18                   |
| Estimated Peak Performance per port (Gbits/s) | 77                   | 72.6                 |

TABLE II. SUMILATION PARAMETERS.

| Router Parameters | 2D router           |  |
|-------------------|---------------------|--|
| Buffer Depth      | 4                   |  |
| Flit size (bit)   | 32                  |  |
| Switching         | wormhole            |  |
| Flow control      | Acknowledgment      |  |
| Arbiter           | Priority scheduler  |  |
| Routing           | Minimal routing     |  |
| Target device     | Virtex5 and Virtex6 |  |

TABLE III. RESULTS OF THE NOC ROUTER'S IMPLEMENTATION ON FPGA.

| Design                                              | 4-Port Router | 5-Port Router | 6-Port Router |
|-----------------------------------------------------|---------------|---------------|---------------|
| Frequency<br>(MHz)                                  | 260           | 252           | 239           |
| Area (Slice)                                        | 735           | 1232          | 1695          |
| Power estiamtion (mW)                               | 9             | 14            | 20            |
| Estimated Peak<br>Performance per<br>port (Gbits/s) | 82            | 80            | 77            |

This research's aim is to deliver a comparative study for different NoC topologies and support designers to choose carefully their NoC architecture. In [11], authors describe a router implantation on virtex2 and virtex5 FPGA. They describe their flexible and extensible design. It supports the diagonal mesh topology, adopts a deterministic routing algorithm, uses the packet switching, and dynamic arbiter. In [12], writers present 2D mesh topology NoC based on a virtual router. Their architecture uses two versions; one for reducing the resources cost and the other for reducing the latency of the network. In [13], we describe a router design for 3D NoC topology. We used the turn model negative-first routing algorithm to avoid dead-locks. As demonstrated in Table IV, we compare the proposal results with other works. The results are presented in terms of area, maximal clock frequency and estimated peak performance. Compared to [12, 13] the proposal outperforms their designs when it comes to maximal clock frequency, area and estimated peak performance. In comparison with [11], the proposal outpaces their design when speaking of maximal clock frequency and estimated peak performance. The proposal, though, shows a small overhead area wise caused by the additional hardware racecourses in use.

TABLE IV. STATE OF THE ART OF ROUTERS IN FPGA.

| Design                                        | [11]                  | [12]          | [13]                          |
|-----------------------------------------------|-----------------------|---------------|-------------------------------|
| Topology                                      | Diagonal<br>Mesh      | Mesh          | Mesh 3D                       |
| Number of ports                               | -                     | 5             | 7                             |
| Routing algorithm                             | Deterministic routing | XY<br>routing | Negative-<br>first<br>routing |
| Frequency<br>(MHz)                            | 200                   | 23            | 195                           |
| Area (Slice)                                  | 989                   | 25821         | 7847                          |
| Power estiamtion (mW)                         | 33                    | -             | 1                             |
| Estimated Peak Performance per port (Gbits/s) | 8.44                  | -             | 62.4                          |
| FPGA device                                   | Virtex5               | Virtex6       | Virtex6                       |

## V. CONCLUSION

We accordingly suggest router architecture for hierarchical NoC topology. The router pipeline stages are presented in detail such as routing calculation, switch allocation and crossbar traversal. To evaluate the performance of the proposal, we compared it with other woks in terms of maximal clock frequency, area, estimated power consumption and estimated peak performance. Evaluation results show that clock frequency wise, the proposal is faster than those of three other works. In terms of hardware cost, the proposal is 19 and 6 times smaller than the other routers which addressed 2D and 3D mesh topologies, respectively.

#### REFERENCES

- Salah, Y., & Tourki, R. (2011, December). Design and fpga implementation of a qos router for networks-on-chip. In 2011 3rd International Conference on Next Generation Networks and Services (NGNS) (pp. 84-89). IEEE.
- [2] Attia, B., Chouchene, W., Zitouni, A., Abid, N., & Tourki, R. (2011, March). A modular router architecture desgin for Network on Chip. In Systems, Signals and Devices (SSD), 2011 8th International Multi-Conference on (pp. 1-6). IEEE.
- [3] Elhaji, M., Boulet, P., Zitouni, A., Meftali, S., Dekeyser, J. L., & Tourki, R. (2012). System level modeling methodology of NoC design from UML-MARTE to VHDL. Design Automation for Embedded Systems, 16(4), 161-187.
- [4] Chemli, B., & Zitouni, A. (2016, November). Design and Evaluation of Optimized router pipeline stages for Network on Chip. In Image Processing, Applications and Systems Conference (IPAS), 2016 Second International. IEEE.
- [5] Ahmed, A. B., & Abdallah, A. B. (2012, August). ONoC-SPL Customized Network-on-Chip (NoC) Architecture and Prototyping for Data-intensive Computation Applications. In Proceedings of the 4th International Conference on Awareness Science and Technology, Seoul, Korea (Vol. 2124, p. 257262).
- [6] Salah, Y., Atri, M., & Tourki, R. (2007, December). Design of a 2d mesh-torus router for network on chip. In 2007 IEEE International Symposium on Signal Processing and Information Technology (pp. 626-631). IEEE.
- [7] Wissem, C., Attia, B., Noureddine, A., Zitouni, A., & Tourki, R. (2011, December). A quality of service network on chip based on a new priority arbitration mechanism. In ICM 2011 Proceeding (pp. 1-6). IEEE.
- [8] Asghari, S. A., Pedram, H., Khademi, M., & Yaghini, P. (2009). Designing and Implementation of a Network on Chip Router Based on Handshaking Communication Mechanism. World Applied Sciences Journal, 6(1), 88-93.
- [9] Salah, Y., Kaddachi, M. L., & Tourki, R. (2013). FPGA Hardware Implementation and Evaluation of a Micro-Network Architecture for Multi-Core Systems. World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering, 7(1), 53-59.
- [10] Chemli, B., & Zitouni, A. (2015, December). Design of a Network on Chip router based on turn model. In Sciences and Techniques of Automatic Control and Computer Engineering (STA), 2015 16th International Conference on (pp. 85-88). IEEE.
- [11] Elhajji, M., Attia, B., Zitouni, A., Tourki, R., Meftali, S., & Dekeyser, J. L. (2011, November). FERONOC: flexible and extensible router implementation for diagonal mesh topology. In Design and Architectures for Signal and Image Processing (DASIP), 2011 Conference on (pp. 1-8). IEEE.
- [12] Chatmen, M. F., Baganne, A., & Tourki, R. (2016). New design of Network on Chip Based on Virtual Routers. Indonesian Journal of Electrical Engineering and Computer Science, 2(1), 115-131.
- [13] Chemli, B., & Zitouni, A. (2014). A Turn Model Based Router Design for 3D Network on Chip. World Applied Sciences Journal, 32(8), 1499-1505