# VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING



#### SPECIALIZED PROJECT

# STUDYING AND DEVELOPING NONBLOCKING DISTRIBUTED MPSC QUEUES

Major: Computer Science

THESIS COMMITTEE: 0
MEMBER SECRETARY:

**SUPERVISORS**: THOẠI NAM

DIỆP THANH ĐĂNG

--000---

STUDENTS: ĐỖ NGUYỄN AN HUY - 2110193

HCMC, 03/2025



#### **Disclaimers**

I affirm that this specialized project is the product of my original research and experimentation. Any references, resources, results which this project is based on or a derivative work of have been given due citations and properly listed in the footnotes and the references section. All original contents presented are the culmination of my dedication and perserverance under the close guidance of my supervisors, Mr. Thoại Nam and Mr. Diệp Thanh Đăng, from the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology. I take full responsibility for the accuracy and authenticity of this document. Any misinformation, copyright infrigment or plagiarism shall be faced with serious punishment.



This thesis is the culmination of joint efforts coming from not only myself, but also my professors, my family, my friends and other teachers of Ho Chi Minh University of Technology.

I want to first acknowledge my university, Ho Chi Minh University of Tecnology. Throughout my four years of pursuing education here, I have built a strong theoretical foundation and earned various practical experiences. These all lend themselves well to the completion of this thesis. Especially, I want to extend my gratitude towards my supervisors, Mr. Thoại Nam and Mr. Diệp Thanh Đăng, who have acted as constant counselors and advisors right from the project's inception throughout. They have provided guidance on the project's direction and laid the academic basis for this project, upon which my work is essentially built upon. Furthermore, they inspired me to work hard and push through all the technical obstacles. Without them, this project wouldn't have reached this point of creation.

I also want to give my family the sincerest thanks for their emotional and financial support, without which I couldn't have whole-heartedly followed my research till the end.

Last but not least important, I want to thank my closest friends for their informal but everconstant check-ups to make sure I didn't miss the timeline for this specialized project, which I usually don't have the mental capacity for.



# **Contents**

| Chapter I Introduction                                    | 8  |
|-----------------------------------------------------------|----|
| 1.1 Motivation                                            | 8  |
| 1.2 Objective                                             | 9  |
| 1.3 Scope                                                 | 9  |
| 1.4 Structure                                             | 9  |
| Chapter II Background                                     | 11 |
| 2.1 Irregular applications                                | 11 |
| 2.2 Multiple-producer, single-consumer (MPSC)             | 11 |
| 2.3 Progress guarantee                                    | 11 |
| 2.3.1 Lock-free algorithms                                | 12 |
| 2.3.2 Wait-free algorithms                                | 12 |
| 2.4 Correctness - Linearizability                         | 12 |
| 2.5 Common issues when designing lock-free algorithms     |    |
| 2.5.1 ABA problem                                         | 13 |
| 2.5.2 Safe memory reclamation problem                     | 14 |
| 2.6 MPI-3                                                 |    |
| 2.6.1 MPI-3 RMA                                           | 15 |
| 2.6.2 MPI-RMA communication operations                    | 15 |
| 2.6.3 MPI-RMA synchronization                             | 15 |
| 2.7 Pure MPI approach of porting shared memory algorithms |    |
| Chapter III Related works                                 | 19 |
| Chapter IV Distributed queues                             | 21 |
| Chapter V Theoretical aspects                             | 22 |
| Chapter VI Preliminary results                            | 23 |
| Chapter VII Conclusion & Future works                     |    |
| References                                                | 25 |

# **List of Listings**

| Listing 1 | An example snippet showcasing our synchronization approach in MPI |    |
|-----------|-------------------------------------------------------------------|----|
|           | PMD                                                               | 18 |





# **List of Tables**

| Table 1 | Specification of MPI_Win_lock_all and MPI_Win_unlock_all                       |
|---------|--------------------------------------------------------------------------------|
| Table 2 | Characteristic summary of existing shared memory MPSCs. The cell marked        |
|         | with (*) indicates that our evaluation contradicts with the author's claims 19 |



# **List of Images**

| Figure 1 | Linerization points of method 1, method 2, method 3, method 4 happens at                |  |  |
|----------|-----------------------------------------------------------------------------------------|--|--|
|          | $t_1 < t_2 < t_3 < t_4$ , therefore, their effects will be observed in this order as if |  |  |
|          | we call method 1, method 2, method 3, method 4 sequentially                             |  |  |
| Figure 2 | An illustration of passive target communication. Dashed arrows represent                |  |  |
|          | synchronization (source: [1]) 16                                                        |  |  |
| Figure 3 | An illustration of our synchronization approach in MPI RMA 18                           |  |  |



The demand for computation power has always been increasing relentlessly. Increasingly complex computation problems arise and accordingly more computation power is required to solve them. Much engineering efforts have been put forth towards obtaining more computation power. A popular topic in this regard is distributed computing: The combined power of clusters of commodity hardware can surpass that of a single powerful machine. To fully take advantage of the potential of distributed computing, specialized algorithms and data structures need to be devised.

Noticeably, multi-producer single-consumer (MPSC) is one of those data structures that are utilized heavily in distributed computing, forming the backbone of many applications. For example, MPSC is the core data structure used in the ISx application, a scalable integer sort application [2]. Consequently, an MPSC can easily present a performance bottleneck if not designed properly. A desirable distributed MPSC should be able to exploit the highly concurrent nature of distributed computing. Currently, in the literature, most distributed data structures are designed from the ground up, completely disregarding the any existing data structures developed in the shared memory area, e.g. [3]. This is partly due to the historical differences between the programming models utilized in these two areas. However, since the introduction of specialized networking hardware RDMA and the improved support of the remote memory access (RMA) programming model in MPI-3, this gap has been bridged. Thus, it has opened up a lot new research ([4]) on reusing the principles in the shared memory literature to distributed computing. One favorable characteristic of concurrent data structures that has been heavily researched in the shared memory literature, which is also equally important in distributed computing, is the property of non-blocking, or in particular, lock-freedom. Lock-freedom guarantees that if some processes suspend or die, other processes can still complete. This provides both progress guarantee and fault-tolerance, especially in distributed computing where nodes can fail any time. Thus, the rest of this document concerns itself with investigating and devising efficient non-blocking distributed MPSCs. Interestingly, we choose to adapt current MPSC algorithms in the shared-memory literature to distributed context, which enables a wealth of accumulated knowledge in this literature.

#### 1.1 Motivation

Lock-free MPSC and other FIFO variants, such as multi-producer multi-consumer (MPMC), concurrent single-producer single-consumer (SPSC), are heavily studied in the shared memory literature, dating back from the 1980s-1990s [5], [6], [7] and more recently [8], [9]. It comes as no surprise that algorithms in this domain are highly developed and optimized for performance and scalability. However, most research about MPSC or FIFO algorithms in general completely disregard the available state-of-the-art algorithms in the shared memory literature. With the new RDMA networking hardware support and capabilities added to MPI-3 RMA API: lock-free shared-memory algorithms can be straightforwardly ported to distributed context using this programming model.



This presents an opportunity to make use of the highly accumulated research in the shared memory literature, which if adapted and mapped properly to the distributed context, may produce comparable results to algorithms exclusively devised within the distributed computing domain. Therefore, we decide to take this novel route to developing new nonblocking MPSC algorithms: Port and adapt potential lock-free shared-memory MSPCs to distributed context using the MPI-3 RMA programming model. If this approach proves to be effective, a huge intellectual reuse of shared-memory MSPC algorithms into the distributed domain is possible. Consequently, there may be no need to develop distributed MPSC algorithms from the ground up.

### 1.2 Objective

This thesis aims to:

- Investigate state-of-the-art shared-memory MPSCs.
- Select and appropriately modify potential MPSC algorithms so they can be implemented in popular distributed programming environments.
- Port MPSC algorithms using MPI-3 RMA.
- Evaluate various theoretical aspects of ported MPSC algorithms: Correctness, progress guarantee, time complexity analysis.
- · Benchmark the ported MPSC algorithms and compare them with current distributed MPSCs in the literature.
- Discover distributed-environment-specific optimization opportunities for ported MPSC algorithms.

### 1.3 Scope

- For related works on shared-memory MPSCs, we only focus on linearizable MPSCs that support at least lock-free enqueue and dequeue operations.
- · Any implementation details, benchmarking and optimizations assume MPI-3 settings.
- For optimizations, we focus on performance-related metrics, e.g. time-complexity (theoretically), throughput (empirically).

#### 1.4 Structure

The rest of this report is structured as follows:

Chapter II discusses the theoretical foundation this thesis is based on and the technical terminology that's heavily utilized in this domain. As mentioned, this thesis investigates state-of-the-art shared-memory MPSCs. Therefore, we discuss the theory related to the design of concurrent algorithms such as lock-freedom and linearizability, the practical challenges such as the ABA problem and safe memory reclamation problem. We then explore the utilities offered by C++11 to implement concurrent algorithms and MPI-3 to port shared memory algorithms.

Chapter III surveys the shared-memory literature for state-of-the-art queue algorithms, specifically MPSC and SPSC algorithms (as SPSC can be modified to implement MPSC). We specifically focus on algorithms that have the potential to be ported efficiently to distributed context, such as NUMA-aware or can be made to be NUMA-aware. We then conclude with a comparison of the most potential shared-memory queue algorithms.

Chapter IV documents distributed-versions of potential shared-memory MPSC algorithms surveys in Chapter III. It specifically presents our adaptation efforts of existing algorithms in the shared-memory literature to make their distributed implementations feasible.

Chapter V discusses various interesting theoretical aspects of our distributed MPSC algorithms in Chapter IV, specifically correctness (linearizability), progress guarantee (lock-freedom and wait-freedom), performance model. Our analysis of performance model helps back our empirical findings in Chapter VI, together, they work hand-in-hand to help us discover optimization opportunities.

Chapter VI introduces our benchmarking setup, including metrics, environments, benchmark/microbenchmark suites and conducting methods. We aim to demonstrate some preliminary results on how well ported shared-memory MPSCs can compare to existing distributed MPSCs. Finally, we discuss important factors that affect the runtime properties distributed MPSC algorithm, which have partly been explained by our theoretical analysis in Chapter V.

Chapter VII concludes what we have accomplished in this thesis and considers future possible improvements to our research.



#### 2.1 Irregular applications

Irregular applications are a class of programs particularly interesting in distributed computing. They are characterized by:

- Unpredictable memory access: Before the program is actually run, we cannot know which data it will need to access. We can only know that at run time.
- Data-dependent control flow: The decision of what to do next (such as which data tp
  accessed next) is highly dependent on the values of the data already accessed. Hence
  the unpredictable memory access property because we cannot statically analyze the
  program to know which data it will access. The control flow is inherently engraved
  in the data, which is not known until runtime.

Irregular applications are interesting because they demand special treatments to achieve high performance. One specific challenge is that this type of applications is hard to model in traditional MPI APIs. The introduction of MPI RMA (remote memory access) in MPI-2 and its improvement in MPI-3 has significantly improved MPI's capability to express irregular applications comfortably.

#### 2.2 Multiple-producer, single-consumer (MPSC)

Multiple-producer, single-consumer (MPSC) is a specialized concurrent first-in first-out (FIFO) data structure. A FIFO is a container data structure where items can be inserted into or taken out of, with the constraint that the items that are inserted earlier are taken out of earlier. Hence, it's also known as the queue data structure. The process that performs item insertion into the FIFO is called the producer and the process that performs items deletion (and retrieval) is called the consumer. In concurrent queues, multiple producers and consumers can run in parallel. Concurrent queues have many important applications, namely event handling, scheduling, etc. One class of concurrent FIFOs is MPSC, where one consumer may run in parallel with multiple producers. The reasons we're interested in MPSCs instead of the more general multiple-producer, multiple-consumer data structures (MPMCs) are that (1) high-performance and high-scalability MPSCs are much simpler to design than MPMCs while (2) MPSCs are noticeably as powerful as MPMCs - its consensus number equals the number of producers [10]. Thus, MPSCs can see as many use cases as MPMCs while being easily scalable and performant.

### 2.3 Progress guarantee

Many concurrent algorithms are based on locks to create mutual exclusion, in which only some processes that have acquired the locks are able to act, while the others have to wait. While lock-based algorithms are simple to read, write and verify, these algorithms are said to be blocking: One slow process may slow down the other faster processes, for example, if the slow process successfully acquires a lock and then the operating system (OS) decides to suspends it to schedule another one, this means until the process



is awaken, the other processes that contend for the lock cannot continue. Lock-based algorithms introduces many problems such as:

- Deadlock: There's a circular lock-wait dependencies among the processes, effectively prevent any processes from making progress.
- Convoy effect: One long process holding the lock will block other shorter processes contending for the lock.
- Priority inversion: A higher-priority process effectively has very low priority because it has to wait for another low priority process.

Furthermore, if a process that holds the lock dies, this will halt the whole program. This consideration holds even more weight in distributed computing because of a lot more failure modes, such as network failures, node falures, etc.

Therefore, while lock-based algorithms are easy to write, they do not provide **progress** guarantee because deadlock or livelock can occur and its use of mutual exclusion is unnecessarily restrictive. These algorithms are said to be **blocking**. An algorithm is said to be **non-blocking** if a failure or slow-down in one process cannot cause the failure or slow-down in another process. Lock-free and wait-free algorithms are to especially interesting subclasses of non-blocking algorithms. Unlike lock-based algorithms, they provide progress guarantee.

#### 2.3.1 Lock-free algorithms

Lock-free algorithms provide the following guarantee: Even if some processes are suspended, the remaining processes are ensured to make global progress and complete in bounded time. This property is invaluable in distributed computing, one dead or suspended process will not block the whole program, providing fault-tolerance. Designing lock-free algorithms requires careful use of atomic instructions, such as Fetch-and-add (FAA), Compare-and-swap (CAS), etc.

#### 2.3.2 Wait-free algorithms

Wait-freedom is a stronger progress guarantee than lock-freedom. While lock-freedom ensures that at least one of the alive processes will make progress, wait-freedom guarantees that any alive processes will finish in bounded time. Wait-freedom is useful to have because it prevents starvation. Lock-freedom still allows the possibility of one process having to wait for another indefinitely, as long as some still makes progress.

# 2.4 Correctness - Linearizability

Correctness of concurrent algorithms is hard to defined, especially when it comes to the semantics of concurrent data structures like MPSC. One effort to formalize the correctness of concurrent data structures is the definition of linearizability. A method call on the FIFO can be visualized as an interval spanning two points in time. The starting point is called the **invocation event** and the ending point is called the **response event**. **Linearizability** informally states that each method call should appear to take effect instantaneously at some moment between its invocation event and response event [11].



The moment the method call takes effect is termed the **linearization point**. Specifically, suppose the followings:

- We have n concurrent method calls  $m_1, m_2, ..., m_n$ .
- Each method call  $m_i$  starts with the **invocation event** happening at timestamp  $s_i$  and ends with the **response event** happening at timestamp  $e_i$ . We have  $s_i < e_i$  for all  $1 \le i \le n$ .
- Each method call  $m_i$  has the **linearization point** happening at timestamp  $l_i$ , so that  $s_i \leq l_i \leq e_i$ .

Then, linerizability means that if we have  $l_1 < l_2 < ... < l_n$ , the effect of these n concurrent method calls  $m_1, m_2, ..., m_n$  must be equivalent to calling  $m_1, m_2, ..., m_n$  sequentially, one after the other in that order.



Figure 1: Linerization points of method 1, method 2, method 3, method 4 happens at  $t_1 < t_2 < t_3 < t_4$ , therefore, their effects will be observed in this order as if we call method 1, method 2, method 3, method 4 sequentially

### 2.5 Common issues when designing lock-free algorithms

#### 2.5.1 ABA problem

In implementing concurrent lock-free algorithms, hardware atomic instructions are utilized to achieve linearizability. The most popular atomic operation instruction is compare-and-swap (CAS). The reason for its popularity is (1) CAS is a **universal atomic instruction** - it has the **concensus number** of  $\infty$  - which means it's the most powerful atomic instruction [12] (2) CAS is implemented in most hardware (3) some concurrent lock-free data structures such as MPSC can only be implemented using powerful atomic instructions such as CAS. The semantic of CAS is as follows. Given the instruction CAS(memory location, old value, new value), atomically compares the value at memory location to see if it equals old value; if so, sets the value at memory location to



new value and returns true; otherwise, leaves the value at memory location unchanged and returns false. Concurrent algorithms often utilize CAS as follows:

- 1. Read the current value old value = read(memory location).
- 2. Compute new value from old value by manipulating some resources associated with old value and allocating new resources for new value.
- 3. Call CAS(memory location, old value, new value). If that succeeds, the new resources for new value remain valid because it was computed using valid resources associated with old value, which has not been modified since the last read. Otherwise, free up new value because old value is no longer there, so its associated resources are not valid.

This scheme is susceptible to the notorious ABA problem:

- 1. Process 1 reads the current value of memory location and reads out A.
- 2. Process 1 manipulates resources associated with A, and allocates resources based on these resources.
- 3. Process 1 suspends.
- 4. Process 2 reads the current value of memory location and reads out A.
- 5. Process 2 CAS(memory location, A, B) so that resources associated with A are no longer valid.
- 6. Process 3 CAS(memory location, B, A) and allocates new resources associated with A.
- 7. Process 1 continues and CAS(memory location, A, new value) relying on the fact that the old resources associated with A are still valid while in fact they aren't.

To safe-guard against ABA problem, one must ensure that between the time a process reads out a value from a shared memory location and the time it calls CAS on that location, there's no possibility another process has CAS the memory location to the same value. Some notable schemes are **monotonic version tag** ([7]) and **hazard pointer** ([13]).

#### 2.5.2 Safe memory reclamation problem

The problem of safe memory reclamation often arises in concurrent algorithms that dynamically allocate memory. In such algorithms, dynamically-allocated memory must be freed at some point. However, there's a good chance that while a process is freeing memory, other processes contending for the same memory are keeping a reference to that memory. Therefore, deallocated memory can potentially be accessed, which is erroneneous. Solutions ensure that memory is only freed when no other processes are holding references to it. In garbage-collected programming environments, this problem can be conveniently push to the garbage collector. In non-garbage-collected programming environments, however, custom schemes must be utilized. Examples include using a reference counter to count the number of processes holding a reference to some memory and hazard pointer [13] to announce to other processes that some memory is not to be freed.

#### 2.6 MPI-3

MPI stands for message passing interface, which is a **message-passing library interface specification**. Design goals of MPI includes high availability across platforms, efficient communication, thread-safety, reliable and convenient communication interface while still allowing hardware-specific accelerated mechanisms to be exploited [1].

#### 2.6.1 MPI-3 RMA

RMA in MPI RMA stands for remote memory access. As introduced in the first section of Section Chapter II, RMA APIs is introduced in MPI-2 and its capabilities are further extended in MPI-3 to conveniently express irregular applications. In general, RMA is intended to support applications with dynamically changing data access patterns where the data distribution is fixed or slowly changing [1]. In such applications, one process, based on the data it needs, knowing the data distribution, can compute the nodes where the data is stored. However, because data acess pattern is not known, each process cannot know whether any other processes will access its data.

Using the traditional Send/Receive interface, both sides need to issue matching operations by distributing appropriate transfer parameters. This is not suitable, as previously explain, only the side that needs to access the data knows all the transfer parameters while the side that stores the data cannot anticipate this.

#### 2.6.2 MPI-RMA communication operations

RMA only requires one side to specify all the transfer parameters and thus only that side to participate in data communication.

To utilize MPI RMA, each process needs to open a memory window to expose a segment of its memory to RMA communication operations such as remote writes (MPI\_PUT), remote reads (MPI\_GET) or remote accumulates (MPI\_ACCUMULATE, MPI\_GET\_ACCUMULATE, MPI\_FETCH\_AND\_OP, MPI\_COMPARE\_AND\_SWAP) [1]. These remote communication operations only requires one side to specify.

#### 2.6.3 MPI-RMA synchronization

Besides communication of data from the sender to the receiver, one also needs to synchronize the sender with the receiver. That is, there must be a mechanism to ensure the completion of RMA communication calls or that any remote operations have taken effect. For this purpose, MPI RMA provides **active target synchronization** and **passive target synchronization**. In this document, we're particularly interested in **passive target synchronization** as this mode of synchronization does not require the target process of an RMA operation to explicitly issue a matching synchronization call with the origin process, easing the expression of irregular applications [14].

In **passive target synchronization**, any RMA communication calls must be within a pair of MPI\_Win\_lock/MPI\_Win\_unlock or MPI\_Win\_lock\_all/MPI\_Win\_unlock\_all. After the unlock call, those RMA communication calls are guaranteed to have taken effect.



One can also force the completion of those RMA communication calls without the need for the call to unlock using flush calls such as MPI\_Win\_flush or MPI\_Win\_flush\_local.



Figure 2: An illustration of passive target communication. Dashed arrows represent synchronization (source: [1])

# 2.7 Pure MPI approach of porting shared memory algorithms

In pure MPI, we use MPI exclusively for communication and synchronization. With MPI RMA, the communication calls that we utilize are:

• Remote read: MPI\_Get

• Remote write: MPI\_Put



• Remote accumulation: MPI\_Accumulate, MPI\_Get\_accumulate, MPI\_Fetch\_and\_op and MPI\_Compare\_and\_swap.

For lock-free synchronization, we choose to use **passive target synchronization** with MPI\_Win\_lock\_all/MPI\_Win\_unlock\_all.

In the MPI-3 specification [1], these functions are specified as follows:

| Operation          | Usage                                                    |
|--------------------|----------------------------------------------------------|
| MPI_Win_lock_all   | Starts and RMA access epoch to all processes in a memory |
|                    | window, with a lock type of MPI_LOCK_SHARED. The calling |
|                    | process can access the window memory on all processes in |
|                    | the memory window using RMA operations. This routine is  |
|                    | not collective.                                          |
| MPI_Win_unlock_all | Matches with an MPI_Win_lock_all to unlock a window      |
|                    | previously locked by that MPI_Win_lock_all.              |

Table 1: Specification of MPI\_Win\_lock\_all and MPI\_Win\_unlock\_all

The reason we choose this is 3-fold:

- Unlike **active target synchronization**, **passive target synchronization** does not require the process whose memory is being accessed by an MPI RMA communication call to participate in. This is in line with our intention to use MPI RMA to easily model irregular applications like MPSCs.
- Unlike **active target synchronization**, MPI\_Win\_lock\_all and MPI\_Win\_unlock\_all do not need to wait for a matching synchronization call in the target process, and thus, is not delayed by the target process.
- Unlike **passive target synchronization** with MPI\_Win\_lock/MPI\_Win\_unlock, multiple calls of MPI\_Win\_lock\_all can succeed concurrently, so one process needing to issue MPI RMA communication calls do not block others.

An example of our pure MPI approach with MPI\_Win\_lock\_all/MPI\_Win\_unlock\_all, inspired by [14], is illustrated in the following:



```
MPI_Win_lock_all(0, win);
MPI_Get(...); // Remote get
MPI_Put(...); // Remote put
MPI_Accumulate(..., MPI_REPLACE, ...); // Atomic put
MPI_Get_accumulate(..., MPI_NO_OP, ...); // Atomic get
MPI_Fetch_and_op(...); // Remote fetch-and-op
MPI_Compare_and_swap(...); // Remote compare and swap
MPI_Win_flush(...); // Make previous RMA operations take effects
MPI_Win_flush_local(...); // Make previous RMA operations take
effects locally
MPI_Win_unlock_all(win);
```

Listing 1: An example snippet showcasing our synchronization approach in MPI RMA



Figure 3: An illustration of our synchronization approach in MPI RMA



There exists numerous research into the design of lock-free shared memory MPMCs and SPSCs. Interestingly, research into lock-free MPSCs are noticeably scarce. Although in principle, MPMCs and SPSCs can both be adapted for MPSCs use cases, specialized MPSCs can usually yield much more performance. In reality, we have only found 4 papers that are concerned with direct support of lock-free MPSCs: LTQueue [8], DQueue [10], WRLQueue [15] and Jiffy [9]. Table 2 summarizes the charateristics of these algorithms.

| MPSCs                         | LTQueue    | DQueue     | WR-           | Jiffy     |
|-------------------------------|------------|------------|---------------|-----------|
|                               |            |            | <b>LQueue</b> |           |
| ABA solution                  | Load-link/ | Incorrect  | Custom        | Custom    |
|                               | Store-con- | custom     | scheme        | scheme    |
|                               | ditional   | scheme (*) |               |           |
| Memory reclamation            | Custom     | Incorrect  | Custom        | Custom    |
|                               | scheme     | custom     | scheme        | scheme    |
|                               |            | scheme (*) |               |           |
| Progress guarantee of dequeue | Wait-free  | Wait-free  | Blocking      | Wait-free |
|                               |            |            | (*)           |           |
| Progress guarantee of enqueue | Wait-free  | Wait-free  | Wait-free     | Wait-free |
| Number of elements            | Un-        | Un-        | Un-           | Un-       |
|                               | bounded    | bounded    | bounded       | bounded   |

Table 2: Characteristic summary of existing shared memory MPSCs. The cell marked with (\*) indicates that our evaluation contradicts with the author's claims

LTQueue [8] is the earliest wait-free shared memory MPSC to our knowledge. This algorithm is wait-free with  $O(\log n)$  time complexity for both enqueues and dequeues, with n being the number of enqueuers. Their main idea is to split the MPSC among the enqueuers so that each enqueuer maintains a local SPSC data structure, which is only shared with the dequeuer. This improves the MPSC's scalability as multiple enqueues can complete the same time. The enqueuers shared a distributed counter and use it to label each item in their local SPSC with a specific timestamp. The timestamps are organized into nodes of a min-heap-like tree so that the dequeuer can look at the root of tree to determine which local SPSC to dequeue next. The min-heap property of the tree is preserved by a novel wait-free timestamp-refreshing operation. Memory reclamation becomes trivial as each MPSC entry is only shared by one enqueuer and one dequeuer in the local SPSC. The algorithm avoids ABA problem by utilizing load-link/store-conditional (LL/SC). This, on the other hand, presents a challenge in directly porting LTQueue as LL/SC is not widely available as the more popular CAS instruction.

DQueue [10] focuses on optimizing performance. It aims to be cache-friendly by having each enqueuer batches their updates in a local buffer to decrease cache misses. It also try to replace expensive atomic instructions such as CAS as many as possible. The MPSC



WRLQueue [15] is a lock-free MPSC for embedded real-time system. Its main purpose is to avoid excessive modification of storage space. WRLQueue is simplfy a pair of buffer, one is worked on by multiple enqueuers and the other is work on by the dequeuer. The enqueuers batch their enqueues and write multiple elements onto the buffer once at a time. The dequeuer upon invocation will swap its buffer with the enqueuer's buffers to dequeue from it. However, this requires the dequeuer to wait for all enqueue operations to complete in their buffer. If an enqueue suspends or dies, the dequeuer will have to wait forever, this clearly violates the property of non-blocking.

Jiffy [9] is a fast and memory-efficient wait-free MPSC by avoiding excessive allocation of memory. Like DQueue, Jiffy represents the queue as a linked list of segments. Each enqueue reserves a slot in the segment, extends the linked-list as appropriately, writes the value into the slot and sets a per-slot flag to indicate that the slot is ready to be dequeued. To dequeue, the dequeuer repeatedly scan all the slots to find the first-ready-to-bedequeue slot. Jiffy shows significant good memory usage and throughput compared to other previous state-of-the-art MPMCs.





# **Chapter V Theoretical aspects**

# **Chapter VI Preliminary results**

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim aeque doleamus animo, cum corpore dolemus, fieri.

# **Chapter VII Conclusion & Future works**

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim aeque doleamus animo, cum corpore dolemus, fieri.



# References

- [1] Message Passing Interface Forum, *MPI: A Message-Passing Interface Standard Version* 3.1. 2015. [Online]. Available: <a href="https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf">https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf</a>
- [2] U. Hanebutte and J. Hemstad, "ISx: A Scalable Integer Sort for Co-design in the Exascale Era," 2015. doi: 10.1109/PGAS.2015.21.
- [3] B. Brock, A. Buluç, and K. Yelick, "BCL: A Cross-Platform Distributed Data Structures Library," 2019, *Association for Computing Machinery*. doi: 10.1145/3337821.3337912.
- [4] Thanh-Dang Diep, Phuong Hoai Ha, and Karl Fürlinger, "A general approach for supporting nonblocking data structures on distributed-memory systems," 2023. doi: <a href="https://doi.org/10.1016/j.jpdc.2022.11.006">https://doi.org/10.1016/j.jpdc.2022.11.006</a>.
- [5] John D. Valois, "Implementing Lock-Free Queues," 1994.
- [6] L. Lamport, "Specifying Concurrent Program Modules," 1983, *Association for Computing Machinery*. doi: 10.1145/69624.357207.
- [7] Mage M.Michael and Michael L.Scott, "Simple, fast, and practical non-blocking and blocking concurrent queue algorithms," 1996, *Association for Computing Machinery*. doi: 10.1145/248052.248106.
- [8] P. Jayanti and S. Petrovic, "Logarithmic-time single deleter, multiple inserter wait-free queues and stacks," 2005, *Springer-Verlag*. doi: 10.1007/11590156\_33.
- [9] D. Adas and R. Friedman, "A Fast Wait-Free Multi-Producers Single-Consumer Queue," 2022, *Association for Computing Machinery*. doi: 10.1145/3491003.3491004.
- [10] J. Wang, Q. Jin, X. Fu, Y. Li, and P. Shi, "Accelerating Wait-Free Algorithms: Pragmatic Solutions on Cache-Coherent Multicore Architectures," 2019. doi: 10.1109/ACCESS.2019.2920781.
- [11] M. Herlihy and N. Shavit, *The Art of Multiprocessor Programming, Revised Reprint*. Morgan Kaufmann, 2012.
- [12] M. Herlihy, "Wait-free synchronization," 1991, *Association for Computing Machinery*. doi: 10.1145/114005.102808.
- [13] M. M. Michael, "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects," 2004, *IEEE Press.* doi: 10.1109/TPDS.2004.8.
- [14] J. Dinan, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur, "An implementation and evaluation of the MPI 3.0 one-sided communication interface," 2016, *John Wiley and Sons Ltd.* doi: 10.1002/cpe.3758.



[15] Q. Yang, L. Tang, Y. Guo, N. Kuang, S. Zhong, and H. Luo, "WRLqueue: A Lock-Free Queue For Embedded Real-Time System," 2022. doi: 10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00197.