

# ChipSort: a SIMD and cache-aware sorting module

## Nicolau Leal Werneck

## TomTom NV

#### ABSTRACT

Sorting is a fundamental programming problem with many important applications. While there are well-known sorting algorithms for conventional computers, special techniques are required to take the most out of real hardware offering features such as cache memory and instruction-level parallelism. ChipSort.jl is a Julia module for SIMD and cache-aware sorting. It implements sorting networks and bitonic merge networks with SIMD instructions, with configurable vector sizes. It also implements Combsort, which lends itself easily to vectorization and can achieve good performance depending on the memory access cost. Insertion sort is used to finalize. Large arrays are approached with a multi-way Mergesort. The implementation of ChipSort itself is of interest from a programming languages perspective due to its use of metaprogramming, more specifically Julia's generated functions. This enables custom code generation for different tasks and hardware at run-time, a feat owed to Julia. This article presents the implemented techniques as well as experiments that demonstrate speed gains compared to multiple standard libraries.

## **Keywords**

Julia, Sorting, SIMD, Parallelism, Metaprogramming

## 1. Introduction

Sorting has enjoyed an important place within computer programming topics for a long time. It has great importance to practical applications, being very useful to organize data for fast retrieval, and also attracts attention as a theoretically interesting problem in itself [13]. While sorting is a well-understood problem in general, with a few classic algorithms available, achieving optimal performance for specific problems and architectures may require a careful algorithm selection rather than a simple parameter tuning [4, Part II, Introduction].

Many factors can influence the running time of a computer program, starting with algorithmic complexity and the base clock speed of the processor. The performance observed in real computers has been increasingly depending on processor features that are not taken into account by the simplest models, though. These features include cache memory and parallelism at the processor and instruction levels. Cache memory has become popular in consumer processors in the past couple of decades, and as long as memory access patterns exhibit temporal and spatial locality, it allows programs to attain a higher performance than would be possible with a simpler memory architecture [7, Chapter 3]. Writing software with better support for parallelism has also become necessary to utilize the full potential offered by modern and future processors [24].

The interest in sorting and the great relevance of cache memory and parallelism to modern computing have naturally led to research on sorting algorithms that account for these factors. Processors offering SIMD (single instruction, multiple data) instructions and other forms of instruction-level parallelism started to become widely available in the early 2000s. One example is the introduction of SSE instructions in the x86 architecture [14, Chapter 1]. The current availability of processors with 512 bit registers underscores the necessity of being able to write software that utilizes vector operations, what can also benefit GPU programming.

At the time SIMD CPUs became widely available there was already a trend to employ specialized hardware like GPUs for scientific applications [16, 23]. Although SIMD was used mostly for multimedia applications at first [2, 21], some early alternative uses include 3D graphics [17] and machine learning [22].

Following earlier work on parallel sorting in other architectures, the late 2000s saw the first practical demonstrations of sorting with modern consumer SIMD chips [10, 3]. Most of the techniques utilized then are still employed in more recent works [1, 11].

This article presents ChipSort.jl, a Julia module that implements SIMD and cache-aware sorting utilizing some of the techniques found in the literature. These include sorting and merging networks, the use of Combsort with Insertion sort finalization for medium sized arrays and of multi-way merging for large arrays. These techniques require not only tuning parameters such as vector and array sizes, but also generating specific code when implementing different size networks. ChipSort seeks to generate efficient custom code for each task, relying on Julia metaprogramming capabilities for that.

Section 2 ahead presents the techniques employed in ChipSort. Section 3 presents experimental results assessing the performance of the module and Section 4 brings some concluding remarks.

#### 2. SIMD sorting techniques

ChipSort offers four high-level functions that can be used in applications as replacements for other sorting functions such as sort in the Julia standard library. The chipsort! function does not require any configuration and offers a great performance on a wide range of cases. The remaining functions are suitable for arrays of specific size ranges, and require choosing some numeric parameters. The function chipsort\_large offers a good overview of all the techniques implemented in the module, and this section is structured according to its stages. Some of these techniques may also be useful in other applications instead of sorting, one clear example being the in-place matrix transpose.

The way ChipSort works mostly follows [11], although ChipSort does not support yet registers with key and payload. Another main difference is that Combsort is always employed with an Insertion sort finalization.

The function chipsort\_large achieves sorting by first splitting the input array in chunks that are sorted separately with

chipsort\_medium! and then merged simultaneously. This inner sort is performed in multiple steps:

- (1) sorting small blocks of data with a sorting network;
- (2) reordering vectors in memory according to a matrix transpose;
- (3) vectorized Combsort with a limited number of iterations;
- (4) regular insertion sort until the whole data is sorted.

The remainder of this section details the multiple techniques employed in this process.

## 2.1 Sorting networks

Sorting networks [13, Sec. 5.3.4] are graphs composed by comparator modules, units that take two numbers in, and output the smallest and largest at specific outputs. By properly arranging comparators we can create a procedure to take an array of elements and output them in sequence. The design of a sorting network can minimize the total number of comparisons, or if they can be done in parallel, minimize the number of steps to complete the process.

A sorting network can be represented by a Knuth graph such as Figure 1. Each vertical line is a comparator. The graph represents a network that takes an array of four values  $(A_1,A_2,A_3,A_4)$  and outputs the sorted sequence  $(S_1,S_2,S_3,S_4)$ .



Fig. 1. A sorting network for 4 elements. The dashed lines delimit each step where operations are independent of each other.

In ChipSort, as in the related work, parallelism is mainly exploited by carrying out the comparisons of the sorting network over vectors of size V contained in SIMD registers, effectively running V networks simultaneously. Other forms of instruction-level parallelism can also be in play, though. It may be possible to parallelize instructions such as min and max, or the loading of data into registers. This is often done implicitly by the microarchitecture, and programmers can only make sure their code is suitable for this by properly ordering operations and avoiding branching, for instance. Part of this work is expected to be performed by the compiler, and in particular LLVM in the case of Julia.

ChipSort supports sorting networks of different sizes, currently only powers of 2, which are predefined as data structures in the file sorting-network-parameters.jl. The generated function sort\_net assigns each element of the sequence at each step to a variable, and the values from the next step are calculated according to the network specification. The comparisons are performed by the min and max functions, and the function is generic on the element types. Figure 2 shows one example of generated code.

This function is a good first example of how ChipSort utilizes Julia's metaprogramming features. Multiple implementations of sorting networks are available in the literature [3, 8, 6], however they implement networks of different sizes straight as code. In

```
4 => (((1,2), (3,4)), ((1,3), (2,4)), ((2,3),))
```

```
input_0_1 =
             input[1]
input_0_2
              input[2]
input_0_3
              input[3]
input_0_4
              input[4]
input_1_1
              min(input_0_1, input_0_2)
input_1_2
              max(input_0_1,
                               input_0_2)
input_1_3
              min(input_0_3,
                               input_0_4
input_1_4
              max(input_0_3,
                               input_0_4)
input_2_1
              min(input_1_1,
                               input_1_3)
input_2_3
input_2_2
              max(input_1_1,
                               input_1_3)
              min(input_1_2,
                               input 1 4
input_2_4
              max(input_1_2, input_1_4)
input_3_1
              input_2_1
input_3_4
              input_2_4
input_3_2
              min(input_2_2, input_2_3)
input 3_3 = max(input_2_2, input_2_3)
return (input_3_1, input_3_2, input_3_3, input_3_4)
```

Fig. 2. The parameters from a four-element sorting network consisting of three steps, and the corresponding code generated by sort\_net.

ChipSort they are represented as data structures that guide a generated function to produce equivalent code. While this data structure is currently hard-coded, it could be potentially produced only when necessary by an algorithm such as Bose-Nelson, what remains a future plan for the module.

The sort\_net generated function expects elements to which max and min are defined. It can be called with simple data types, e.g. sort\_net(4,2,5,3) works. In practice ChipSort often calls this function with vectors, completely relying on SIMD.jl [20]. This module maps a vector in Julia similar to tuples or static arrays straight to LLVM vector types, ensuring SIMD code can be produced when possible and also allowing the flexibility of handling logic vectors that are larger than the actual registers, or spilling data to the stack if all registers are occupied.

## 2.2 SIMD vectors transpose

Consider a group of stacked SIMD vectors. Each column in this representation is termed a *lane*. The result from <code>sort\_net</code> with vectors is that each lane contains a sorted sequence. This data must be transposed to produce sorted vectors. This is done in <code>ChipSort</code> through <code>transpose\_vecs</code>, which is a generated function that supports rectangular matrices, but with dimensions constrained to powers of 2. The generated code consists mostly of calls to <code>SIMD.shufflevec</code>. This is another instance of an operation that other projects [1, 6] implement with multiple low-level functions contemplating each different case, while in <code>ChipSort</code> there is only a single Julia generated function that produces equivalent code. Figure 3 displays an example of 16 random integers loaded into

SIMD registers as 4 vectors. After sort\_net each lane contains a sorted sequence, and transpose\_vecs turns the lanes into vectors (registers). The chipsort\_small! function is essentially a sorting network followed by a transpose and a merge procedure.

## 2.3 Bitonic Merge Networks

A bitonic sequence starts either increasing or decreasing and contains at most one change of direction. Bitonic merge networks allow the creation of merge networks of different sizes, and they are



Fig. 3. Four vectors with four values, the result from a vectorized fourelements sorting network, and the transpose.

again implemented in ChipSort with a single generated function, bitonic\_merge. Merging two sorted vectors requires first reversing one of them, and the result is stored in two vectors containing the first and second halves of the complete sorted sequence.

Apart from the use in sorting small sequences and in the multi-way merge tree, bitonic merge networks can also be used to implement a regular merge sort. This technique was investigated in ChipSort only superficially, and while there were no promising results in terms of performance, ChipSort also contains the generated function bitonic\_merge\_interleaved to illustrate how metaprogramming can help implement multiple simultaneous merges with interleaved execution, as done in other projects [3].

#### 2.4 Vectorized Combsort

One of the most peculiar aspects of ChipSort, taken directly from AA-sort [10, 11], is the use of the Combsort algorithm. Combsort essentially generalizes Bubble sort in the same way that Shell sort generalizes Insertion sort [9, 5, 15, 13, 4]. The array is swept multiple times, ordering pairs of values at a given distance from each other. The distance is reduced each time by  $\frac{4}{3}$ . Parallelism is achieved either by parallelizing the operations from a single array or by processing independent arrays of the same size.

ChipSort contains two functions that utilize Combsort, employing these two distinct forms of parallelism. The first is chipsort!, a serial implementation of the algorithm which is vectorized implicitly by the optimizing compiler. chipsort! starts with Combsort until the interval size is 1, when it switches to Insertion sort.

The other function is chipsort\_medium!, an explicitly vectorized implementation utilizing SIMD.jl. This second function also performs a number of operations before and after Combsort. It consists of the following steps:

- (1) Apply a sorting network on K blocks of J vectors of size V.
- (2) Vectorized Combsort until the interval size is 1.
- (3) Transpose blocks.
- (4) Transpose data in-place from K blocks of J vectors into J blocks of K vectors.
- (5) Vectorized Combsort again.
- (6) Sort blocks again.
- (7) Insertion sort.

After the first two steps the result is essentially a matrix where the  $K \times J$  columns are the vectors, and each of the V rows contains an approximately sorted sequence. The transpose steps reshape each row into a new block. Each row is processed independently at first, and the transpose allows elements from distinct groups to interact. The last step is necessary to ensure the whole array is sorted, not viewed anymore as a matrix with independent rows.

In both implementations switching to Insertion sort presents a better performance than sticking to Combsort until the end. Once the interval becomes 1 Combsort essentially becomes Bubble sort, thus finding an alternative seems beneficial. Insertion sort should fare well because the array is now approximately sorted, and any out-of-order elements should be already close to their destinations.

Finishing a sorting method with Insertion sort is a common practice, present at least in Musser's Introsort [18]. This is also akin to the classic method used in the Julia standard library, switching to Insertion sort when the array becomes too small. The difference is that Combsort is not a divide-and-conquer approach, and therefore Insertion sort must be applied once on the whole array.

The utilization of Insertion sort as a final stage to Combsort has been considered before [19, 9], and Insertion sort was even employed by [11] for dealing with partial keys. The main contribution of the present work may be to extend AA-sort [10, 11] by giving a larger role to Insertion sort and also relying more on automatic vectorization by the compiler.

#### 2.5 In-place matrix transpose

In-place matrix transposition can be attained by moving a number to its destination, then moving away the number that was found there, and so on, until a cycle is completed and a new number is selected to be moved [25]. While carrying out this procedure is simple, it is not trivial to find out what are the cycles to transpose a matrix of any given dimensions, and even finding out the number of cycles turns out to be a difficult problem [12, 1.3.3-12].

ChipSort contains the generated function transpose! to perform in-place matrix transposition. The cycle seeds are computed at the time of code generation, and at run-time the function only moves the data, computing the index sequences and detecting when each cycle finishes. Therefore even though the function is completely generic, supporting any shape, the most complex part of the problem is tackled programmatically during code generation, resulting in a very simple run-time code.

## 2.6 Vectorized multi-way merging

Merging multiple arrays simultaneously requires the creation of a *merge tree* to keep some intermediate data. Each node in this tree is essentially an iterator traversing the merger of two arrays in the lower level. To employ the vectorized bitonic merge network in this process it is necessary to keep a whole vector at each node of the tree. To produce the next vector of elements one of the two source arrays is selected according to their smallest next element, and a whole vector is pulled from that array. This vector is merged with the intermediate one, and the first half of the resulting sequence can be taken to the next level of the tree, while the second half is kept as the new intermediate data.

Given a set of sorted arrays, once the tree is initialized data is taken from it one vector at a time, producing the final sorted sequence. Each time a new vector is requested the tree is traversed following the smallest element among the node children, until one of the input arrays is reached. The data from the array is then loaded from memory and merged with the intermediate data, producing new updates moving through the tree until the root is reached.

The intermediate data from the tree is kept by ChipSort in a contiguous array, treated as a heap. Assuming 1-based indexing, each node ancestor can be simply found by the Euclidean division of its index by 2. This data is assumed to fit some level of cache memory during operation, as in the related work [10, 3, 11].

# 3. Experiments

This section reports experiments with the ChipSort methods, as well as alternative ones. The term *method* can be understood as both the algorithms and their implementations in the Julia language. The specific methods tested were:

- —chipsort! Combsort finishing with Insertion sort, written as a serial program and vectorized by the compiler.
- -chipsort\_small! in-register, based on sorting networks.
- —chipsort\_medium! vectorized Combsort, plus additional steps, finishing with Insertion sort.
- —chipsort\_large merge-sort with a multi-way merge tree, starting from chunks processed by chipsort\_medium!.
- —insertion\_sort! Insertion sort taken from the Julia library.
- —sort! the Julia standard sort, Quicksort finishing with Insertion sort for small arrays (20 elements).

The ! implies in-place operation. The allocation time for chipsort\_large was not discounted in our analyses.

#### 3.1 Small arrays

In our first experiment we evaluate the sorting of small arrays, with 64 32-bit values. The chipsort\_small! method based on sorting and bitonic merge networks was compared to serial Insertion sort, the Julia sort and chipsort!. The short running time makes it challenging to perform proper measurements, therefore the task was to sort 128 sequences of 64 contiguous values in memory, forming a single block of  $2^{13}$  uniform random entries. Sorting was performed in-place, in a for-loop. The results are in Table 1.

Table 1. Sorting  $128 \times 64$  UInt32 elements.

| Method          | Median time |  |  |
|-----------------|-------------|--|--|
| chipsort_small! | 57.206 μs   |  |  |
| chipsort!       | 133.510 μs  |  |  |
| Insertion sort  | 194.679 μs  |  |  |
| Julia standard  | 222.910 μs  |  |  |

The chipsort\_small function achieved a significant acceleration of over 3 times relative to Insertion sort or the Julia standard function, and twice over chipsort!. Even though the in-register function was clearly faster, this first experiment already demonstrates the potential of chipsort!, which presented a speedup of at least 46% compared to the other alternatives.

#### 3.2 Medium and large arrays

Our second experiment compared many different methods on inputs of exponentially increasing sizes, from  $2^6$  to  $2^{20}$ . Figure 4 displays the resulting measurements. Although the curves look too concentrated in this visualization, it is possible to notice that chipsort! consistently outperforms the Julia standard library by a factor of 80% to 100%, or up to twice as fast. We can also observe that Insertion sort soon becomes too inefficient.

Figure 5 shows a different visualization from the same data with the time divided by the input size. The vertical dotted lines indicate the size of each cache level. Here we can see more clearly how chipsort\_medium! and chipsort\_large slightly outperform chipsort! in a few cases.

Figure 6 displays the measured times for  $2^{13}$  and  $2^{18}$  elements. All the tested ChipSort methods surpass the Julia standard library in these cases. In the  $2^{13}$  case chipsort\_medium! exhibits



Fig. 4. Running time of the different sorting methods studied.



Fig. 5. Curves displaying the relative running time from each studied method. The ordinate represents the median of the measured running times divided by the input size at each test. Lower is faster.

a slight speedup of 7.3% over chipsort!, and on the  $2^{13}$  case it is chipsort\_large! with 4.4%.

#### 3.3 Other platforms

A final experiment was carried out with different programming languages in the same machine and with Julia in different machines, to provide more context. Table 2 displays statistics collected from Julia, Python and C++ programs. For the case of Python the measurements obtained by calling the function through PyCall were better than using the native Python shell, therefore the best value was kept. Results obtained for C++ with Julia inter-operation using Cxx.jl were slightly worse than native, thus discarded. While Numpy was almost as fast as Julia, no reason could be found for the dismal performance of the C++ standard library.

Table 3 contains the results for chipsort! and the Julia standard sort! on an Intel processor with AVX512 instructions, and an ARM processor with 128-bit NEON instructions. No change in performance can be seen on the AVX512 machine, and on the ARM machine the speedup was smaller than before: only 60%. The most notable fact in this experiment is that the same high-level and serial code successfully produced a vectorized sort on machines with



Fig. 6. Empirical CDF from different methods with 8k and 256k UInt32 values. Results from 10,000 measurements.

Table 2. Sorting arrays of 8k and 1M UInt32 elements in different software platforms.

| in different software platforms. |                  |           |  |
|----------------------------------|------------------|-----------|--|
| Method                           | 8k               | 1M        |  |
| ChipSort                         | 172.35 $\mu$ s   | 35.28 ms  |  |
| Julia standard                   | 345.46 μs        | 68.39 ms  |  |
| Numpy via PyCall                 | $369.50 \ \mu s$ | 70.01 ms  |  |
| C++ std::qsort                   | 788.59 μs        | 147.37 ms |  |
| C++ std::sort                    | 1,207.23 μs      | 224.71 ms |  |

Table 3. Sorting arrays of 8k and 1M UInt32 elements in different hardware platforms.

| Chip   | Method         | 8k            | 1M        |
|--------|----------------|---------------|-----------|
| AXV512 | ChipSort       | 177.4 μs      | 35.20 ms  |
| AXV512 | Julia standard | 347.5 μs      | 69.07 ms  |
| NEON   | ChipSort       | 350.9 $\mu$ s | 73.36 ms  |
| NEON   | Julia standard | 573.2 μs      | 113.68 ms |

different vector sizes and even instruction sets, always with a beneficial outcome.

# 4. Conclusion

We have presented ChipSort, a Julia module for SIMD and cacheaware sorting. The main sorting method utilized in ChipSort is Combsort finalized with Insertion sort. This method is based on previous proposals for SIMD sorting [10, 11] and other practical sorting methods [9, 18]. This method was implemented as a serial program vectorized by the Julia compiler and LLVM, resulting in a sorting function up to twice as fast as the Julia standard library sort. This performance holds for at least a million 32-bit integers in a personal computer.

Other techniques available in ChipSort are sorting networks, bitonic merge networks, in-place matrix transpose and a multi-way merge tree. Vectorized in-register sorting, although limited to small powers of 2, proved to deliver speedups of up to 3 times relative to the standard library.

Results were limited for the method dedicated to large arrays, only narrowly surpassing the main ChipSort method in few cases. Merge-based techniques may be more advantageous when sorting large records, though.

While the main ChipSort method relies on compiler optimizations, other methods utilize explicit vectorization trough SIMD.jl. ChipSort makes extensive use of metaprogramming, more specifically Julia generated functions. This allows functions like sorting networks to be implemented in an abstract and generic way. The networks are represented as data structures, used by a higher-level program to assemble the final network code. This contrasts with other libraries where different networks are implemented straight as final functions. Julia also has the opportunity to perform custom optimizations since ChipSort is implemented in pure Julia with abstract methods that can be specialized.

It is a testament to the quality of the Julia compiler infrastructure that the most successful method implemented in ChipSort has no explicit vectorization or intrinsics. It is just a simple, scalar implementation of Combsort that is optimized by the compiler to generate code adapted to different problem settings and to the hardware architecture at run-time. And in the cases where explicit vectorization is necessary Julia still offers a suitable framework that allowed ChipSort to implement sophisticated techniques with concise code. This project can illustrate how well Julia implements traditional language features such as parametric types, and its metaprogramming features make it stand out from other languages even more.

Some future prospects for ChipSort are to implement missing features such as buffers in the merge tree and stable sorting of keypayload data; explore the limits of in-register processing in modern 512-bit architectures an eventually GPUs; explore how to interact with the compiler in order to optimize code for complex tasks such as multi-stage multi-threaded merge-sorting of large arrays.

## 5. References

- Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85–96, 2013.
- [2] Yen-Kuang Chen, Eric Q. Li, Xiaosong Zhou, and Steven Ge. Implementation of h.264 encoder and decoder on personal computers. *Journal of Visual Communication and Image Rep*resentation, 17(2):509 – 532, 2006. Introduction: Special Issue on emerging H.264/AVC video coding standard.
- [3] Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, William Macy, Mostafa Hagog, Yen-Kuang Chen, Akram Baransi, Sanjeev Kumar, and Pradeep Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. *PVLDB*, 1(2):1313–1324, 2008.
- [4] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. *Introduction to Algorithms, 3rd Edition*. MIT Press, 2009.
- [5] Wlodzimierz Dobosiewicz. An efficient variation of bubble sort. *Information Processing Letters*, 11(1):5–6, 1980.

[6] Dee Dong and Saatvik Shah. Ultra-sort: Extremely parallel hardware optimized sorting, 2018.

- [7] Ulrich Drepper. What every programmer should know about memory, 2007. Linux Weekly News.
- [8] Jeffrey Sarnoff et alii. Sortingnetworks.jl, 2018.
- [9] Janet Incerpi and Robert Sedgewick. Practical variations of shellsort. *Information Processing Letters*, 26(1):37 – 43, 1987.
- [10] Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, and Toshio Nakatani. AA-Sort: A new parallel sorting algorithm for multi-core SIMD processors. In 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), Brasov, Romania, September 15-19, 2007, pages 189–198. IEEE Computer Society, 2007.
- [11] Hiroshi Inoue and Kenjiro Taura. SIMD- and cache-friendly algorithm for sorting an array of structures. PVLDB, 8(11):1274–1285, 2015.
- [12] Donald Ervin Knuth. The art of computer programming, volume 1: Fundamental Algorithms. Addison-Wesley, 3rd edition, 1997.
- [13] Donald Ervin Knuth. The art of computer programming, volume 3: Sorting and Searching. Addison-Wesley, 2nd edition, 1998
- [14] Daniel Kusswurm. Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512. 01 2018
- [15] Stephen Lacey and Richard Box. A fast, easy sort. *BYTE*, 16(4):315–ff., April 1991.
- [16] E Scott Larsen and David McAllister. Fast matrix multiplies using graphics hardware. In *Proceedings of the 2001 ACM/IEEE conference on Supercomputing*, pages 55–55. ACM, 2001.
- [17] Wan-Chun Ma and Chia-Lin Yang. Using Intel streaming SIMD extensions for 3D geometry processing. In Yung-Chang Chen, Long-Wen Chang, and Chiou-Ting Hsu, editors, Advances in Multimedia Information Processing - PCM 2002, Third IEEE Pacific Rim Conference on Multimedia, Hsinchu, Taiwan, December 16-18, 2002, Proceedings, volume 2532 of Lecture Notes in Computer Science, pages 1080–1087. Springer, 2002.
- [18] David R Musser. Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8):983–993, 1997.
- [19] Martín Knoblauch Revuelta. Comb sort, or Dobosiewicz sort, 2013.
- [20] Erik Schnetter, Takafumi Arakaki, Valentin Churavy, Kristoffer Carlsson, Nicolau Werneck, Gunnar Farnebäck, Miguel Raz Guzmán Macedo, Matt Bauman, Kenta Sato, and Elliot Saba. eschnett/simd.jl, March 2019.
- [21] Nathan T. Slingerland and Alan Jay Smith. Multimedia extensions for general purpose microprocessors: a survey. *Microprocessors and Microsystems*, 29(5):225–246, 2005.
- [22] Alfred Strey and Martin Bange. Performance analysis of intel's MMX and SSE: A case study. In Rizos Sakellariou, John A. Keane, John R. Gurd, and Len Freeman, editors, Euro-Par 2001: Parallel Processing, 7th International Euro-Par Conference Manchester, UK August 28-31, 2001, Proceedings, volume 2150 of Lecture Notes in Computer Science, pages 142–147. Springer, 2001.

- [23] Chris J. Thompson, Sahngyun Hahn, and Mark Oskin. Using modern graphics architectures for general-purpose computing: a framework and analysis. In Erik R. Altman, Kemal Ebcioglu, Scott A. Mahlke, B. Ramakrishna Rau, and Sanjay J. Patel, editors, *Proceedings of the 35th Annual International Symposium on Microarchitecture, Istanbul, Turkey, November 18-22*, 2002, pages 306–317. ACM/IEEE Computer Society, 2002.
- [24] Sophie Wilson. The future of microprocessors, 2018. Julia-Con keynote address.
- [25] P. F. Windley. Transposing Matrices in a Digital Computer. The Computer Journal, 2(1):47–48, 01 1959.