In [1]:
import sys

sys.path.append("..")

In [2]:
from common.utility import show_implementation

In [3]:
import random

random.seed(42)

# Linear sorting
Suppose that we don't know that nature of the content of the elements we are sorting, and are only given an operator that compares two elements $a,b$.


We can model our sorting algorithm as a decision tree, as below (for sorting [1,2,3])

```
                                               comp(1, 2)
                                      <= /                    \   >
                                comp(2, 3)                    comp(1, 3)
                          <=   /         \>               <= /         \  > 
                          [1, 2, 3]      comp(1, 3)     [2, 1, 3]     comp(2, 3)
                       <= /       \  >               <= /        \  >
                  [1, 3, 2]       [3, 1, 2]     [2, 3, 1]        [3, 2, 1]
```

At each step, we ask the question of comparing $a$ and $b$, and is returned which is greater.
Then, we perform some decision based on the result.

This means that without prior knowledge about the elements of the array, any sorting algorithm must traverse the decision tree to determine the sorted order of the array.
(Otherwise, if there is an unvisited branch, we can supply the sorting algorithm with the sequence of comparison results which lands it in said branch, of which it would have yet determined the exact ordering of the elements, hence making the sort incomplete)

The worst case complexity of the sorting algorithm is determined by the longest path in the decision tree, which is the height of the tree.
A tree of height $h$ can only have at most $2^h$ leaves.
And the leaves correspond to a permutation of the array, thus given an array of size $n$, there must be $n!$ leaves.
This gives us the following bound of $n! \leq 2^h$

Using Stirling's approximation, we get 
$$
h = \Omega(n\log n)
$$
which means that our sorting algorithm must have a complexity of at least $n \log n$.

Since we know that merge sort has a complexity of $O(n \log n)$, it may seem like there is no more room for optimization, however what if we can sort the array without doing any comparisons between elements of the array?

## Counting sort

Suppose that we wish to sort $n$ numbers, where the number ranges between $0 \dots k$.
A way to sort the array is to:
1. Initialize an array of size $k$, $A[0\dots k]$
2. For each element $i$ in the input array, increment $A[i]$ by 1
3. For each index $i$ in $A$, output the value $i$ repeated $A[i]$ times to obtain our sorted array

### Analysis

Step 1 is $O(k)$, while steps 2 and 3 are $O(n)$.

Thus, we get the complexity of $O(n + k)$.

If $k$ has a smaller complexity than $O(n)$, then we get a complexity of $O(n)$, which is faster than $O(n \log n)$, the theoretical lower bound of comparison sort!

### Implementation

In [4]:
from random import randint
from module.sort import merge_sort

n = 100_000
arr = [randint(0, 100) for _ in range(n)]
sorted_arr = merge_sort(arr)

In [5]:
def counting_sort(arr):
    A = [0 for _ in range(max(arr) + 1)]

    for i in arr:
        A[i] += 1
    output = []

    for index, i in enumerate(A):
        output += [index] * i
    return output

In [6]:
counting_sort(arr) == sorted_arr

True

### Improvement

We can make the sort stable by slightly modifying the algorithm.
A sorting algorithm is said to be **stable** when ordering two elements $a,b$ which are of same value, and $a$ appears before $b$ in the input array, the algorithm would place $a$ before $b$ in the output array.

Notice that in our current counting sort, it does not deal well when two elements are equal in sorting order, but are different.
For example, if we wanted to sort an array of 2-tuples by their first value, our current implementation of counting sort does not work, as we treated all elements of the same sorting values as identical.

The below improvements resolves this issue.
We will later see that a stable sorting algorithm have some desirable properties.

In [7]:
def stable_counting_sort(arr, key_func=lambda x: x):
    keys = [key_func(i) for i in arr]
    if not keys:
        return keys

    A = [0 for _ in range(max(keys) + 1)]

    for i in keys:
        A[i] += 1

    for i in range(1, len(A)):
        A[i] += A[i - 1]

    output = [None for _ in arr]

    for item, i in zip(reversed(arr), reversed(keys)):
        output[A[i] - 1] = item
        A[i] -= 1
    return output

In [8]:
sorted_arr == stable_counting_sort(arr)

True

## Radix sort
Suppose that we were given $n$ numbers with $d$ digits.
Since the largest number is $10^d - 1$, counting sort is $O(10^d)$, which is not very efficient.

Using similar idea as counting sort, we might sort the most significant (left-most) digit by counting sort, and repeatedly sort the less significant digits.

In [9]:
d = 10
n = 20
arr = [randint(10 ** (d - 1) + 1, 10**d) for _ in range(n)]

First we sort the most significant digit.

In [10]:
first_pass = stable_counting_sort(arr, key_func=lambda x: x // 10 ** (d - 1))
first_pass

[1985526561,
 1505356216,
 1223789820,
 2744156774,
 2763744940,
 4042147037,
 4343537916,
 4110390618,
 5166997149,
 5355831453,
 6649451454,
 6599334165,
 7379154982,
 7167549285,
 7535805496,
 7016552533,
 8289183507,
 8991777060,
 8117694333,
 9710836999]

Now, we need to group together all the numbers with the same first digit, and sort them.

In [11]:
grouped = [[k for k in first_pass if k // 10 ** (d - 1) == i] for i in range(10)]
grouped

[[],
 [1985526561, 1505356216, 1223789820],
 [2744156774, 2763744940],
 [],
 [4042147037, 4343537916, 4110390618],
 [5166997149, 5355831453],
 [6649451454, 6599334165],
 [7379154982, 7167549285, 7535805496, 7016552533],
 [8289183507, 8991777060, 8117694333],
 [9710836999]]

In [12]:
[stable_counting_sort(g, key_func=lambda x: x // 10 ** (d - 2)) for g in grouped]

[[],
 [1223789820, 1505356216, 1985526561],
 [2744156774, 2763744940],
 [],
 [4042147037, 4110390618, 4343537916],
 [5166997149, 5355831453],
 [6599334165, 6649451454],
 [7016552533, 7167549285, 7379154982, 7535805496],
 [8117694333, 8289183507, 8991777060],
 [9710836999]]

Now, we've sorted all the numbers based on the 1st 2 digits, and we just need to repeat this for every digit.
However, notice that after the first pass, there can be up to 10 different group of numbers.
After the 2nd pass, there can be up to 100 different group of numbers.
After $d$ passes, there can be up to $10^d$ different groups numbers.
Hence, our algorithm is still $O(10^d)$.

Counter-intuitively, we can obtain our sorted array simply by **repeatedly sorting the least-significant digit** instead.

In [13]:
def radix_sort(arr, d):
    temp_arr = arr
    for i in range(d):
        temp_arr = stable_counting_sort(
            temp_arr, key_func=lambda x: (x // 10**i) % 10
        )
    return temp_arr


radix_sort(arr, 10)

[1223789820,
 1505356216,
 1985526561,
 2744156774,
 2763744940,
 4042147037,
 4110390618,
 4343537916,
 5166997149,
 5355831453,
 6599334165,
 6649451454,
 7016552533,
 7167549285,
 7379154982,
 7535805496,
 8117694333,
 8289183507,
 8991777060,
 9710836999]

In [14]:
radix_sort(arr, 10) == merge_sort(arr)

True

### Analysis

Each pass of the counting sort is $O(n + k)$, and since $k = 10$, it is $O(n)$.
And since we performed $d$ passes of the counting sort, the complexity is simply $O(dn)$.

Convince yourselves that **if the sorting algorithm of each pass is stable, then the resultant array would be (properly) sorted**.

#### Different base

Pre-requisite: [Binary representation](TODO).

We had represented our number as a $d$-digit number in base 10.
Suppose that we have represented number in a different base $b$.

Suppose that we know that our number is at most $m$ (this is equals to $10^d-1$ in our previous example).
Then, our numbers would have $\log _b m$ digits in base $b$.

Thus, each pass of the stable sort would be $O(n + b)$.
And we would need to repeat this $\log _b m$ times to fully sort the array.
Thus, our overall complexity would be $O( \log _b m (n + b))$

## Bucket sort
Suppose that we are given $n$ numbers, each drawn from a [uniform distribution](../statistic/probability_distributions.ipynb#Discrete-uniform-distribution) between $[0, 1)$

One way to sort this would be to 
1. Create $n$ buckets, which cover: $[0, \frac{1}{n}), [\frac{1}{n}, \frac{2}{n}), \dots [\frac{n-1}{n}, 1)$
2. Fill the numbers from the array into their respective buckets
    * Use insertion sort when adding the number into the bucket
3. List the number from each bucket

### Analysis

Step 1 is $O(n)$.

For step 2, notice that since it is uniformed, we would expect on average there being only $1$ number in each bucket.
Since insertion sort is $O(k)$, where $k$ is the size of the array, this reduces to an expected complexity of $O(1)$ as $k = 1$.
Hence, step 2 is $O(n \times 1) = O(n)$

Lastly, step 3 is $O(n)$, hence overall the algorithm is $O(n)$

### Implementation

In [15]:
from math import floor


def insert(arr, item):
    return [i for i in arr if i <= item] + [item] + [i for i in arr if i > item]


def bucket_sort(arr):
    buckets = [[] for _ in arr]

    for i in arr:
        index = floor(i * len(arr))
        buckets[index] = insert(buckets[index], i)

    output = []
    for b in buckets:
        output += b

    return output

In [16]:
n = 10
arr = [random.random() for _ in range(n)]
bucket_sort(arr)

[0.058546743423030345,
 0.3387211476675075,
 0.4291099505621351,
 0.5545338248590267,
 0.593639596051472,
 0.6415794894044458,
 0.6866080537532946,
 0.7450222325657058,
 0.8378054739343479,
 0.9611608997285658]

In [17]:
bucket_sort(arr) == merge_sort(arr)

True