# BucketSort

BucketSort works by splitting up the elements into ordered buckets. The elements are then sorted within their buckets, before the buckets' output are concantenated. This sort works best when the input data is uniformly distributed across the range. If the sorting algorithm within buckets is O(n), then the whole algorithm will be O($n^2$)

In [1]:
import numpy as np

# Linked List functionality

To implement this sort we use a linked list for each bucket. The below implements the necessary functionality

In [2]:
class LNode:
    def __init__(self, elem=None, next=None):
        self.elem=elem
        self.next=next

class LList:
    def __init__(self, elem):
        first = LNode(elem)
        self.head=first
        self.count=1
    
    # Functionality to print list        
    def print(self):
        currentnode = self.head
        for i in range(self.count):
            print(currentnode.elem)
            currentnode = currentnode.next
    
    # Functionality for adding a element into the list in a sorted fashion
    def append_sorted(self, newdata):
        # Creating new node
        newnode = LNode(newdata)
        # Creating pointer for current node and previous node
        currentnode = self.head
        prevnode = self.head
        # Counter for front of list detection
        moves = 0
        # Until the element to be added is smaller or equal than the current node, keep going to the next
        while newnode.elem>currentnode.elem:
            prevnode = currentnode
            currentnode = currentnode.next
            moves += 1
            # If at the end of the list, break
            if currentnode==None:
                break
        # In the case we are adding to the front of the list
        if moves==0:
            newnode.next = currentnode
            self.head = newnode
            
        # In all other cases
        else:
            # Once at the node where this node is smaller or equal to (node_i), stitch the previous pointer to newnode and 
            # point newnode to node_i
            prevnode.next = newnode
            newnode.next = currentnode
            
        self.count += 1

    # Functionality for deleting elements from the left
    def popleft(self):
        if self.count==0:
            raise Exception("Cannot pop from empty list")
        else:
            popped = self.head.elem
            self.head = self.head.next
            self.count -=1
        return popped
        

In [3]:
linky = LList(7)

In [4]:
linky.append_sorted(10)

In [5]:
linky.print()

7
10


In [6]:
linky.count

2

# Generating a list of random numbers

In [110]:
np.random.seed(122)
n = 10
high = 100
randseq = np.random.randint(0, high+1, n).tolist()
print(randseq)

[26, 10, 54, 76, 62, 16, 15, 59, 98, 79]


# Step 1: Determining the buckets

The buckets' width will be determined by the following simple formula: $$\dfrac{max-min}{k}$$ where k is the inputted number of buckets

In [111]:
k = 10
# Storing the maximum and minimum of the sequence
max_val = max(randseq)
min_val = min(randseq)
bucket_det = (max_val-min_val)/k
bucket_det

8.8

The bucket table can then be generated based on k

In [101]:
buckets = {i:None for i in range(k+1)}

In [102]:
print(buckets)

{0: None, 1: None, 2: None, 3: None, 4: None, 5: None, 6: None, 7: None, 8: None, 9: None, 10: None}


# Step 2: Adding elements to their designated buckets

We want the first value to be in bucket 0, and the last value to be in bucket 10. We therefore use the following bucketing function to decide which elements go into which buckets: $$\dfrac{element-min}{bucket det}$$

In [128]:
# for each element in the sequence
for element in randseq:
    hash = int((element-min_val)/bucket_det)
    # if the hash index's bucket is currently empty, create a new linked list
    if buckets[hash]==None:
        buckets[hash] = LList(element)
    # Else, add and sort
    else:
        buckets[hash].append_sorted(element)

In [129]:
buckets

{0: <__main__.LList at 0x1077c4f50>,
 1: <__main__.LList at 0x1077e9cd0>,
 2: None,
 3: None,
 4: None,
 5: <__main__.LList at 0x1077c4810>,
 6: None,
 7: <__main__.LList at 0x1077d49d0>,
 8: None,
 9: None,
 10: <__main__.LList at 0x1077d51d0>}

In [135]:
buckets[5].print()

54
59
62


# Step 3: Concatonating the outputs

In [136]:
sorted_array = []
# For each bucket
for i in buckets:
    # If the bucket is empty, go to the next iteration
    if buckets[i]==None:
        continue
    # Empty the bucket into the new array
    while buckets[i].count!=0:
        sorted_array.append(buckets[i].popleft())

In [137]:
sorted_array

[10, 15, 16, 26, 54, 59, 62, 76, 79, 98]

# Putting everything together

In [138]:
np.random.seed(123)
n = 100
high = 100
randseq = np.random.randint(0, high+1, n).tolist()
print(randseq)

[66, 92, 98, 17, 83, 57, 86, 97, 96, 47, 73, 32, 46, 96, 25, 83, 78, 36, 96, 80, 68, 49, 55, 67, 2, 84, 39, 66, 84, 47, 61, 48, 7, 99, 92, 52, 97, 85, 94, 27, 34, 97, 76, 40, 3, 69, 64, 75, 34, 58, 10, 22, 77, 18, 100, 15, 27, 30, 52, 70, 26, 80, 6, 14, 75, 54, 71, 1, 43, 58, 55, 25, 50, 84, 56, 49, 12, 18, 81, 1, 51, 44, 48, 56, 91, 49, 86, 3, 67, 11, 21, 89, 98, 3, 11, 3, 94, 6, 9, 87]


In [141]:
def BucketSort(seq, k):
    # Getting the required max and min information. Note that these in Python are O(n)
    max_val = max(seq)
    min_val = min(seq)
    # Determining bucket width
    bucket_det = (max_val-min_val)/k

    # Generating buckets
    buckets = {i:None for i in range(k+1)}

    # Filling buckets
    # for each element in the sequence
    for element in seq:
        hash = int((element-min_val)/bucket_det)
        # if the hash index's bucket is currently empty, create a new linked list
        if buckets[hash]==None:
            buckets[hash] = LList(element)
        # Else, add and sort
        else:
            buckets[hash].append_sorted(element)

    # Filling a new array with the sorted elements
    sorted_array = []
    # For each bucket
    for i in buckets:
        # If the bucket is empty, go to the next iteration
        if buckets[i]==None:
            continue
        # Empty the bucket into the new array
        while buckets[i].count!=0:
            sorted_array.append(buckets[i].popleft())

    return sorted_array

In [142]:
print(BucketSort(randseq, 10))

[1, 1, 2, 3, 3, 3, 3, 6, 6, 7, 9, 10, 11, 11, 12, 14, 15, 17, 18, 18, 21, 22, 25, 25, 26, 27, 27, 30, 32, 34, 34, 36, 39, 40, 43, 44, 46, 47, 47, 48, 48, 49, 49, 49, 50, 51, 52, 52, 54, 55, 55, 56, 56, 57, 58, 58, 61, 64, 66, 66, 67, 67, 68, 69, 70, 71, 73, 75, 75, 76, 77, 78, 80, 80, 81, 83, 83, 84, 84, 84, 85, 86, 86, 87, 89, 91, 92, 92, 94, 94, 96, 96, 96, 97, 97, 97, 98, 98, 99, 100]


In [143]:
BucketSort(randseq, 10)==sorted(randseq)

True

# Timing the algorithm

In [144]:
np.random.seed(123)
randseq = np.random.randint(0,101,10).tolist()
print(randseq)

[66, 92, 98, 17, 83, 57, 86, 97, 96, 47]


In [145]:
print(BucketSort(randseq, 10))

[17, 47, 57, 66, 83, 86, 92, 96, 97, 98]


In [146]:
def BucketSortTester(k, n, high, low=0):
    randseq = np.random.randint(low, high+1, n).tolist()
    return randseq, BucketSort(randseq, k)

In [154]:
BucketSortTester(10,10,100)

([55, 25, 50, 84, 56, 49, 12, 18, 81, 1],
 [1, 12, 18, 25, 49, 50, 55, 56, 81, 84])

In [155]:
# 10 buckets, 10 elements, numbers 0-10
%timeit BucketSortTester(10,10,10)

7.56 μs ± 408 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [156]:
# 10 buckets, 100 elements, numbers 0-100
%timeit BucketSortTester(10,100,100)

46.8 μs ± 3.86 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [162]:
# 10 buckets, 100 elements, numbers 0-10,000
%timeit BucketSortTester(10, 100, 10_000)

45.4 μs ± 1.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [161]:
# 10 buckets, 10_000 elements, numbers 0-100
%timeit BucketSortTester(10, 10_000, 100)

70.2 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [160]:
# 10 buckets, 10,000 elements, numbers 0-10,000
%timeit BucketSortTester(10, 10_000, 10_000)

81.5 ms ± 224 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Naturally, the algorithm is slower for larger sequences. Would increasing the number of buckets help?

In [163]:
# 100 buckets, 10,000 elements, numbers 0-10,000
%timeit BucketSortTester(100, 10_000, 10_000)

11 ms ± 25.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Unsurprisingly, increasing the number of buckets lessens the runtime since each bucket's linked list is now smaller. So, when appending a new element into the bucket, the algorithm has to go through lesser elements. However, more buckets also mean more memory usage

# Trying a distrbution whose range is very different from the number of elements

In [170]:
# 10 buckets, 100 elements, high = 100_000_000, low = 99_999_990
%timeit BucketSortTester(10, 100, 100_000_000, 99_999_990)

40 μs ± 4.65 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Since the algorithm is able to rescale the distribution and equally divide it into each bucket, the sequence is not too taxing on the algorithm

# Trying a very skewed distribution

However, the algorithm can run into trouble when the distribution is very skewed to one point. In that case, one bucket gets very filled, and thus the algorthim has to spent a lot of time traversing the linked list in the bucket

In [176]:
outlier = [50_000_000+i for i in range(98)] + [100_000_000, 0]
print(outlier)

[50000000, 50000001, 50000002, 50000003, 50000004, 50000005, 50000006, 50000007, 50000008, 50000009, 50000010, 50000011, 50000012, 50000013, 50000014, 50000015, 50000016, 50000017, 50000018, 50000019, 50000020, 50000021, 50000022, 50000023, 50000024, 50000025, 50000026, 50000027, 50000028, 50000029, 50000030, 50000031, 50000032, 50000033, 50000034, 50000035, 50000036, 50000037, 50000038, 50000039, 50000040, 50000041, 50000042, 50000043, 50000044, 50000045, 50000046, 50000047, 50000048, 50000049, 50000050, 50000051, 50000052, 50000053, 50000054, 50000055, 50000056, 50000057, 50000058, 50000059, 50000060, 50000061, 50000062, 50000063, 50000064, 50000065, 50000066, 50000067, 50000068, 50000069, 50000070, 50000071, 50000072, 50000073, 50000074, 50000075, 50000076, 50000077, 50000078, 50000079, 50000080, 50000081, 50000082, 50000083, 50000084, 50000085, 50000086, 50000087, 50000088, 50000089, 50000090, 50000091, 50000092, 50000093, 50000094, 50000095, 50000096, 50000097, 100000000, 0]


In [177]:
%timeit BucketSort(outlier, 10)

156 μs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Here we see that even though it is still decently fast, it is much slower than the other runtimes involving 100 elements. Furthermore, it is also slower than RadixSort when faced with roughly the same issue

In [178]:
%timeit BucketSort(outlier, 100)

159 μs ± 352 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Increasing the number of buckets also does not appear to be help, which makes sense since the pain point is 1 specfic bucket taking on too many elements