In [1]:
import numpy as np

# 01.5: Bucket Sort
### Opdracht 1.5: Bucket Sort
## 01.5.1 Implementing Bucket Sort for integers
The first part of this assignment is implementing bucket sort. This implementation takes the following steps:

1. Split list into positives and negatives
2. Temporarily turn negatives positive (abs)
3. Do the following steps for both lists
    1. Map data to strings
    2. Check if exit condition has been calculated:
        1. If not, set it maximum to the length of the longest string
    3. Generate the buckets' rows. There are a total of 19 rows, numbered -9 to 9. In the given psuedo-code, one would also generate the columns. This style would add extra overhead to the algorithm, and has not been implemented.
    4. Distribution pass
        1. Loop through the strings generated at **1.** and the original data simultaneously. 
        3. Check if index_offset is greater or equal to the length of the string  
            1. If yes, append the integer to bucket 0
            2. Else, continue
        4. Get the character at the wanted index (In the first iteration, the rightmost position)
        5. Convert the character to an integer, this new integer is the index
        6. Append the integer to the bucket with the columnindex of character
    5. Gathering pass
        1. For each bucket in buckets, concatenate to a new list
    6. Increase index offset by 1
    7. Check if index offset is greater or equal to the maximum (exit condition)
        1. If yes, return the array generated in **5.**
        2. If no, call bucket sort again with the array generated in **5.**
4. Turn the temporarily positive negatives back to true negatives, and reverse list
5. Concatenate the negatives with the positives, and return

In [2]:
def bucket_sort_int(data):
    """"""
    # Split data into positive and negative values
    positives = list(filter(lambda x: x >= 0, data))
    negatives = list(filter(lambda x: x < 0, data))
    
    # Temporarily turn negatives positive
    abs_negatives = list(map(abs, negatives))
    
    # Run bucket_sort_positive_int on both abs_negatives and positives
    sorted_positives = bucket_sort_positive_int(positives)
    sorted_abs_negatives = bucket_sort_positive_int(abs_negatives)
    
    # Turn sorted_abs_negatives back into true positives, and reverse
    sorted_negatives = list(map(lambda x: x * -1, sorted_abs_negatives))[::-1]
    
    # Concatenate and return the lists
    return sorted_negatives + sorted_positives
    

def bucket_sort_positive_int(data, index_offset=0, maximum=None):
    # Number of items in data
    n = len(data)
    
    if n == 0:
        return []
    
    # Map data to strings
    str_data = list(map(str, data))
    
    # Check if exit condition has been established (maximum)
    if not maximum:
        # If not, generate maximum
        maximum = len(max(str_data, key=len))
   
    
    # Generate buckets
    buckets = [[] for __ in range(0, 10)]

    
    # Distribution pass
    for integer, string in zip(data, str_data):
        
        if (index_offset >= len(string)):
            buckets[0].append(integer)
        else:
            character = string[-1 - index_offset]            
            bucket_number = int(character)
            buckets[bucket_number].append(integer)
    
    # Gathering pass
    new_data = []
    for bucket in buckets:
        new_data += bucket
    
    # Set new index_offset
    index_offset += 1
    # Check if exit condition has been reached (index_offset >= maximum)
    if index_offset >= maximum:
        # If exit condition has been reached, return new_data
        return new_data
    else:
        # Else, recursively call bucket_sort again
        return bucket_sort_positive_int(new_data, index_offset=index_offset, maximum=maximum)
    
    

In [3]:
bucket_sort_int([50,1, 15,1,2,31, -1])

[-1, 1, 1, 2, 15, 31, 50]

So this implementation works on smaller lists, how does it perform on big lists?

In [4]:
thirtyK  = np.arange(-15_000, 15_000)
np.random.shuffle(thirtyK)
sorted_thirtyK = bucket_sort_int(thirtyK)
sorted_thirtyK == sorted(thirtyK)

True

If Python's ```Sorted``` function is to be believed, this buck sort implementation has properly sorted a randomly-filled list with 30 thousand items!



## 01.5.2 Sorting, speed!
Next up is analysing bucket sort. How fast is it, and what type of time complexity does it have?  

Using the time testing functions from 01.1, it can be timed:

For a completely random set of lists:

In [5]:
def give_lists():
    """Taken from 01.1"""
    thirtyK  = np.arange(0, 30_000)
    tenK     = np.arange(0, 10_000)
    oneK     = np.arange(0, 1_000)
    
    np.random.shuffle(thirtyK)
    np.random.shuffle(tenK)
    np.random.shuffle(oneK)
    return thirtyK, tenK, oneK

def partial_sort_func(func):
    """Taken from 01.1"""
    # TODO redo with cProfile    
    thirtyK, tenK, oneK = give_lists()
    oneK_time    = %timeit -r 1 -n 1 -o -q func(oneK)
    tenK_time    = %timeit -r 1 -n 1 -o -q func(tenK)
    thirtyK_time = %timeit -r 1 -n 1 -o -q func(thirtyK)
    
    
    return {"1.000" : oneK_time, "10.000" : tenK_time, "30.000" : thirtyK_time}

In [6]:
partial_sort_func(bucket_sort_int)

{'1.000': <TimeitResult : 3.25 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>,
 '10.000': <TimeitResult : 42 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>,
 '30.000': <TimeitResult : 169 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>}

For a pre-sorted list of 30 thousand items:

In [7]:
def partial_presorted_func(func):
    thirtyK = np.arange(0, 30_000)
    %timeit -r 2 -n 2 func(thirtyK)


In [8]:
partial_presorted_func(bucket_sort_int)

158 ms ± 3.65 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)


For a reversed list of 30 thousand items:

In [9]:
def partial_reversed_func(func):
    thirtyK = np.arange(0, 30_000)[::-1]
    %timeit -r 1 -n 1 func(thirtyK)

In [10]:
partial_reversed_func(bucket_sort_int)

155 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


This is remarkably faster than the fastest sorting algorithm in 01.1, merge sort. It took merge sort approximately 1.17 seconds to sort 30.000 items. Bucket sort does this in 156 milliseconds. Merge sort is about 7.5 times faster in this scenario!

The next question to answer is, what type of time complexity can be used to describe bucket sort?

Now this is far from the regular implementation of bucket sort. Usually, bucket sort implements other sorting algorithms on its buckets. This implementation recursively uses bucket sort instead. This version of bucket sort also works on negative integers, adding some time complexity to the equation.

This version of bucket sort can be described as such:

First of all, bucket sort loops through all of the items in the list twice to split positives and negatives. This adds $2n$ to our Big O.

Next, it loops through all of our negative numbers, adding $a$ (number of negative numbers) to our Big O. 

The following it does twice, once for the positive, and once for the negative numbers.

Bucket sort maps through all values in the array to turn them into strings. This adds another $n$ to our Big O.
Next, bucket sort loops through the strings and integers together. This adds another $n$ to our Big O. Bucket sort then concatenates the buckets, adding another $n$ to our Big O. This then recursively loops $k$(longest string in negatives) and $p$(longest string in positives) times.

After both the absolute negatives, and positives have been sorted, the absolute negatives get negated and reversed. This adds another $2a$ to our Big O. Then, finally, the sorted negatives and sorted positives get concatenated, adding the final $n$ to our Big O.

This, finally, results in the following big O:

$O(3(a+n)+k(2n)+p(2n))$

This big O is the worse-case scenario. The best case scenario is when $a==0$. In this scenario, there is no longest negative string. This results in the following big O:

$O(3n + p(2n))$

This Big O may be reduced to simply $O(n)$, as the time complexity grow linearly.