# Searching
Sorting is used to preprocess the collection to make searching faster, as well as identify items that are similar. Naive sorting algorithms run in $O(n^2)$ time. The best sorting algorithms run in $O(n\log n)$ time:
- **heapsort**: in-place but not stable
- **mergesort**: stable but not in place
- **quicksort**: $O(n^2)$ worst-case run time, however, generally best choice

For short arrays, e.g., 10 or fewer elements, insertion sort is easier to code and faster than aymptotically superior algorithms. If every element is known to be at most $k$ places from its final location, min-heap can be used to get $O(n\log k)$ time. If there are a small number of distinct keys, counting sort, which records for each element, the number of elements less than it, works well.

In [13]:
from collections import Counter, namedtuple
from typing import Dict, List

from utils import run_tests

## Tips
- Sorting problems come in two flavors: 
    1. **Use sorting to make subsequent steps in an algorithm simpler** - fine to use a library sort function, possibly with a custom comparator
    1. Design a **custom sorting routine** - use a data structure like a heap, a BST, or an array indexed by values
- Certain problems become easier to understand, as well as solve, when the input is sorted. The most natural reason to sort is if the inputs have a **natural ordering**, and sorting can be used as a preprocessing step to **speed up searching**
- For **specialized input**, e.g., a very small range of values, or a small number of values, it's possible to sort in $O(n)$ time rather than $O(n\log n)$ time.
- It's often the case that sorting can be implemented in **less space** than required by a brute-force approach.
- Sometimes it is not obvious what to sort on, e.g., should a collection of intervals be sorted on starting points or endpoints?

## Libraries
- `sort()`: stable, in-place sort for list objects. 
    - Returns none
    - takes two arguments:
        1. `key=None`: a function that defines the sort order, taking a list element and mapping them to objects that are comparable
        1. `reverse=False`
- `sorted()`: takes an iterable and returns a new list containing all items from the iterable in ascending order

### Make a Class Sortable

In [3]:
class Student:

    def __init__(self, name: str, gpa: float) -> None:
        self.name = name 
        self.gpa = gpa 

    def __lt__(self, other: 'Student') -> None:
        return self.name < other.name 

    def __str__(self):
        return f'{self.name} has a GPA of {self.gpa}'

students = [
    Student('Jack', 3.7),
    Student('Jill', 3.75),
    Student('Freya', 1.25),
    Student('Banana', 2.75)
]
# sort according to __lt__
students.sort()
list(map(print, students))
print()

# sort in place by gpa
students.sort(key=lambda student: student.gpa)
list(map(print, students))

Banana has a GPA of 2.75
Freya has a GPA of 1.25
Jack has a GPA of 3.7
Jill has a GPA of 3.75

Freya has a GPA of 1.25
Banana has a GPA of 2.75
Jack has a GPA of 3.7
Jill has a GPA of 3.75


[None, None, None, None]

### 13.1: Compute the Intersection of Two Sorted Arrays
e.g.: [2, 3, 3, 5, 5, 6, 7, 7, 8, 12] & [5, 5, 6, 8, 8, 9, 10, 10] -> [5, 6, 8]

In [4]:
def intersection_sorted_arrays_bf(A: List[int], B: List[int]) -> List[int]:
    return [a for i, a in enumerate(A) if (i == 0 or a != A[i - 1]) and a in B]  # prevent duplicates by checking with previous value

print(intersection_sorted_arrays_bf([2, 3, 3, 5, 5, 6, 7, 7, 8, 12], [5, 5, 6, 8, 8, 9, 10, 10]))


[5, 6, 8]


$O(nm)$ time complexity because two loops

In [5]:
def intersection_sorted_arrays(A: List[int], B: List[int]) -> List[int]:
    ''' 
    take advantage of the fact that both arrays are sorted
    '''
    i, j = 0, 0
    intersection = []

    while i < len(A) and j < len(B):  # once get to end of one list, elements cannot intersect 
        if A[i] == B[j]:
            if (i == 0 or A[i] != A[i - 1]): 
                intersection.append(A[i])
            i += 1
            j += 1
        elif A[i] < B[j]:
            i += 1
        else:         # A[i] > B[j]
            j += 1

    return intersection

print(intersection_sorted_arrays([2, 3, 3, 5, 5, 6, 7, 7, 8, 12], [5, 5, 6, 8, 8, 9, 10, 10]))


[5, 6, 8]


$O(n + m)$ time complexity

### 13.2: Merge Two Sorted Arrays
Assume the first array has enough empty spaces at the end for the elements of the second array   
e.g.: [3, 13, 17, _, _, _, _] & [3, 7, 11, 19] -> [3, 3, 7, 11, 13, 17, 19]

In [6]:
def merge_two_sorted_arrays(A: List[int], n:int, B: List[int], m: int) -> None:
    ''' 
    n,m: number of elements in resepective arrays 
    updates array A
    '''
    a, b, write_index = n-1, m-1, n+m-1

    while a >= 0 and b >= 0:
        if A[a] > B[b]:
            A[write_index] = A[a]
            a -= 1
        else:
            A[write_index] = B[b]
            b -= 1

        write_index -= 1 

    # rest of entries in b
    while b >= 0:
        A[write_index] = B[b]
        b -= 1

A = [3, 13, 17, None, None, None, None]
merge_two_sorted_arrays(A, 3, [3, 7, 11, 19], 4)
print(A)

A = [3, 13, 17, None, None, None, None, None]
merge_two_sorted_arrays(A, 3, [3, 7, 11, 19, 20], 5)
print(A)

A = [3, 13, 17, None, None, None, None]
merge_two_sorted_arrays(A, 3, [7, 11, 19, 20], 4)
print(A)

[3, 3, 7, 11, 13, 17, 19]
[3, 3, 7, 11, 13, 17, 19, 20]
[3, 7, 11, 13, 17, 19, 20]


### 13.3: Calculate H-Index
The h-index is a metric that measures both the productivity and citation impact of a researcher. A researcher's h-index is the largest number $h$ s.t. the researcher has published $h$ papers that have been cited at least $h$ times

In [7]:
def calc_h_index_bf(citations: List[int]) -> int:
    if len(citations) == 0:
        return 0 
    
    h = 0
    num_papers = 0
    for h_test in range(len(citations)):
        for c in citations:
            if c >= h_test:
                num_papers += 1
        if num_papers >= h_test:
            h = h_test 
        else:
            return h
        num_papers = 0

print(calc_h_index_bf([1, 4, 1, 4, 2, 1, 3, 5, 6]))

4


$O(n^2)$ time complexity

In [8]:
def calc_h_index(citations: List[int]) -> int:

    citations.sort() 
    for i, c in enumerate(citations):
        if c >= len(citations) - i:
            return len(citations) - i
    return 0
print(calc_h_index([1, 4, 1, 4, 2, 1, 3, 5, 6]))

4


#### Variant

#### Variant: H-Index but can use additional space 

In [9]:
def calc_h_index_space(citations: List[int]) -> int:

    citation_paper_count: Dict[int, int] = {}    # number of citations, number of papers
    max_citations = 0
    for c in citations:
        if c in citation_paper_count:
            citation_paper_count[c] += 1
        else:
            citation_paper_count[c] = 1
        max_citations = max([max_citations, c])

    # count citations from paper with most citations
    citation_count = 0
    for h in reversed(range(max_citations+1)):
        if h in citation_paper_count:
            citation_count += citation_paper_count[h]
            if citation_count >= h:
                return h

print(calc_h_index_space([1, 4, 1, 4, 2, 1, 3, 5, 6]))

4


$O(n + h) = O(n)$ time and $O(n)$ space complexity

### 13.4: Remove First Name Dupicates

In [10]:
class Name:
    def __init__(self, first: str, last: str) -> None:
        self.first = first 
        self.last = last 

    def __lt__(self, other: 'Name') -> bool:
        return self.first < other.first if self.first != other.first else self.last < other.last 

    def __eq__(self, other: 'Name') -> bool:
        return self.first == other.first

    def __str__(self) -> str:
        return f'{self.first} {self.last}'


def eliminate_duplicates(names: List[Name]) -> None:
    names.sort()
    write_index = 1

    for candidate in names[1:]:
        if candidate != names[write_index-1]:
            A[write_index] = candidate
            write_index += 1

    del names[write_index:]

names = [
    Name('Ian', 'Botham'),
    Name('David', 'Gower'),
    Name('Ian', 'Bell'),
    Name('Ian', 'Cambell')
]

eliminate_duplicates(names)
list(map(print, names))


David Gower
Ian Bell


[None, None]

$O(n\log n)$ time complexity

### *13.5: Smallest Nonconstructible Value
Given a set of coins, there are some amounts of change that you may not be able to make with them. E.g., [1, 1, 1, 1, 1, 5, 10, 25], cannot make 21.  

In [11]:
def smallest_nonconstructible_value(A: List[int]) -> int:

    max_constructible_value = 0
    for a in A:
        if a > max_constructible_value + 1:
            break 
        max_constructible_value += a 
    return max_constructible_value + 1

smallest_nonconstructible_value([1, 1, 1, 1, 1, 5, 10, 25])

21

$O(n\log n)$ time complexity

### 13.6: Render a Calendar
Write a program that takes a set of events and determines the maximum number of events that take place concurrently

In [15]:
Event = namedtuple('Event', ('start', 'end'))

def max_simultaneous_events(E: List[Event]) -> int:
    ''' 
    sort events then keep a counter that increments at
    start times and decrements at end times 
    '''
    # endpont is a tuple (start_time, 0) or (end_time, 1)
    # so that if times are equal, start_time comes first
    Endpoint = namedtuple('endpoint', ('time', 'is_start'))

    # build an array of endpoints
    Ends = [ 
        p for event in E for p in (Endpoint(event.start, True), Endpoint(event.end, False))
    ]
    # sort the endpoint array according to the time, breaking ties 
    # by putting start times before end time
    Ends.sort(key=lambda e: (e.time, not e.is_start))   # false comes first when sorting

    # track the number of simultaneous events
    max_num_events = num_events = 0
    for e in Ends:
        if e.is_start:
            num_events += 1
            max_num_events = max([max_num_events, num_events])
        else:
            num_events -= 1
    return max_num_events

events = [Event(1, 5), Event(2, 7), Event(4, 5), Event(6, 10), Event(8, 9), Event(9, 17), Event(11, 13), Event(14, 15), Event(12, 15)]
max_simultaneous_events(events)

3

$O(n\log n)$ time complexity, and $O(n)$ space complexity for the endpoints array

In [None]:
# false comes first when sorting
A = [True, False]
A.sort()
A

#### Variant 13.6.A: 
Users $1, 2, ..., n$ share and Internet connection. User $i$ uses $b_i$ bandwidth from time $s_i$ to $f_i$, inclusive. What is the peak bandwidth usage?

In [17]:
InternetUse = namedtuple('InternetUse', ('start', 'end', 'bandwidth'))

def peak_bandwidth(A: List[InternetUse]) -> int:
    # keep track of endpoints
    Endpoint = namedtuple('Endpoint', ('time', 'is_start', 'bandwidth'))

    # build endpoints array
    # add one to end time since calculate is inclusive
    E = [ 
        p for usage in A for p in (Endpoint(usage.start, True, usage.bandwidth), Endpoint(usage.end + 1, False, usage.bandwidth))
    ]

    # sort endpoints s.t. if tie, end time comes first 
    E.sort(key=lambda e: (e.time, e.is_start))

    # calculate bandwidth
    max_bandwidth = current_bandwidth = 0
    for endpoint in E:
        if endpoint.is_start:
            current_bandwidth += endpoint.bandwidth
            max_bandwidth = max([max_bandwidth, current_bandwidth])
        else:
            current_bandwidth -= endpoint.bandwidth
    return max_bandwidth


usages = [InternetUse(1, 3, 10), InternetUse(2, 3, 10), InternetUse(2, 4, 10), InternetUse(3, 5, 10)]
peak_bandwidth(usages)

40

$O(n\log n)$ time complexity, and $O(n)$ space complexity for the endpoints array

### 13.7: Merging Intervals
Write a program which takes as input an array of disjoint closed intervals with integer indepoints, sorted by increasing order of left endpoint, and an interval to be added, and returns the union of the intervals in the array and the added interval.    
e.g.: ([-4, -1], [0, 2], [3, 6], [7, 9], [11, 12], [14, 17]) -> ([-4, -1], [0, 9], [11, 12], [14, 17])

In [19]:
Interval = namedtuple('Interval', ('left', 'right'))

def merge_intervals(disjoint_intervals: List[Interval], new_interval: Interval) -> List[Interval]:
    i, result = 0, []

    # process intervals that come before new interval
    while i < len(disjoint_intervals) and new_interval.left > disjoint_intervals[i].right:
        result.append(disjoint_intervals[i])
        i += 1

    # merge overlapping intervals
    merged_interval = new_interval
    while i < len(disjoint_intervals) and new_interval.right >= disjoint_intervals[i].left:
        merged_interval = Interval(min([merged_interval.left, disjoint_intervals[i].left]), max([merged_interval.right, disjoint_intervals[i].right]))
        i += 1

    return result + [merged_interval] + disjoint_intervals[i:]

merge_intervals([Interval(4, -1), Interval(0, 2), Interval(3, 6), Interval(7, 9), Interval(11, 12), Interval(14, 17)], new_interval=Interval(1, 8))

[Interval(left=4, right=-1),
 Interval(left=0, right=9),
 Interval(left=11, right=12),
 Interval(left=14, right=17)]

$O(n)$ time complexity since spends $O(1)$ time per Interval

### 13.8: Union of Intervals

In [22]:
Endpoint = namedtuple('Endpoint', ('is_closed', 'value'))
Interval = namedtuple('Interval', ('left', 'right'))

def union_of_intervals(intervals: List[Interval]) -> List[Interval]:
    ''' 
    when sorting, if two intervals have the same left-endpoint, 
    put intervals which are left closed first

    Cases:
        - The interval most recently added to the result does not 
            intersect the current interval, nor does its right endpoint
            equal the left endpoint of the current interval. In this case, 
            we simply add the current interval to the end of the result array
            as a new interval.
        - The interval most recently added to the result intersects the current interval.
            In this case, we update the most recently added interval to the union
            of it with the current interval.
        - The interval most recently added to the result has its right endpoint equal
            to the left endpoint of the current interval, and one (or both) of these
            endpoints are closed. In this case too, we update the most recently 
            added interval to the union of it with the current interval
    '''
    # sort endpoints according to left endpoint of intervals
    intervals.sort(key=lambda i: (i.left.value, not i.left.is_closed))
    result = [intervals[0]]
    for i in intervals:
        if intervals and (i.left.value < result[-1].right.value or
                            (i.left.value == result[-1].right.value and 
                            (i.left.is_closed or result[-1].right.is_closed))):
            if (i.right.value > result[-1].right.value or
                (i.right.value == result[-1].right.value and i.right.is_closed)):
                result[-1] = Interval(result[-1].left, i.right)
        else:
            result.append(i)

    return result

intervals = [
    Interval(Endpoint(True, 3), Endpoint(True, 4)),
    Interval(Endpoint(False, 0), Endpoint(False, 3)),
    Interval(Endpoint(True, 1), Endpoint(True, 1)),
    Interval(Endpoint(True, 2), Endpoint(True, 4)),
    Interval(Endpoint(True, 5), Endpoint(False, 7)),
    Interval(Endpoint(True, 7), Endpoint(False, 8)),
    Interval(Endpoint(True, 8), Endpoint(False, 11)),
    Interval(Endpoint(False, 9), Endpoint(True, 11)),
    Interval(Endpoint(True, 12), Endpoint(True, 14)),
    Interval(Endpoint(False, 12), Endpoint(True, 16)),
    Interval(Endpoint(False, 13), Endpoint(False, 15)),
    Interval(Endpoint(False, 16), Endpoint(False, 17))
]
union_of_intervals(intervals)

[Interval(left=Endpoint(is_closed=False, value=0), right=Endpoint(is_closed=True, value=4)),
 Interval(left=Endpoint(is_closed=True, value=5), right=Endpoint(is_closed=True, value=11)),
 Interval(left=Endpoint(is_closed=True, value=12), right=Endpoint(is_closed=False, value=17))]

### *13.9: Partitioning and Sorting an Array with Many Repeated Elements

In [None]:
Person = namedtuple('Person', ('age', 'name'))

def group_by_age(people: List[Person]) -> None:
    ''' 
    maintain a subarray for each of the different elements.
    use two hash tables to track subarrays
    '''
    age_to_count = Counter([person.age for person in people])
    age_to_offset, offset = {}, 0
    for age, count in age_to_count.items():
        age_to_offset[age] = offset   # starting index for that age 
        offset += count

    while age_to_offset:
        from_age = next(iter(age_to_offset))
        from_idx = age_to_offset[from_age]
        to_age = people[from_idx].age
        to_idx = age_to_offset[people[from_idx].age]

        # switch
        people[from_idx], people[to_idx] = people[to_idx], people[from_idx]

        # use age_to_count to see when we are finished with a particular age
        age_to_count[to_age] -= 1
        if age_to_count[to_age]:
            age_to_offset[to_age] = to_idx + 1
        else:
            del age_to_offset[to_age]

$O(n)$ time and $O(m)$ space complexity where $m$ is the unique number of ages

#### Variant 13.9.A: Maintain Ages in Sorted Order
Use a BST

### 13.10: Team Photo Day
Two teams line up for a photo where team 0 is in the front row and team 1 is in the back row. A photo is possilbe if the person in the back is taller than the person in the front

In [12]:
def valid_photo(team0: List[int], team1: List[int]) -> bool:
    return all(
        a < b for a, b in zip(sorted(team0), sorted(team1))
    )

valid_photo([1, 5, 2, 1, 0, 3, 1], [4, 2, 9, 4, 2, 8, 3])

True

### 13.11: Implement a Fast Sorting Algorithm for Lists
Unlike arrays, lists can be merged in place, therefore, can use mergesort to create a fast, stable sort

In [25]:
from data_structures.linked_lists.single_node import Node
from data_structures.linked_lists import single_node

def merge_sorted_lists(L1: Node, L2: Node) -> Node:

    # base case
    if L1 is None:
        return L2 
    if L2 is None:
        return L1

    dummy_head = tail = Node(0)

    while L1 and L2:
        if L1.data < L2.data:
            tail.next = L1
            L1 = L1.next
        else: 
            tail.next = L2 
            L2 = L2.next 
        tail = tail.next

    # append remeaing nodes of L1 or L2
    tail.next = L1 or L2

    return dummy_head.next

x = single_node.push_list([7, 5, 2])
y = single_node.push_list([11, 3])
single_node.print_list(x)
single_node.print_list(y)
merge_ll = merge_sorted_lists(x, y)
single_node.print_list(merge_ll)

def stable_sort_list(L: Node):

    # base case
    if L is None or L.next is None:
        return L 

    # find midpoint of L using a slow and fast pointer 
    pre_slow, slow, fast = None, L, L 
    while fast and fast.next:
        pre_slow = slow 
        fast, slow = fast.next.next, slow.next 
    
    # split the list into two equal size lists
    if pre_slow:
        pre_slow.next = None

    return merge_sorted_lists(stable_sort_list(L), stable_sort_list(slow))

print()
x = single_node.push_list([7, 5, 2, 10, 1, 4, 2, 3, 4, 10, 11, -5])
single_node.print_list(x)
x = stable_sort_list(x)
single_node.print_list(x)


2 5 7 
3 11 
2 3 5 7 11 

-5 11 10 4 3 2 4 1 10 2 5 7 
-5 1 2 2 3 4 4 5 7 10 10 11 


$O(n\log n)$ time complexity. Though no memory is explicitly allocated, the space complexity is $O(\log n)$, which is the maximum function call stack depth, since each recursive call is with and argument that is a half as long

### 13.12: Compute a Salary Threshold

In [27]:
def salary_threshold(salaries: List[int], target_payroll: int) -> int:
    salaries.sort()
    unadjusted_salary_sum = 0

    for i, salary in enumerate(salaries):
        num_people_adjusted = len(salaries) - i 
        adjusted_salary_sum = salary * num_people_adjusted
        if adjusted_salary_sum + unadjusted_salary_sum >= target_payroll:
            return (target_payroll - unadjusted_salary_sum) / num_people_adjusted
        else:
            unadjusted_salary_sum += salary
    return -1

salary_threshold([90, 30, 100, 40, 20], target_payroll=210)

60.0

$O(n\log n)$ time complexity

#### Variant