# Searching
Sorting is used to preprocess the collection to make searching faster, as well as identify items that are similar. Naive sorting algorithms run in $O(n^2)$ time. The best sorting algorithms run in $O(n\log n)$ time:
- **heapsort**: in-place but not stable
- **mergesort**: stable but not in place
- **quicksort**: $O(n^2)$ worst-case run time, however, generally best choice

For short arrays, e.g., 10 or fewer elements, insertion sort is easier to code and faster than aymptotically superior algorithms. If every element is known to be at most $k$ places from its final location, min-heap can be used to get $O(n\log k)$ time. If there are a small number of distinct keys, counting sort, which records for each element, the number of elements less than it, works well.

In [45]:
from collections import Counter
from typing import Dict, List

from utils import run_tests

## Tips
- Sorting problems come in two flavors: 
    1. **Use sorting to make subsequent steps in an algorithm simpler** - fine to use a library sort function, possibly with a custom comparator
    1. Design a **custom sorting routine** - use a data structure like a heap, a BST, or an array indexed by values
- Certain problems become easier to understand, as well as solve, when the input is sorted. The most natural reason to sort is if the inputs have a **natural ordering**, and sorting can be used as a preprocessing step to **speed up searching**
- For **specialized input**, e.g., a very small range of values, or a small number of values, it's possible to sort in $O(n)$ time rather than $O(n\log n)$ time.
- It's often the case that sorting can be implemented in **less space** than required by a brute-force approach.
- Sometimes it is not obvious what to sort on, e.g., should a collection of intervals be sorted on starting points or endpoints?

## Libraries
- `sort()`: stable, in-place sort for list objects. 
    - Returns none
    - takes two arguments:
        1. `key=None`: a function that defines the sort order, taking a list element and mapping them to objects that are comparable
        1. `reverse=False`
- `sorted()`: takes an iterable and returns a new list containing all items from the iterable in ascending order

### Make a Class Sortable

In [13]:
class Student:

    def __init__(self, name: str, gpa: float) -> None:
        self.name = name 
        self.gpa = gpa 

    def __lt__(self, other: 'Student') -> None:
        return self.name < other.name 

    def __str__(self):
        return f'{self.name} has a GPA of {self.gpa}'

students = [
    Student('Jack', 3.7),
    Student('Jill', 3.75),
    Student('Freya', 1.25),
    Student('Banana', 2.75)
]
# sort according to __lt__
students.sort()
list(map(print, students))
print()

# sort in place by gpa
students.sort(key=lambda student: student.gpa)
list(map(print, students))

Banana has a GPA of 2.75
Freya has a GPA of 1.25
Jack has a GPA of 3.7
Jill has a GPA of 3.75

Freya has a GPA of 1.25
Banana has a GPA of 2.75
Jack has a GPA of 3.7
Jill has a GPA of 3.75


[None, None, None, None]

### 13.1: Compute the Intersection of Two Sorted Arrays
e.g.: [2, 3, 3, 5, 5, 6, 7, 7, 8, 12] & [5, 5, 6, 8, 8, 9, 10, 10] -> [5, 6, 8]

In [15]:
def intersection_sorted_arrays_bf(A: List[int], B: List[int]) -> List[int]:
    return [a for i, a in enumerate(A) if (i == 0 or a != A[i - 1]) and a in B]  # prevent duplicates by checking with previous value

print(intersection_sorted_arrays_bf([2, 3, 3, 5, 5, 6, 7, 7, 8, 12], [5, 5, 6, 8, 8, 9, 10, 10]))


[5, 6, 8]


$O(nm)$ time complexity because two loops

In [24]:
def intersection_sorted_arrays(A: List[int], B: List[int]) -> List[int]:
    ''' 
    take advantage of the fact that both arrays are sorted
    '''
    i, j = 0, 0
    intersection = []

    while i < len(A) and j < len(B):  # once get to end of one list, elements cannot intersect 
        if A[i] == B[j]:
            if (i == 0 or A[i] != A[i - 1]): 
                intersection.append(A[i])
            i += 1
            j += 1
        elif A[i] < B[j]:
            i += 1
        else:         # A[i] > B[j]
            j += 1

    return intersection

print(intersection_sorted_arrays([2, 3, 3, 5, 5, 6, 7, 7, 8, 12], [5, 5, 6, 8, 8, 9, 10, 10]))


[5, 6, 8]


$O(n + m)$ time complexity

### 13.2: Merge Two Sorted Arrays
Assume the first array has enough empty spaces at the end for the elements of the second array   
e.g.: [3, 13, 17, _, _, _, _] & [3, 7, 11, 19] -> [3, 3, 7, 11, 13, 17, 19]

In [35]:
def merge_two_sorted_arrays(A: List[int], n:int, B: List[int], m: int) -> None:
    ''' 
    n,m: number of elements in resepective arrays 
    updates array A
    '''
    a, b, write_index = n-1, m-1, n+m-1

    while a >= 0 and b >= 0:
        if A[a] > B[b]:
            A[write_index] = A[a]
            a -= 1
        else:
            A[write_index] = B[b]
            b -= 1

        write_index -= 1 

    # rest of entries in b
    while b >= 0:
        A[write_index] = B[b]
        b -= 1

A = [3, 13, 17, None, None, None, None]
merge_two_sorted_arrays(A, 3, [3, 7, 11, 19], 4)
print(A)

A = [3, 13, 17, None, None, None, None, None]
merge_two_sorted_arrays(A, 3, [3, 7, 11, 19, 20], 5)
print(A)

A = [3, 13, 17, None, None, None, None]
merge_two_sorted_arrays(A, 3, [7, 11, 19, 20], 4)
print(A)

[3, 3, 7, 11, 13, 17, 19]
[3, 3, 7, 11, 13, 17, 19, 20]
[3, 7, 11, 13, 17, 19, 20]


### 13.3: Calculate H-Index
The h-index is a metric that measures both the productivity and citation impact of a researcher. A researcher's h-index is the largest number $h$ s.t. the researcher has published $h$ papers that have been cited at least $h$ times

In [41]:
def calc_h_index_bf(citations: List[int]) -> int:
    if len(citations) == 0:
        return 0 
    
    h = 0
    num_papers = 0
    for h_test in range(len(citations)):
        for c in citations:
            if c >= h_test:
                num_papers += 1
        if num_papers >= h_test:
            h = h_test 
        else:
            return h
        num_papers = 0

print(calc_h_index_bf([1, 4, 1, 4, 2, 1, 3, 5, 6]))

4


$O(n^2)$ time complexity

In [42]:
def calc_h_index(citations: List[int]) -> int:

    citations.sort() 
    for i, c in enumerate(citations):
        if c >= len(citations) - i:
            return len(citations) - i
    return 0
print(calc_h_index([1, 4, 1, 4, 2, 1, 3, 5, 6]))

4


#### Variant

#### Variant: H-Index but can use additional space 

In [51]:
def calc_h_index_space(citations: List[int]) -> int:

    citation_paper_count: Dict[int, int] = {}    # number of citations, number of papers
    max_citations = 0
    for c in citations:
        if c in citation_paper_count:
            citation_paper_count[c] += 1
        else:
            citation_paper_count[c] = 1
        max_citations = max([max_citations, c])

    # count citations from paper with most citations
    citation_count = 0
    for h in reversed(range(max_citations+1)):
        if h in citation_paper_count:
            citation_count += citation_paper_count[h]
            if citation_count >= h:
                return h

print(calc_h_index_space([1, 4, 1, 4, 2, 1, 3, 5, 6]))

4


$O(n + h) = O(n)$ time and $O(n)$ space complexity

### 13.4: Remove First Name Dupicates

In [58]:
class Name:
    def __init__(self, first: str, last: str) -> None:
        self.first = first 
        self.last = last 

    def __lt__(self, other: 'Name') -> bool:
        return self.first < other.first if self.first != other.first else self.last < other.last 

    def __eq__(self, other: 'Name') -> bool:
        return self.first == other.first

    def __str__(self) -> str:
        return f'{self.first} {self.last}'


def eliminate_duplicates(names: List[Name]) -> None:
    names.sort()
    write_index = 1

    for candidate in names[1:]:
        if candidate != names[write_index-1]:
            A[write_index] = candidate
            write_index += 1

    del names[write_index:]

names = [
    Name('Ian', 'Botham'),
    Name('David', 'Gower'),
    Name('Ian', 'Bell'),
    Name('Ian', 'Cambell')
]

eliminate_duplicates(names)
list(map(print, names))


David Gower
Ian Bell


[None, None]

$O(n\log n)$ time complexity

### 13.5: 


### 13.10: Team Photo Day
Two teams line up for a photo where team 0 is in the front row and team 1 is in the back row. A photo is possilbe if the person in the back is taller than the person in the front

In [59]:
def valid_photo(team0: List[int], team1: List[int]) -> bool:
    return all(
        a < b for a, b in zip(sorted(team0), sorted(team1))
    )

valid_photo([1, 5, 2, 1, 0, 3, 1], [4, 2, 9, 4, 2, 8, 3])

True