# Hashing  
A hash table is a data structure used to store keys, optionally, with corresponding values. Inserts, deletes, and lookups run in $O(1)$ time on average.    
   
The underlying idea is to store keys in an array. A key is stored in the array locations ("slots") based on its "hash code". The hash code is an integer computed from the key by a hash function. If the hash function is chosen well, the objects are distributed uniformly across array locations.    
   
If two keys map to the same location, a "collision" has occurred. the standard mechansim to deal with collisions is to maintain a linked list of objects at each array location. If the hash function does a good job of spreading objects across the underlying array and takes $O(1)$ time to compute, on average, lokkups, insertions, and deletions have $O(1 + n/m)$ time complexity, where $n$ is the number of objects and $m$ is the length of the array.   
   
If the "load" $n/m$ grows large, rehashing can be applied, but is expensive $O(n + m)$    


In [2]:
from collections import Counter, defaultdict, namedtuple, OrderedDict
import functools
from typing import DefaultDict, Dict, List, Set

from utils import run_tests

## Tips
- Hash tables have the **best theoretical and real-world performance** for lookup, insert and delete. Each of these operations has $O(1)$ time complexity. The $O(1)$ time complexity for insertions is for the average case - a single insert can take $O(n)$ if the hash table has to be resized.  
- Consider using a hash code as a **signature** to enhance performance, e.g., to filter out candidates.  
- Consider using a precomputed lookup table instead of boilerplate if-then code for mappings, e.g., from character to value or character to character.
- When defining your own type that will be put in a hash table, be sure you understand the relationship between **logical equality** and the fields the hash function must inspect. Specifically, anytime equality is implemented, it is imperative that the correct hash function is also implemented, o/w when objects are placed in hash tables, logically equivalent objects may appear in different buckets, leading to lookups returning false, even when the searched item is present.
- Somtimes you'll need a **multimap**, i.e., a map that contains multiple values for a single key, or a bi-directional map. 

## Libraries

### String Hash Function

In [7]:
def string_hash(s: str, modulus: int) -> int:
    mult = 997
    return functools.reduce(lambda v, c: (v * mult + ord(c)) & modulus, s, 0)

print(string_hash('cat', 10))
print(string_hash('cats', 10))

2
8


### Finding Anagrams
An anagram is a word formed by rearranging the letters of another word   
Give a set of words, return groups of anagrams of these words   

In [11]:
def find_anagrams(words: List[str]) -> List[List[str]]:
    ''' 
    key idea is to map a strings to a representative
    the representative can be the sorted version of the string since 
    anagrams will have the same sorted representation
    '''
    sorted_string_to_anagram: DefaultDict[str, List[str]] = defaultdict(list)

    for w in words:
        w_sorted = ''.join(sorted(w))       # sorted returns a character array
        sorted_string_to_anagram[w_sorted].append(w)

    return [
        group for group in sorted_string_to_anagram.values() if len(group) >= 2
    ]

words = ['debitcard', 'elvis', 'silent', 'badcredit', 'lives', 'freedom', 'listen', 'levis', 'money']
find_anagrams(words)

[['debitcard', 'badcredit'], ['elvis', 'lives', 'levis'], ['silent', 'listen']]

$O(nm\log m)$ time complexity

#### Variant: 
Design and $O(nm)$ algorithm

### Designing a Hashable Class

In [12]:
class ContactList:
    def __init__(self, names: List[str]):
        self.names = names 

    def __hash__(self) -> int:
        # conceptually we want to hash the set of names. 
        # since the set type is mutable, it cannot by hashable.
        # therefore use a frozen set
        return hash(frozenset(self.names))

    def __eq__(self, other) -> bool:
        return set(self.names) == set(other.names)


def merge_contanct_lists(contacts: List[ContactList]) -> ContactList:
    return list(set(contacts))

Hash codes are often cached for performance, with the caveat that cache must be cleared if object fields that are referenced by the hash function are updated.   
Could also cache equality function

### 12.1: Test for Palindromic Permutations
Test whether letters in a word can be permuted to form a palindrome.   
e.g., "edified" can be permuted to form "deified"

In [13]:
def is_string_permutable_to_palindrome(s: str) -> bool:
    ''' 
    a palindrome has an even count of characters (because have to match pairs)
    except can have on character with an odd count
    '''
    character_counts = Counter(s)
    return sum(c % 2 for c in character_counts.values()) <= 1

print(is_string_permutable_to_palindrome('edified'))
print(is_string_permutable_to_palindrome('cat'))

True
False


$O(n)$ time complexity and $O(c)$ space complexity where $c$ is the unique number of characters in the string

### 12.2: Is an Anonymous List Constructable
Given the text for an anonymous letter and the text of a magazine, check if the letter could be constructed from the magazine

In [None]:
def is_letter_constructible_from_magazine(letter: str, magazine: str) -> bool:

    # compute frequencies for characters in letter
    char_letter_freq = Counter(letter)

    for c in magazine:
        if c in char_letter_freq:
            char_letter_freq[c] -= 1
            if char_letter_freq[c] == 0:
                del char_letter_freq[c]
                if not char_letter_freq:
                    # all characters in letter matched
                    return True
                    
    # empty dict means every character in letter can
    # be matched to a character in magazine
    return not char_letter_freq

### 12.3: Implement an ISBN Cache

In [None]:
class LruCache:
    def __init__(self, capacity: int) -> None:
        self._capacity = capacity
        self._isbn_price_table: OrderedDict[int, int] = OrderedDict() 

    def lookup(self, isbn: int) -> int:
        if isbn not in self._isbn_price_table:
            return -1
        
        # since just returned price, move isbn to front
        price = self._isbn_price_table.pop(isbn)
        self._isbn_price_table[isbn] = price
        return price

    def insert(self, isbn: int, price: int) -> None:
        if isbn in self._isbn_price_table:
            price = self._isbn_price_table.pop(isbn)
        elif len(self._isbn_price_table) == self._capacity:
            self._isbn_price_table.popitem(last=False)
        self._isbn_price_table[isbn] = price 
    
    def erase(self, isbn: int) -> bool:
        return self._isbn_price_table.pop(isbn, None) is not None

$O(1)$ time complexity

### 12.4: Compute the LCA, Optimizing for Close Ancestors

### 12.5: Find the Nearest Repeated Entry in an Array

In [18]:
def nearest_repeated_entries(paragraph: str) -> int:
    paragraph = paragraph.lower().split(' ')

    word_to_latest_index: Dict[str, int] = {}
    nearest_repeated_distance = float('inf')

    for i, word in enumerate(paragraph):
        if word in word_to_latest_index:
            nearest_repeated_distance = min([nearest_repeated_distance, i - word_to_latest_index[word]])
        word_to_latest_index[word] = i
    
    return int(nearest_repeated_distance) if nearest_repeated_distance != float('inf') else -1

paragraph = 'All work and no play makes for no work no fun and no results'
print(nearest_repeated_entries(paragraph))
paragraph = 'cat in the hat'
print(nearest_repeated_entries(paragraph))
paragraph = 'cat in the hat is still a cat'
print(nearest_repeated_entries(paragraph))

2
-1
7


### 12.6: Find the Smallest Subarray Covering All Values
In a list of strings, find the interval that covers all strings in a sub array  
e.g. ['apple', 'banana', 'apple', 'apple', 'dog', 'cat', 'apple', 'dog', 'banana', 'apple', 'cat', 'dog'], ['banana', 'cat'] -> (8, 10)

In [4]:
Interval = namedtuple('Interval', ('start', 'end'))

def find_smallest_subarray_covering_set(paragraph: List[str], keywords: Set[str]) -> Interval:
    ''' 
    advance right pointer until cover set, then advance left pointer to see if smaller set
    '''
    keywords_to_cover = Counter(keywords)
    result = Interval(-1, -1)
    remaining_to_cover = len(keywords)
    left = 0

    for right, word in enumerate(paragraph):
        if word in keywords:
            keywords_to_cover[word] -= 1
            if keywords_to_cover[word] >= 0:      # if this count is negative, implies keyword showed up multple time in subarray
                remaining_to_cover -= 1

        # advance left until key_words_to_cover 
        # does not contain all keywords
        while remaining_to_cover == 0:
            if result == Interval(-1, -1) or right - left < result.end - result.start:
                result = Interval(start=left, end=right)
            word_left = paragraph[left]
            if word_left in keywords:
                keywords_to_cover[word_left] += 1
                if keywords_to_cover[word_left] >= 0:
                    remaining_to_cover += 1
            left += 1
    
    return result

find_smallest_subarray_covering_set(['apple', 'banana', 'apple', 'apple', 'dog', 'cat', 'apple', 'dog', 'banana', 'apple', 'cat', 'dog'], ['banana', 'cat'])

Interval(start=8, end=10)

Time complexity is $O(n)$, where $n$ is the length of the array, since for each of the two indices we spend $O(1)$ time per advance, and each is advanced at most $n-1$ times

#### Variant 12.6.A:

#### Variant 12.6.B:

#### Variant 12.6.C:

### 12.7: Find the Smallest Subarray Sequentially Covering All Values

Processing each entry of the paragraph array entails a constant number of lookups and updates, leading to an $O(n)$ time complexity, where $n$ is the length of the paragraph array. The additional space complexity is dominated by the three hash tables, i.e., $O(m)$, where $m$ is the number of keywords

### 12.8: Find the Longest Subarray with Distinct Entries

In [19]:
def longest_distinct_subarray(elements: List[str]) -> int:
    element_to_last_occurrence: Dict[str, int] = {}
    max_subarray_len = float('-inf')
    start = 0

    for i, s in enumerate(elements):
        if s not in element_to_last_occurrence:
            element_to_last_occurrence[s] = i 
        else:
            max_subarray_len = max(max_subarray_len, i - start)
            start = element_to_last_occurrence[s] + 1
            element_to_last_occurrence[s] = i 
    
    # check endpoint
    max_subarray_len = max(max_subarray_len, i - start + 1)

    return -1 if max_subarray_len == float('-inf') else int(max_subarray_len)


inputs = (['f', 's', 'f', 'e', 't', 'w', 'e', 'n', 'w', 'e'], ['f', 's', 'f', 'e', 't', 'w', 'x', 'n', 'w', 'e'], ['f', 'f'], ['f', 's'], ['a', 'b', 'c', 'd'])
outputs = [5, 7, 1, 2, 4]
run_tests(longest_distinct_subarray, inputs, outputs)

$O(n)$ time complexity since performed a constant number of operations per element

### 12.9: Find the Length of a Longest Contained Interval
Write a program which takes as input a set of integers represented by an array, and returns the size of a large subset of integers in the array having the property that if two integers are in the subset, then so are all integers between them.   
e.g.: [3, -2, 7, 9, 8, 1, 2, 0, -1, 5, 8] -> [-2, -1, 0, 1, 2, 3]

In [27]:
def longest_contained_interval(A: List[int]) -> int:

    unprocessed_entries = set(A)
    max_interval_len = 0

    while unprocessed_entries:
        a = unprocessed_entries.pop()

        # find the lower bound of the largest range containing a
        lower_bound = a - 1
        while lower_bound in unprocessed_entries:
            unprocessed_entries.remove(lower_bound)
            lower_bound -= 1

        # find upper bound of the largest range containing a
        upper_bound = a + 1
        while upper_bound in unprocessed_entries:
            unprocessed_entries.remove(upper_bound)
            upper_bound += 1

        max_interval_len = max([max_interval_len, upper_bound - lower_bound - 1])

    return max_interval_len

inputs = ([3, -2, 7, 9, 8, 1, 2, 0, -1, 5, 8], [3, -2, 7, 9, 8, 1, 2, 0, -1, 5, 8, 4], [1, 2, 3], [1, 3, 5, 10], [1, 8, 3, 9, 2, 7], [1, 8, 3, 9, 2, 7, 10], [1])
outputs = [6, 8, 3, 1, 3, 4, 1]
run_tests(longest_contained_interval, inputs, outputs)

$O(n)$ time complexity, where $n$ is the array length, since add and remove array elements in the hash table no more than once

### 12.10: Compute All String Decompositions
Take a string (sentence) and a list of substrings (words) and see if the concatenation of substrings is contained in the larger string. The list of substrings can contain duplicates. Each substring must appear once in the sentence and order does not matter. Return a list of integers corresponding to the start indexes of each occurrence
e.g.: 'amanaplanacanal', ['can', 'apl', 'ana'] -> "aplanacan", 4

In [29]:
def find_all_substrings(sentence: str, words: List[str]) -> List[int]:
    def match_all_words_in_dict(start: int) -> bool:
        curr_string_freq = Counter()
        for i in range(start, start + len(words) * unit_size, unit_size):
            curr_word = sentence[i:(i + unit_size)]
            freq = word_to_freq[curr_word]
            if freq == 0:     # if not in words 
                return False 
            curr_string_freq[curr_word] += 1
            if curr_string_freq[curr_word] > freq:  # curr_word occurs to many times for a match to be possible
                return False 
        return True

    word_to_freq = Counter(words)
    unit_size = len(words)
    return [
        i for i in range(len(sentence) - unit_size * len(words) + 1) if match_all_words_in_dict(i)
    ]

find_all_substrings('amanaplanacanal', ['can', 'apl', 'ana'])

[4]

Let $m$ be the number of words and $n$ the length of each word. Let $N$ be the length of the sentence. For any fixed $i$, to check if the string of length $nm$ starting at an offset of $i$ in the sentence is the concatenation of all words has time complexity $O(nm)$, assuming a hash table is used to store the set of words. This implies the overall time complexity is $O(Nnm)$. In practice, the checks are likely much faster since stop as soon as a mismatch is detected.

### 12.1: Collatz Conjecture
Take any natural number. If it is odd, triple it and add one. If it is even, halve it. Write a program to test the Collatz conjecture for first n integers

In [8]:
def test_collatz_conjecture(n: int):
    ''' 
    - Reuse compuations by storing all the numbers you have already proved to converge to 1
    - To save time, skip even numbers
    - If test every number up to k, can stop as soon as reached a number less than k. Don't need to save them in hash table either
    '''
    verified_numbers = set()

    for i in range(3, n+1):   # dont need to check 1 and 2
        sequence = set()
        num = i

        while num >= i:     # if num is less than i, already has been verified

             # check if in cycle
            if num in sequence:
                print(i)
                return False 
            else:
                sequence.add(num)

           # update num     
            if num % 2:     # odd
                if i in verified_numbers:
                    break 
                verified_numbers.add(i)
                num = num * 3 + 1
            else:
                num /= 2
            
test_collatz_conjecture(100000)

Not much can be said about time complexity other than it is proportional to $O(n)$

### 12.12: Implement a Hash Function for Chess