# [CS Series] Lecture 6: Hashing & Priority Queues

## May 8, 2022

### Hosted by and maintained by the [Student Association for Applied Statistics (SAAS)](https://saas.berkeley.edu).
Created by Akhil Vemuri

### Table of Contents
1. [Hashing](#hash)
    1. [Arrays](#hash-arr)
    2. [Dictionaries / Hash Tables](#hash-dict)
    3. [Sets](#hash-set)
2. [Priority Queues](#pq)
    1. [Heaps](#heap)
    2. [Queues](#q)
    3. [Heaps + Queues](#heap+q)
3. [Conclusion](#conc)
4. [References](#ref)

<a id='hash'></a>
## Hashing

<a id='hash-arr'></a>
### Arrays

One of the most common data structures used in Python is the array. We potentially abstract some of the features of arrays, but they do have operations and time complexities, similar to any regular class or function. The two operations we'll look at are ```get()``` and ```pop()```.

```get(i)``` retrieves the element at index $i$ in the array.

```pop(i)``` deletes the element at index $i$ in the array.

Digging deeper into the Python docs, we can find out that ```get(i)``` is a constant $O(1)$ time operation, whereas ```pop(i)``` is a linear $O(n)$ time operation. But why is there a discrepancy? We would think it's quite simple to delete an element; we simply take it out of the array. It has to do with how arrays are formulated. Arrays are contiguous blocks of memory, meaning there are multiple blocks of space allocated on the computer to store an array. But these blocks of space must come one after another, no matter what. This means that if an element is deleted, we have to close the gap, i.e. we have to remove the empty block of space and conjoin the array once again.

![alt text](https://beginnersbook.com/wp-content/uploads/2018/10/array.jpg)

Another important operation of arrays, ```contains()```, also suffers from a poor time complexity. ```contains(val)``` checks whether the array contains element $\text{val}$. This also takes $O(n)$ time since we must scan through all elements of the array to check whether $\text{val}$ is inside.

This is a gap in the design of arrays. And so other data structures have since been created as alternatives to solve some of these issues. But in this lecture, we'll be looking at alternatives that used hashing as a backbone.

<a id='hash-dict'></a>
### Dictionaries / Hash Tables

A dictionary, commonly known as a hash table, is a data structure that maintains a key-mapping pair between elements. For instance, I could store $\{(\text{dog}, 2), \, (\text{cat}, 5), \, (\text{bird}, 3) \}$ inside one dictionary. The main benefit of using a dictionary is the $O(1)$ ```contains()``` operation as well as $O(1)$ ```insert()``` and ```pop()``` operations. However, it does give up the ability to order elements, like you could in an array. All these benefits are due to hashing.

*Hashing* is the process of mapping data to some representative integer value using certain hashing algorithms. In Java, a hash code is an integer value that is linked with each object. In Python, the built-in ```hash()``` function returns the hash value of an object, the integer which is used to compare dictionary keys while looking at a dictionary. So hashing exists everywhere.

There's two properties we must have for hashing to work:

**1. Each key should have a unique hashcode.**

**2. The same key should have the same hashcode.**

This is because each hashcode maps to a different key in the dictionary, and we look-up values based on a key's hash.

For instance, let $\text{hash(dog)} = 1$. Now, if we wanted to check whether our dictionary contained a $\text{dog}$ element, we would simply check if bucket 1 contains $\text{dog}$. Rather than linearly searching through all $n$ buckets, like in an array, we already know where a $\text{dog}$ element would be if it existed, which is what makes the ```contains()``` operation for dictionaries so quick.

We do run into some problems with the hashcode properties, however. Python has a max integer limit, so we can't possibly assign every single object a different integer hashcode. For example, if $\text{hash(dog)} = 1$ and $\text{hash(cat)} = 1$, then both keys map to the same bucket. This is known as *The Pigeonhole Principle*, and it essentially means that **collisions are inevitable.** We combat this by chaining keys with the same hashcode into the same bucket.

<div> <img src="hashing_pqs_images/hashing-chaining.png" width="450"/> </div>

Another observation is that the worst case runtime of ```contains()``` is proportional to the length of the longest chain. So if we can reduce the number of buckets and keep the length of the longest chain the same, then we'll have saved a bunch of memory. We achieve this by using placing items via $\text{hash(key)} \, \% \, m$ where $m$ is the number of buckets.

The final result is a dictionary / hash table with *amortized* $O(1)^*$ runtime. There are also other optimizations that can be made, such as resizing the number of buckets relative to the number of items, but this is the basic jist of how a dictionary works in Python.

<div> <img src="hashing_pqs_images/hashing-modulus.png" width="350"/> </div>

The final missing component is how we determine our ```hash()``` function. In Python (and most other languages), this is built-in. But creating a good hash function on your own is actually quite difficult. The challenge is that we are bound to have collisions, but we still want to ensure our items are evenly spread such that no one bucket has a large chain. This is all dependent on the hash function used. Luckily, we don't have to make these on our own, but there is a whole other study behind the properties of good hash functions and how to ensure they work well in practice. For now, we'll just trust the ones given to us.

<a id='hash-set'></a>
### Sets

The other equivalent data structure to dictionaries are sets. Sets also use hashing for fast look-up operations, but instead of holding a (key, value) mapping, it simply holds keys. In a mathematical sense, a set of the first 5 positive integers includes $\{1, 2, 3, 4, 5\}$. A set in a data structures sense is actually quite similar, except it's purpose is the same as that of dictionaries: $O(1)$ ```contains()```, ```insert()```, and ```pop()``` operations.

<hr \>

<a id='pq'></a>
## Priority Queues

<a id='heap'></a>
### Heaps

Before we can talk about priority queues, we have to talk about heaps. A heap is a specialized tree-based data structure in which the tree is a complete binary tree that satisfies the *heap property*. The heap property states that any parent node $P$ must be smaller than both its child nodes $C$ in a *min heap*. Correspondingly, any parent node $P$ must be larger than its child nodes $C$ in a *max heap*. The node at the top of the heap is the *root* node, and as a result of the heap property, it must contain the min or max element of the array. This means that heaps are usually great data structures for implementations that require quick access to the min or max elements.

**Note:** Heaps are different than binary trees since there is no dependency between different subtrees (meaning no recursive nature), whereas a binary tree requires the *entire* left subtree to be smaller than the root and the *entire* right subtree to be greater than the root.

<div> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Max-Heap.svg/800px-Max-Heap.svg.png" width="400"/> </div>

<a id='q'></a>
### Queue

The other major component to priority queues is, surprisingly, a queue. A *queue* is a data structure similar to an array, where elements are inserted (enqueued) at the back and popped (dequeued) from the front (parallels to a queue in real life). Usually we want to use queues when our data needs to support fast inserts and deletes. Notice that a queue has $O(1)$ ```insert()``` and ```pop()``` operations due to the restrictions on its operations as well as how it's implemented in certain programming languages (we won't go into the details here).

<div> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Data_Queue.svg/1200px-Data_Queue.svg.png" width="400"/> </div>

<a id='heap+q'></a>
### Heaps + Queues = 🔥

Combining the two ideas, we get an abstract data type known as a priority queue. A *priority queue (PQ)* is an abstract data type similar to a regular queue or stack data structure in which each element additionally has a "priority" associated with it. In a priority queue, an element with high priority is served before an element with low priority. In most cases, we deal with min or max PQs.

For instance, ```add(item)``` would add $\text{item}$ to the PQ. In a min PQ, ```get_smallest()``` would return the smallest element in the PQ.

**Note:** though heaps and PQs are similar, they are not the same! A heap is a **data structure.** It is a name for a particular way of storing data that makes certain operations very efficient. A priority queue is an **abstract data type.** It is a shorthand way of describing a particular interface and behavior, and says nothing about the underlying implementation.

At the end of the day, we can represent a priority queue using a heap, which happens to solve our problem of making a "weighted" queue.

<hr \>

<a id='conc'></a>
## Conclusion

Hashing and priority queues are definitely fundamental algorithmic / data structure topics, especially if you're looking to become a software engineer. There's actually a whole lot more detail to cover on these topics, such as runtime analysis, hashing collision probabilities, heap implementations, etc., but many many lectures can be devoted to these specific topics. I highly recommend practicing these ideas either in Python or Java (Java tends to be easier) due to how frequently they show up in real life "CS" work. I also highly recommend learning more on these topics if you'd like to continue a profession in software engineering, or anything similar.

If you'd like to continue learning more on arrays, hash tables, heaps, priority queues, and other common data structures, CS 61B is a great class to take.

If you'd like to continue learning more on theoretical hashing, runtime analysis, and probabilistic analysis, CS 170 is also a great follow-up class.

<hr \>

<a id='ref'></a>
## References 

* CS 61B SP21: Lecture 20

<hr \>