# Comprehensive Hashing Study Notes

## Introduction to Hashing

Hashing is a technique that allows us to build data structures that can be searched in **O(1) constant time**. This is a significant improvement over traditional search methods that require O(n) or O(log n) time complexity.

The core idea is to use a mathematical function to directly compute where an item should be stored, eliminating the need to search through multiple locations.

## Hash Tables

### Theory and Structure
A hash table is a collection of items stored in a way that makes finding them later extremely efficient. The structure consists of:

- **Slots**: Individual positions in the table, each named by an integer (0, 1, 2, ..., m-1)
- **Size (m)**: Total number of slots available
- **Initial State**: All slots contain None (empty)

### Implementation Concept
Hash tables can be implemented using arrays or lists where each element is initialized to None. The index of each element serves as the slot name.

Example: Empty hash table with size m=11
```
[None, None, None, None, None, None, None, None, None, None, None]
  0     1     2     3     4     5     6     7     8     9    10
```

### Load Factor
The **load factor (λ)** measures how full the hash table is:
- Formula: λ = number of items / table size
- Important for performance analysis
- Higher load factors increase collision probability

## Hash Functions

### Purpose and Requirements
A hash function maps any item in a collection to an integer within the range of slot names (0 to m-1). Good hash functions should:

1. **Minimize collisions**: Reduce items mapping to same slot
2. **Be computationally efficient**: Fast to calculate
3. **Distribute evenly**: Spread items across all slots uniformly
4. **Be deterministic**: Same input always produces same output

### Perfect Hash Functions
A **perfect hash function** maps each item to a unique slot with no collisions. While ideal, perfect hash functions are rare in practice for arbitrary data sets.

## Hash Function Methods

### 1. Remainder Method (Division Method)

**Theory**: Uses modular arithmetic to map items to slots.

**Formula**: `h(item) = item % table_size`

**Example with table size 11**:

| Item | Calculation | Hash Value | Final Slot |
|------|-------------|------------|------------|
| 54   | 54 % 11     | 10         | 10         |
| 26   | 26 % 11     | 4          | 4          |
| 93   | 93 % 11     | 5          | 5          |
| 17   | 17 % 11     | 6          | 6          |
| 77   | 77 % 11     | 0          | 0          |
| 31   | 31 % 11     | 9          | 9          |

**Advantages**: Simple to implement and understand
**Disadvantages**: Choice of table size affects distribution quality

### 2. Folding Method

**Theory**: Divides the item into equal-sized pieces, then combines them mathematically.

**Process**:
1. Break item into equal parts (last part may be shorter)
2. Add all parts together
3. Apply modulo operation with table size

**Example**: Phone number 436-555-4601
- Break into pairs: 43, 65, 55, 46, 01
- Sum: 43 + 65 + 55 + 46 + 01 = 210
- Hash: 210 % 11 = 1

**Advantages**: Works well with non-numeric data
**Disadvantages**: May not distribute evenly depending on data patterns

### 3. Mid-Square Method

**Theory**: Squares the item and extracts middle digits for more random distribution.

**Process**:
1. Square the item value
2. Extract middle portion of resulting digits
3. Apply modulo operation

**Example**: Item = 44
- Square: 44² = 1,936
- Extract middle: 93 (middle two digits)
- Hash: 93 % 11 = 5

**Advantages**: Often provides good distribution
**Disadvantages**: May waste computation on squaring large numbers

### Hash Function Comparison

| Item | Remainder Method | Mid-Square Method |
|------|------------------|-------------------|
| 54   | 10               | 3                 |
| 26   | 4                | 7                 |
| 93   | 5                | 9                 |
| 17   | 6                | 8                 |
| 77   | 0                | 4                 |
| 31   | 9                | 6                 |

## Hashing Non-Integer Elements

### String Hashing Theory
Strings can be hashed by converting characters to their ordinal (ASCII/Unicode) values and performing mathematical operations.

**Basic Method**:
1. Convert each character to its ordinal value using `ord()`
2. Sum all ordinal values
3. Apply modulo operation

**Example**: "cat"
- 'c' = ord('c') = 99
- 'a' = ord('a') = 97  
- 't' = ord('t') = 116
- Sum: 99 + 97 + 116 = 312
- Hash: 312 % 11 = 4

## Collision Resolution

### Understanding Collisions
A **collision** occurs when two different items hash to the same slot. This is inevitable in most practical scenarios due to the pigeonhole principle.

**Example**: Both 44 % 11 = 0 and 77 % 11 = 0, causing a collision at slot 0.

### Open Addressing Methods

#### Linear Probing
**Theory**: When a collision occurs, sequentially check the next slots until an empty one is found.

**Algorithm**:
1. Compute initial hash value
2. If slot is occupied, move to next slot
3. Continue until empty slot found
4. Wrap around to beginning if necessary

**Example**: Adding 44, 55, 20 to a table where slot 0 is occupied
- 44 % 11 = 0 (occupied) → try slot 1 → place at slot 1
- 55 % 11 = 0 (occupied) → try slot 1 (occupied) → try slot 2 → place at slot 2
- 20 % 11 = 9 (empty) → place at slot 9

**Problems**: 
- **Clustering**: Items group together, creating long search chains
- **Primary clustering**: Consecutive occupied slots slow down future insertions

#### Quadratic Probing
**Theory**: Instead of checking consecutive slots, use quadratic increments to reduce clustering.

**Formula**: If initial hash is h, try positions h+1², h+2², h+3², etc.
**Sequence**: h+1, h+4, h+9, h+16, h+25, ...

**Advantages**: Reduces primary clustering
**Disadvantages**: Can create secondary clustering and may not probe all slots

### Chaining Method

**Theory**: Each hash table slot contains a reference to a collection (linked list, array) that can hold multiple items.

**Process**:
1. Compute hash value normally
2. Add item to the collection at that slot
3. Search requires traversing the collection at the hashed slot

**Visual Representation**:
```
Slot 0: [77] → [44] → [55]
Slot 1: [None]
Slot 2: [None]
Slot 3: [None]
Slot 4: [26]
Slot 5: [93]
Slot 6: [17]
Slot 7: [None]
Slot 8: [None]
Slot 9: [31] → [20]
Slot 10: [54]
```

**Advantages**: 
- Simple to implement
- No limit on items per slot
- Good performance with good hash function

**Disadvantages**: 
- Extra memory for storing collections
- Performance degrades as chains grow longer
- Cache performance may suffer

## Performance Analysis

### Time Complexity
- **Best Case**: O(1) for search, insert, delete
- **Average Case**: O(1) with good hash function and reasonable load factor
- **Worst Case**: 
  - Open Addressing: O(n) when clustering occurs
  - Chaining: O(n) when all items hash to same slot

### Space Complexity
- **Open Addressing**: O(m) where m is table size
- **Chaining**: O(m + n) where n is number of items

### Load Factor Impact
- **Low load factor (λ < 0.5)**: Fewer collisions, faster operations
- **High load factor (λ > 0.7)**: More collisions, slower operations
- **Critical threshold**: Performance typically degrades significantly above λ = 0.75



# Implementation of a Hash Table


Keep in mind that Python already has a built-in dictionary object that serves as a Hash Table, you would never actually need to implement your own hash table in Python.

___
## Map
The idea of a dictionary used as a hash table to get and retrieve items using **keys** is often referred to as a mapping. In our implementation we will have the following methods:


* **HashTable()** Create a new, empty map. It returns an empty map collection.
* **put(key,val)** Add a new key-value pair to the map. If the key is already in the map then replace the old value with the new value.
* **get(key)** Given a key, return the value stored in the map or None otherwise.
* **del** Delete the key-value pair from the map using a statement of the form del map[key].
* **len()** Return the number of key-value pairs stored 
* **in** the map in Return True for a statement of the form **key in map**, if the given key is in the map, False otherwise.

In [1]:
class HashTable(object):
    
    def __init__(self,size):
        
        # Set up size and slots and data
        self.size = size
        self.slots = [None] * self.size
        self.data = [None] * self.size
        
    def put(self,key,data):
        #Note, we'll only use integer keys for ease of use with the Hash Function
        
        # Get the hash value
        hashvalue = self.hashfunction(key,len(self.slots))

        # If Slot is Empty
        if self.slots[hashvalue] == None:
            self.slots[hashvalue] = key
            self.data[hashvalue] = data
        
        else:
            
            # If key already exists, replace old value
            if self.slots[hashvalue] == key:
                self.data[hashvalue] = data  
            
            # Otherwise, find the next available slot
            else:
                
                nextslot = self.rehash(hashvalue,len(self.slots))
                
                # Get to the next slot
                while self.slots[nextslot] != None and self.slots[nextslot] != key:
                    nextslot = self.rehash(nextslot,len(self.slots))
                
                # Set new key, if NONE
                if self.slots[nextslot] == None:
                    self.slots[nextslot]=key
                    self.data[nextslot]=data
                    
                # Otherwise replace old value
                else:
                    self.data[nextslot] = data 

    def hashfunction(self,key,size):
        # Remainder Method
        return key%size

    def rehash(self,oldhash,size):
        # For finding next possible positions
        return (oldhash+1)%size
    
    
    def get(self,key):
        
        # Getting items given a key
        
        # Set up variables for our search
        startslot = self.hashfunction(key,len(self.slots))
        data = None
        stop = False
        found = False
        position = startslot
        
        # Until we discern that its not empty or found (and haven't stopped yet)
        while self.slots[position] != None and not found and not stop:
            
            if self.slots[position] == key:
                found = True
                data = self.data[position]
                
            else:
                position=self.rehash(position,len(self.slots))
                if position == startslot:
                    
                    stop = True
        return data

    # Special Methods for use with Python indexing
    def __getitem__(self,key):
        return self.get(key)

    def __setitem__(self,key,data):
        self.put(key,data)

Let's see it in action!

In [2]:
h = HashTable(5)

In [3]:
# Put our first key in
h[1] = 'one'

In [4]:
h[2] = 'two'

In [5]:
h[3] = 'three'

In [6]:
h[1]

'one'

In [7]:
h[1] = 'new_one'

In [8]:
h[1]

'new_one'

In [10]:
print(h[4])

None


### Great Job!

That's it for this rudimentary implementation, try implementing a different hash function for practice!