# Hashing

## Waht is hashing?

By definition it is a method of sorting and indexing data. The idea behind hashing is to allow large amounts of data to be indexted using keys commonly created by formulas.

Let's explain this definition by using real example, we assume that we have 3 strings and we want to sort them in an efficient way:

- Apple
- Applicatioon
- Appmillers
  If we want to sort them using hashing

1. the fist step is that we are changing these strings into number by using some magical function, the conversion through our magic function will give

- Apple $\rarr$ 18
- Applicatioon $\rarr$ 20
- Appmillers $\rarr$ 22
  The question here is how do we convert them? What is the magical function used here? And how does it work?

2. The next step is that we have to store this number into some data structure. Let's store it inside and array or list. 

|  0  |  1  | $\dots$ |  18   | 19  |     20      | 21  |     22     | 23  |
| :-: | :-: | :-----: | :---: | :-: | :---------: | :-: | :--------: | :-: |
|     |     | $\dots$ | Apple |     | Application |     | Appmillers |     |


### Why hashing?

It is time efficient in case of SEARCH operation. It hase $\Omicron(1)$ time complexity and where there's a lot of collision $\Omicron(n)$ time complexity.

### Hashing terminology

* **Hash function**: It is a function that can be used to map of arbitrary size to data of fixed size. (Our **magic function**)
* **Key**: Input data given by a user (**Apple, Application, Appmillers**)
* **Hash value**: A value that is returned by hash function (**18, 20, 22**)
* **Hash table**: It is a data structure which implements an associative array abstract data type, a structure that can *map keys to values.* (**the array in our exemple**)
* **Collision**: A collision occures when two different keys to a hash function produce the same output (*hash value*)

### Hash functions

#### $1^{st}$ case with an integer key

#### Mod function

```python
def mod(number, cell_number):
    return number % cell_number
```



Let's insert $400$ into our hash table

We'll call 

`mod(400, 24)` $\rarr$ 16

`mod(700, 24)` $\rarr$ 4

|  0  |  1  | $\dots$ |   4   |  5  | $\dots$ | 16  | $\dots$ | 23  |
| :-: | :-: | :-----: | :---: | :-: | :-----: | :-: | :-----: | :-: |
|     |     | $\dots$ | $700$ |     | $\dots$ |  $400$   | $\dots$ |     |


### $2^{nd}$ case with a string key

#### ASCII function

Here we will use ASCII values of strings to convert them to the number.

```python

def mode_ascii(stringm, cell_number):
    return sum([ord(char) for char in string]) % cell_number
```
Let's insert

`mode_ascii("ABC", 24)`

From the ASCII table list we have:

A $\rarr$ 65

B $\rarr$ 66

C $\rarr$ 67

$65+66+67 = 198$

$\frac{198}{24} = 8$ and $6$ is the remainder


|  0  |  1  | $\dots$ |   6   |  7  | $\dots$ | 16  | $\dots$ | 23  |
| :-: | :-: | :-----: | :---: | :-: | :-----: | :-: | :-----: | :-: |
|     |     | $\dots$ | *ABC* |     | $\dots$ |     | $\dots$ |     |

These are simpler version of hash function, but in the real world hash function are more complex.

To choose a good hash function we need to define what defines a good hash function.Let's see some characteristics of good hash functions:

* **It distributed hash values uniformly accross hash tables**
* **It has to use all input data**

### Collision resolution techniques

Let's say we have these strings : _ABCD, EFGH, IJKL_ and we want to insert them into in the following hash table:
Let's say when we apply hash function on these strings everytime we get the same index

_ABCD_ $\rarr$ $2$

_EFGH_ $\rarr$ $2$

|  0  |               |
| :-: | :-----------: |
|  1  |               |
|  2  | _ABCD_ / EFGH (Collision here) |
|  3  |               |
|  4  |               |
|  5  |               |
|  6  |               |
|  7  |               |
|  8  |               |
|  9  |               |
| 10  |               |
| 11  |               |
| 12  |               |
| 13  |               |
| 14  |               |
| 15  |               |



Resolution Techniques:
* Direct Chaining
* Open Addressing
  * Linear Probing
  * Quadratic Probing
  * Double Hashing

### Direct chaining

Implements the buckets as **linked list**. Colliding elements are **stored in this lists**.

It means every cell in the hash table will **store a reference to a linked list**

### Open addressing

Colliding elements are store in **other vacant buckets**. During storage and lookup these are found through so called probing.

#### Linear probing

It places new key into **closes following empty cell**

#### Quadratic probing

Adding arbitrary polynomial to the index unitl and empty cell is found

#### Double hashing

Interval between probes is computed by another hash function

### Hash table is full

#### Direct chaining
This situation will never happen.

#### Open addressing

Create $2\times$ size of current hash table and **recall hashing for current keys**

We need to take in account that when we are creating a new hash table it affects the performance because we need to call hash function for all the strings one more time to insert in the new hash table, it will be $\Omicron(n)$ time complexity.

### Pros and cons of collision resolution techniques

#### Direct chaining

* Hash table never gets full
* Huge linked list causes performance leaks(Time complexity for search operation becomes $\Omicron(n)$)

#### Open addressing
* Easy to implement
* When hash table is full, creation of new hash table affects performance (Time complexity for search operation becomes $\Omicron(n)$)

To summarize, the decision to use direct chaining or open addressing is based on the following factors:

* If the input size is known, we always use ***"Open addressing"***
* If we perform **deletion operation frequently** we use ***"Direct chaining"***

### Pros and cons of hashing

On average insertion/deletion/search operarions take $\Omicron(1)$ time complexity

When hash function is not good enough insertion/deletion/search operations take $\Omicron(n)$ time complexity

| Operations | Array/Python list |  Linked list  |       Tree       |           Hashing           |
| :--------: | :---------------: | :-----------: | :--------------: | :-------------------------: |
| Insertion  |   $\Omicron(n)$   | $\Omicron(n)$ | $\Omicron(logn)$ | $\Omicron(1)$/$\Omicron(n)$ |
|  Deletion  |   $\Omicron(n)$   | $\Omicron(n)$ | $\Omicron(logn)$ | $\Omicron(1)$/$\Omicron(n)$ |
|   Search   |   $\Omicron(n)$   | $\Omicron(n)$ | $\Omicron(logn)$ | $\Omicron(1)$/$\Omicron(n)$ |
