## 哈希函数
### 特征
1. 无限输入->有限输出域
2. 相同输入->相同输出
3. 低概率不同输入->相同输出
4. 输出离散且均匀


### 例题1
在一个大文件中，有40亿个无符号整数，整数的范围在0～$2^{32}$。  
请用1GB内存完成统计出现最多次数的数是哪一个。  

### 例题2 设计RandomPool结构
   
设计一种结构，在该结构中有如下三个功能：  
`insert (key)`：将某个key加入到该结构，做到不重复加入  
`delete (key）`：将原本在结构中的某个key移除  
`getRandom ()`：等概率随机返回结构中的任何一个key。  
   
要求Insert、delete和getRandom方法的时间复杂度都是0(1)  

In [9]:
import random

class RandomPool:
    def __init__(self):
        self.size = 0
        self.map1 = {}
        self.map2 = {}
    
    def insert(self, str):
        self.map1[str] = self.size
        self.map2[self.size] = str
        self.size += 1
    
    def delete(self, str):
        index = self.map1[str]
        lastStr = self.map2[self.size - 1]

        self.map1[lastStr] = index
        del self.map1[str]
        self.map2[index] = lastStr
        del self.map2[self.size - 1]

        self.size -= 1
    
    def getRandom(self):
        index = random.randint(0, self.size-1)
        return self.map2[index]

rp = RandomPool()
arr = ['A', 'B', 'C', 'D']

for chr in arr:
    rp.insert(chr)
print(rp.map1)
print(rp.map2)
print()

rp.delete('C')
print(rp.map1)
print(rp.map2)
print()

rp.getRandom()


{'A': 0, 'B': 1, 'C': 2, 'D': 3}
{0: 'A', 1: 'B', 2: 'C', 3: 'D'}

{'A': 0, 'B': 1, 'D': 2}
{0: 'A', 1: 'B', 2: 'D'}



'D'

## 衍生：布隆过滤器
用于黑名单查询，必定存在一定失误率
### BitMap
用标准数组表示bitmap

In [19]:
# 32bit x 10 = 320bit
bitMap = [0 for _ in range(10)]

# 获取第283位的信息
# 第283的位置
numIndex = 283 // 32
bitIndex = 283 % 32

# 获取283的状态
state = (bitMap[numIndex] >> bitIndex) & 1
print(state)

# 把283位的数改成1
bitMap[numIndex] = bitMap[numIndex] | (1 << bitIndex)
state = (bitMap[numIndex] >> bitIndex) & 1
print(state)
print(bin(bitMap[numIndex]))

# 把283位的数改成0
bitMap[numIndex] = bitMap[numIndex] & (~(1 << bitIndex))
state = (bitMap[numIndex] >> bitIndex) & 1
print(state)
print(bin(bitMap[numIndex]))



0
1
0b1000000000000000000000000000
0
0b0


## 布隆过滤器原理
![filter](images/哈希函数与哈希表-布隆过滤器.png)

### 公式
$n = 样本量$， $p = 失误率$， $m = 需要内存空间$，$k = 哈希函数数量$

公示1：$ m = -(n * ln \ p) \ / \ (ln \ 2)^2$
    
例如 失误率为万分之一，样本量为100亿，每个样本64字节，则需要大约26GB内存。  
   
公式2：$ k = ln \ 2* (m \ / \ n)$
   
公式3：$ p_真 = (1-e^{-(n*k_真 \ / \ m_真 )})^{k_真}$

## 一致性哈希

In [30]:
import hashlib as hb
a = '101'
md5 = hb.md5()
md5.update(a.encode())
md5.hexdigest()
from queue import PriorityQueue

class Server:
    def __init__(self, name:str, ablility:int):
        self.md5Codes = []
        self.data = []
        self.name = name
        md5 = hb.md5()
        for i in range(ablility):
            md5.update((name+str(i)).encode())
            self.md5Codes.append(md5.hexdigest())

class DataSeparater:
    def __init__(self):
        self.servers = PriorityQueue()
    
    def appendServer(self, s:Server):
        for code in s.md5Codes:
            self.servers.put((code, s.name))
    


s1 = Server('s1', 8)
s2 = Server('s2', 5)
dp = DataSeparater()
dp.appendServer(s1)
dp.appendServer(s2)
while not dp.servers.empty():
    print(dp.servers.get())


('2a4918f8a1712a97fb3360d91191e286', 's2')
('2b3357d90b70edb1773bba13390166c1', 's1')
('3c55d34ff977905a713015283bc25e5e', 's2')
('583a3463d121548e24fb0df2da0e5490', 's2')
('6f1d80ff858025be4dac5f1f26f3edb7', 's1')
('70334e4bfe11208d9a975939943914bd', 's1')
('7b08126602036179da38b0059fdce34d', 's1')
('ca43bc17c500c2baad4ed0d6b9ce71f4', 's1')
('cef2393a345ef42beb74c0ae25ee99dc', 's1')
('cff850f16499e786b92cb825a79bf173', 's2')
('df33c554c56c1703b6f98841fe7df132', 's1')
('e988b9b66eb0ae4d5bed10584b8ee74c', 's1')
('ee48e6db45763bf71c23e178131f0c9b', 's2')


In [41]:
import time
class SpeedTest:
    def __init__(self, n):
        self.randoms = []
        self.pq = PriorityQueue()
        self.arr = []
        for _ in range(n):
            self.randoms.append(random.randint(0,n))

        start = time.time()
        self.pqSort()
        end = time.time()
        print('PQ sort time cost: ', (end - start)*1000, 'ms')

        start = time.time()
        self.arrSort()
        end = time.time()
        print('Arr sort time cost: ', (end - start)*1000, 'ms')

    

    def pqSort(self):
        for num in self.randoms:
            self.pq.put(num)
    
    def arrSort(self):
        for num in self.randoms:
            self.arr.append(num)
        self.arr.sort()

SpeedTest(1000)
print()
SpeedTest(10000)
print()
SpeedTest(100000)
        

PQ sort time cost:  2.122640609741211 ms
Arr sort time cost:  3.888845443725586 ms

PQ sort time cost:  16.91293716430664 ms
Arr sort time cost:  1.9147396087646484 ms

PQ sort time cost:  150.81191062927246 ms
Arr sort time cost:  21.432876586914062 ms


<__main__.SpeedTest at 0x7f7d89b85a60>