<a href="https://colab.research.google.com/github/ARU-Bioinformatics/ARU-Bioinf-CMA-2021/blob/main/demo_string_key_to_integer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hash map function choice

One of the features of most hash table keys is that they will be *strings*. These will not be immediately useful for making an integer that can be changed to a hash table address. 

For example, the common example of a _modulus_ function to select the slot in the table requires an integer. 

A hash map function must be used to change a string key into an integer?.

One possible way is the ord() function which takes a character and returns its numbering in the Unicode lookup table data. 

In [None]:
# run this cell to see ord values for 'a', 'b' and 'z'
for letter in 'a', 'b', 'z':
    ord_letter = ord(letter)
    print(f'ord({letter})= {ord_letter}')

In [None]:
# now your turn - What is the ord values of 'A', 'Z', '$', '\n'

So one commonly suggested way to produce an integer key from a string is to add together the ord() values of its characters. 

In [None]:
# run this cell to define the function str_hash_simple
def str_hash_simple(mystring, hash_table_size=27):
    _sum = 0
    for _char in mystring:
        _sum += ord(_char)        
    return _sum%hash_table_size

In [None]:
# run this cell to see the str_hash_simple value of "AUG"
str_hash_simple('AUG')

One obvious issue with a simple sum is that of course it does not depend on order. This would create a lot of collisions especially with a limited alphabet. 

In [None]:
# now your turn - write code to get the str_hash_simple() values of 'GUA' and 'AGU' - why?

A way to get around this is to weight each character position. For example using the position of the character in the string. This is available via the Python [enumerate function](https://www.programiz.com/python-programming/methods/built-in/enumerate). 

In [None]:
# run this cell to define str_hash_02
def str_hash_02(mystring, hash_table_size=27):
    _sum = 0
    for index, char in enumerate(mystring, start=1):
        _sum += ord(char) * index        
    return _sum%hash_table_size

In [None]:
# now your turn - write code to get the str_hash_02 values of 'GUA' and 'AGU' 

One other simple approach is to use binary numbers. These are numbers in the base 2 so the binary '10' is equivalent to the decimal number 2. If this is unfamilar to you please see https://www.ducksters.com/kidsmath/binary_numbers_basics.php

To convert decimal numbers to binary strings in python it is easiest to use the format function:

In [None]:
# run this cell to see how to convert decimals to binary in python
for value in range(17):
    binary_string = format(value, 'b')
    print(f'decimal integer {value} in binary is {binary_string}')

So an alternative hash map is convert the ordinals into binary but then to simply concatenate the digits rather than using addition. The result can be converted back to a base 10 integer key using the int() function with a base '2' option.

So `GUA` would be 
* `71` `85` `65` in decimal or 
* `1000111` `1010101` `1000001` in binary. 
* Concatenating the 3 binary strings gives `100011110101011000001` 
* which is converted back to `1174209` by the [Python `int` function](https://www.geeksforgeeks.org/python-int-function/). 
* The final result would be `1174209%27` = `6` using the default `hash_table_size=27`

In [None]:
# run this cell to see 1174209%27 from worked example
1174209%27

In [None]:
# run this cell to define str_hash_03 function
def str_hash_03(mystring, hash_table_size=27):
    concat = ''.join(format(ord(c), 'b') for c in mystring)        
    return int(concat,2)%hash_table_size

In [None]:
# now your turn write Python to confirm that str_hash_03('GUA') is 6

In [None]:
# Write Python to find out what is the str_hash_03 for 'AGU' 

# does this differ from the value for 'GUA'

# Optional advanced question

Another (possibley neater) way of using conversion to binary with [Python Bitwise Operators](https://www.tutorialspoint.com/python/bitwise_operators_example.htm). Can you come up with a function that uses this? Does your function lead to different hash bins for `GUA` and `AGU`