In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

In [None]:
%%capture
import sys
!{sys.executable} -m pip install hashlib

In [None]:
from hashlib import md5

# Lab 4: Rainbow Table Attack

Contributions from: Teo Honda-Scully and Ryan Cottone

Welcome to Lab 4! In this lab, you will learn about a common workaround to getting plaintext values from a hash value despite hashing being a one-way function.

## Hash Functions and Collision Exploits

A **hash function**, denoted $H(x)$, is a deterministic function taking in some arbitrary amount of data, and outputting a fixed amount of data that appears random. Hash functions are used to condense a bunch of data down into a tag that _almost_ uniquely identifies it. As we will see in this lab, the _almost_ unique characteristic of a hash can be exploited.

First, let's go over a use case of hashing. Hash functions are often used to store passwords. Imagine a scenario in which you are signing up for a website using your typical password "_donut123_". The website, without knowing that a bad guy secretly has access to its database, stores "_donut123_" in their database of login information. At this point, your beloved "_donut123_" password is completely compromised.

Alternatively, imagine a scenario in which the website stores a hashed version of your password "_donut123_" instead. The adversary (who can see the website's database) will only have access to a bunch of strings of characters that appears random. In this case, your "_donut123_" password is safe, and the adversary will not be able to login to your account.

Websites store a hashed version of your password rather than the actual plaintext version. During any given login request, the website will compare $H(submitted\_password)$ to their stored database entry hash for the given username to see if they are equivalent.

By nature, hash functions will result in collisions. An infinite amount of inputs returning a limited amount of hashes (fixed length output) means that some inputs will return the same output hash. 

![hash_function.jpg](https://user-images.githubusercontent.com/114739901/218287160-c01f5c31-3248-4cf1-aeef-324fa6c1a184.jpg)

An "attack" on a hash function usually involves finding a *collision*, that is, two values $m_1$ and $m_2$ such that $m_1 \neq m_2$ and $H(m_1) = H(m_2)$. Say the adversary with access to the website database had the hash of your password, $H(p)$, and was able to find some value $k$ such that $H(k) = H(p)$. They could submit your username and the password $k$ to the server, which would then be accepted as the correct password in the eyes of the website!

We will quickly walk through finding a collision in a bad hash function.

In [None]:
# You don't need to do anything with this function besides analyze its functionality.
def hash_(x) -> str:
    """
    Returns the string version of argument `x` padded up to length 6 characters. 
    This is a terrible hash function (does not appear random). Do not use it in practice :)
    
    >>> h(5)
    '5zzzzz'
    
    >>> h("5z")
    '5zzzzz'
    
    """        
    return f'{str(x):z<6}'[:6]

In this question, pretend you are an adversary, Alice, trying to log into Bob's account on a given website. You have access to the website's password database, but all the passwords are hashed! You don't know anything about Bob except that his hashed password is "123zzz".

Your goal is to find an input `k_password`, for the password hash function that will return an output equivalent to Bob's hashed password "123zzz". **You cannot use Bob's real password of '123' as a `k_password` value**.

In [None]:
"""
Bob's password is 123, so password_hash_function(123) will output and store '123zzz' in the database

What input (besides 123) to password_hash_function will serve as an equivalent password in the eyes 
of the website?
"""

website_password_hash_entry = "123zzz"
k_password = ...

In [None]:
hash_(k_password)

The server checks to see if the hash of the given password is equivalent to the stored hash.

In [None]:
assert hash_(k_password) == "123zzz"

## Rainbow Table Attacks
Recall that hashing is a one-way function, meaning you can get $H(m_1)$ from $m_1$, but you cannot get $m_1$ from $H(m_1)$.

A **rainbow table** is a table that maps every hash output to one of its inputs. Since the table is dependent on the hashing algorithm, every hash function will have a different rainbow table. You might be thinking, "Aren't there multiple inputs for every output? How will I map output to inputs?" When you are creating your rainbow table, you only need to know one of the many inputs for the reasons defined in **Question 0**.

Ultimately, a rainbow table is a workaround for getting $m_1$ from $H(m_1)$ by pre-computing all $H(x)$ hash values for every notable $x$ password. This key-value pair is stored inside of a table, and finding $x$ from $H(x)$ boils down to searching for the $k$ value in the table given the hash as a key.

In [None]:
# Helper function
def md5_hash(raw):
    return md5(raw.encode()).hexdigest()

In [None]:
password = "123"

# Run this cell
md5_hash(password)

For our **rainbow table** implementation, we will be using the [MD5](https://en.wikipedia.org/wiki/MD5) hashing algorithm. As you can see, the outputted string is much more random than the output of our `hash_` function above. Feel free to test some different values for `password`. Notably, every input returns a fixed length output that appears random.

While it is impossible to reverse the hash function and get _"123"_ from _'202cb962ac59075b964b07152d234b70'_, we want to make a table that contains millions of plaintext values and their corresponding hashes to see if we can find a match with our _'202cb962ac59075b964b07152d234b70'_ key. 

Let's try to do this by only accounting for numbers passwords **restricted to a length of 3**

In [None]:
def gen_passwords(): # You don't need to do anything with this function besides analyze its functionality
    """
    Returns a list of every password (string) that can be formed using 3 numbers.
    '0', '022', and '31', are examples of passwords that satisfy these conditions.
    
    """
    base_case = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    combinations = base_case
    for i in range(2):
        extended = []
        for j in range(10):
            extended.extend([str(j) + x for x in base_case])
        base_case = extended
        combinations += base_case
    return combinations

gen_passwords()

You can assert this on your own, but for the total combinations that can be made (0 inclusive) using 3 digits, 2 digits, and 1 digit, we can sum 10^3, 10^2, and 10^1, finding that there are 1110 total possibilties for passwords given the password restrictions.

**Question 1.1**: Create an MD5 rainbow table for the possible 3-digit number passwords

Now that we know all of the possible passwords, let's find the hashes of the passwords and store them both as a key-pair inside of a dictionary

In [None]:
def create_table_example():
    table = dict()
    
    # Create a rainbow table for the possible 3-digit number passwords where the key is the hash
    for pw in gen_passwords():
        ...
    
    return table

In [None]:
grader.check("q1_1")

Let's see what our table looks like. Run the following cell.

In [None]:
create_table_example()

Once again, hash functions are one-way functions, meaning we cannot find the input given the output. Despite this, there is nothing stopping us from storing as many common passwords alongside their hashes inside of a cache to check for matches when trying to get the input (the password) from the output (the password hash).

In the case above, we knew the constraints of the password and were able to generate a list of every possible hash. In a standard rainbow table attack, it is impossible for the adversary to know every possible password (if this were the case, they could brute force the login). Therefore, the adversary would need to compile another list of common passwords (available online) to create a rainbow table of those common passwords with the hash of the given hash function.

**Question 1.2**: Use the rainbow table to find Eve's password

Haha! You are a bad guy and you have access to the website's password database! As you are sifting through all the entries, this specific pair stands out to you for some reason:
```python
{... 
'63538fe6ef330c13a05a3ed7e599d5f7': 'Eve',
...}
```
Well, we are smart enough to know that _"63538fe6e..."_ is the hashed version of Eve's password, so we cannot use that to log into Eve's account. We can't hash _"63538fe6e..."_ since that will spit out another string of random characters... How can we find Eve's real password then? Good thing we've made a rainbow table for this function already!

**Question 1.2**: Implement `find_correct_password`, which takes in the target hash and a rainbow table to return the corresponding plaintext value.

In [None]:
eve_password_hashed = '63538fe6ef330c13a05a3ed7e599d5f7'
rainbow_table = create_table_example()

def find_correct_password(target_hash, rainbow_table):
    ...

In [None]:
find_correct_password(eve_password_hashed, rainbow_table)

In [None]:
grader.check("q1_2")

Nice! From only knowing the hash version of the password, we were able to access the password itself! Now we can log into Eve's account and transfer ourselves some cryptocurrency without any worry.

**Okay... So our last attack worked beautifully, but we cannot neglect the fact that passwords will never be restricted to something as simple as 3-digit numbers**... Let's repeat the same exercise with a list of common passwords

In [None]:
with open('passwords.txt') as file:
    passwords = file.read().split("\n")

In [None]:
def create_table():
    table = dict()
        
    # Create an MD5 rainbow table for common passwords
    for pw in passwords:
        table[md5_hash(pw)] = pw
    
    return table

In [None]:
common_rainbow_table = create_table()

Once again, imagine a scenario where you are an adversary who has access to the website's database. As you are sifting through all the entries, this specific pair stands out to you for some reason:
```python
{... 
'bed128365216c019988915ed3add75fb': 'Bob',
...}
```
Let's use this to find Bob's password!

In [None]:
bob_password_hashed = 'bed128365216c019988915ed3add75fb'
find_correct_password(bob_password_hashed, common_rainbow_table)

## Salting

Imagine a scenario where you are the website responsible for storing your users' login information, but you are aware that an adversary with access to your database could deploy a rainbow table attack to gain access to your users' original passwords. What can you do to prevent this from happening? Is the rainbow table attack the downfall of your business? How can you change the hash so that it cannot be reverse-engineered into the original password?

**As the website, what if you were to concatenate a random string of text (only known to you) on to each users' passwords before you hash them and store the hashed version?** Using this approach, the rainbow table hash keys will almost never match up with a password value.

The user inputs $k$ as a password while signing up and we store $H(k||s)$ in our database, where $s$ denotes a random string of characters only known to us and not the adversary.

**Question 2.1**: Implement a function that finds the hash of a password using a given salt, as $H(k || s)$.

In [None]:
# Instead of returning H(k) given k, return H(k||s).
def get_password_hash(user_password, salt) -> str:
    ...

In [None]:
get_password_hash("password", "e66cf0fec6")

In [None]:
grader.check("q2_1")

In the case of Bob's original password of _"123"_, the password stored in the website database will be $H($123e66cf0fec6$)$, which is very different than the value of $H($123$)$. 

For the adversary deploying a rainbow table attack, they will not be able to return the original password for the $H($123e66cf0fec6$)$ hash even though their table attack easily compromised the _"123"_ hash.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Once you have generated the zip file, go to the Gradescope page for this assignment to submit.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)