**[replace this text with your name(s) and your ID number(s)]**

# Homework 4

*Due date:* March 20, 2024 (Wednesday) at 8 PM on CodePost

Encrypt-and-MAC? MAC-then-encrypt? Or encrypt-*then*-MAC?
Each scheme may have its own merits, but one thing's for sure: [the wrong choice will somehow inevitably lead to doom](https://moxie.org/2011/12/13/the-cryptographic-doom-principle.html).
Such is the fate of a well-known local healthcare company FailHealth, which recently suffered from a nasty data breach.
It has been reported all over the news that hackers were able to exfiltrate account details of its fifty thousand customers, all because the company was not up-to-date on its security practices.

The website, which is hosted at http://hw4.lunchtimeattack.wtf, provides an API that allows its client-side applications to download files by loading URLs such as:

```
http://hw4.lunchtimeattack.wtf/download?token=b5e6179eb44d634781740a7a61145ec95a4ff967&file_name=hw4_part1.ipynb
```

To validate the token, the web server parses the full query as:
```
query_string = token=token_string&rest_of_query_string
```
and then checks whether
```
token_string == hex(SHA-1(API_KEY || unquote_to_bytes(rest_of_query_string)))
```
where `API_KEY` is a 256-bit secret key only known to the server. 

Unfortunately for this insecure attempt at MAC construction, SHA-1 is vulnerable to length extension attacks.
Your task for this homework is to carry out a length extension attack that appends the string `&file_name=hw4_part2.txt` to the above API query and forges a corresponding token that will validate against the server, thus allowing you to download the rest of your homework.
Your attack will necessarily need to include some binary garbage in the API string; this is fine, since the server won't notice as long as it's appropriately URL-encoded.

**Part 1 has 22 points, and Part 2 has 10 points,** thus this homework has 32 points in total.
But this will be divided by 30 to get the final percentage. Final percentages are capped at 100%.

Please be guided on the policies regarding late submissions, regrading, and collaboration.
If any, please direct all your questions and clarifications about this homework in the `#hw4-help` channel on 
the Discord server.

## Some reminders

You are not allowed to use additional third-party libraries other than those explicitly used here, though libraries within the Python standard library are fair game.

Although there are Python libraries that automagically do the length extension attack for you, you are expected for this homework to implement the logic of the attack yourself.

**Very important:** When dealing with hash functions, always work with raw bytes, never with encoded strings.

## Some background

In most applications, you should use MACs such as HMAC-SHA256 instead of plain cryptographic hash functions (e.g., MD5, SHA-1, or SHA-256) because hashes fail to match our intuitive security expectations.
What we really want is something that behaves like a pseudorandom function, which HMACs seem to approximate and hash functions do not.

One difference between hash functions and pseudorandom functions is that many hashes are subject to *length extension*.
Many common hash functions use a design called the Merkle–Damgård construction.
Each is built around a compression function $h$ and maintains an internal state $t$, which is initialized to a fixed constant.
Messages are processed in fixed-size blocks by applying the compression function to the current state and current block to compute an updated internal state, i.e., $t_{i+1} := h(t_i, m_i)$.
The result of the final application of the compression function becomes the output of the hash function.

A consequence of this design is that if we know the hash of an $n$-block message, we can find the hash of longer messages by applying the compression function for each block $m_{n+1}, m_{n+2}, \dots$ that we want to add.
This process is called length extension, and it can be used to attack many applications of hash functions.

Length extension attacks can cause serious vulnerabilities when people mistakenly try to construct something like an HMAC by using `H(secret || message)`.
In 2009, security researchers found that the API used by the photo-sharing site Flickr suffered from a length-extension vulnerability almost exactly like the one in this homework.

Here are some additional resources that might help:
* https://blog.skullsecurity.org/2012/everything-you-need-to-know-about-hash-length-extension-attacks
* Here's the original NIST publication (FIPS 180-4) that defines SHA-1 and other Secure Hash algorithms: https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
* The source code of the FailHealth web server can be found here: https://gist.github.com/alltootechnical/a33723a808bca0106250a22944770fd2

## Getting started

Here are the libraries and functions we might need for later:

In [2]:
from binascii import hexlify, unhexlify
import requests

In [3]:
from urllib.parse import quote_from_bytes, unquote_to_bytes, urlparse, parse_qs

For this homework, we will be using a (pure) Python implementation of the SHA-1 hash function, instead of relying on the built-in `hashlib` library.
You can download the `sha1` module in the following link: https://gist.github.com/alltootechnical/f7b4c04f005412d4d7c00c543b925feb.
(Once downloaded, place the `sha1.py` file in the same directory as this notebook.)

Try to follow along so that you can familiarize yourself with this particular implementation.

Consider the string `hash functions are super secure`.
We can compute its SHA-1 hash by running:

In [4]:
from sha1 import SHA1, padding

In [5]:
m = b'hash functions are super secure'

In [6]:
sha1 = SHA1()
sha1.update(m)
sha1.hexdigest()

'3d3dc78f186e3815775ffc44104b348d9eb77c5c'

The output should be `3d3dc78f186e3815775ffc44104b348d9eb77c5c`.

SHA-1 processes messages in 512-bit blocks, so internally, the hash function pads `m` to a multiple of that length.
The padding consists of the bit `1`, followed by as many `0` bits as necessary, followed by a 64-bit count of the number of bits in the unpadded message.
(If the `1` and the count won't fit in the current block, an additional block is added.) 

For your convenience, you can use the function `padding(count)` in the `sha1` module to compute the padding that will be added to a `count`-bit message.

Even if we didn't know `m`, we could compute the hash of longer messages of the general form `m || padding(len(m)*8) || suffix` by setting the initial internal state of our SHA-1 function to `SHA-1(m)`, instead of the default magic constants, and setting the function's message length counter to the size of `m` plus the padding (a multiple of the block size).

To find the padded message length (in bits), guess the length of `m` and run `bits = (len(m) + len(padding(len(m) * 8))) * 8`. If you want that in bytes, just divide by 8.

## 4-1. Length extension warm-ups [5 pts]

The `SHA1` object constructor takes in two optional parameters, `state` which would be used to override the internal state, and `length` that serves as a counter of message bytes that have been processed so far.

**(a) [2 pts]** Construct a new `SHA1` object, where the initial state is set to the SHA-1 hash of `m`, where `m` is as previously defined, and the message length counter to the size of `m` plus the padding.

In [7]:
sha1 = SHA1()
sha1.update(m)
init_state=sha1.hexdigest()

m_length = len(m) + len(padding(len(m) * 8))
new_sha1 = SHA1(state=init_state, length=m_length)
print(new_sha1)

<sha1.SHA1 object at 0x0000019AD57A3090>


**(b) [3 pts]** Using the new `SHA1` hash object you just constructed, hash a new message `, unless it is SHA-1`.
Afterwards, verify that it equals the SHA-1 hash of `m || padding(len(m)*8) || new_msg`.

In [8]:
new_msg = b', unless it is SHA-1'

In [9]:
new_sha1.update(new_msg)
hash_msg = new_sha1.hexdigest()

print(hash_msg)

7939c0d240c9e8fda736706524a6303fd3c5b5ad


In [10]:
msg_concat = m + padding(len(m) * 8) + new_msg
exp_sha1 = SHA1()
exp_sha1.update(msg_concat)
exp_msg = exp_sha1.hexdigest()

print(exp_msg)

7939c0d240c9e8fda736706524a6303fd3c5b5ad


In [11]:
assert(hash_msg == exp_msg)

Notice that, due to the length-extension property of SHA-1, we didn't need to know the value of `m` to compute the hash of the longer string; all we needed to know was `m`'s length and its SHA-1 hash.

## 4-2. HTTP responses for bad queries [3 pts]

For this item, explore what happens to the server's responses whenever it recieves an ill-formed query.

In a separate Markdown cell provided below, describe and briefly explain what happens if:
- **(a)** The token is missing from the query string.
- **(b)** The file name is missing from the query string.
- **(c)** You try to download a file using a token for another file (i.e., the token doesn't correspond to the file you're trying to download).

**Answer for 4-2:** \
**(a)** 401 Unauthorized - No Token Provided \
**(b)** 400 Bad Request - No File Name Provided \
**(c)** 401 Unauthorized - Invalid Token

## 4-3. Doing the length extension attack [12 pts]

Write a function called `len_ext_attack` that, given a valid URL (as an ordinary string, not as a bytestring) in the same for as the one found at the start of the notebook, modifies the URL so that it will download the `hw4_part2.txt` file, then returns the new URL.

Use the `quote_from_bytes` function (using `raw_unicode_escape` encoding) to encode non-ASCII data within the URL.

*Pro-tip*: To more easily parse a URL, you may use the functions `urlparse` and `parse_qs` from the `urllib.parse` module.

In [188]:
def len_ext_attack(orig_url):
    parsed_url = urlparse(orig_url)
    query = parse_qs(parsed_url.query)

    TOKEN = query.get('token', [''])[0]
    FILE = query.get('file_name', [''])[0]
    REQUEST = '&file_name=hw4_part2.txt'

    query_length = len(unquote_to_bytes(f'file_name={FILE}'))
    API_length = 256 // 8
    secret_msg_length = query_length + API_length

    hash_length = secret_msg_length + len(padding(secret_msg_length * 8 ))

    custom_sha1 = SHA1(state=TOKEN, length=hash_length)
    custom_sha1.update(unquote_to_bytes(REQUEST))
    new_token = custom_sha1.hexdigest()

    url = f"{parsed_url.scheme}://{parsed_url.netloc}/{parsed_url.path}?token={new_token}&file_name={quote_from_bytes(FILE.encode('raw_unicode_escape'), safe=':/?=&')}{quote_from_bytes(padding(secret_msg_length*8))}{REQUEST}"    
    return url

In [189]:
len_ext_attack("http://hw4.lunchtimeattack.wtf/download?token=b5e6179eb44d634781740a7a61145ec95a4ff967&file_name=hw4_part1.ipynb")

'http://hw4.lunchtimeattack.wtf//download?token=09ad9db4f4b877e9e7c958979d64f69ca55a4f0b&file_name=hw4_part1.ipynb%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01%C8&file_name=hw4_part2.txt'

## 4-4. Making sure it's not a fluke... [2 pts]

Go to the FailHealth website and pick another valid download link (of your choice, but should be different from `hw4_part1.ipynb`), and carry out a length extension attack to download the `hw4_part2.txt` file using your `len_ext_attack` function.
Print out the new URL for this case, and verify that this new URL also works.

In [190]:
len_ext_attack("http://hw4.lunchtimeattack.wtf/download?token=3d77561ec4e40b8f1fe9752f555ecb56e842eb93&file_name=account_data.tar.gz")

'http://hw4.lunchtimeattack.wtf//download?token=529386757d238473e599a0e77f288d606b9a3f6d&file_name=account_data.tar.gz%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01%E8&file_name=hw4_part2.txt'

In [191]:
len_ext_attack("http://hw4.lunchtimeattack.wtf/download?token=7a68ef4b896191beae031a7d6d3ce4efb1e5fe8a&file_name=picture%20of%20hotdog.png")

'http://hw4.lunchtimeattack.wtf//download?token=33d874d732e5c7b52b90fa55730d5a979633663f&file_name=picture%20of%20hotdog.png%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01%F8&file_name=hw4_part2.txt'

In [192]:
len_ext_attack("http://hw4.lunchtimeattack.wtf/download?token=c41ca39b4a4cfa0b898a4376c3a6dc669cf37f6e&file_name=Tux-ECB.png")

'http://hw4.lunchtimeattack.wtf//download?token=b190468a5de4bac72c0fbbd191a75f25718a29ce&file_name=Tux-ECB.png%80%00%00%00%00%00%00%00%00%01%A8&file_name=hw4_part2.txt'

In [193]:
len_ext_attack("http://hw4.lunchtimeattack.wtf/download?token=a67425f21064d4bf21999ad945c109294faac2d2&file_name=solutions.pdf")

'http://hw4.lunchtimeattack.wtf//download?token=ac65c1c74a03c9dfc9678657ee043d67e76e9ca3&file_name=solutions.pdf%80%00%00%00%00%00%00%01%B8&file_name=hw4_part2.txt'

## 4-5. Part 2 [10 pts]

Go to the new URL that your function returned to download the `hw4_part2.txt` file.
Submit your answers for Part 2 on the same CodePost assignment with the filename `hw4_part2_answers.txt`.