In [1]:
import sys
sys.path.append('..')

from common.utility import show_implementation

While the [Authentication chapter](./authentication.ipynb) mainly focused on the entity authentication, this chapter will focus on the data-origin authentication aspect.

# Public Key Cryptography (PKC)
A public-key scheme, also known as asymmetric-key scheme, uses 2 different keys for encryption and decryption.
Contrast this against the ciphers seen in the [encryption chapter](./classical_ciphers.ipynb) which uses the same key for both encryption and decryption.

In typical usage, one of the key (the public key) is made known to everyone, while the other is kept a secret (the secret key).

This allows the following usage:
1. Alice posts her public key to everyone
2. Bob sees this public key, and encrypts and sends his message to Alice using the public key through an insecure medium
3. Eve, despite being able to see both the ciphertext and the public key, is unable to decrypt the message because she does not have the secret key
4. Alice, being the only person to know the secret key, can decrypt the ciphertext from Bob

Formally, it is defined as:
* $K_e$: set of all public keys
* $K_d$: set of all secret keys
* $M$: set of all plaintexts
* $C$: set of all ciphertexts
* $G$: key generation algorithm that generates a key pair $(k_e, k_d)$
* $E$: encryption algorithm: $M, K_e \rightarrow C$
* $D$: decryption algorithm: $C, K_d \rightarrow M$

### Requirements
* Correctness: For any $m\in M$ and $(k_e, k_d)$ generated by $G$, $D_{k_d}(E_{k_e}(m))  = m$
* Efficiency: $G,E,D$ are performed "relatively fast", usually polynomial time
* Security: $E$ is an one-way function, meaning it can be easily performed with $k_e$, but it is difficult to find an inverse $D$ without knowing $k_d$.


## Key Management
Suppose we have $n$ users, and we have the requirement that each user should be able to send an encrypted message that only their recipient can decrypt.

Using symmetric keys, we would require $C^n_2$ different (symmetric) keys, each shared between 2 different pairs of users in the system.

Contrast this to using a public key system, where each user needs broadcast their public key, and keep their private key.
This means there will be $n$ public keys and $n$ secret keys in the system, totaling $2n$ keys, which is much less than in symmetric keys.

## RSA
Disclaimer: We are discussing textbook version of RSA. This is more basic than actual implementations of RSA in the wild.

PKC system usually represent the plaintext, ciphertext and keys all as integers.
The algorithms, thus, are all arithmetic operations on integers.

### Setup
1. Key owner randomly choose 2 large primes, $p,q$.
2. They compute the **public composite modulus**, $n = pq$.
3. They compute $\phi(n) = (p-1)(q-1)$, which are the numbers of integers $<n$ that are relatively prime to $n$.
$\phi(n)$ conveniently equates to $(p-1)(q-1)$ when $p$ and $q$ are prime.
4. They randomly choose an **encryption exponent**, $e$, such that, $gcd(e, \phi(n)) = 1$, meaning $e$ is relatively prime to $\phi(n)$.
5. They now compute the **decryption exponent**, $d$, where $de \equiv 1 \mod \phi(n)$.
This is typically obtain via the [extended Euclid algorithm](../discrete_structures/integers.ipynb#extended-euclids-algorithm).
6. The owner now broadcast $(n,e)$ as the public key and keep $(n, d)$ as the private key.

### Encryption
We require that the message is smaller than the public modulus, that is $m < n$.

Given a message $m$, the sender can encrypt the message to obtain the ciphertext by $c \equiv m^e \mod n$

### Decryption
Given a message $m$, the receiver can decrypt the message to obtain the plaintext by $m \equiv c^d \mod n$

### Proof
We require the Euler's theorem, which states that:

if $n > 0$ and $m$ is relative prime to $n$, then $m^{\phi(n)} \equiv 1$

We know that:
\begin{align}
de \equiv 1 \mod \phi(n) &\Rightarrow de - 1 \equiv 0 \mod \phi(n) \\
&\Rightarrow de - 1 = k \times \phi(n) \quad \text{for some integer k}\\
&\Rightarrow de = k \times \phi(n) + 1 \\
\end{align}

Thus, assuming $m$ is relatively prime to $n$

\begin{align}
(m^e)^d \mod n &\equiv m^{de}\\
&\equiv m^{k \times \phi(n) + 1}\\
&\equiv m^{k \times \phi(n)} \times m\\
&\equiv (m^{\phi(n)})^k \times m\\
&\equiv 1^k \times m \quad \text{by Euler's theorem}\\
&\equiv m \mod n\\ 
\end{align}

And if $m$ is not relatively prime to $n$ (highly unlikely), since $n$ is a product of 2 primes and $m < n$, $m$ must relatively prime to one of $p,q$.
Thus, we can use the same argument above to achieve equivalence.

In [2]:
import module.rsa as rsa

show_implementation(rsa)

from .euclids_algorithm import extended_euclid 
from .is_prime import is_prime

def modulo_inverse(e, n):
    gcd, p, _ = extended_euclid(e, n)
    if gcd != 1:
        raise ValueError(f"e ({e}) is not relatively prime to n ({n})")
    return (p + n) % n
        
def make_keys(p, q, e):
    assert is_prime(p), "p is not prime"
    assert is_prime(q), "q is not prime"
    n = p * q
    phi = (p - 1) * (q - 1)
    d = modulo_inverse(e, phi)
    return (e, n), (d, n)

def encrypt(m, public_key):
    e, n = public_key
    return pow(m, e, n)

def decrypt(m, private_key):
    d, n = private_key
    return pow(m, d, n)


In [3]:
public_key, private_key = rsa.make_keys(p=5, q=17, e=3)
plaintext = 42
ciphertext = rsa.encrypt(plaintext, public_key)
decrypted_ciphertext = rsa.decrypt(ciphertext, private_key)
print("public key:", public_key)
print("private key:", private_key)
print("plaintext:", plaintext)
print("ciphertext:", ciphertext)
print("decrypted ciphertext:", decrypted_ciphertext)

public key: (3, 85)
private key: (43, 85)
plaintext: 42
ciphertext: 53
decrypted ciphertext: 42


Note: When playing with RSA, you might encounters situations where `ciphertext==plaintext` or `decrypted_ciphertext(using public key)==plaintext`.
However, these are artifacts that appear because small primes $p,q$ are used, thus there are more frequent occurrences of such events.
When sufficiently large primes are used (such as RSA in the wild), these events becomes extremely rare.

Also, notice that since $e$ and $d$ are symmetrical, they are interchangeable.
This means we could have encrypted using $d$ instead, and decrypted using $e$.

### Time complexity of RSA
#### Key generation
We can precompute all the possible large primes within some desired range and store it as a set.
To obtain 2 random primes, we simply select 2 random primes from this set.

To test for co-primeness $e$ against $\phi(n)$ and finding the modulo inverse of $e$, we use the extended Euclid's algorithm, which runs in $O(\log n)$.

#### Encryption/Decryption
To perform encryption/decryption using modulo exponentiation, we can use Exponential Squaring to obtain a time complexity of $O(\log e)$.

Thus, with prior knowledge of the keys, the whole process of RSA runs in logarithmic time with respect to $n$.

### Security of RSA
It can be proven that deriving the private key from the public key is as difficult as factorizing $n$ (*i.e.* prime factorization).

And since prime factorization is computationally difficult, we can say that trying to obtain the corresponding private key for a public key is also difficult.

However, it is still an open question whether obtaining the plaintext from the ciphertext and public key (which is the main goal of the attacker) is as difficult as prime factorization.

Also, it is to note that with quantum computing, we can solve integer factorization in polynomial time, thus RSA will be broken with a sufficiently powerful quantum computer.

### Improper RSA usage

#### Efficiency issue
Suppose that we wish to encrypt some data in a symmetric key setting.
Instead of using a [block ciphers](./modern_ciphers.ipynb#block-ciphers) like AES, we use RSA.

It is to note that RSA runs an order of magnitude slower than AES, thus it is more inefficient than if we were to use a symmetric cipher.

#### Key reuse
Suppose that our message is too long, such that it cannot be represented as an integer smaller than $n$.
We can still encrypt the message by segmented it into blocks, and encrypting each block using RSA instead.
However, since RSA is deterministic, blocks with identical information will result in identical ciphertext blocks, thus leaking information.
This is similar to [AES using ECB](./modern_ciphers.ipynb#ecb).



## <div id='hash'> Hash </div>

### Motivation
Suppose that we have a long message, and we have both a secure channel and an insecure channel.
On the insecure channel, an attacker has the ability to change the the content of the message sent through it.

How can we ensure that the receiver receives our intended message?

One way is to simply send the long message through the secure channel.
This might be inefficient because there are higher overhead to sending messages through a secure channel.

Another option is to send the long message through the insecure channel, and some sort of "checking mechanism" through the secure channel.
The recipient, upon reception of the message, will only accept the message if the checking mechanism passes.

Suppose that we send the message length as through the secure channel, now the attacker can no longer perform insertion/deletion to modify the message, as the recipient will reject any messages whose length has changed.

Using this idea, it serves as the motivation for using a hash function, where it protects the **integrity** of the message.

### Unkeyed Hash
When talking about hashes, most people are referring to unkeyed hash.
The hash takes a message of any length, and produces its corresponding fixed-length output, called a **digest**.

Cryptographic hashes has to satisfy certain conditions:
1. Efficiency: Given $m$, it is computationally efficient to compute $y=h(m)$.
2. Pre-image resistance (**one-way**): Given $y$, it is computationally difficult to find a message $m$ such that $y = h(m)$.
3. Collision-resistance: It is computationally difficult to find two messages $m_1, m_2$ such that $h(m_1) = h(m_2)$.

Notice that collision-resistance implies pre-image resistance.

#### Examples
* MD5 **(Insecure)**
* SHA-0 **(Insecure)**
* SHA-1 **(Insecure)**
* SHA-2 Family
    * SHA-224
    * SHA-256
    * SHA-384
    * SHA-512
* SHA-3

Note that because of pre-image resistance of a hash function, a small change in the message will cause a large change to the digest as per below, similar to the characteristic of [block ciphers](./modern_ciphers.ipynb#block-ciphers).

In [4]:
import hashlib

h1 = hashlib.sha224(b"a bit of tid and a bit of tad").hexdigest()
h2 = hashlib.sha224(b"a bit of tid and a bit of tid").hexdigest()

print('First digest :', h1)
print('Second digest:',h2)

First digest : 23ee4f591dde0a0f0d24f240b7dbea1df282675d8be027ef8e4f291a
Second digest: c89496f803f68cdd73956061b1701d9772d8032614383e83032c80b8


### Keyed Hash <span id="mac"/>
A keyed hash is similar to unkeyed hash, except that it also takes a secret key as an input and outputs a **Message Authentication Code (MAC)**.
Thus, only people who knows the secret key can produce the correct MAC for a given message.

#### Examples
* CBC MAC
    * Based on CBC AES
* HMAC
    * Based on SHA functions

### Remarks
* For an attacker, the clear way of attack would be to find a **another pre-image** $m'$ which holds their desired message, such that the hash is the same as the hash for the original message $m$, that is $h(m') = h(m)$.

* Given the digest, we can be assured of the origin on the data, thus one may feel that "authenticity" is satisfied.
However, in literature, when there is no secret key involved, we deem it fulfilling **integrity** instead of authenticity.
    * Thus, to achieve authenticity, we need either **MAC** or **digital signatures**.

* Note that hashes **do not protect confidentiality**, because the messages are sent unencrypted.

## Data origin authentication
Suppose that we do not even have access to some secure channel to deliver the digest.
How can we have a system which can verify the integrity of the messages sent?

We simply use MAC (symmetric setting) or digital signatures (asymmetric setting).

### MAC
To ensure that the message is not forged, we perform the following:
1. Sender and receiver agree on the secret key beforehand.
2. Sender sends both their message and their MAC (produced using the message and the secret key)
3. Receiver receives the message and the MAC, then computes the corresponding MAC using the message and the secret key.
4. Receiver accepts the message only if the received MAC matches with the computed MAC

### Digital signature
Digital signature is simply the asymmetric version of MAC.
Thus, the sender will sign the message using his secret key.
And **anyone** can verify that the message is indeed from the sender by using the public key.
Note that digest sent to the receiver is computed using the **secret key**, unlike in RSA where the sent message is encrypted using the **public key**.

Using similar arguments as RSA, the message can only be signed by someone who knows the corresponding private key of the public key.
Thus, signature scheme also achieves **non-repudiation**, where the signing entity cannot deny the his previous action/message.

Note that non-repudiation is not achieved using MAC only.
Since MAC is symmetric, both sender and receiver has know about the secret key.
Thus, the sender of the message can wrongly attribute the message to the receiver, claiming that it was the receiver who sent the message instead.

#### Examples
* RSASSA-PSS, RSASSA-PKCS1
    * Based on RSA
* DSA

### Attacks on Hashes
#### Birthday attacks
Note that if collisions can be easily generated, the hashing function no longer guarantee integrity, because it will not be able to discern the difference between the two collided data.
Thus, attackers will aim to find collisions as a method to violate the integrity of the system.

Consider the "Birthday Paradox" which states that it only requires 23 people such that the probability that two of them shares the same birthday is >50%.
This phenomenon appears because the pairwise comparisons between people's birthdays grows quadratically with respect to the number of people.

Thus, suppose that our hash can only appear as 366 different values.
Then the attacker only needs to generate the hashes of 23 different messages to have a more than even chance of producing a collision.

In general, if the hash space is $T$, if an attacker generates $M$ messages such that 
$$
M > 1.17T^{0.5}
$$
then he has a probability of greater than 0.5 to find a collision.

Thus, for a 64 bit digest, an attacker only needs to generate $M > 1.17 * {2 ^ {64}} ^ {0.5} = 1.17 * 2 ^ {32} \approx 5.025 \times 10^9$ messages, which is easily bruteforce-able.

The probability that a collision occurs is approximately $$1 - e ^ {-\frac{M^2}{2T}}$$

Hence, the digest length of hashes are rather long, to mitigate birthday attacks.

### <span id= "protection-of-passwords"/>Protection of password files 

Since the system needs to keep a file storing the user identity and their corresponding password, there is a need to protect this file, since the leaking of this file can lead to many credentials being compromised.

Hence, the hash of the password should be stored, rather than the raw value.
During authentication, the system compares the hash of the password response against the password hash in its storage.
This allows the system to check if the password is correct without actually storing the actual password.
Hence, in the event that the system is compromised and the password file is stolen, the attackers will not be able to obtain the password of the users, which could have been reused for authentications of other services.

#### Salt

It is desirable for the same password to hash to different values when they are assigned to different user identity.
This prevents the attacker from seeing which users have identical password (thus, potentially weak passwords) if the password file is leaked.
This can be achieved using **salt**, which is simply some random value appended to each user's passwords before hashing.
The salt is then stored in clear text in the password file, so that the system is able to recompute the salted hash from a provided password.

Notice that this is similar to using [IV in encryption](./modern_ciphers.ipynb#iv).