In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab02.ipynb")

In [None]:
import math
import matplotlib.pyplot as plt
import numpy as np
import string
import itertools
import re
from functools import reduce
import hashlib

In [None]:
%%capture
# Install our custom library
import sys
!{sys.executable} -m pip install git+https://github.com/CodebreakingAtCal/codebreaking_at_cal.git

import codebreaking_at_cal

# Lab 2: Information Theory
Contributions From: Ryan Cottone, Imran Khaliq-Baporia

Welcome to Lab 2! In this lab, we will demonstrate the basics of information theory, and some techniques used to break Enigma.

As a quick review from lecture and Note 2, the *entropy* of a probability distribution is derived from the following formula:

$$ H(x) = -\sum_{i=0} x_i \log_2(x_i) $$

where $x_i$ represents the probability of event $i$ occuring. It is used to quantify how uncertain a random variable is.

**Question 1.1**: Implement calculateEntropy, which takes in a array of probabilities and returns its entropy.

*HINT: math.log(x, base) returns $\log_{base}(x)$*

In [None]:
def calculateEntropy(probabilityDistribution):
    total = 0
        
    for probability in probabilityDistribution:
        if probability == 0:
            continue
        ...
    
    return -total
    

In [None]:
grader.check("q1_1")

It is a lot easier to compute the entropy of a uniform distribution (an array of equal probabilities). We can reduce our equation to the following for a uniform distribution with $k$ entries and $x$ probability:

$$H(X_{uniform}) = -k \cdot x\log_2(x)$$

In [None]:
def calculateUniformEntropy(probability, k):
    ...

In [None]:
grader.check("q1_2")

Let's visualize what entropy actually looks like. Below is a graph with the x-axis representing the probability of tossing heads in an unfair coin. You'll notice that the entropy peaks at 0.5, or when the coin is fair. This is at the point where the outcome is most uncertain. Contrast this to when there's a 90% chance to hit heads -- we can be pretty certain about the outcome (lower entropy).

In [None]:
plt.plot(np.arange(0.1,.9001,0.001), [calculateEntropy([x, 1-x]) for x in np.arange(0.1,.9001,0.001)])
plt.show()

Specifically, we have one bit of entropy for a single fair coinflip. This corresponds to a sample space of size 2 with uniform probability. We can do no better but to guess the result each time.

In the context of cryptographic keys, we'd need to guess approximately $2^k$ keys for a system with $k$ bits of entropy.

Let's demonstrate the entropy of some ciphers we tried before:

In [None]:
sample_text = ""
with open('sample.txt', 'r') as file:
    sample_text = file.read().replace('\n', '').replace(" ", "")
    
sample_text = codebreaking_at_cal.clean_text(sample_text)
sample_text = re.sub('[^a-z]*', '', sample_text.lower())

In [None]:
# Takes in two NumPy arrays and returns their TVD.
def analyzeFrequency(text):
    diff = abs(freq1 - freq2)
    return sum(diff)/2

In [None]:
# Takes in a Caesar ciphertext and returns the map of shift -> probability
def shiftProbabilityCaesar(ciphertext):
    ciphertext = re.sub('[^a-z]*', '', ciphertext.lower())
    
    arr = np.array([])
        
    for i in range(0,26):
        arr = np.append(arr, 2**(1/codebreaking_at_cal.analyze_frequency(codebreaking_at_cal.caesar_decrypt(ciphertext, i))))
    
    arr = arr/sum(arr)
    return arr

In [None]:
caesarEntropy = calculateEntropy(shiftProbabilityCaesar(codebreaking_at_cal.caesar_encrypt(sample_text, 1)))

k = 20
baseEntropy = calculateUniformEntropy(1/(26**k), 26**k)

print('A Caesar encrypted message of length', len(sample_text), "has", caesarEntropy, "bits of entropy.")
print('A random message of length', k, "has", baseEntropy, "bits of entropy.")

Read through those functions and try to understand what they do. In the last cell, we calculated the entropy of a Caesar cipher on a very long text and the entropy of a completely random alphabetic string of length 20 (far, far less than the Caesar text).

To put these numbers into perspective (note that e+28 = $\cdot$ 10^28):

In [None]:
print('There are ', 2**caesarEntropy, ' items in the "sample space" for a caesar encrypted message of length', len(sample_text))
print('There are ', 2**baseEntropy, ' items in the "sample space" for a random message of length', k)

Take a look at the graph of the entropy of a Caesar cipher on a $k$ length text (represented by the x-axis):

In [None]:
plt.plot(np.arange(1, 175),
         [calculateEntropy(shiftProbabilityCaesar(codebreaking_at_cal.caesar_encrypt(sample_text[:x], 9))) for x in np.arange(1,175
                                                                                                     )])
plt.show()

Note the entropy actually decreasing as the text size gets larger! Can you think of why this is the case?

Now for the graph of a uniformly random string:

In [None]:
y = 10
plt.plot(range(1, y),
         [calculateUniformEntropy(1/26,26**x) for x in range(1,y)])
plt.xticks(range(1,y))
plt.yscale('log', base=10)
plt.show()

Note the exponentially increasing y-axis!

We can see that vigenere is better but not by much.

In [None]:
def calculateVigenereEntropy(text, keylen):
    texts = ['']*keylen
    for i in range(len(text)):
        texts[i%keylen] += text[i]
    probabilities = [shiftProbabilityCaesar(x) for x in texts]
        
    return sum([calculateEntropy(x) for x in probabilities])

key = "ab"    
calculateVigenereEntropy(codebreaking_at_cal.vignere_encrypt(sample_text[:50], key), len(key))

**Question 1.3**: Is the entropy of a keylength 1 vigenere cipher equal to a Caesar cipher? Enter True or False.

In [None]:
concept_check = ...

In [None]:
grader.check("q1_3")

The following function generates a random key for use in later functions.

In [None]:
def genRandomKey(length):
    return ''.join(np.random.choice([x for x in string.ascii_lowercase], length))


What happens if we use a Vigenere cipher with a random key? Check out the graph of key length versus entropy below:

*NOTE: The orange bar represents the entropy of a 102-long string*

In [None]:
cap = 102
plt.plot(range(1,cap,10),
         [calculateVigenereEntropy(codebreaking_at_cal.vignere_encrypt(sample_text[:cap], genRandomKey(x)), x) for x in range(1,cap,10)])
plt.plot(range(1, cap,10),
         [calculateUniformEntropy(1/26**cap,26**cap) for x in range(1,cap,10)])
plt.show()

Do you see why an increasingly long keylength corresponds to more entropy? What do you think happens as the keylength approaches the message size?

# One-Time Pad

Let's introduce a quite surprising idea -- there exists a cipher that is **provably unbreakable** and not any more complicated than the Vigenere cipher!

We will define provably unbreaking, also known as perfectly secret, to be a ciphertext with entropy equal to a random bitstring of the same length. This means that the ciphertext yields no more information than a string of truly random ones and zeroes. You have zero recourse but to try and brute force every possibility (we will see that even this is useless).

Enter the One-Time Pad. As it's name suggests, we can only use it a single time before it becomes insecure again. A one-time pad operates by pairing each letter/bit of the plaintext with a truly random key letter/bit and shifting/XOR-ing. Basically, if we used Vigenere cipher with a truly random key of the same length as the plaintext, we would have a perfectly secure cipher.

You may be wondering why cryptography is even an active field if such a perfect cipher exists. Well, there are considerable drawbacks to such a cipher, namely the onerous requirement of a very large amount of perfectly random key material.

**Question 2.1**: Which of the following message/key pairs are secure?

1: abcd, efgh

2: abcd, b

3: abcd, longkey

In [None]:
correct = ...

In [None]:
grader.check("q2_1")

In [None]:
def otp_encrypt(text, key):
    if len(text) != len(key):
        raise Exception()

    return codebreaking_at_cal.vignere_encrypt(text, key)
    

In [None]:
def otp_decrypt(text, key):
    if len(text) != len(key):
        raise Exception()
        
    return codebreaking_at_cal.vignere_decrypt(text, key)
    

Let's graph the entropy of sample_text encoded with a OTP. We know that each 

In [None]:
def count_letters(text):
    counts = {}
    text = text.lower()
    
    for letter in string.ascii_lowercase:
        counts[letter] = 0
    
    for letter in text:
        if (letter in string.ascii_lowercase): 
            counts[letter] += 1
    
    return counts

def calculate_proportions(text): # Coded for you
    counts = count_letters(text).values()
    nparr = np.fromiter(counts, dtype=float)
    return nparr / sum(counts)

plt.bar([string.ascii_lowercase[i] for i in range(26)], calculate_proportions(otp_encrypt(sample_text, genRandomKey(len(sample_text)))))
plt.show()

Looks pretty uniform, right? Assuming a random shift at each step, we have 26 possibilities at each index. For a string of length $k$, that means $26^k$ possible keys. Moreover, we will see that the properties of a OTP make it impossible to decrypt at all, not just hard.

The crux of have a separate shift for every letter is that you can decrypt it into **every possible string** by changing the key. With traditional ciphers like Vigenere or Caesar, there were a bounded number of different strings possible with a key of a certain length. This is because eventually the key repeats with the same plaintext, and you can't map these two occurences to different outcomes. With a OTP, that is very much possible.

Take for example the ciphertext "AUVDGP". 

Say we found our possible key "ABCDEF": decrypt("AUVDGP", "ABCDEF") == "ATTACK"

Instead, say we found a possible key "XQQZTM": decrypt("AUVDGP", "XQQZTM") == "DEFEND"

Which is correct? There's no way to tell.

Let's revisit why this is the case. The formula for encryption can be thought of the following, where adding is pairwise by letter:

PLAINTEXT + KEY = CIPHERTEXT

To find the key for any arbitrary plaintext from the ciphertext, simply solve the following:

KEY = CIPHERTEXT - PLAINTEXT = vignere_decrypt(CIPHERTEXT, PLAINTEXT)

# Enigma weaknesses

Unfortunately, we do not have enough computing power nor time to build a full Enigma decryption machine. However, we can examine some fatal weaknesses of the system to get a better idea of what a full decryption machine would involve.

## Weakness 1: No self-encryption
In an Enigma machine, a letter can never encrypt to itself. While seemingly benign, this has absolutely massive ramifications on the potential sample size. 

For example, ciphertext to decrypted ciphertext of "ODSXO" "HELLO" is **not a valid decryption**, despite its high English frequencies.

How much does this help us? For a message of length $n$ , we are able to eliminate it if **any letter maps to itself**. As $n$ increases, the chances of this happening also skyrocket.

$$P(repeat) = 1 - \left(\frac{25}{26}\right)^n$$

In [None]:
plt.plot(range(1, 100), [(1 - (25/26)**x) for x in range(1,100)])
plt.show()

This means at higher ranges, we can eliminate the vast, vast majority of possible messages.
For example, at N=100, we can discard 98% of decryptions without further analysis if we know the plaintext!

The goal in this case would be to find the key to decrypt future messages, not to decrypt the one we currently have.

# Partial Known-Plaintext Attacks

You may be wondering what the point of that is if we already know the decrypted message. It turns out that we often know *parts* of the message, sometimes even where those messages are. 

For example, take the follow weather report often sent daily by the German Army:

\-----

WEATHER REPORT 

There are clear skies today.

\----

If you know that the plaintext must start with 'WEATHER REPORT', you can immediately eliminate anything that doesn't decrypt to that.

If you don't know where it is, but know it must be somewhere, you can try it at every index. (See why breaking the Enigma was a hard task in the 1940s?)

In [None]:
from collections import Counter

# Takes in a ciphertext and known crib, and returns all possible indices of the crib
def verifyKPA(ciphertext, crib): # Crib was the term for a segment of known plaintext
    possible = []
    cribCounts = Counter(crib)

    
    for i in range(0, len(ciphertext)-len(crib)):
        segment = ciphertext[i:i+len(crib)]
        segmentCounts = Counter(segment)
                
        valid = True
        
        for key in segmentCounts:
            if key in cribCounts:
                valid = False
                break
        
        if valid: 
            possible.append(i)
            
    
    return possible

verifyKPA("abaacaad", 'bc')

At any given position, the chance for the next $k$ letters to be valid for the crib is $$1 - \left(\frac{25}{26}\right)^k$$ If you had a long crib, you would be able to eliminate many potential places and only need to check in a few spots for decrypting. On the other hand, knowing there must be a "c" somewhere in the text does very little.

**Question 3.1**: Given the following Enigma ciphertext, what is the plaintext?

HINT: What do we know about self-encryption?

HINT: Remember the plaintext must be as long as the ciphertext.

ehwbr ximqc moche cgksf ivyid xjnql qofhv myhjq opyoz nldnh cgbqb mzwht jtugy vmxkm hwdxz oncwt rjfim oiclj nqxpi tlsrr pdtif wtmpc zwlwy uzwjc rjsrl fkyqp yd

In [None]:
text = "ehwbrximqcmochecgksfivyidxjnqlqofhvmyhjqopyoznldnhcgbqbmzwhtjtugyvmxkmhwdxzoncwtrjfimoicljnqxpitlsrrpdtifwtmpczwlwyuzwjcrjsrlfkyqpyd"

In [None]:
letterCount = count_letters(text)
plt.bar(letterCount.keys(), letterCount.values())
plt.show()

In [None]:
plaintext = ...

In [None]:
grader.check("q3_1")

That's all for Lab 2. I would have loved to include more on the Enigma, but it is quite heavily involved both computing wise and coding wise. If you are interested in learning more, I highly recommend reading over https://en.wikipedia.org/wiki/Cryptanalysis_of_the_Enigma !

Based on previous lab results, we would like to collect feedback about the labs to see if they are too short/long/involved/etc. 

**Fill out the form and put the secret word below once you're done:** https://forms.gle/cFa5ihS8DH8d1XWT6


In [None]:
secret_word = ...

In [None]:
grader.check("q4_1")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Once you have generated the zip file, go to the Gradescope page for this assignment to submit.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)