# Breaking Caesar ciphers
Now we can enciper and decipher with a Caesar cipher, the next step is to write a program to automatically break enciphered messages. 

### Want to know more?
I've written more about [writing programs for ciphers](https://work.njae.me.uk/tag/codes-and-ciphers/) on [my blog](https://work.njae.me.uk/). You can also grab more [code from Github](https://github.com/NeilNjae/cipher-tools/). Feel free to dive in and take a look!

# The basic idea
The basic model is simple: we'll get the computer to try all the keys and see which is the best.

"Trying all the keys" is simple: for a Caesar cipher, we just run through the shifts 0–25.

Finding which is "best" is somewhat tricker. How do we know which is the best possible decryption? For complex ciphers, and long ciphertexts, we don't want to rely on a human to make the decision. 

## Recap: Caesar ciphers
We're just repeating the Caesar cipher implmentation from the previous notebook.

In [None]:
import string

In [None]:
def pos(letter): 
    """Return the position of a letter in the alphabet (0-25)"""
    if letter in string.ascii_lowercase:
        return ord(letter) - ord('a')
    elif letter in string.ascii_uppercase:
        return ord(letter) - ord('A')
    else:
        raise ValueError('pos requires input of {} to be an ascii letter'.format(letter))

In [None]:
def unpos(number): 
    """Return the letter in the given position in the alphabet (mod 26)"""
    return chr(number % 26 + ord('a'))

In [None]:
def caesar_encipher_letter(letter, shift):
    """Encipher a letter, given a shift amount"""
    if letter in string.ascii_letters:
        cipherletter = unpos(pos(letter) + shift)
        if letter in string.ascii_uppercase:
            return cipherletter.upper()
        else:
            return cipherletter
    else:
        return letter      

In [None]:
def caesar_encipher(message, shift):
    """Encipher a message with the Caesar cipher of given shift"""
    enciphered = ""
    for character in message:
        enciphered += caesar_encipher_letter(character, shift)
    return enciphered

In [None]:
def caesar_decipher(message, shift):
    """Decipher a message with the Caesar cipher of given shift"""
    return caesar_encipher(message, -shift)

In [None]:
caesar_encipher('This is a test message.', 4)

In [None]:
caesar_decipher('Xlmw mw e xiwx qiwweki.', 4)

# The New Stuff

## Trying all the keys

Remember, our approach to automatically breaking ciphers is simple:

1. Try all the keys
2. For each key, decrypt the message and "score" it
3. Pick the key with the best "score"

Trying all the keys is easy. Let's try it for a Caesar cipher. Here's an enciphered message:

In [None]:
ciphertext = caesar_encipher('This is a test message.', 4)
ciphertext

And let's try all the keys, seeing what they give.

In [None]:
for key in range(26):
    plaintext = caesar_decipher(ciphertext, key)
    print(plaintext, '<=', key)

From looking at the generated plaintexts, you can probably spot that key 4 gives a sensible text and the others don't. But if you didn't know that `4` was the correct key, it could take you a few moments to find it in the mass of text. It gets harder if the text is longer, and harder still if there are more keys. We want to get a machine to recognise text that looks like English.

# A language model

The idea is to have a _model_ of what English text looks like. When we're presented with a possible plaintext, we can see how well that text fits the model. The better it fits, the more like "ideal English" it is. When we run through all the keys, we pick the key that gives a plaintext that looks most like English.

![Monkey typing](Monkey-typing.jpg) 

My approach takes the idea from the apocryphal story that an infinite number of moneys, with an infinite number of typewriters, will eventually create the complete works of Shakespeare. As the computer tries each key, generating a possible plaintext, it can score the possible plaintext by how likely it would be for a monkey, completely randomly, to generate that plaintext. 

That idea isn't that helpful when all the letters are equally likely. But if the monkey is using a keyboard where the keys are sized in proportion to how often they appear in English, we have something. The diagram below gives an idea of what such a keyboard could look like. A monkey using that keyboard, htting keys at random, will produce something like *treattlpis* more often than *nziuechjtk*.

![English letters by proportion](letter-treemap.png) 

That allows us to score how close a piece of text is to English. If we can work out the probability htat a random monkey would produce our possible plaintext, we can choose the key which produces the most likely plaintext.

(This model is also called a _bag of letters_ model, as it's the same as taking all the letters in the text and putting them in a bag, losing all idea of the order of letters. While we lose a lot of information with this, it's actually good enough for our purposes.)

## Finding letter probabilities
How do we find these probabilities of the letters of English? The easy answer is to simply read a lot of text, counting the letters. Let's use three large texts: the [complete works of Shakespeare](https://www.gutenberg.org/ebooks/100), [War and Peace](https://www.gutenberg.org/ebooks/2600), and [The Adventures of Sherlock Holmes](https://www.gutenberg.org/ebooks/48320), all from [Project Gutenberg](https://www.gutenberg.org/wiki/Main_Page).

The Python [`collections.Counter()`](https://docs.python.org/3/library/collections.html#collections.Counter) object from the standard library does a good job of counting letters for us. If we pass a `Counter` a sequence of characters, it will count them for us:

In [None]:
import collections
collections.Counter('this is some test text')

If we `update` the `Counter`, it will update the counts for us:

In [None]:
counts = collections.Counter()
counts.update('this is some text')
counts

In [None]:
counts.update('here is some more text')
counts

This allows us to easily combine the counts of letters in all the texts, then write them to a file. 

In [None]:
corpora = ['shakespeare.txt', 'sherlock-holmes.txt', 'war-and-peace.txt']
counts = collections.Counter()

# count all the characters
for corpus in corpora:
    text = open(corpus, 'r').read().lower()
    counts.update(text)

    
# keep just the letters    
letter_counts = {}
for l in counts:
    if l in string.ascii_lowercase:
        letter_counts[l] = counts[l]
        
# sort the letters, most common first
sorted_letters = sorted(letter_counts, key=counts.get, reverse=True)

# write the counts to a file, in case we need them later
with open('count_1l.txt', 'w') as f:
    for l in sorted_letters:
        f.write("{0}\t{1}\n".format(l, letter_counts[l]))
        print("{0}\t{1}".format(l, letter_counts[l]))

For interest, how many letters were counted?

In [None]:
sum(letter_counts.values())

At five letters per word on average, how many words is that? (Pretty printing to show the commas as separators in the number.)

In [None]:
"{:,}".format(sum(letter_counts.values()) / 5)

## Using the language model
Now we have the letter counts, we need to use them to score possible plaintexts. There are two stages here:
1. Convert the counts into probabilities;
2. Score a piece of text for how probable it is (by the random monkey metric).

## Converting the counts to probabilities
This just requires working out what proportion each letter is of the total counts. 

The probability of a letter occuring by chance is just the number of times that letter appears, divided by the total number of letters. 

Whe have the number of each letter in `letter_counts`: `letter_counts[letter]` is how often a letter appeared in the corpus. 

We can find the total of all the counts with `sum(letter_counts.values())`.

Division in Python is done with the `/` operator, like this:

In [None]:
12 / 4

In [None]:
12 / 5

## Your turn: find `letter_probability`s

Complete the code below. If you're stuck, you can [see the solution](letter_probability-solution.ipynb).

In [None]:
letter_probability = {}
count_sum = # Write your code here
for letter in letter_counts:
    letter_probability[letter] = # Write your code here

letter_probability

## Score a piece of text
We can use these to find the probability of a sequence of letters by finding the probability of each letter, then multiplying them all together. What we end up with is the probability of this piece of text being typed out by the random money I introduced at the top of this notebook.

Note that the `letter_probability` dict only holds lower-case letters, so we make sure to convert the text to lowercase before tying to score it. 

In [None]:
def text_probability(text):
    prob = 1.0
    for letter in text.lower():
        if letter in letter_probability:
            prob *= letter_probability[letter]
    return prob

In [None]:
text_probability('hello')

In [None]:
text_probability('this is a longer piece of text, like what we might find in a message')

That's quite a small probability. To see how small, let's put it in a more human-readable format:

In [None]:
'{:0.70f}'.format(text_probability('this is a longer piece of text, like what we might find in a message'))

The final values will be very small even longer texts. There's a danger that with a long text, we'll end up with a number that's too small to represent. 

We can get around this by using the "trick" of taking logarithms of probabilities. As numbers get smaller, their logarithms get smaller, but much less quickly. This means we can still handle the probability of long texts, while still being able to see which is the most probable. 

To find the logarithm of a number, we use the `math.log10()` function.

In [None]:
import math

math.log10(1000)

In [None]:
math.log10(1)

In [None]:
math.log10(0.1)

In [None]:
math.log10(0.005)

When we multiply numbers, we _add_ their logarithms. Here's a multiplication:

In [None]:
0.002 * 5

Here's the logarithm of the result.

In [None]:
math.log10(0.002 * 5)

Here are the logarithms of the two numbers.

In [None]:
math.log10(0.002), math.log10(5)

To multiply them together, we add their logarithms, and get the same answer as above.

In [None]:
math.log10(0.002) + math.log10(5)

(If you're interested, we can get the original number back from the logarithm by raising 10 to a power:)

In [None]:
10 ** -2.0

Before, we multiplied the probabilities together to get the probability of the sentence. With logarithms, to get the logarithm of the probability, we add all the log probabilities of the letters. As we're only interested in which is the _most_ likely, we don't need to convert _from_ logarithms anywhere: the most likely plaintext is just the one with the highest probability, which is the highest log probability. 

## Your turn: `log_letter_probability`
Knowing what you know from above, fill the `log_letter_probability` dict with the log probabilities of the letters. 

Follow the code for filling in `letter_probability`, but take the logarithm of the fraction of letters.

If you get stuck, you can [see the solution](log_letter_probability-solution.ipynb).

In [None]:
log_letter_probability = {}

# Write your code here

log_letter_probability

## Your turn: `log_text_probability()`
Using the `log_letter_probability` dict, write the `log_text_probability()` function to give the log probability of a piece of text. 

You should follow the code from `text_probability()` above, but start with a `log_probability` of zero, and _add_ the log probabilities of the letters.

If you get stuck, you can [see the solution](log_text_probability-solution.ipynb).

In [None]:
def log_text_probability(letters):
    # Write your code here

We can use that to find some log probabilities:

In [None]:
log_text_probability('hello')

In [None]:
log_text_probability('hello there')

In [None]:
log_text_probability('this is a longer piece of text, like what we might find in a message')

Much better values!

We've ended up with a simple _model_ of the English language, which we can use to judge how likely a given piece of text is actually English. 

## Breaking Caesar ciphers
Finally, we can use `log_text_probability()` to score possible plaintexts, and hence automatically break Caesar ciphers!

Back at the top of the notebook, we showed how to try all keys and generate the corresponding plaintext for each:

In [None]:
ciphertext = caesar_encipher('This is a test message.', 4)

for key in range(26):
    plaintext = caesar_decipher(ciphertext, key)
    print(plaintext, '<=', key)

## Your turn: show the log probability scores
Generate the same table as above, but this time give the log probability score for each generated plaintext.

If you get stuck, you can [see the solution](show-log-probs-solution.ipynb).

In [None]:
# Write your code here

If you do it, you can see that all the possible plaintexts score -25 or lower, apart from one. And that's the correct key.

The next step is to keep track of the _best_ key. 

We work through all 26 keys, deciphering the text with each one. For each possible plaintext, we use the `log_text_probability()` function to score how good that plaintext is. As we go through, we keep track of the best probability and the key which generated it. 

At the end, we return the best key and probability.

We start by initialising the `best_key` and `best_prob` to extreme values which will be instantly overridden (`float('-inf')` is negative infinity, a _really_ bad score):

In [None]:
best_key = 0
best_prob = float('-inf')

We can use this code to update the best key and best probability:

In [None]:
# Don't run this cell: it's just a code fragment and will give an error
if log_prob_score > best_prob:
    best_prob = log_prob_score
    best_key = key

## Your turn: find the best key
In the code you wrote above, you showed the score for each candidate plaintext. 

Extend that to keep track of the best score over all the keys. Copy you code below and add in the couple of fragments above for keeping track of the best key.

If you get stuck, you can [see the solution](show-best-log-prob.ipynb).


In [None]:
# Write your code here

## Your turn: wrap it in a function

Let's make this codebreaking snippet more usable by wrapping it in a function definition. 

Fill in the definition of `caesar_break()` using the code you've just written.

If you're stuck, you can [see the solution](caesar_break-solution.ipynb).

In [None]:
def caesar_break(message):
    """Breaks a Caesar cipher using frequency analysis"""

    # Write your code here
    
    # Return the best we found
    return best_key, best_prob

Let's see if it works. We encipher a message, then see if `caesar_break` returns the correct key.

In [None]:
key, score = caesar_break(ciphertext)
key, score

Success!

Let's try some other messages.

In [None]:
ct = caesar_encipher("""Here's another test message to see if the caesar 
breaking code works well.""", 17)

key, score = caesar_break(ct)
key, score

Let's try it on something larger: the ciphertext from the [National Cipher Challenge](https://www.cipherchallenge.org/) [2016 challenge 1](https://2016.cipherchallenge.org/challenges/challenge-1/).

In [None]:
c1a = open('2016-1a.ciphertext').read()
print(c1a)

In [None]:
key, score = caesar_break(c1a)
key, score

In [None]:
print(caesar_decipher(c1a, key))

# Over to you
If you (or a friend) generated some Caesar-enciphered messages before, can `caesar_break()` crack the code and recover the message?