# Breaking Caesar ciphers
Now we can enciper and decipher with a Caesar cipher, the next step is to write a program to automatically break enciphered messages. 

## The basic idea
The basic model is simple: we'll get the computer to try all the keys and see which is the best.

"Trying all the keys" is simple: for a Caesar cipher, we just run through the shifts 0–25.

Finding which is "best" is somewhat tricker. How do we know which is the best possible decryption? For complex ciphers, and long ciphertexts, we don't want to rely on a human to make the decision. 

## Recap: Caesar ciphers
We're just repeating the Caesar cipher implmentation from the previous notebook.

In [12]:
def pos(letter): 
    """Return the position of a letter in the alphabet (0-25)"""
    if letter in string.ascii_lowercase:
        return ord(letter) - ord('a')
    elif letter in string.ascii_uppercase:
        return ord(letter) - ord('A')
    else:
        raise ValueError('pos requires input of {} to be an ascii letter'.format(letter))

In [13]:
def unpos(number): 
    """Return the letter in the given position in the alphabet (mod 26)"""
    return chr(number % 26 + ord('a'))

In [14]:
def caesar_encipher_letter(letter, shift):
    """Encipher a letter, given a shift amount"""
    if letter in string.ascii_letters:
        cipherletter = unpos(pos(letter) + shift)
        if letter in string.ascii_uppercase:
            return cipherletter.upper()
        else:
            return cipherletter
    else:
        return letter      

In [15]:
def caesar_encipher(message, shift):
    """Encipher a message with the Caesar cipher of given shift"""
    enciphered = ""
    for character in message:
        enciphered.append(caesar_encipher_letter(l, shift))
    return enciphered

In [16]:
def caesar_decipher(message, shift):
    """Decipher a message with the Caesar cipher of given shift"""
    return caesar_encipher(message, -shift)

In [18]:
cat = ''.join

In [19]:
caesar_encipher('This is a test message.', 4)

'Xlmw mw e xiwx qiwweki.'

In [20]:
caesar_decipher('Xlmw mw e xiwx qiwweki.', 4)

'This is a test message.'

## A language model
![Monkey typing](Monkey-typing.jpg) 

My approach takes the idea from the apocryphal story that an infinite number of moneys, with an infinite number of typewriters, will eventually create the complete works of Shakespeare. As the computer tries each key, generating a possible plaintext, it can score the possible plaintext by how likely it would be for a monkey, completely randomly, to generate that plaintext. 

![English letters by proportion](letter-treemap.png) 

That idea isn't that helpful when all the letters are equally likely. But if the money is using a keyboard where the keys are sized in proportion to how often they appear in English, we have something. The diagram to the right gives an idea of what such a keyboard could look like. A monkey using that keyboard, htting keys at random, will produce something like *treattlpis* than *nziuechjtk*.

That allows us to score how close a piece of text is to English. If we can work out the probability of how likely a random monkey would be to produce our possible plaintext, we can choose the key which produces the most likely plaintext.

(This model is also called a _bag of letters_ model, as it's the same as taking all the letters in the text and putting them in a bag, losing all idea of the order of letters. While we lose a lot of information with this, it's actually good enough for our purposes.)

## Finding letter probabilities
How do we find these probabilities? The easy answer is to simply read a lot of text, counting the letters. Let's use three large texts: the [complete works of Shakespeare](https://www.gutenberg.org/ebooks/100), [War and Peace](https://www.gutenberg.org/ebooks/2600), and [The Adventures of Sherlock Holmes](https://www.gutenberg.org/ebooks/48320), all from [Project Gutenberg](https://www.gutenberg.org/wiki/Main_Page).

The Python [`collections.Counter()`](https://docs.python.org/3/library/collections.html#collections.Counter) object from the standard library does a good job of counting letters for us. If we pass a `Counter` a sequence of characters, it will count them for us:

In [1]:
import collections
collections.Counter('this is some test text')

Counter({'t': 5,
         'h': 1,
         'i': 2,
         's': 4,
         ' ': 4,
         'o': 1,
         'm': 1,
         'e': 3,
         'x': 1})

If we `update` the `Counter`, it will update the counts for us:

In [2]:
counts = collections.Counter()
counts.update('this is some text')
counts

Counter({'t': 3,
         'h': 1,
         'i': 2,
         's': 3,
         ' ': 3,
         'o': 1,
         'm': 1,
         'e': 2,
         'x': 1})

In [3]:
counts.update('here is some more text')
counts

Counter({'t': 5,
         'h': 2,
         'i': 3,
         's': 5,
         ' ': 7,
         'o': 3,
         'm': 3,
         'e': 7,
         'x': 2,
         'r': 2})

This allows us to easily combine the counts of letters in all the texts, then write them to a file. 

In [4]:
import collections
import string

corpora = ['shakespeare.txt', 'sherlock-holmes.txt', 'war-and-peace.txt']
counts = collections.Counter()

for corpus in corpora:
    text = open(corpus, 'r').read().lower()
    counts.update(text)

letter_counts = {l: counts[l] for l in counts if l in string.ascii_lowercase}    
sorted_letters = sorted(letter_counts, key=counts.get, reverse=True)

with open('count_1l.txt', 'w') as f:
    for l in letter_counts:
        f.write("{0}\t{1}\n".format(l, counts[l]))

letter_counts

{'a': 490124,
 'm': 172199,
 'i': 421240,
 'd': 267917,
 's': 404473,
 'u': 190269,
 'e': 758091,
 'r': 373599,
 'n': 419374,
 'g': 117888,
 'h': 416369,
 't': 560576,
 'o': 504520,
 'w': 154157,
 'f': 135318,
 'p': 100690,
 'l': 259023,
 'y': 143040,
 'c': 141094,
 'b': 92919,
 'k': 54248,
 'v': 65297,
 'q': 5499,
 'j': 6679,
 'x': 7414,
 'z': 3577}

## Using the language model
Now we have the letter counts, we need to use them to score possible plaintexts. There are two stages here:
1. Convert the counts into probabilities;
2. Score a piece of text for how probable it is (by the random monkey metric).

## Converting the counts to probabilities
This just requires working out what proportion each letter is of the total counts. This process is called _normalisation_. We can find the total of all the counts with `sum(counts.values())`, and finding the normalised version of the counts with:

In [5]:
normalised_counts = {letter: count / sum(letter_counts.values())
    for letter, count in letter_counts.items() }
normalised_counts

{'a': 0.07822466632852368,
 'm': 0.0274832681466434,
 'i': 0.06723065682200283,
 'd': 0.04276003200973443,
 's': 0.06455461365674188,
 'u': 0.03036727244056988,
 'e': 0.12099267842761596,
 'r': 0.059627068080057535,
 'n': 0.06693283988716792,
 'g': 0.0188151354843611,
 'h': 0.06645323651676122,
 't': 0.08946893143730666,
 'o': 0.08052229365643544,
 'w': 0.024603732702757314,
 'f': 0.02159699463450712,
 'p': 0.016070303948835497,
 'l': 0.041340533714760326,
 'y': 0.022829439634933255,
 'c': 0.022518854557125788,
 'b': 0.01483003846083867,
 'k': 0.00865807774969141,
 'v': 0.010421517895988792,
 'q': 0.0008776502275761883,
 'j': 0.001065980336421415,
 'x': 0.0011832876499817894,
 'z': 0.0005708955926604884}

## Score a piece of text
We can use these to find the probability of a sequence of letters by finding the probability of each letter, then multiplying them all together. 

But the final values will be very small for long texts. There's a danger that with a long text, we'll end up with a number that's too small to represent. 

We can get around this by using the "trick" of taking logarithms of probabilities. As numbers get smaller, their logarithms get smaller, but much less quickly. This means we can still handle the probability of long texts, while still being able to see which is the most probable. 

To find the log probability of a sequence of letters, we find the log probability of each letter, then just add them up, using Python's built-in `sum()`.

We find the log probabilities of each letter in much the same way as before. We define a convenience function `Pletters` which finds the log probability of a letter sequence. Note that `Pletters` assumes it is only passed lower-case letters: anything else will cause it to raise an error.

In [30]:
import math
Pl = {letter: math.log10(count / sum(counts.values()))
    for letter, count in letter_counts.items() }
    
def Pletters(letters):
    return sum(Pl[letter] for letter in letters)

We can get around that limitation by defining `sanitise`, which removes everthing apart from letters from a piece of text, and converts all the letters to lowercase.

In [36]:
def sanitise(text):
    return cat(l for l in text.lower() 
               if l in string.ascii_lowercase)

We can use that to find some log probabilities:

In [31]:
Pletters('hello')

-6.578324107946823

In [34]:
Pletters('hellothere')

-12.485441714891119

Much better values!

We've ended up with a simple _model_ of the English language, which we can use to judge how likely a given piece of text is actually English. 

## Breaking Caesar ciphers
Finally, we can use `Pletters` to score possible plaintexts, and hence automatically break Caesar ciphers!

In [37]:
def caesar_break(dirty_message):
    """Breaks a Caesar cipher using frequency analysis"""
    message = sanitise(dirty_message)
    best_shift = 0
    best_fit = float('-inf')
    for shift in range(26):
        plaintext = caesar_decipher(message, shift)
        fit = Pletters(plaintext)
        if fit > best_fit:
            best_fit = fit
            best_shift = shift
    return best_shift, best_fit

Let's see if it works. We encipher a message, then see if `caesar_break` returns the correct key.

In [38]:
caesar_encipher('this is a sample message to be decrypted', 17)
'kyzj zj r jrdgcv dvjjrxv kf sv uvtipgkvu'
caesar_break('kyzj zj r jrdgcv dvjjrxv kf sv uvtipgkvu')
(17, -41.43440679319847)

(17, -41.43440679319847)

Success!

Let's try it on something larger: the ciphertext from the [National Cipher Challenge](https://www.cipherchallenge.org/) [2016 challenge 1](https://2016.cipherchallenge.org/challenges/challenge-1/).

In [39]:
c1a = open('2016-1a.ciphertext').read()
print(c1a)

PIZZG,
Q PIDM AKIVVML BPM MVKZGXBML VWBM BPM XWTQKM NWCVL WV RIUMTQI'A LMAS IVL IBBIKPML QB NWZ GWC BW TWWS IB. BPM XWTQKM LMKZGXBML QB NWZ BPMUAMTDMA (QB QA DMZG ABZIQOPBNWZEIZL WVKM GWC ZMITQAM BPIB QB PIA JMMV EZQBBMV JIKSEIZLA - QB RCAB CAMA I KIMAIZ APQNB KQXPMZ). BPM WNNQKMZ QV KPIZOM WN BPM QVDMABQOIBQWV UILM QB KTMIZ BW UM BPIB PM BPQVSA BPQA XZWDMA RIUMTQI'A LMIBP QA "RCAB" I XMZAWVIT BZIOMLG. KIZMTMAA CAM WN BPM EWZL "RCAB" MDMV QN PM QA ZQOPB, JCB Q LWV'B BPQVS PM QA. Q PIDM AXWSMV BW PMZ KWTTMIOCMA, IVL RIUMTQI LWMAV'B ABZQSM UM IA I RCUXMZ. APM EIA XZMBBG LZQDMV IVL PMZ EWZS EIA OWQVO MFBZMUMTG EMTT. IXXIZMVBTG APM EIA CVPIXXG IJWCB PMZ JWGNZQMVL TMIDQVO, JCB I YCQKS AKIV WN PMZ AMIZKP PQABWZG ACOOMABA APM EIA XZMBBG IKBQDM QV BZGQVO BW BZIKS PQU LWEV. BPM XWTQKM BPQVS BPIB APWEA PWE LMAXMZIBM APM EIA. Q BPQVS QB APWEA BPIB APM EIAV'B BPM AWZB BW OQDM CX MIAQTG.
WV WVM BPQVO Q LW IOZMM EQBP BPM XWTQKM, QB LWMAV'B AMMU DMZG TQSMTG BPIB I XPGAQKQAB EWZSQVO WV OZIDQBG EIDMA Q

In [42]:
key, fitness = caesar_break(c1a)
key, fitness

(8, -1698.9474014903544)

In [41]:
print(caesar_decipher(c1a, key))

HARRY,
I HAVE SCANNED THE ENCRYPTED NOTE THE POLICE FOUND ON JAMELIA'S DESK AND ATTACHED IT FOR YOU TO LOOK AT. THE POLICE DECRYPTED IT FOR THEMSELVES (IT IS VERY STRAIGHTFORWARD ONCE YOU REALISE THAT IT HAS BEEN WRITTEN BACKWARDS - IT JUST USES A CAESAR SHIFT CIPHER). THE OFFICER IN CHARGE OF THE INVESTIGATION MADE IT CLEAR TO ME THAT HE THINKS THIS PROVES JAMELIA'S DEATH IS "JUST" A PERSONAL TRAGEDY. CARELESS USE OF THE WORD "JUST" EVEN IF HE IS RIGHT, BUT I DON'T THINK HE IS. I HAVE SPOKEN TO HER COLLEAGUES, AND JAMELIA DOESN'T STRIKE ME AS A JUMPER. SHE WAS PRETTY DRIVEN AND HER WORK WAS GOING EXTREMELY WELL. APPARENTLY SHE WAS UNHAPPY ABOUT HER BOYFRIEND LEAVING, BUT A QUICK SCAN OF HER SEARCH HISTORY SUGGESTS SHE WAS PRETTY ACTIVE IN TRYING TO TRACK HIM DOWN. THE POLICE THINK THAT SHOWS HOW DESPERATE SHE WAS. I THINK IT SHOWS THAT SHE WASN'T THE SORT TO GIVE UP EASILY.
ON ONE THING I DO AGREE WITH THE POLICE, IT DOESN'T SEEM VERY LIKELY THAT A PHYSICIST WORKING ON GRAVITY WAVES I