# Exercise: Attack on mono alphabetic cipher

In this exercice you will implement an attack to the mono alphabetic cipher using the corpus of the book Nineteen eighty four by George Orwell. Needless to say, a masterpiece worth reading for anyone intereted in privacy.

Alice and Bob want to communicate secretly so they meet in person and choose a key that nobody else knows. They agree on using the mono alphabetic cipher to encrypt and decrypt their messages. An attacker (Charlie) is eavesdroping the communication betweeen Alice and Bob so he's able to see all the ciphertext they send to each other. Charlie only knows that Alice and Bob communicate in english and that they use the mono alphabetic cipher. Our question here is, can Charlie crack the secret key with just this information? We will see how in this exercice

Author: [Sebastià Agramunt Puig](https://github.com/sebastiaagramunt) for [OpenMined](https://www.openmined.org/) Privacy ML Series course.

## Alice and Bob's communication

As mentioned, first alice and Bob meet and agree on a secret key, for simplicity here, we copy the code of the Monoalphabetic cipher we coded in the Ciphers notebook

In [None]:
from random import randrange, seed
from copy import deepcopy
import string


seed(3)

def mono_key_generator()-> str:
    chars = list(deepcopy(string.ascii_lowercase))
    chars_permutation = []
    
    while len(chars)>0:
        letter = chars.pop(randrange(len(chars)))
        chars_permutation.append(letter)
        
    return ''.join(chars_permutation) 
    
def mono_encrypt_decrypt(text: str, secret_key: str, encrypt: bool=True) -> str:
    assert len(secret_key)==len(string.ascii_lowercase), "secret key must be all ascii lowercase, 26 letters"
    
    if encrypt:
        convert_dict = {p:c for p, c in zip(string.ascii_lowercase, secret_key)}
    else:
        convert_dict = {c:p for p, c in zip(string.ascii_lowercase, secret_key)}
    convert_dict[" "] = " "
    
    return ''.join([convert_dict[c] for c in text])

In [None]:
seed(5)

# generate a secret key and print on screen
secret_key = None
print(f"Secret key shared between Alice and Bob: {secret_key}")

In [None]:
# encrypt the message using monoalphabetic cipher and decrypt the resulting ciphertext
# print both on screen

message = "this is a top secret message"
encrypted_message = None
decrypted_ciphertext = None

print(f"message:\n{message}\n\nciphertext:\n{encrypted_message}\n\ndecrypted_ciphertext:\n{decrypted_ciphertext}")

To get real words used in english we can download a corpora in this language. For instance we can download a book and use it as the messages Alice and Bob will send to each other. In the following chunk of code we download Nineteen Eighty Four by George Orwell from [Project Gutenberg](http://gutenberg.net.au).

In [None]:
from utils import download_data, process_load_textfile
import string
import os

url = 'http://gutenberg.net.au/ebooks01/0100021.txt'
filename = 'Nineteen-eighty-four_Orwell.txt'
download_path = '/'.join(os.getcwd().split('/')[:-1]) + '/data/'

#download data to specified path
download_data(url, filename, download_path)

#load data and process
data = process_load_textfile(filename, download_path)

Let's see how it looks after some processing

In [None]:
data[10000:11000]

So Alice wants to send a very long message to Bob from the book Nineteen Eighty Four, this is the same as sending many messages of one word each. Let's code this part

In [None]:
data_len = len(data)

init_letter = data_len//2
final_letter = init_letter + data_len//4

message = data[init_letter:final_letter]
encrypted_message = mono_encrypt_decrypt(message, secret_key)

## Charlie's side

As we mentioned, Charlie only knows that Alice and Bob communciate in english and that they use the Monoalphabetic cipher. He's a smart guy and knows what are the most frequent letters in english. His attack will consist on compare the most frequent letters of the ciphertxt (encrypted data) that Alice sends to Bob with the most frequent letters in english.

First things first, we need to obtain the most frequent words in english, luckily you can find them in [wikipedia](https://en.wikipedia.org/wiki/Letter_frequency).

In [None]:
english_letter_counts = [("a", 0.082),
                         ("b", 0.015),
                         ("c", 0.028),
                         ("d", 0.043),
                         ("e", 0.13),
                         ("f", 0.022),
                         ("g", 0.02),
                         ("h", 0.061),
                         ("i", 0.07),
                         ("j", 0.0015),
                         ("k", 0.0077),
                         ("l", 0.04),
                         ("m", 0.024),
                         ("n", 0.067),
                         ("o", 0.075),
                         ("p", 0.019),
                         ("q", 0.00095),
                         ("r", 0.06),
                         ("s", 0.063),
                         ("t", 0.091),
                         ("u", 0.028),
                         ("v", 0.0098),
                         ("w", 0.024),
                         ("x", 0.0015),
                         ("y", 0.002),
                         ("z", 0.00074)
                        ]

In [None]:
# and sort them according to their frequency


In [None]:
from collections import Counter
from typing import List, Tuple

# Write a function that inputs a text and outputs a list of tuples with frequencies of letters,
# hint: use Counter from package collections
def letter_count(text: str) -> List[Tuple[str, int]]:
    # step 1: remove white spaces
    
    
    # step 2: create a list of charactrs in the text
    
    
    # step 3: count characters and sort 
    pass

In [None]:
lc = letter_count(data)

assert lc[0][0]=="e", "letter_count not well implemented"
assert lc[1][0]=="t", "letter_count not well implemented"
assert lc[2][0]=="a", "letter_count not well implemented"
assert lc[3][0]=="o", "letter_count not well implemented"
assert lc[4][0]=="n", "letter_count not well implemented"
assert lc[5][0]=="i", "letter_count not well implemented"

lc

### Exercice 3: Charlie's attack

Now Charlie has the ciphertext that Alice sent to Bob and the frequencies of the english letters, a simple attack can be calculate the frequencies of the ciphertext and compare the two lists. Let's code this

In [None]:
import string

def plaintext_attack(encrypted_message: str, english_letter_counts: List[Tuple[str, int]]) -> str:
    # encrypted message is the message intercepted from Alice to Bob
    # english_letter counts is the list of tuples for the frequencies
    characters = string.ascii_lowercase
    
    # first calculate the frequencies in plaintext and ciphertext
    
    
    # a dictionary that holds each letter in plaintext the conversion to ciphertext
    
    
    pass


In [None]:
inferred_secret_key = plaintext_attack(encrypted_message, english_letter_counts)
print(f"secret_key:\n\t{secret_key}\ninferred_secret_key:\n\t{inferred_secret_key}")

correctly_guessed = 0
for sk, isk in zip(secret_key, inferred_secret_key):
    if sk==isk:
        correctly_guessed+=1
print(f"\nCorrectly guessed {correctly_guessed} out of {len(secret_key)}")

Not bad! we've guessed 14 out of 26 characters!, let's see how the decrypted text would look like with our inferred key and compare it to the original Let's see how the text looks like when decrypting with this key

In [None]:
mono_encrypt_decrypt(encrypted_message, inferred_secret_key, encrypt=False)[0:500]

In [None]:
message[0:500]

# Conslusions

Charlie has been able to correctly guess 14 out of 26 characters from the key with this very simple attack!. The main takeaway from this exercice is that one can take information by simply looking at the ciphertext. Can we construct a perfectly secure cipher so that the ciphertext carries no information about the original message?. This is what we are going to see in the next section.