# Exercice: Attack on mono alphabetic cipher

In this exercice you will implement an attack to the mono alphabetic cipher using the corpus of the book Nineteen eighty four by George Orwell. Needless to say, a masterpiece worth reading for anyone intereted in privacy.

Alice and Bob want to communicate secretly so they meet in person and choose a key that nobody else knows. They agree on using the mono alphabetic cipher to encrypt and decrypt their messages. An attacker (Charlie) is eavesdroping the communication betweeen Alice and Bob so he's able to see all the ciphertext they send to each other. Charlie only knows that Alice and Bob communicate in english and that they use the mono alphabetic cipher. Our question here is, can Charlie crack the secret key with just this information? We will see how in this exercice

## Alice and Bob's communication

As mentioned, first alice and Bob meet and agree on a secret key, for simplicity here, we copy the code of the Monoalphabetic cipher we coded in the Ciphers notebook

In [1]:
from random import randrange, seed
from copy import deepcopy
import string


seed(3)
characters = string.ascii_lowercase

def MonoKeyGenerator(characters):
    old_chars = list(deepcopy(characters))
    permut_ = []
    
    while len(old_chars)>0:
        elem = old_chars.pop(randrange(len(old_chars)))
        permut_.append(elem)
    return ''.join(permut_)
    
def MonoEncrypt(plaintext, characters, secret_key):
    convert_dict = {}
    for p, c in zip(characters, secret_key):
        convert_dict[p] = c    
    convert_dict[' '] = ' '
    
    c = ''
    for p in plaintext:
        if p not in secret_key + " ":
            c += "_"
        else:
            c += convert_dict[p]   
    return c


def MonoDecrypt(ciphertext, characters, secret_key):
    convert_dict = {}
    for p, c in zip(characters, secret_key):
        convert_dict[c] = p    
    convert_dict[' '] = ' '
    
    c = ''
    for p in  ciphertext:
        if p not in secret_key + " ":
            c += "_"
        else:
            c += convert_dict[p]
        
    return c

In [4]:
seed(5)
secret_key = MonoKeyGenerator(characters)
print(f"Secret key shared between Alice and Bob: {secret_key}")

Secret key shared between Alice and Bob: tizmxsarjchdlpgqwenbykvofu


In [10]:
message = "this is a top secret message"
encrypted_message = MonoEncrypt(message, characters, secret_key)
decrypted_ciphertext = MonoDecrypt(encrypted_message, characters, secret_key)

print(f"message:\n{message}\n\nciphertext:\n{encrypted_message}\n\ndecrypted_ciphertext:\n{decrypted_ciphertext}")

message:
this is a top secret message

ciphertext:
brjn jn t bgq nxzexb lxnntax

decrypted_ciphertext:
this is a top secret message


To get real words used in english we can download a corpora in this language. For instance we can download a book and use it as the messages Alice and Bob will send to each other. In the following chunk of code we download Nineteen Eighty Four by George Orwell from [Project Gutenberg](http://gutenberg.net.au).

In [11]:
from utils import download_data, process_load_textfile
import string
import os

url = 'http://gutenberg.net.au/ebooks01/0100021.txt'
filename = 'Nineteen-eighty-four_Orwell.txt'
download_path = '/'.join(os.getcwd().split('/')[:-1]) + '/data/'

#download data to specified path
download_data(url, filename, download_path)

#load data and process
data = process_load_textfile(filename, download_path)

Let's see how it looks after some processing

In [12]:
data[10000:11000]

'ook its smooth creamy paper a little yellowed by age was of a kind that had not been manufactured for at least forty years past he could guess however that the book was much older than that he had seen it lying in the window of a frowsy little junkshop in a slummy quarter of the town just what quarter he did not now remember and had been stricken immediately by an overwhelming desire to possess it party members were supposed not to go into ordinary shops dealing on the free market it was called but the rule was not strictly kept because there were various things such as shoelaces and razor blades which it was impossible to get hold of in any other way he had given a quick glance up and down the street and then had slipped inside and bought the book for two dollars fifty at the time he was not conscious of wanting it for any particular purpose he had carried it guiltily home in his briefcase even with nothing written in it it was a compromising possession the thing that he was about to

So Alice wants to send a very long message to Bob from the book Nineteen Eighty Four, this is the same as sending many messages of one word each. Let's code this part

In [13]:
data_len = len(data)

init_letter = data_len//2
final_letter = init_letter + data_len//4

message = data[init_letter:final_letter]
encrypted_message = MonoEncrypt(message, characters, secret_key)

## Charlie's side

As we mentioned, Charlie only knows that Alice and Bob communciate in english and that they use the Monoalphabetic cipher. He's a smart guy and knows what are the most frequent words in english. His attack will consist on compare the words of the ciphertxt (encrypted data) that Alice sends to Bob and compare them with the most frequent words in english.

First things first, we need to calculate the most frequent words in english, for that we will use the same book, Nineteen Eighty Four to have an estimate. There are other sources for these frequences, look at [wikipedia](https://en.wikipedia.org/wiki/Most_common_words_in_English).

### Exercice 1: Word Counts

In [14]:
from collections import Counter
from typing import List, Tuple

## Write a function that inputs a text and outputs a list of tuples with frequencies of words, hint: use Counter from package collections
def word_count(text: str) -> List[Tuple[str, int]]:
    # step 1: split the string into words, words are separated by space
    
    # step 2: return the most common words sorted using Counter from collections
    pass

# solution for exercice
def word_count(text: str) -> List[Tuple[str, int]]:
    words = text.split(" ")
    return Counter(words).most_common()

In [15]:
wc = word_count(data)

assert wc[0][0]=="the", "word_count not well implemented"
assert wc[1][0]=="of", "word_count not well implemented"
assert wc[2][0]=="a", "word_count not well implemented"
assert wc[3][0]=="and", "word_count not well implemented"
assert wc[4][0]=="to", "word_count not well implemented"
assert wc[5][0]=="was", "word_count not well implemented"

wc[:10]

[('the', 6507),
 ('of', 3497),
 ('a', 2552),
 ('and', 2423),
 ('to', 2336),
 ('was', 2306),
 ('he', 1953),
 ('in', 1854),
 ('it', 1853),
 ('that', 1456)]

### Exercice 2: Most common words lenghts

Once we have the most common words we need to find the most common word for each length, this is what we are going to use to perform the attack. Program a function that inputs the whole text as a string and outputs a sorted list of the most common word with one letter, the most common word with two letters and so on. An example of output is

```python
[('a', 2552),
 ('of', 3497),
 ('the', 6507),
 ('that', 1456),
 ('there', 542)]
```

i.e. the most common word with one letter is "a" with a frequency of 2552, the most common word of two letters is "of" with a frequency of 2497 and so on.

In [16]:
## Write a function to get the most common words by length, i.e. the most common word with 1 letter, with 2 letters...
def most_common_words_lenght(text: str) -> List[Tuple[str, int]]:
    # Step 1: use function word_count to count all the words and have them sorted
    
    # Step 2: loop over the word_count list and find the first occu
    pass


def most_common_words_lenght(text: str) -> List[Tuple[str, int]]:
    wc = word_count(text)
    
    word_freq = []
    max_word_size = max([len(word) for word, occurrence in wc])
    
    size = 1
    while size < max_word_size:
        found = False
        for word, occurrence in wc:
            if len(word)==size:
                found = True
                word_freq.append((word, occurrence))
                size+=1
                break
        if not found:
            size += 1
    return word_freq

In [17]:
common_words = most_common_words_lenght(data)

assert common_words[0][0]=="a", "common words not well implemented"
assert common_words[1][0]=="of", "common words not well implemented"
assert common_words[2][0]=="the", "common words not well implemented"
assert common_words[3][0]=="that", "common words not well implemented"
assert common_words[4][0]=="there", "common words not well implemented"
assert common_words[5][0]=="obrien", "common words not well implemented"

common_words

[('a', 2552),
 ('of', 3497),
 ('the', 6507),
 ('that', 1456),
 ('there', 542),
 ('obrien', 174),
 ('winston', 450),
 ('newspeak', 83),
 ('something', 97),
 ('telescreen', 91),
 ('doublethink', 30),
 ('intellectual', 15),
 ('consciousness', 23),
 ('simultaneously', 16),
 ('extraordinarily', 5),
 ('thoughtcriminals', 7),
 ('nineteenthcentury', 4),
 ('disproportionately', 2),
 ('insufficientnothing', 1),
 ('counterrevolutionary', 1),
 ('demoralizationcontrol', 1),
 ('counterrevolutionaries', 1),
 ('impedimentahockeysticks', 1),
 ('dirtymindednesseverything', 1)]

### Exercice 3: Charlie's attack

Now Charlie can calculate the most common word for each word lenght in english and compare them to the most common words in the ciphertext. Let's have a look what are the most common words in the ciphertext that Alice sends to Bob. Remember Charlie is eavesdroping all encrypted communications between Alice and Bob.

In [18]:
most_common_words_lenght(encrypted_message)

[('t', 535),
 ('gs', 952),
 ('brx', 1759),
 ('brtb', 359),
 ('vrjzr', 135),
 ('nxxlxm', 43),
 ('vjpnbgp', 79),
 ('qgnnjidx', 19),
 ('pxzxnntef', 35),
 ('bxdxnzexxp', 21),
 ('mgyidxbrjph', 18),
 ('rjxetezrjztd', 7),
 ('zgpnzjgynpxnn', 7),
 ('njlydbtpxgyndf', 6),
 ('mjnbjpayjnrtidx', 2),
 ('zgpnbeyzbjgpvjnx', 1),
 ('ltpbxdqjxzxtdvtfn', 1),
 ('mjnqegqgebjgptbxdf', 1)]

Looking at the above we may be tempted to think that the letter "a" is substituted by letter "t" because both are the most frequent word with one letter. Likewise the word "of" would be encrypted as "gs" so we would substitute "o" by "g" and "f" by "s". In this exercice you need to come up with a function to try to guess the secret key from the frequencies of the words

In [24]:
def plaintext_attack(data: str, encrypted_message: str) -> str:
    # data is the book nineteen eighty four to calculate the frequency of words
    # encrypted message is the message that Alice sends to Bob
    
    # first calculate the frequencies in plaintext and ciphertext
    common_words_plaintext = most_common_words_lenght(data)
    common_words_ciphertext = most_common_words_lenght(encrypted_message)
    
    # a dictionary that holds each letter in plaintext the conversion to ciphertext
    key_dict = {}
    for (word_pt, _), (word_ctx, _) in zip(common_words_plaintext, common_words_ciphertext):
        for letter_pt, letter_ctx in zip(word_pt, word_ctx):
            # TODO: add the letter conversion if it hasn't been added
            pass
        
    # TODO: from key_dict calculate the secret_key putting the character "_" if the
    # conversion hasn't been found
    inferred_secret_key = ''
    
    return inferred_secret_key


def plaintext_attack(data: str, encrypted_message: str) -> str:
    
    common_words_plaintext = most_common_words_lenght(data)
    common_words_ciphertext = most_common_words_lenght(encrypted_message)
    
    key_dict = {}
    for (word_pt, _), (word_ctx, _) in zip(common_words_plaintext, common_words_ciphertext):
        for letter_pt, letter_ctx in zip(word_pt, word_ctx):
            if key_dict.get(letter_pt) is None:
                key_dict[letter_pt] = letter_ctx
    
    inferred_secret_key = ''
    for letter in characters:
        if key_dict.get(letter) is not None:
            inferred_secret_key+=key_dict[letter]
        else:
            inferred_secret_key+="_"
    return inferred_secret_key

And finally let's test our algorithm

In [30]:
inferred_secret_key = plaintext_attack(data, encrypted_message)
print(f"secret_key:\n\t{secret_key}\ninferred_secret_key:\n\t{inferred_secret_key}")

correctly_guessed = 0
for sk, isk in zip(secret_key, inferred_secret_key):
    if sk==isk:
        correctly_guessed+=1
print(f"\nCorrectly guessed {correctly_guessed} out of {len(secret_key)}")

secret_key:
	tizmxsarjchdlpgqwenbykvofu
inferred_secret_key:
	txzmxsfrl_xdzmgj_znby_vjf_

Correctly guessed 13 out of 26


Not bad! we've guessed 13 out of 26 characters!, let's see how the decrypted text would look like with our inferred key and compare it to the original

In [31]:
MonoDecrypt(encrypted_message, characters, inferred_secret_key)[0:500]

'x_ukn fo_wa_n x_ thk saik io_kik_t layx__ a f_xk_nly ha_n fo_ a ioik_t o_ wx_sto_s a_i so that thk two of thki wk_k wal_x__ sxnk _y sxnk hk _k_a_ s_ka_x__ wxth thk _krulxa_ __a_k rou_tksy that nxffk_k_txatkn hxi f_oi thk ia_o_xty of x__k_ _a_ty iki_k_s x han _kk_ ho_x__ fo_ a_ o__o_tu_xty of tal_x__ to you hk saxn x was _kanx__ o_k of you_ _kws_ka_ a_txrlks x_ thk txiks thk othk_ nay you ta_k a srhola_ly x_tk_kst x_ _kws_ka_ x _klxk_k wx_sto_ han _kro_k_kn _a_t of hxs sklf_osskssxo_ ha_nly srhol'

In [32]:
message[0:500]

'inued forward in the same movement laying a friendly hand for a moment on winstons arm so that the two of them were walking side by side he began speaking with the peculiar grave courtesy that differentiated him from the majority of inner party members i had been hoping for an opportunity of talking to you he said i was reading one of your newspeak articles in the times the other day you take a scholarly interest in newspeak i believe winston had recovered part of his selfpossession hardly schol'

# Conslusions

Charlie has been able to correctly guess 13 out of 26 characters from the key with this very simple attack!. The main takeaway from this exercice is that one can take information by simply looking at the ciphertext. Can we construct a perfectly secure cipher so that the ciphertext carries no information about the original message?. This is what we are going to see in the next section.