## Mono-alphabetic sustitution cipher

We've seen that the shift ciphertext is easy to break as the number of possible keys was 26. A better approach would be to simply substitute each letter of the alphabet by another random letter. This way we would have 26! different permutations (that is a number of the order of $10^{26}$), here we cannot do exhaustive search, it would take so long for our computer...

Let's see an example of this encryption scheme

In [1]:
import string
from copy import deepcopy
from random import randint, seed

seed(1) #fix seed so that we can reproduce the results
characters = string.ascii_lowercase

def random_permutation(characters):
    old_chars = list(deepcopy(characters))
    permut_ = []
    
    while len(old_chars)>0:
        elem = old_chars.pop(randint(0,len(old_chars)-1))
        permut_.append(elem)
    return ''.join(permut_)
    
print("Plaintext characters are: \n\t{}".format(characters))
print("Equivalent in one random mono-alphabetic order: \n\t{}".format(random_permutation(characters)))

Plaintext characters are: 
	abcdefghijklmnopqrstuvwxyz
Equivalent in one random mono-alphabetic order: 
	etckfuswqjgnaopybxmvhzdlir


Alice and Bob just have to meet once and exchange the key which is now lenght 26 (the random substitution). In this particular case $a$ in the plaintext will be substituted by $e$ in the ciphertext, $b$ by $t$... according to the results above. Let's write the functions to encrypt and decrypt

In [2]:
# The random key generator is simply the permutation of the original characters
key_generator = lambda c: random_permutation(c)

def mono_encrypt(plaintext, characters, k):
    convert_dict = {}
    for p, c in zip(characters, k):
        convert_dict[p] = c    
    convert_dict[' '] = ' '
    
    c = ''
    for p in plaintext:
        c += convert_dict[p]
        
    return c


def mono_decrypt(ciphertext, characters, k):
    convert_dict = {}
    for p, c in zip(characters, k):
        convert_dict[c] = p    
    convert_dict[' '] = ' '
    
    c = ''
    for p in  ciphertext:
        c += convert_dict[p]
        
    return c

Now lets encrypt and decrypt a message using a new fresh generated key

In [4]:
k = key_generator(characters)
sentence = 'it was a bright cold day in april and the clocks were striking thirteen winston smith his chin nuzzled into his breast in an effort to escape the vile wind slipped quickly through the glass doors of victory mansions though not quickly enough to prevent a swirl of gritty dust from entering along with him'

ciphertext = mono_encrypt(sentence, characters, k)
plaintext = mono_decrypt(ciphertext, characters, k)
print("The key is: {}\n\n".format(k))
print("THE SENTENCE:\n\n{}\n\nCIPHERTEXT:\n\n{}\n\nPLAINTEXT:\n\n{}".format(sentence, ciphertext, plaintext))

The key is: xkzyrotglmwujhqapesfvcindb


THE SENTENCE:

it was a bright cold day in april and the clocks were striking thirteen winston smith his chin nuzzled into his breast in an effort to escape the vile wind slipped quickly through the glass doors of victory mansions though not quickly enough to prevent a swirl of gritty dust from entering along with him

CIPHERTEXT:

lf ixs x keltgf zquy yxd lh xaelu xhy fgr zuqzws irer sfelwlht fglefrrh ilhsfqh sjlfg gls zglh hvbbury lhfq gls kerxsf lh xh rooqef fq rszxar fgr clur ilhy sulaary pvlzwud fgeqvtg fgr tuxss yqqes qo clzfqed jxhslqhs fgqvtg hqf pvlzwud rhqvtg fq aercrhf x sileu qo telffd yvsf oeqj rhfrelht xuqht ilfg glj

PLAINTEXT:

it was a bright cold day in april and the clocks were striking thirteen winston smith his chin nuzzled into his breast in an effort to escape the vile wind slipped quickly through the glass doors of victory mansions though not quickly enough to prevent a swirl of gritty dust from entering along with him


Seems to work well, but hold your horses... There is a plausible attack we can carry. Imagine the attacker knows the language in which Alice and Bob are comunciating, then he gained a lot of information with that!. He knows the distribution/frequency of all letters. Let me load George Orwell's book to estimate the probabilities of letters in English language

In [5]:
from utils import download_data, process_load_textfile
import string
import os

url = 'http://gutenberg.net.au/ebooks01/0100021.txt'
filename = 'Nineteen-eighty-four_Orwell.txt'
download_path = '/'.join(os.getcwd().split('/')[:-1]) + '/data/'

#download data to specified path
download_data(url, filename, download_path)
#load data and process
data = process_load_textfile(filename, download_path)#.replace(" ","")

In [6]:
print("The lenght of the book is {} characters".format(len(list(data))))

The lenght of the book is 569427 characters


In [7]:
#just a sample the first 1000 characters to see how it looks like
data[:1000]

'  table width border  tr td bgcolorffeefont color sizep styletextaligncenterba hrefhttpgutenbergnetau targetblankproject gutenberg australiaabr bfontfont color sizeia treasuretrove of literatureibr fonttreasure found hidden with no evidence of ownershipptd tr table  ad goes here  pre    title nineteen eightyfour author george orwell pseudonym of eric blair   a project gutenberg of australia ebook  ebook no  txt language   english date first posted august  date most recently updated november   project gutenberg of australia ebooks are created from printed editions which are in the public domain in australia unless a copyright notice is included we do not keep any ebooks in compliance with a particular paper edition  copyright laws are changing all over the world be sure to check the copyright laws for your country before downloading or redistributing this file  this ebook is made available at no cost and with almost no restrictions whatsoever you may copy it give it away or reuse it un

We assume that english letters occur with the distribution of Orwell's book so we count the letters as follows:

In [24]:
def count_char_freqs(text, characters = string.ascii_lowercase):
    freqs = {}
    for letter in characters:
        f = text.count(letter)
        freqs[letter] = f
    return freqs

english_frequencies = count_char_freqs(data)
print(english_frequencies)

{'a': 36548, 'b': 7668, 'c': 11642, 'd': 19033, 'e': 59667, 'f': 10203, 'g': 9298, 'h': 29178, 'i': 31969, 'j': 464, 'k': 3612, 'l': 18673, 'm': 10830, 'n': 32004, 'o': 35073, 'p': 8627, 'q': 409, 'r': 26158, 's': 28987, 't': 43918, 'u': 13047, 'v': 4315, 'w': 12247, 'x': 793, 'y': 9425, 'z': 308}


Let's take a random sample of length 0.01 the size of the original text and encrypt it. Then we will try to infer some information just looking at the ciphertext letter frequencies and knowing the english letter distrbution.

In [35]:
n = round(len(data)*0.05)
i = randint(0, len(data)-1)
sampled_data = data[i:i+n]
encrypted_sampled_data = mono_encrypt(sampled_data, characters, k)

print("We sample a chunk of {} characters from the book starting at position {}".format(n, i))
print("Using the private key k = {}\n\n".format(k))
print("Sampled Plaintext:")
print(sampled_data)

We sample a chunk of 28471 characters from the book starting at position 113174
Using the private key k = xkzyrotglmwujhqapesfvcindb


Sampled Plaintext:
uch as five minutes and it was possible that his features had not been perfectly under control it was terribly dangerous to let your thoughts wander when you were in any public place or within range of a telescreen the smallest thing could give you away a nervous tic an unconscious look of anxiety a habit of muttering to yourselfanything that carried with it the suggestion of abnormality of having something to hide in any case to wear an improper expression on your face to look incredulous when a victory was announced for example was itself a punishable offence there was even a word for it in newspeak facecrime it was called  the girl had turned her back on him again perhaps after all she was not really following him about perhaps it was coincidence that she had sat so close to him two days running his cigarette had gone out and he la

In [37]:
ciphertext_frequencies = count_char_freqs(encrypted_sampled_data)

In [38]:
def find_key_attack(ciphertext_freqencies, english_frequencies):
    """Takes two frequency dictionaries on letters and outputs a plausible
    key
    inputs like: {'a': 36548, 'b': 7668, 'c': 11642 ...
    outputs a key
    """
    cf = sorted(ciphertext_frequencies.items(), key=lambda item: item[1])
    ef = sorted(english_frequencies.items(), key=lambda item: item[1])
    
    #map english to 
    mapping = {}
    for e, c in zip(ef, cf):
        mapping[e[0]] = c[0]
    
    m = ''
    for letter in string.ascii_lowercase:
        m += mapping[letter]
        
    return m


inferred_key = find_key_attack(ciphertext_frequencies, engish_frequencies)

print("The orinal key is: \n\t{}".format(k))
print("The inferred key: \n\t{}".format(inferred_key))

The orinal key is: 
	xkzyrotglmwujhqapesfvcindb
The inferred key: 
	xkzurodghmwyjlqapesfvcintb


In [40]:
count = 0
for a, b in zip(k, inferred_key):
    if a == b:
        count += 1
        
print("We have correctly guessed {} out of {} digits of the key".format(count, len(characters)))

We have correctly guessed 20 out of 26 digits of the key


We see that there are some coincidences, let's try to decrypt the message using the inferred key

In [41]:
mono_decrypt(encrypted_sampled_data, characters, inferred_key)

'uch as fnve mniutes ail nt was possnbde that hns features hal iot beei perfectdg uiler coitrod nt was terrnbdg laiyerous to det gour thouyhts wailer whei gou were ni aig pubdnc pdace or wnthni raiye of a tedescreei the smaddest thniy coudl ynve gou awag a iervous tnc ai uicoiscnous dook of aixnetg a habnt of mutterniy to goursedfaigthniy that carrnel wnth nt the suyyestnoi of abiormadntg of havniy somethniy to hnle ni aig case to wear ai nmproper expressnoi oi gour face to dook nicreludous whei a vnctorg was aiiouicel for exampde was ntsedf a puinshabde offeice there was evei a worl for nt ni iewspeak facecrnme nt was caddel  the ynrd hal turiel her back oi hnm ayani perhaps after add she was iot readdg foddowniy hnm about perhaps nt was conicnleice that she hal sat so cdose to hnm two lags ruiiniy hns cnyarette hal yoie out ail he danl nt carefuddg oi the elye of the tabde he woudl fninsh smokniy nt after work nf he coudl keep the tobacco ni nt qunte dnkedg the persoi at the iext tab

We can see that some words are easily readable to the human eye. In fact, we have guessed quite a lot of information just by looking at the ciphertext and this is dangerous!. If the attacker keeps on gathering encrypted messages in between Alice and Bob, he'll gather a lot of information and eventually will be able to find the key.