In [1]:
from solution import get_cipher_key, get_random_pairs, encode_output, get_model, get_alphabet

Using TensorFlow backend.


***Can we learn an AES cipher?***

I became curious one day if you could learn a cipher if you had pairs of inputs and outputs. I thought to myself 'AES is a thing, right?' and decided to see if I could train a neural net to learn an AES cipher if we knew ahead of time all the origin and encrypted text.

If you're interested, there is some additional code behind the notebook available in the repo.

I'll start off by generating some training data. Here we'll generate 50000 examples of encrypted text and print a few examples. The generated text is just random characters. In reality, you don't stand a chance of learning an AES cipher with this little data, but here we go anyway.

In [2]:
# generate a key we can use to create a cipher and encode the text, and then print the key
key = get_cipher_key()
print('The key: ', key)

# get our training data
text, ciphertext = get_random_pairs(250000, 5, 10, key)

# this should just be the length we asked for
print('Generated {0} encrpyted sequences.'.format(len(ciphertext)))

# print 3 examples
for i in range(1000,1003):
    print('Example of origin and encrypted text:')
    print('\t{0}'.format(text[i]))
    print('\t{0}'.format(ciphertext[i]))

The key:  b'n\xea\xb6\xcb\x13\x1c\x17,n\x15\xa2g\xe4\x9f\xeb}'
Generated 250000 encrpyted sequences.
Example of origin and encrypted text:
	^Wn#SrSLs
	\xb5\xc1A\x8c\xa1{ B\xd
Example of origin and encrypted text:
	M hwgeZ@e
	\x8cV>sjE)\x1e
Example of origin and encrypted text:
	JSgW@Av
	M\xdbW\xbf\x0c\xab\x9


***Create the training data***

Now we need to calculate some required data so we can encode the text for the model, and then actually encode the text. The input will be int encoded because I'm going to try an embedding layer, and the output will be one-hot encoded to predict the characters in the output sequence. The model will be a basic seq2seq LSTM modeled in Keras.

In [3]:
import numpy as np
import pandas as pd
from keras.preprocessing.sequence import pad_sequences

# turn the lists into numpy arrays
df_text = pd.DataFrame(text, columns=['text'])
np_text = df_text.to_numpy()
df_ciphertext = pd.DataFrame(ciphertext, columns=['ciphertext'])
np_ciphertext = df_ciphertext.to_numpy()

# determine what unique characters are present in the ciphertext
ct_alphabet = set()
for line in ciphertext:
    [ct_alphabet.add(c) for c in line]

# determine the maximum length of any sequence in the text and ciphertext
max_ct_len = max([len(line) for line in ciphertext])
max_input_len = max([len(line) for line in text])

# create char<->index and index<->char dictionaries
alphabet = get_alphabet()
ctalph_to_idx = { char: i+1 for i, char in enumerate(ct_alphabet) }
idx_to_ctalph = { ctalph_to_idx[key]: key for key in ctalph_to_idx.keys() }
alph_to_idx = { char: i+1 for i, char in enumerate(alphabet) }
idx_to_alph = { alph_to_idx[key]: key for key in alph_to_idx.keys() }

# int encode all the text and ciphertext
encoded_text_lines = []
for i, line in enumerate(np_text):
    line = line[0]
    new_line = np.zeros((len(line), ))
    for j, char in enumerate(line):
        new_line[j] = alph_to_idx[char]
    encoded_text_lines.append(new_line)
np_text = np.asarray(encoded_text_lines)

encoded_text_lines = []
for i, line in enumerate(np_ciphertext):
    line = line[0]
    new_line = np.zeros((len(line), ))    
    for j, char in enumerate(line):
        new_line[j] = ctalph_to_idx[char]       
    encoded_text_lines.append(new_line)
np_ciphertext = np.asarray(encoded_text_lines)

# ensure that all sequences are the same length by applying zero padding to increase the length of shorter sequences
np_text = pad_sequences(np_text, maxlen=max_input_len, padding='pre')
np_ciphertext = pad_sequences(np_ciphertext, maxlen=max_ct_len, padding='pre')

# determine how many unique characters exist in the input and output sequences
alphabet_len = len(alphabet) +1 # +1 to accommodate the padding char which was not in the original alphabet
ct_alphabet_len = len(ct_alphabet) +1 # +1 here too

***Prepare to train the model***

Ok now the fun part (jk I love it all) where we try to train the model. Here we'll finalize the training data and create the LSTM model.

In [4]:
# use the traditional X, y variable names
y = encode_output(np_text, alphabet_len, alph_to_idx)
X = np_ciphertext

# get the lengths of the input and output sequences
input_seq_len = X.shape[1]
output_seq_len = y.shape[1]

# get an instance of the model and use a small embedding dim
embedding_dim = 3
model = get_model(alphabet_len, ct_alphabet_len, input_seq_len, output_seq_len, embedding_dim)

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 35)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 35, 3)             288       
_________________________________________________________________
bidirectional_1 (Bidirection (None, 512)               532480    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 9, 512)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 9, 256)            787456    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 9, 78)             20046     
Total params: 1,340,270
Trainable params: 1,340,270
Non-trainable params: 0
_________________________________________________

And then see how it goes...

In [5]:
epochs=50
batch_size = 512
model.fit(X, y, epochs=epochs, batch_size=batch_size, validation_split=0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 200000 samples, validate on 50000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x1cbb667cf60>

**Hmm.. so that didn't work**

Ok so that did not work (actually as expected). AES does (more-or-less) secure the whole internet so thankfully we cannot use simple seq2seq model to learn the AES cipher when we also don't know the key. There is a little bit of accuracy here (22-23%) but that is only due to correctly guessing the padding char, not actually decoding the sequence.


**Round 2**

Now I'll try including the key with the encoded text, as input to the model, and see if it can learn an AES cipher when the decryption key is known ahead of time. I'll use the same seq2seq LSTM model as before. We just need to re-create the training data.

In [6]:
# the only difference is we're setting use_key=True when creating the training data
text, ciphertext = get_random_pairs(250000, 5, 10, key, use_key=True)

# turn the lists into numpy arrays
df_text = pd.DataFrame(text, columns=['text'])
np_text = df_text.to_numpy()
df_ciphertext = pd.DataFrame(ciphertext, columns=['ciphertext'])
np_ciphertext = df_ciphertext.to_numpy()

# determine what unique characters are present in the ciphertext
ct_alphabet = set()
for line in ciphertext:
    [ct_alphabet.add(c) for c in line]

# determine the maximum length of any sequence in the text and ciphertext
max_ct_len = max([len(line) for line in ciphertext])
max_input_len = max([len(line) for line in text])

# create char<->index and index<->char dictionaries
alphabet = get_alphabet()
ctalph_to_idx = { char: i+1 for i, char in enumerate(ct_alphabet) }
idx_to_ctalph = { ctalph_to_idx[key]: key for key in ctalph_to_idx.keys() }
alph_to_idx = { char: i+1 for i, char in enumerate(alphabet) }
idx_to_alph = { alph_to_idx[key]: key for key in alph_to_idx.keys() }

# int encode all the text and ciphertext
encoded_text_lines = []
for i, line in enumerate(np_text):
    line = line[0]
    new_line = np.zeros((len(line), ))
    for j, char in enumerate(line):
        new_line[j] = alph_to_idx[char]
    encoded_text_lines.append(new_line)
np_text = np.asarray(encoded_text_lines)

encoded_text_lines = []
for i, line in enumerate(np_ciphertext):
    line = line[0]
    new_line = np.zeros((len(line), ))    
    for j, char in enumerate(line):
        new_line[j] = ctalph_to_idx[char]       
    encoded_text_lines.append(new_line)
np_ciphertext = np.asarray(encoded_text_lines)

# ensure that all sequences are the same length by applying zero padding to increase the length of shorter sequences
np_text = pad_sequences(np_text, maxlen=max_input_len, padding='pre')
np_ciphertext = pad_sequences(np_ciphertext, maxlen=max_ct_len, padding='pre')

# determine how many unique characters exist in the input and output sequences
alphabet_len = len(alphabet) +1 # +1 to accommodate the padding char which was not in the original alphabet
ct_alphabet_len = len(ct_alphabet) +1 # +1 here too

# use the traditional X, y variable names
y = encode_output(np_text, alphabet_len, alph_to_idx)
X = np_ciphertext

# get the lengths of the input and output sequences
input_seq_len = X.shape[1]
output_seq_len = y.shape[1]

# get an instance of the model and use a small embedding dim
embedding_dim = 3
model = get_model(alphabet_len, ct_alphabet_len, input_seq_len, output_seq_len, embedding_dim)

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 83)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 83, 3)             288       
_________________________________________________________________
bidirectional_2 (Bidirection (None, 512)               532480    
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 9, 512)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 9, 256)            787456    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 9, 78)             20046     
Total params: 1,340,270
Trainable params: 1,340,270
Non-trainable params: 0
_________________________________________________

**Peek at the training data**

If we remove the zero padding from each string and print it, we can see that each training example starts with the same sequence, which is the key to the cipher.

[94, 51, 7, 33, 27, 51, 7, 16, 38, 51...]

In [7]:
for i, row in enumerate(X[:3, :]): print(f'\nRow {i}: ', list(filter(lambda char: char != 0, row)))        


Row 0:  [94, 51, 7, 33, 27, 51, 7, 16, 38, 51, 7, 50, 16, 51, 7, 60, 47, 51, 7, 60, 50, 51, 7, 60, 85, 78, 94, 51, 7, 60, 88, 51, 7, 27, 67, 9, 51, 7, 33, 17, 51, 7, 77, 26, 51, 7, 33, 16, 28, 19, 51, 7, 60, 15, 40, 51, 7, 27]

Row 1:  [94, 51, 7, 33, 27, 51, 7, 16, 38, 51, 7, 50, 16, 51, 7, 60, 47, 51, 7, 60, 50, 51, 7, 60, 85, 78, 94, 51, 7, 60, 88, 51, 7, 27, 67, 9, 51, 7, 33, 17, 51, 7, 77, 26, 51, 7, 33, 16, 51, 7, 60, 60, 9, 14, 51, 7, 93, 38, 63, 51, 7, 3]

Row 2:  [94, 51, 7, 33, 27, 51, 7, 16, 38, 51, 7, 50, 16, 51, 7, 60, 47, 51, 7, 60, 50, 51, 7, 60, 85, 78, 94, 51, 7, 60, 88, 51, 7, 27, 67, 9, 51, 7, 33, 17, 51, 7, 77, 26, 51, 7, 33, 16, 51, 7, 85, 26, 51, 7, 27, 16, 51, 7, 77, 85, 87, 51, 7, 16, 85, 51, 7, 33, 17, 51, 7, 33, 27, 51, 7, 3]


**Now train!**

In [8]:
epochs=50
batch_size = 512
model.fit(X, y, epochs=epochs, batch_size=batch_size, validation_split=0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 200000 samples, validate on 50000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x1cc33bd3f28>

**This concludes the test**

Well, if you needed any convincing that you cannot learn to decipher AES with a small LSTM and few training examples, there you have it.  :)

Once again we're seeing just 22-23% accuracy, and even that is only due to correctly guessing the padding character. No meaningful predictions were possible. 

Check the other notebook in this repo to see an simple cipher that can be learned with an LSTM!

**Thanks for following along!**