**Can we learn a simple cipher?**

In the AES Cipher notebook we saw that it was not possible to learn an AES cipher with a small LSTM and a few training samples. In this notebook we'll try the same test but with a very simple cipher that uses a key and replacement. Please know ahead of time I know nothing about cryptography and this should be seen as a toy example.

In [1]:
import numpy as np
from solution2 import get_random_pairs, encode, decode, get_model, get_alphabet, encode_output
from simple_cipher import AotWCipher

Using TensorFlow backend.


**Create the training data**

Here the cipher object is created and used to create some training data. The data into the model will be int encoded so I can try using an embedding, and the output is one hot encoded. At this step we have the raw text and cipher encoded text and a few examples are printed.

In [2]:
key = 11
print('The key: ', key)

# create the cipher object, this is my toy example of a cipher
cipher = AotWCipher(key)

# get our training data
text, ciphertext = get_random_pairs(cipher, 250000, 5, 10, key)

# this should just be the length we asked for
print('Generated {0} encrpyted sequences.'.format(len(ciphertext)))

# print 3 examples
for i in range(1000,1003):
    print('Example of origin and encrypted text:')
    print('\t{0}'.format(text[i]))
    print('\t{0}'.format(ciphertext[i]))

The key:  11
Generated 250000 encrpyted sequences.
Example of origin and encrypted text:
	oD:W3+i
	[ 32  11  95  89  80 118  56]
Example of origin and encrypted text:
	ja^B+7))s
	[ 51  59  63 109 128  77  94 150  73]
Example of origin and encrypted text:
	$9agC
	[13 29 83 67 77]


**Let's test the cipher object**

Since this isn't a standard cipher let's just demonstrate what it does. Basically it just int encodes your text, and can decode the same. I'll encode a few sample sequences and see if we can decode the original message.

In [3]:
# create a few test strings
sample_1 = 'hi there'
sample_2 = 'metallica'
sample_3 = 'easter bunny'
# encode the strings with the cipher
encoded_1 = encode(cipher, sample_1)
encoded_2 = encode(cipher, sample_2)
encoded_3 = encode(cipher, sample_3)
# print what the original and encoded strings look like
print(f'Text: {sample_1}   Became: {encoded_1}')
print(f'Text: {sample_2}   Became: {encoded_2}')
print(f'Text: {sample_3}   Became: {encoded_3}\n')
# now decode the encoded strings
decoded_1 = ''.join(decode(cipher, encoded_1))
decoded_2 = ''.join(decode(cipher, encoded_2))
decoded_3 = ''.join(decode(cipher, encoded_3))
# and print what the transformation was
print(f'Text: {encoded_1}   Became: {decoded_1}')
print(f'Text: {encoded_2}   Became: {decoded_2}')
print(f'Text: {encoded_3}   Became: {decoded_3}')

Text: hi there   Became: [ 80  41  62  36 137  94  79 113]
Text: metallica   Became: [ 36  42  29  90 135 125  56  87  65]
Text: easter bunny   Became: [ 47  59  91  36 104 116  53 140  63  75  80  99]

Text: [ 80  41  62  36 137  94  79 113]   Became: hi there
Text: [ 36  42  29  90 135 125  56  87  65]   Became: metallica
Text: [ 47  59  91  36 104 116  53 140  63  75  80  99]   Became: easter bunny


**Ok, now to carry on generating the training data**

In [4]:
from keras.preprocessing.sequence import pad_sequences
alphabet = get_alphabet()
# determine what the unique chars are in the ciphertext
ct_alphabet = set()
for line in ciphertext:
    [ct_alphabet.add(c) for c in line]

# figure out the length of the longest sequences
max_ct_len = max([len(line) for line in ciphertext])
max_input_len = max([len(line) for line in text])

# int encode with a lookup dict - reserve zero to be used as the padding 
ctalph_to_idx = { char: i+1 for i, char in enumerate(ct_alphabet) }
idx_to_ctalph = { ctalph_to_idx[key]: key for key in ctalph_to_idx.keys() }

alph_to_idx = { char: i+1 for i, char in enumerate(alphabet) }
idx_to_alph = { alph_to_idx[key]: key for key in alph_to_idx.keys() }

# int encode all the input chars
encoded_text_lines = []
for i, line in enumerate(text):
    new_line = np.zeros((len(line), ))
    for j, char in enumerate(line):
        new_line[j] = alph_to_idx[char]
    encoded_text_lines.append(new_line)

# apply zero padding to the input sequences 
np_text = pad_sequences(encoded_text_lines, maxlen=max_input_len, padding='pre')

# pad the ciphertext sequences as well
np_ciphertext = pad_sequences(ciphertext, maxlen=max_input_len, padding='pre')

# determine the number of characters used in our input and output sequences
alphabet_len = len(alphabet) +1 # +1 to accommodate the padding char
ct_alphabet_len = len(ct_alphabet) +1 # +1 here too

# now one-hot-encode the target, it's been int encoded until now
y = encode_output(np_text, alphabet_len, alph_to_idx)
X = np_ciphertext

input_seq_len = X.shape[1]
output_seq_len = y.shape[1]

**Train!**

With the data ready for the model, we can train! This is a relatively small seq2seq model that starts with an embedding layer that takes a small embedding dimension. You could one-hot-encode the input and do away with the embedding too, it's just something I wanted to try. 

In [9]:
model = get_model(alphabet_len, ct_alphabet_len, input_seq_len, output_seq_len, 3)
epochs=65
batch_size = 512
model.fit(X, y, epochs=epochs, batch_size=batch_size, validation_split=0.2, verbose=2)
model.save('./decode_model.h5')

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 9)                 0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 9, 3)              447       
_________________________________________________________________
bidirectional_5 (Bidirection (None, 512)               532480    
_________________________________________________________________
repeat_vector_5 (RepeatVecto (None, 9, 512)            0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 9, 256)            787456    
_________________________________________________________________
time_distributed_5 (TimeDist (None, 9, 78)             20046     
Total params: 1,340,429
Trainable params: 1,340,429
Non-trainable params: 0
_________________________________________________

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 200000 samples, validate on 50000 samples
Epoch 1/65
 - 19s - loss: 3.4449 - accuracy: 0.2308 - val_loss: 3.3605 - val_accuracy: 0.2325
Epoch 2/65
 - 18s - loss: 3.2874 - accuracy: 0.2377 - val_loss: 3.2423 - val_accuracy: 0.2379
Epoch 3/65
 - 19s - loss: 3.2104 - accuracy: 0.2466 - val_loss: 3.1714 - val_accuracy: 0.2497
Epoch 4/65
 - 18s - loss: 3.1270 - accuracy: 0.2580 - val_loss: 3.0744 - val_accuracy: 0.2671
Epoch 5/65
 - 18s - loss: 3.0565 - accuracy: 0.2680 - val_loss: 3.0002 - val_accuracy: 0.2802
Epoch 6/65
 - 18s - loss: 2.9614 - accuracy: 0.2846 - val_loss: 2.9112 - val_accuracy: 0.2964
Epoch 7/65
 - 18s - loss: 2.8761 - accuracy: 0.2992 - val_loss: 3.0642 - val_accuracy: 0.2529
Epoch 8/65
 - 18s - loss: 2.7867 - accuracy: 0.3268 - val_loss: 2.7213 - val_accuracy: 0.3520
Epoch 9/65
 - 18s - loss: 2.6818 - accuracy: 0.3610 - val_loss: 2.6351 - val_accuracy: 0.3718
Epoch 10/65
 - 18s - loss: 2.6059 - accuracy: 0.3722 - val_loss: 2.5576 - val_accuracy: 0.3820
Epoch 11

**Pretty cool, the LSTM had no problem learning the simple cipher**

Is it any suprise the LSTM could learn this simple cipher? Not really. I read a blog article once about project to make a LSTM to learn the Enigma cipher, so it's really no suprise this simple cipher is learned easily. You can check out the article about the enigma here if you are interested.  https://greydanus.github.io/2017/01/07/enigma-rnn/

**Now as a test of the trained model**

Let's encode some text with the cipher and see if the neural net can decode it! We'll choose some sample text to encode and then decode it with the neural net.

In [10]:
from keras.models import load_model
# load up the trained model
model = load_model('./decode_model.h5')

# create the ciphertext we would like to decode
text = 'good day!'
ciphertext = encode(cipher, text)
print(f'"{text}" became {ciphertext}')

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


"good day!" became [ 41  27  51 108 100 129  74 119  78]


**Ok! This is getting exciting now. The text is now encrypted.**

Let's use the neural net to decrypt the sequence and see what happens! Before comparing cipher vs. neural net, let's get the prediction from the neural net.

In [11]:
# add a dimension by creating a list so padding works
ciphertext = [ciphertext]
# the source text is already the max length, so there is no padding actually applied
np_ciphertext = pad_sequences(ciphertext, maxlen=len(text), padding='pre')

prediction = model.predict(np_ciphertext)
# the shape of the prediction is (batch, time_steps, features)
print('Shape of the array containing the prediction\n', prediction.shape)

predicted_chars = np.argmax(prediction, axis=2)
print('\nThe decoded text is the index of the most probable character at each timestep\n',predicted_chars[0])

probabilities = np.take_along_axis(prediction, np.expand_dims(predicted_chars, axis=-1), axis=-1)
print('\nWe can check the probabilities of each predicted char\n', probabilities)

Shape of the array containing the prediction
 (1, 9, 78)

The decoded text is the index of the most probable character at each timestep
 [ 7 15 15  4 63  4  1 25 65]

We can check the probabilities of each predicted char
 [[[0.99757487]
  [0.993335  ]
  [0.9839482 ]
  [0.9900725 ]
  [0.9998447 ]
  [0.9561768 ]
  [0.9754199 ]
  [0.97674584]
  [0.979807  ]]]


In [12]:
neural_net_decoded = ''.join(cipher.int_decode(predicted_chars[0] -1))
cipher_decoded = ''.join(decode(cipher, ciphertext[0]))

print(f'\nNeural net decoded text\n{neural_net_decoded}')
print(f'\nCipher decoded text\n{cipher_decoded}')


Neural net decoded text
good day!

Cipher decoded text
good day!


**So cool!**

**We were able to demonstrate that our seq2seq LSTM, with a learned function, can produce the same output as the hand coded function.**

This concludes the project. Thanks for following along! Be 1% better every day. Keep on learning friends. :)