# A bit of Visualizations

First let's load [Competition](https://www.kaggle.com/c/ciphertext-challenge-iii) data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from itertools import cycle
from string import ascii_lowercase, ascii_uppercase

train = pd.read_csv("../input/train.csv")
test = pd.read_csv('../input/test.csv')
test1 = test[test["difficulty"] == 1].reset_index()

plaintext = train.loc[range(train.shape[0])]["text"]
ciphertext1 = test1.loc[list(range(test1.shape[0]))]["ciphertext"]

From test data we took only rows with **Difficulty 1**.

Let's get **letter** frequency statistics first.

In [None]:
def letter_frequency_stats(texts):
    memo = Counter("".join(texts))
    mstats = pd.DataFrame([[x[0], x[1]] for x in memo.items()], columns=["Letter", "Frequency"])
    return mstats.sort_values(by='Frequency', ascending=False)

plainStats = letter_frequency_stats(plaintext)
cipherStats = letter_frequency_stats(ciphertext1)

plt.figure(figsize=(24, 6))

plt.subplot(2, 1, 1)
print(len(plainStats))
plot_series = np.array(range(len(plainStats))) + 0.5
plt.bar(plot_series, plainStats['Frequency'].values)
plt.xticks(plot_series, plainStats['Letter'].values)

plt.subplot(2, 1, 2)
print(len(cipherStats))
plot_series = np.array(range(len(cipherStats))) + 0.5
plt.bar(plot_series, cipherStats['Frequency'].values)
plt.xticks(plot_series, cipherStats['Letter'].values)

# plt.savefig("count.png")
plt.show()
plainStatsL = plainStats

It seems that cipher doesn't affect spaces frequency. And possibly **doesn't transform spaces at all**.

So, let's get **word** frequency statistics.

In [None]:
def word_frequency_stats(texts):
    memo = Counter(" ".join(texts).split(" "))
    mstats = pd.DataFrame([[x[0], x[1]] for x in memo.items() if len(x[0]) > 0], columns=["Word", "Frequency"])
    return mstats.sort_values(by='Frequency', ascending=False)

plainStats = word_frequency_stats(plaintext)
cipherStats = word_frequency_stats(ciphertext1)

plt.figure(figsize=(24, 6))

plt.subplot(2, 1, 1)
plot_series = np.array(range(10)) + 0.5
plt.bar(plot_series, plainStats['Frequency'].values[:10])
plt.xticks(plot_series, plainStats['Word'].values[:10])

plt.subplot(2, 1, 2)
plot_series = np.array(range(40)) + 0.5
plt.bar(plot_series, cipherStats['Frequency'].values[:40])
plt.xticks(plot_series, cipherStats['Word'].values[:40])

# plt.savefig("count.png")
plt.show()

There is a strong correlation between plain and encoded word frequencies:

In [None]:
{
    "the": ["flt", "ssi", "xwd", "jgp"],
    "and": ["pmo", "lrs", "yyh", "edc"],
    "I": ["M", "H", "T", "X"], # ?!
    "to": ["xe", "fs", "aj", "jn"], # might be a bit
    "of": ["su", "nq", "ee", "sa"], # messed here
    "a": ["l", "e", "p", "y"] # !!! 
}

Note, that first letters from words corresponding to "**a**nd" also are **"p", "l", "y", "e"**!

Time to try Caesar cipher (more precisely **Vigenère cipher**):

In [None]:
def caesar_shift(text, key):
    def substitute(char, i):
        if char in ascii_lowercase:
            char = chr((ord(char) - 97 - ord(key[i]) + 97) % 26 + 97)
            i = (i + 1) % len(key)
        if char in ascii_uppercase:
            char = chr((ord(char) - 65 - ord(key[i]) + 97) % 26 + 65)
            i = (i + 1) % len(key)
        return char, i
    i = 0
    result = [char for char in text]
    for j in range(len(result)):
        result[j], i = substitute(result[j], i)
    return ''.join(result)

encodedtext = ciphertext1.map(lambda x: caesar_shift(x, 'pyle'))
encodedtext.values[:20]

I had to apply transformation only to alphabetic characters skipping other characters and spaces, and played around letters order in the cipher key (which can be guessed from transformations of the words 'and' and 'the' **;)** ), but still looks messy.

Let's look at the word frequency chart, though:

In [None]:
encodedStats = word_frequency_stats(encodedtext)

plt.figure(figsize=(24, 6))

plt.subplot(2, 1, 1)
plot_series = np.array(range(10)) + 0.5
plt.bar(plot_series, plainStats['Frequency'].values[:10])
plt.xticks(plot_series, plainStats['Word'].values[:10])

plt.subplot(2, 1, 2)
plot_series = np.array(range(40)) + 0.5
plt.bar(plot_series, encodedStats['Frequency'].values[:40])
plt.xticks(plot_series, encodedStats['Word'].values[:40])

# plt.savefig("count.png")
plt.show()

Interestingly 'a' was "merged" properly, but other words still have variations.
Specifically:

In [None]:
{
    "a": ["a"], 
    "the": ["uhe", "thf", "uie"],
    "and": ["and", "aod", "aoe"],
    "I": ["I", "J"],
    "to": ["up", "tp", "uo"],
    "of": ["pf", "pg", "of"],
    "my": ["mz", "nz"],
    "in": ["io", "in", "jn"],
    "is/it": ["it", "iu", "jt"],
    "you": ["zpv", "zpu", "zov"]
}

I left some values behind to show the pattern that can be seen here.
Sometimes next letter is taken.
But it doesn't seem to happen with "a" word alone.

Hmm...

# Example
I need to find specific example. After looking at encoded texts, I've chosen one. Let's try to decode it:

In [None]:
[x for x in ciphertext1.values if 'AHYK WDYVO U: Iltqpmd sssk ydx bdew wybto apmdf qipq\'o' in x]

In [None]:
print(caesar_shift("RUld.B]4:tV79 wTUXjHHxgAHYK WDYVO U: Iltqpmd sssk ydx bdew wybto apmdf qipq'oYVRwT5KnGazYrqKYOdBF4.5", 'pyle'))
print(caesar_shift("RUld.B]4:tV79 wTUXjHHxgAHYK WDYVO U: Iltqpmd sssk ydx bdew wybto apmdf qipq'oYVRwT5KnGazYrqKYOdBF4.5", 'qzmf'))

Interesting, I can see something like 'KING HENRY V: ', don't you see it?

Let's find possible cipher key:

In [None]:
''.join((chr((ord(x) - 97) % 26 + 97) for x in caesar_shift('AHYKWDYVOU', 'KINGHENRYV')))

A bit of ~~woodoo~~ qzlepzle magic:

In [None]:
print(caesar_shift("RUld.B]4:tV79 wTUXjHHxgAHYK WDYVO U: Iltqpmd sssk ydx bdew wybto apmdf qipq'oYVRwT5KnGazYrqKYOdBF4.5", 'qzlepzle'))

What do we have in plain data:

In [None]:
[x for x in plaintext.values if 'KING HENRY V: Wherein' in x]

Nice!

Now we can find cipher key more precisely:

In [None]:
key01 = ''.join((chr((ord(x) - 97) % 26 + 97) for x in caesar_shift('AHYKWDYVOUIltqpmdssskydxbdewwybtoapmdfqipqo', 'KINGHENRYVWhereinthouartlesshappybeingfeard')))
key01

A bit of alignment...

In [None]:
np.array([x for x in (key01 + ' ')]).reshape((-1, 4))

Ok, I could play a lot more with it.

But it seems to be a way to nowhere.

Or...

# New Hypotesis

Based on word frequency statistics above, let's try guess cipher key based on other words, not only 'a':

In [None]:
print(''.join((chr((ord(x) - 97) % 26 + 97) for x in caesar_shift('MHTX', 'I')))) # Upper case to lower case transformation magic here
print(caesar_shift('flt ssi xwd jgp', 'the'))
print(caesar_shift('pmo lrs yyh edc', 'and'))
print(caesar_shift('xe fs sa jn', 'to'))
print(caesar_shift('su nq ee aj', 'of'))

Interesting! Let's see how each individual letter is transformed:

In [None]:
{
    "a": "pyle",
    "I": "pzle",
    "t": "qzme",
    "h": "pzle",
    "e": "pzle",
    "a": "pyle",
    "n": "qzle",
    "d": "pzle",
    "t": "qzme",
    "o": "qzme",
    "o": "qzme",
    "f": "pzle"
}

In [None]:
{
    "a": "pyle",
    "a": "pyle",
    "d": "pzle",
    "e": "pzle",
    "f": "pzle"
    "h": "pzle",
    "I": "pzle",
    "n": "qzle",
    "o": "qzme",
    "o": "qzme",
    "t": "qzme",
    "t": "qzme",
}

Really interesting! This looks like a pattern, each letter changes to next letter at some point.

- For letter 'a' our cipher uses 'y' while for other letters 'z' is used instead.
- 'p' is used for letters before 'i', but 'q' used for letters starting from 'n'.
- Shift from 'l' to 'm' is between letters 'n' and 'o'!
- 'e' if even replaced by 'f' then only after letter 't'.

Wait..

In [None]:
print(ord('a') - 97 + ord('y') - 97, ord('b') - 97 + ord('z') - 97)
print(ord('n') - 97 + ord('l') - 97, ord('o') - 97 + ord('m') - 97)
print(ord('j') - 97 + ord('p') - 97, ord('k') - 97 + ord('q') - 97)
print(ord('u') - 97 + ord('e') - 97, ord('v') - 97 + ord('f') - 97)

Nice! Do you know what it means?

That ciphered text doesn't produce 'z'! (ord('z') == 25)

Let's get back to letter frequency.
You might not have noticed, but now we can see that:
- 'Z' doesn't appear neither in ciphered nor in plane text!
- 'z' frequency in ciphered text is way lower than for other lower case letters, and is around those of numbers and special characters.

It might be that **'z' only appears in randomly generated salt.**

**Well, so what?**

Ok, first where 'z' comes from in ciphered text. If cipher is applied only to plain text part, decoding will require finding proper alignment of plain text withing encoded string, which may be problematic.

Next, 26 letters are encoded into 25. Specifically:
- when key letter is 'y/z': 'z' and 'a' will be encoded into 'y'
- when key letter is 'p/q': 'z' and 'j' will be encoded into 'p'
- when key letter is 'l/m': 'z' and 'n' will be encoded into 'l'
- when key letter is 'e/f': 'z' and 'u' will be encoded into 'e'

Giving higher overall letter frequencies for letters 'pyle' constituting cipher, which we could also see in letter frequency chart if frequency of 'z' in plain text wouldn't be so extremely low.

Anyway, we need to adjust our decoding function:

In [None]:
def caesar_shift_ex(text, key):
    def substitute(char, i):
        if char in ascii_lowercase:
            char = chr((ord(char) - 97 - ord(key[i]) + 97) % 25 + 97)
            i = (i + 1) % len(key)
        if char in ascii_uppercase:
            char = chr((ord(char) - 65 - ord(key[i]) + 97) % 25 + 65)
            i = (i + 1) % len(key)
        return char, i
    i = 0
    result = [char for char in text]
    for j in range(len(result)):
        result[j], i = substitute(result[j], i)
    return ''.join(result)

encodedtext = ciphertext1.map(lambda x: caesar_shift_ex(x, 'pyle'))
encodedtext.values[:20]

Which is as simple as **changing modulus from 26 to 25**.

Some were decoded other still look weird which stands for the idea that plain text is only part which is encoded and starts after unknown number of salted...

### Oh, wait..

# WHAT?!!

In [None]:
encodedtext.values[9]

This string was broken in the middle. Probably, next character selection from cipher key failed. But what could go wrong?

Maybe.. Did original string contain 'z'?

In [None]:
[x for x in plaintext.values if 'DOMITIUS ENOBARBUS: Had gone to' in x]

It did! So looks like 'z' was just skipped!

Let's try again:

In [None]:
def caesar_shift_ex2(text, key):
    def substitute(char, i):
        if char in ascii_lowercase and char != 'z':
            char = chr((ord(char) - 97 - ord(key[i]) + 97) % 25 + 97)
            i = (i + 1) % len(key)
        if char in ascii_uppercase:
            char = chr((ord(char) - 65 - ord(key[i]) + 97) % 25 + 65)
            i = (i + 1) % len(key)
        return char, i
    i = 0
    result = [char for char in text]
    for j in range(len(result)):
        result[j], i = substitute(result[j], i)
    return ''.join(result)

encodedtext = ciphertext1.map(lambda x: caesar_shift_ex2(x, 'pyle'))
encodedtext.values[:20]

It looks like **we finally did it!**

Now the only step left is to figure out how to use it for submittion and perhaps for cracking next cipher difficulties.


*See you in my next kernel, ladies and gentelmen!*

P.S.: What are numbers and special symbols doing in plain data? Let's figureout.

In [None]:
rare_symbols = plainStatsL.loc[plainStatsL["Frequency"] < 100]['Letter'].values
rare_occurances = [x for x in plaintext if np.any(np.isin(rare_symbols, [y for y in x]))]
print(rare_symbols)
rare_occurances

Look's like some clues left by competition authors about further difficulty levels.