
# Cipher classifier (developing analyser.py)

Contents:
1. Overview
2. Data acquisition
3. Feature Extraction
4. Model training and Evaluation


# 1. Overview
For the National Cipher Challenge it would be helpful to identify ciphers as soon as possible. After Moses suggesting me make a statistics extractor that displays information like index of coincidence, I am going above and beyond to attempt to let ML do the job.

# 2. Data acquisition

This data is generated with the British National Cipher Challenge archive by themaddoctor on GitHub. Initially I thought to use our own corpus of Agatha Christie novels but the encryption process is going to take more time. 
I gave up on hand copying the desired details, and let ChatGPT deal with this nitty gritty. Might come back later to understand it.
This code can soon be used for the whole data set as well. Having 9 or 10 challenges, each with two parts, gives 22 * 9 * 2 = 396 data points. Not massive, but easy and useful. One concern at the moment is how much monoalphabetic substitution ciphers there are causing underrepresentation of others.


In [87]:
import pandas as pd

In [88]:
cipher_data = pd.read_csv("2023_ciphertext_data.csv")
print(cipher_data['CipherType'])

0                           Caesar shift cipher
1            monoalphabetic substitution cipher
2                                   Hill cipher
3                            four-square cipher
4                           Caesar shift cipher
5                           Caesar shift cipher
6                                 affine cipher
7            monoalphabetic substitution cipher
8            monoalphabetic substitution cipher
9            monoalphabetic substitution cipher
10           monoalphabetic substitution cipher
11                           permutation cipher
12           monoalphabetic substitution cipher
13                              Vigenere cipher
14                           permutation cipher
15                  Morse code, Vigenere cipher
16                              Vigenere cipher
17    Morse code, columnar transposition cipher
18                              Vigenere cipher
19                              Polybius square
20                                  Hill

The only problem with this is how the titles of each challenge is included in the first line of ciphertext but not always, so slicing the first line might remove ciphertext. However, I am hoping it is rare enough an occurence to not affect results. I am using the 2023 challenges as my small set of data. With this small range of ciphers, I think this will work well in identifying monoalphabetic substitution ciphers at first. 

# 3. Feature extraction
The evaluate.py file by Moses has included letter frequencies, IOC etc counters. They were initially used in hill-climb algorithms to measure how close to real text a decrypt is. I had been asked to write code that applies this analysis function to ciphertexts over periods, generating e.g. IOC at each period, i.e. extracting every nth letter and calculating the IOC of that, since that statistic would resemble the value for plain English when encrypted with the same key letter.

I have attempted to refactor the calculateStatistics.py to work better with pandas by returning a dictionary.

In [None]:
import sys
import os

# Add the parent directory to the system path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

# Now you can import the function
from statsAnalyser.computeStatistics import getStatistics
stats = getStatistics("1440 x 900  monitor looks so bad") 
stats


Factors of length:  [1, 2, 3, 6, 9, 18]


{'lengthStats': [1, 2, 3, 6, 9, 18],
 'monogramFitness': 0.6268193150739267,
 'IOC': 0.0718954248366013,
 'bigramIOC': 0.0,
 'trigramIOC': 0.0,
 'quadgramFrequenciesScore': -219.75729687081144}

NOW I needto fix all this ;p/
I discovered that whatever fixed path is going on in imported files, all of this is just gathered into the caller file and run, so paths are only usable if they correspond to the current file.

For lengthStats there is lots of non-linearity that should mean a neural network performs better. I dont think we need it yet.

Now we need to accomplish adding the stats for each ciphertext row. Just occured to me that I should clean the ciphertext's whitespaces as well.

In [90]:
cipher_stats = pd.read_csv("2023_ciphertext_data_stats.csv")

cipher_stats.loc[cipher_stats['CipherType'] == 'permutation cipher']

Unnamed: 0,Ciphertext,CipherType,lengthStats,monogramFitness,IOC,bigramIOC,trigramIOC,quadgramFrequenciesScore
10,ROREP MRTFO AMISS SEISI SETYL ILTHE YRBRA GAOU...,permutation cipher,"[1, 2, 5, 10, 193, 386, 965, 1930]",0.996043,0.064288,0.00467,0.000415,-36603.088546
13,JDEFO IOLWI LONUO NGPYU LAORS MSSTE AETHG IOGT...,permutation cipher,"[1, 2, 17, 34, 73, 146, 1241, 2482]",0.996918,0.066226,0.005101,0.000523,-50671.609546


In [91]:
cipher_stats.dropna(axis=0, inplace=True) # removes missing values. Didn't know that inplace would be something to think about.
cipher_stats.isnull().sum()


Ciphertext                  0
CipherType                  0
lengthStats                 0
monogramFitness             0
IOC                         0
bigramIOC                   0
trigramIOC                  0
quadgramFrequenciesScore    0
dtype: int64

Let's check that the scores are fairly normalised.
- problem with codes and doubled ciphers - it's like the MNIST two overlapping digits. We can try train that later.
- problem with numerical ciphers - the stats just don't apply! Polybius etc.
For now, substitution ciphers look fine around 0.068 for IOC regardless of length.
- permutation ciphers have high monogram fitness scores and IOC since they are just normal letters jumbled up.
- to identify periodic vigenere, there has to be a measure that extracts key insights from the graph that computeStatistics would give. Or just dump everything...
- For Hill and other ngram substitution like Playfair, where one digram maps to another, we will need the block IOC. Which is the bigram and trigram IOC. Some things might overfit for Hill looking at the presence of 2 or 3 as a factor, since that is probably all data available.


- new issue: some solution.txt do not provide simply the cipher name. e.g. 2005 7B gives a long solution. Anything longer than say 20 characters should be weeded out. Alternatively, just dictionary search the whole text for cipher names. for now i think double encryptions should be ignored for now.
- would love numbers to be recognised as well somehow in the future.


Now, let's try to train on monoalphabetic-or-not.

In [92]:

# Getting our data split into X and y

y = cipher_stats.pop('CipherType')
X = cipher_stats.drop(['Ciphertext', 'lengthStats'], axis=1) # remove columns (axis 1)

X.head()
X.shape



(17, 5)

In [93]:
from sklearn.svm import LinearSVC
from sklearn import model_selection
def svc_train_predict(X, y):
        
    train_x, test_x, train_y, test_y = model_selection.train_test_split(
    X, y, train_size=0.9)
    svm = LinearSVC()
    svm.fit(train_x, train_y)
    print(svm.predict(train_x))

    print(f"Training accuracy: {svm.score(train_x,train_y)}")
    print(f"Test accuracy: {svm.score(test_x,test_y)}")

# OBsolete bc train test split exists - Get ourselves a 30% testing set
# index = X.shape[0] * 2 // 3
# train_x = X[:index]
# train_y = y[:index]
# test_x = X[index:]
# test_y = y[index:]

# train_x.shape

svc_train_predict(X, y)

['monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher']
Training accuracy: 0.4
Test accuracy: 0.0


Great... it thinks everything's a monoalphabetic. Representation wise, yes, that gets more training data.
Let's actually make it multiclass...


A second run made me realise a problem: Caesar, atbash, affine are all monoalphabetic substitution ciphers and we should probably replace Caesar with mono, since they are pretty much identical. This should boost our accuracy.
Another observation is that the four-square is a standout due to it's monstrous quadram fitness score.

In [94]:
# best to finally refactor as a whole data cleansing section in the end

y.replace("Caesar shift cipher", "monoalphabetic substitution cipher", inplace=True)
y.replace("affine cipher", "monoalphabetic substitution cipher", inplace=True)
y.replace("atbash cipher", "monoalphabetic substitution cipher", inplace=True)
y.head(20)

0     monoalphabetic substitution cipher
1                            Hill cipher
2                     four-square cipher
3     monoalphabetic substitution cipher
4     monoalphabetic substitution cipher
5     monoalphabetic substitution cipher
6     monoalphabetic substitution cipher
7     monoalphabetic substitution cipher
8     monoalphabetic substitution cipher
9     monoalphabetic substitution cipher
10                    permutation cipher
11    monoalphabetic substitution cipher
12                       Vigenere cipher
13                    permutation cipher
15                       Vigenere cipher
17                       Vigenere cipher
19                           Hill cipher
Name: CipherType, dtype: object

In [95]:
svc_train_predict(X, y)

['monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'four-square cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher']
Training accuracy: 0.6
Test accuracy: 0.5


Improved slightly, given that we still don't have the periodicity data for Vigenere, and not sure about how Hill would work...
Vigenere is Caesar but only taken every nth. We need to account for a 'spike' of IOC in one period, perhaps take the difference between the average and max of all assumed key lengths.

For permutation, it should have a monogram fitness close to 1 and close to 0.068 IOC (un-normalised). In an SVM, a linear classification method, the IOC as-is is pretty useless because it is not linearly separable. If you had only the IOC as a feature, you'd have:

-----x------\*-\*----x--

0__0.05__0.068__ 0.1

Along the number line, where * represents permutation cipher data points, you need two point separators to classify a range of acceptable deviation values from 0.068. 
So why not take the net deviation from 0.068?


In [96]:
from math import log

def difference_IOC(ioc, expected_ioc=0.068):
    return log(abs(ioc - expected_ioc)) # not sure if I should square it? no log to the rescue!!
X['differenceIOC'] = X['IOC'].apply(difference_IOC)
X.head(20)

Unnamed: 0,monogramFitness,IOC,bigramIOC,trigramIOC,quadgramFrequenciesScore,differenceIOC
0,0.621285,0.061796,0.006346,0.001269,-9876.094958,-5.082523
1,0.794419,0.041249,0.007262,0.000408,-30829.986645,-3.62118
2,0.786917,0.046987,0.007767,0.000513,-102883.093764,-3.862635
3,0.667307,0.067042,0.008231,0.001346,-38343.925193,-6.950488
4,0.526117,0.064788,0.007085,0.00193,-29829.060086,-5.740787
5,0.547506,0.068899,0.007545,0.00153,-32994.187883,-7.014537
6,0.515992,0.065243,0.007623,0.001529,-23974.702148,-5.893585
7,0.647638,0.068096,0.007995,0.001477,-47588.393022,-9.251265
8,0.579257,0.067529,0.007788,0.001681,-35252.356382,-7.660395
9,0.747297,0.068092,0.007877,0.00171,-32718.500586,-9.292394


In [97]:
svc_train_predict(X, y)

['Vigenere cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'four-square cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'Vigenere cipher' 'Vigenere cipher'
 'Vigenere cipher' 'monoalphabetic substitution cipher' 'Vigenere cipher'
 'monoalphabetic substitution cipher']
Training accuracy: 0.7333333333333333
Test accuracy: 0.5


This is insane! Permutation is sadly still not on hte list, but Vigenere and Hill had some random appearances. I'll try make a high monogram fitness stick out.locals

In [98]:
X['logMonogramFitness'] = X['monogramFitness'].apply(lambda x: log(x)*100)

In [99]:
X.head(20)


Unnamed: 0,monogramFitness,IOC,bigramIOC,trigramIOC,quadgramFrequenciesScore,differenceIOC,logMonogramFitness
0,0.621285,0.061796,0.006346,0.001269,-9876.094958,-5.082523,-47.596584
1,0.794419,0.041249,0.007262,0.000408,-30829.986645,-3.62118,-23.014463
2,0.786917,0.046987,0.007767,0.000513,-102883.093764,-3.862635,-23.963246
3,0.667307,0.067042,0.008231,0.001346,-38343.925193,-6.950488,-40.450441
4,0.526117,0.064788,0.007085,0.00193,-29829.060086,-5.740787,-64.223233
5,0.547506,0.068899,0.007545,0.00153,-32994.187883,-7.014537,-60.23827
6,0.515992,0.065243,0.007623,0.001529,-23974.702148,-5.893585,-66.1664
7,0.647638,0.068096,0.007995,0.001477,-47588.393022,-9.251265,-43.442353
8,0.579257,0.067529,0.007788,0.001681,-35252.356382,-7.660395,-54.600892
9,0.747297,0.068092,0.007877,0.00171,-32718.500586,-9.292394,-29.129238


In [100]:
svc_train_predict(X, y)


['monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'permutation cipher'
 'monoalphabetic substitution cipher' 'Vigenere cipher'
 'monoalphabetic substitution cipher' 'permutation cipher'
 'Vigenere cipher' 'monoalphabetic substitution cipher'
 'monoalphabetic substitution cipher' 'Hill cipher'
 'monoalphabetic substitution cipher']
Training accuracy: 0.8666666666666667
Test accuracy: 0.5


WE GOT OUR PERMUTATION CIPHER APPEARANCE LESGOOOO!! Although still not sure of how Hill got there. Let's see what it's getting right:


In [101]:
svm = LinearSVC()

def svc_train_predict(X, y, drop_diff=False, drop_logmono=False):
    if drop_diff:
        X.drop(columns='differenceIOC')
    else:
        X.drop(columns='IOC')
    if drop_logmono:
        X.drop(columns='logMonogramFitness')
    train_x, test_x, train_y, test_y = model_selection.train_test_split(
    X, y, train_size=0.9)
    svm.fit(X, y)
    for predicted, real in zip(svm.predict(X), y):
        print("Actual:"+real, "Predicted:"+predicted)

    print(f"Training accuracy: {svm.score(train_x,train_y)}")
    
    for real, predicted in zip(svm.predict(test_x), test_y):
        print("Actual:"+real, "Predicted:"+predicted)
    print(f"Test accuracy: {svm.score(test_x,test_y)}")
svc_train_predict(X, y)

# X.drop(['differenceIOC', 'monogramFitness', 'logMonogramFitness'],axis=1)
# train_predict(X, y)


Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:Hill cipher Predicted:Vigenere cipher
Actual:four-square cipher Predicted:four-square cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:permutation cipher Predicted:permutation cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:Vigenere cipher Predicted:Vigenere cipher
Actual:permut

Here we can see:
- Vigenere is ok. Mono also predicted, which is fair due to Vigenere being Caesar shifts in nature... we REALLY need the periodic data!
- Four-square still going strong.
- Permutation still predicted as mono - due to strong IOC? I think IOC shouldn't be taken too seriously in this case. Perhaps it would improve with more training data, or I should get rid of the IOC difference.
- Hill is pretty much always Vigenere.
Tomorrow: get loads more data. Must sort out permutation.

Big Issue! We just multiplied the log monograms by 100 and the difference it made was huge! Why is that? Because we gave more space between each and amplified the tiny numbers that were 'close to 0 anyway;? Why does that work?
Still not happy with this hill to vigenere stuff but we will sort that out.

In [102]:
X.head(20)

svc_train_predict(X, y)
# honestly not too sure about the impact of logmonogramfitness since it would be linearly seperable either way.

Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:Hill cipher Predicted:Vigenere cipher
Actual:four-square cipher Predicted:four-square cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:permutation cipher Predicted:permutation cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:Vigenere cipher Predicted:Vigenere cipher
Actual:permut

How about let's move on to random forests, since there is a feature importance report for that? Though for linear svc you could look at the coefficients.


In [103]:
feature_coefficients = svm.coef_
classes = map(lambda name: name[:4],svm.classes_)
feature_names = map(lambda name: name[:5],svm.feature_names_in_)
print(feature_coefficients.shape)
# for i, coeff in enumerate(feature_coefficients):
#     print(classes[i])
#     for j, feature in enumerate(feature_names):
#         print(f" {feature} :", "%.6f" % coeff[j])

df = pd.DataFrame(feature_coefficients,index = classes, columns = feature_names)
print (df)

(5, 7)
         monog       IOC     bigra     trigr     quadg     diffe     logMo
Hill  0.000199 -0.000225  0.000042 -0.000002  0.000004  0.028635  0.014831
Vige -0.002751 -0.000731 -0.000129 -0.000021  0.000003  0.088622 -0.000056
four -0.055024 -0.003270 -0.000256 -0.000035 -0.000022  0.299296  0.020041
mono -0.063992  0.001569  0.000406  0.000232  0.000081 -0.423413 -0.023773
perm  0.005479  0.000567 -0.000042 -0.000003 -0.000006 -0.060087  0.137898


Unpacking this:
First let's look at monoalphabetic. Compared to everyone else, it values the differenceIOC the most - a small change in differenceIOC is weighted to cause a big difference in the agreement, meaning it's very important. This makes sense. We also expected monogramFitness to be valued more.
One thing to keep in mind is, since the order of magnitudes of each feature value varies, we should not compare one feature to another, but rather compare one class' one feature to another class' same feature. But of course, the closer to 0 the weight is, the less impact it has.

For permutation, we expect a high value for monogramFitness moreso than monoalphabetic, or the log of it. Still not entirely sure how the log would impact it.
Am thinking about the ranked monogramFitness since for monoalpha it would be high, but maybe more random for other ciphers. Currently monogramFitness hovers around 0.01-0.02

They're all so tiny...

Let's get more data and implement the random forest.

# Random forest 

In [104]:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

def rf_train_predict(X, y, print_predictions=False, drop_diff=False, drop_logmono=False):
    if drop_diff:
        X.drop(columns='differenceIOC')
    else:
        X.drop(columns='IOC')
    if drop_logmono:
        X.drop(columns='logMonogramFitness')

    train_x, test_x, train_y, test_y = model_selection.train_test_split(
    X, y, train_size=0.8)

    rf.fit(X, y)
    if print_predictions:

        for predicted, real in zip(rf.predict(X), y):
            print("Actual:"+real, "Predicted:"+predicted)

    print(f"Training accuracy: {rf.score(train_x,train_y)}")
    
    if print_predictions:
        for real, predicted in zip(rf.predict(test_x), test_y):
            print("Actual:"+real, "Predicted:"+predicted)
    print(f"Test accuracy: {rf.score(test_x,test_y)}")

rf_train_predict(X, y, True, True)

Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:Hill cipher Predicted:Hill cipher
Actual:four-square cipher Predicted:four-square cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:permutation cipher Predicted:permutation cipher
Actual:monoalphabetic substitution cipher Predicted:monoalphabetic substitution cipher
Actual:Vigenere cipher Predicted:Vigenere cipher
Actual:permutatio

This is INCREDIBLY accurate compared to SVC! Perhaps this is due to how inherently the classes can be represented as a decision tree, even as a human processing each data point. Let's scale up the data.
To summarise what we've done to the data:
- extract ciphertext and cipher type
- calculate statistics
- further process statistics (but might be unneccessary)
- group monoalphabetic ciphers with different names, potentially merge variations on a cipher.
Upon expanding the data set, 2021 has a format of "cipher: Caesar shift" which means it's best to clean up row by row, looping through keywords instead. This change is reflected in the extractData_clean.py.

For now we'll have to have things like permutation, stream etc all lumped together before I find an efficient way of extracting the essence of the cipher type. Perhaps manually. Or borrow an LLM.

In [105]:
cipher_data = pd.read_csv("2021-23_ciphertext_data_stats.csv")
print(cipher_data['CipherType'])
# at this point it would be convenient to have data reading as a function.


0                           monoalphabetic substitution
1                           monoalphabetic substitution
2                                columnar transposition
3                                     modified Playfair
4                           monoalphabetic substitution
5                           monoalphabetic substitution
6                           monoalphabetic substitution
7                           monoalphabetic substitution
8                           monoalphabetic substitution
9                           monoalphabetic substitution
10                          monoalphabetic substitution
11    monoalphabetic substitution (from Morse code w...
12                                            railfence
13    Vigenere (from Morse code with A = dot and D =...
14                               columnar transposition
15                             Vigenere (reversed text)
16                                             Vigenere
17                               columnar transp

In [106]:
filename = "2021-23_ciphertext_data_stats.csv"
def get_X_y(filename):
    cipher_stats = pd.read_csv(filename)
    
    y = cipher_stats.pop('CipherType')
    X = cipher_stats.drop(['Ciphertext', 'lengthStats'], axis=1) # remove columns (axis 1)
    print(y.shape
          )
    return X, y
rf_train_predict(*get_X_y(filename))


(57,)
Training accuracy: 1.0
Test accuracy: 1.0


The astonishing accuracy makes me suspect that there might be overfitting. 57 examples. But it's performing well on test data. If fixing is needed we need "k-fold validation" to look up later.

In [107]:
def test_predict():
    with open("test_ciphertext.txt", 'r') as f:
        ciphertext = f.read()
        c = pd.DataFrame([getStatistics(ciphertext)])
        print(c)
        c.drop(["lengthStats"], axis=1, inplace=True)
        label = rf.predict(c)
        print(c, label)


On second thought, it will only be some more complex training examples lost if our cipher type extraction technique fails on older data. It's time to give the whole set a try.

In [110]:
def rf_predict(filename):
    X, y = get_X_y(filename)
    # label = rf.predict(cipher_stats)
    for predicted, real in zip(rf.predict(X), y):
        print("Actual:"+real, "Predicted:"+predicted)
        print(predicted == real)
    print(rf.score(X, y))

rf_predict("2019_ciphertext_data_stats.csv")

(17,)
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution
True
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution 
False
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution
True
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution
True
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution 
False
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution
True
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution 
False
Actual:columnar transposition  Predicted:columnar transposition
False
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution
True
Actual:first, reverse the text Predicted:Vigenere 
False
Actual:Vigenere  Predicted:monoalphabetic substitution 
False
Actual:monoalphabetic substitution Predicted:monoalphabetic substitution
True
Actual:First, reverse the text; then it is a Predicted:monoalphabetic substitution
False
Actual:F

Here comes another problem: my non-centrally managed data cleansing has failed me. 
1. FIXED STRIP - any whitespace meant it was being considered another class. 
2. any cipher slightly contrived, hidden in a sentence, was not in the set - we ended up with all mono.
3. permutation cipher (block transposition) this type of names seem to vary more.
4. ciphers with weird symbols e.g. 2020 7B cipher clock, I think thingsl ike # and + - are they standard? treated as letters? that would need ioc to apply to them as well else results skewed.
5. ADFVGX understandably the model had never seen.
6. Some errors in Vigenere?
7. 2005 was pretty much all morse encrypted - a decoder would be nice but perhaps we don't absolutely need that year's data?
8. Sometimes it's funky giving pdfs or images instead of .txt for ciphertext and solution
9. Still problems with combinations - 2019 6B
10. "firstly, reverse the text, then \n vigenere" and "First, do a matrix transposition...vigenere" predicted vigenere, marked wrong. These are sort of transposition + vigenere. By what I've seen so far, the transposition doesn't affect it being identified as vigenere. Just a note, if it says it's vigenere, decrypt it seems like rubbish - predict again. 
two ways to fix this combo problem: get a list of all the ciphers we've worked on to search the whole solution.txt or include the second line for the mono we can actually detect rn.
11. idea: if prime, factors of length list has length 2, meaning cannot be hill but i think we're doing well enough.



In [None]:
test_predict() # just a test for 2019 6B - gave vigenere but should be block transposition and monoalphabetic substitution. 

Factors of length:  [1, 2, 3, 6, 373, 746, 1119, 2238]
                          lengthStats  monogramFitness       IOC  bigramIOC  \
0  [1, 2, 3, 6, 373, 746, 1119, 2238]          0.78451  0.040864   0.002002   

   trigramIOC  quadgramFrequenciesScore  
0    0.000432             -53182.514438  
   monogramFitness       IOC  bigramIOC  trigramIOC  quadgramFrequenciesScore
0          0.78451  0.040864   0.002002    0.000432             -53182.514438 ['Vigenere ']


Right now I need to do this for a whole
# Process for someone who's just downloaded the code
- train the model
- convert the ciphertext to a one-row dataframe
- rf.predict(df)
- print result