<center><h1><b> Data Science Challenge @ Interdisciplinary Workshop on Machine Learning for Cryptology (ML4Crypto 2023) </b></h1></center>

<center><h4>Sruba Sarkar, Srijan Kundu</h4></center>
<center><h5>October '23</h5></center>
<hr>

A large set of plaintexts is generated in the form of bitstreams. These plaintexts are converted into ciphertexts in the form of bitstreams. There are three columns in the dataset. The first column denotes the ID of the ciphertext. The second column comprises either the ciphertexts in the form of bitstreams generated by applying one particular cipher or random bitstreams. Each bitstream (ciphertexts / random) has a length of 64 bits. The third column consists of the labels (0 and 1), which denote the bitstream type.

The dataset is segregated into two parts - training and test. The training dataset comprises 250,000 bitstreams. The training dataset includes the labels. The test dataset comprises 62,500 bitstreams without any labels. The labels of the test dataset are to be predicted. 
The performance metric for the two-class classification on the test dataset will be the accuracy score.

In [None]:
# To set up and install "RAPIDS AI" libraries
!nvidia-smi
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git > /dev/null 2>&1
print("\nRepository Cloned Successfully;")
!python rapidsai-csp-utils/colab/pip-install.py > /dev/null 2>&1
print("\n\"RAPIDS AI\" libraries Installed Successfully;")

Mon Oct 30 18:10:27 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# To import the required libraries
import os
import time
import warnings; warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras import Input
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

In [None]:
# To import the "Rapids.AI" functions for 'GPU Acceleration'
import cudf; import cuml; import cupy
from cuml.ensemble import RandomForestClassifier as cu_RFC
from cuml.svm import SVC as cu_SVC
from cuml.model_selection import GridSearchCV as cu_GridSearchCV

In [None]:
# To set up goole drive to mount
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# To load the data-set
data = pd.read_csv('/content/drive/MyDrive/test/TrainingData.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   CID        250000 non-null  int64 
 1   Bitstream  250000 non-null  object
 2   class      250000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 5.7+ MB


In [None]:
data.head()

Unnamed: 0,CID,Bitstream,class
0,293436,1110101100001100011010101011111111101000110111...,0
1,233249,1100111110000110111110100011111101111100110100...,0
2,32011,1010001000000111000010000001100100011010011110...,0
3,113131,0111001010111101010001011110010001011000000010...,1
4,297144,0100010101101101111000100111101100111110001110...,1


<center><h2><b>Data Pre-Processing</b></h2></center>

<h2><b>Process:</b></h2>

As the cipher text is of 64 bit, first we will divide it by the length of 8 bits (1 byte), so we will get 8 chunks each of 8 bit length.

Now we will form 8 extraction for cipher text:

- We will take the first bit of the first byte and append it to 'Extraction 1'; similarly, the second bit in 'Extraction 2' and so on.
- We will repeat this step for each of the entries.

For the frequency analysis part,

For each entries, for each of the 1 byte `Extraction i` features we count number of `1`s and make our final features on the basis of that.

In [None]:
data_processed = data['Bitstream'].apply(lambda x: pd.Series(list(x)))

In [None]:
data_processed.columns = [f'Bitstream{i+1}' for i in range(64)]
data_processed['class'] = data['class']

In [None]:
# To split the string into chunks of 8 characters and convert to that to a DataFrame
expanded_data = pd.DataFrame((data['Bitstream'].apply(lambda x: [x[i:i+8] for i in range(0, len(x), 8)])).tolist())
expanded_data.columns = [f'Byte{i+1}' for i in range(8)]

In [None]:
expanded_data

Unnamed: 0,Byte1,Byte2,Byte3,Byte4,Byte5,Byte6,Byte7,Byte8
0,11101011,00001100,01101010,10111111,11101000,11011110,10010110,00011100
1,11001111,10000110,11111010,00111111,01111100,11010001,11100011,01000110
2,10100010,00000111,00001000,00011001,00011010,01111011,01000001,11010101
3,01110010,10111101,01000101,11100100,01011000,00001011,00101100,11011111
4,01000101,01101101,11100010,01111011,00111110,00111000,11100010,00110100
...,...,...,...,...,...,...,...,...
249995,00011101,00010000,11101000,11001100,00110110,01110100,11110000,00000110
249996,00010101,00111001,11110000,11101110,11010000,10000001,01111111,01001010
249997,10010111,10111001,01001110,00010010,11101100,11000011,11000001,01101010
249998,11101010,01011101,10010101,11011101,01001000,00110001,01010101,00000100


In [None]:
expanded_data.iloc[1,:]

Byte1    11001111
Byte2    10000110
Byte3    11111010
Byte4    00111111
Byte5    01111100
Byte6    11010001
Byte7    11100011
Byte8    01000110
Name: 1, dtype: object

In [None]:
extracted_data = pd.DataFrame(columns=[f'Extraction{i+1}' for i in range(8)])
for j in range(8):
    extracted_data[f'Extraction{j+1}'] = expanded_data.applymap(lambda x: x[j] if len(x) > j else "").agg(''.join, axis=1)

In [None]:
extracted_data

Unnamed: 0,Extraction1,Extraction2,Extraction3,Extraction4,Extraction5,Extraction6,Extraction7,Extraction8
0,10011110,10101100,10111000,00010111,11111101,01010111,10110110,10010000
1,11100110,10101111,00111010,00111100,10111000,11011001,11110011,10010110
2,10000001,00000111,10000100,00011101,00111100,01000001,11001100,01010111
3,01010001,10111001,11010010,11001001,01001111,01110011,10000101,01100101
4,00100010,11110010,01111111,00011101,01011100,11001001,00111010,11010000
...,...,...,...,...,...,...,...,...
249995,00110010,00110110,00101110,11001110,10110000,10011101,00001001,10000000
249996,00111100,00111011,01110010,11101010,01010011,10010010,00010011,11000110
249997,11001110,00101111,01001001,11010000,01101001,10101000,10110101,11000110
249998,10110000,11011010,10000100,01110110,11011000,01110011,10000000,01110110


In [None]:
## Frequency Analysis

def count_ones(string):
  return string.count('1')

extracted_data_freq = extracted_data.applymap(count_ones)
extracted_data_freq.columns = [f'Extraction{i+1}_1_Count' for i in range(8)]

In [None]:
classes = data['class']

<center><h2><b>Implementing Machine Learning Models</b></h2></center>

<h2><b>Training Set Size: 137250</b><h2>

<h2><b>Validation Set Size: 74250</b><h2>

<h2><b>Test Set Size: 2500</b><h2>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(extracted_data_freq, classes, test_size = 0.01, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.3, random_state=42)

In [None]:
X_train_cu = cudf.DataFrame(X_train)
X_test_cu = cudf.DataFrame(X_test)
X_val_cu = cudf.DataFrame(X_val)

y_train_cu = cudf.Series(y_train)
y_val_cu = cudf.Series(y_val)
y_test_cu = cudf.Series(y_test)

# Encoding and 'float32' conversion
X_train_cu_enc = cudf.get_dummies(X_train_cu).astype('float32')
X_val_cu_enc = cudf.get_dummies(X_val_cu).astype('float32')


<h2><b>
1.   Implementing Random Forest </b></h2>



In [None]:
random_forest_model = cu_RFC()

rf=random_forest_model.fit(cudf.DataFrame.to_cupy(X_train_cu_enc), cupy.asnumpy(y_train_cu))

pred = rf.predict(X_val_cu)

random_forest_accuracy = rf.score(X_val_cu, y_val_cu)
print("Random Forest Accuracy:", random_forest_accuracy)

Random Forest Accuracy: 0.5013198852539062


In [None]:
np.mean(pred.to_numpy() == y_val_cu.to_numpy())

0.5013198653198653

<h3><b>Hyper-parameter Tuning</b></h3>

In [None]:
params_dist_rf = {
		'max_features': ['sqrt', 'log2'],
	  'max_samples': np.array([0.75, 1.]).astype('float32'),
	  'n_bins': [64, 128, 256],
	  'min_samples_split': np.array([4, 5]).astype('float32')
}

grid_search_rf = cu_GridSearchCV(rf, params_dist_rf, cv=5, scoring='accuracy')
tick = time.perf_counter()
grid_search_rf.fit(cudf.DataFrame.to_cupy(X_train_cu_enc), cupy.asnumpy(y_train_cu))
tock = time.perf_counter()

In [None]:
print(f"Finished model training in {round(tock - tick, 2)} second(s).")

best_params_rf = grid_search_rf.best_params_
print(best_params_rf)

Finished model training in 132.82 second(s).
{'max_features': 'sqrt', 'max_samples': 0.75, 'min_samples_split': 4.0, 'n_bins': 64}


In [None]:
y_pred_rf = grid_search_rf.predict(X_val_cu)
test_accuracy_rf = (y_pred_rf.to_numpy() == y_val_cu.to_numpy()).mean()
print("Test Set Accuracy for Random Forest Classifier:", test_accuracy_rf)

Test Set Accuracy for Random Forest Classifier: 0.5015757575757576



<h2><b>
2.   Implementing Support Vector Machine </b></h2>

In [None]:
svc = cu_SVC()

svc_mod= svc.fit(cudf.DataFrame.to_cupy(X_train_cu_enc), cupy.asnumpy(y_train_cu))

pred_svc = svc_mod.predict(X_val_cu)

svc_accuracy = svc_mod.score(X_val_cu, y_val_cu)
print("Support vector Machine Accuracy:", svc_accuracy)

Support vector Machine Accuracy: 0.5018181800842285


<h3><b>Hyper-parameter Tuning</b></h3>

In [None]:
params_dist_svm = {
		'kernel': ['poly', 'rbf', 'sigmoid']
}

grid_search_svm = cu_GridSearchCV(svc, params_dist_svm, cv=5, scoring='accuracy')
tick = time.perf_counter()
grid_search_svm.fit(cudf.DataFrame.to_cupy(X_train_cu_enc), cupy.asnumpy(y_train_cu))
tock = time.perf_counter()
print(f"Finished model training in {round(tock - tick, 2)} second(s).")

best_params_svm = grid_search_svm.best_params_
print(best_params_svm)

grid_search_svm.best_estimator_.get_params()

y_pred_svm = grid_search_svm.predict(X_val_cu)
test_accuracy_svm = (y_pred_svm.to_numpy() == y_val_cu.to_numpy()).mean()
print("Test Set Accuracy for Support Vector Classifier:", test_accuracy_svm)

Finished model training in 252.58 second(s).
{'kernel': 'poly'}
Test Set Accuracy for Support Vector Classifier: 0.5026531986531987



<h2><b>
3.   Implementing Neural Networks </b></h2>

In [None]:
tf.random.set_seed(25)
model = Sequential(
    [
        tf.keras.Input(shape = (8,)),
        Dense(32, activation = "relu", name = "L1"),
        Dropout(0.25),
        Dense(16, activation = "relu", name = "L2"),
        Dropout(0.15),
        Dense(10, activation = "relu", name = "L3"),
        Dropout(0.1, name = "Sigmoid_Output"),
        Dense(1),
    ], name = "ANN_Model"
)

model.summary()

Model: "ANN_Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 L1 (Dense)                  (None, 32)                288       
                                                                 
 dropout_4 (Dropout)         (None, 32)                0         
                                                                 
 L2 (Dense)                  (None, 16)                528       
                                                                 
 dropout_5 (Dropout)         (None, 16)                0         
                                                                 
 L3 (Dense)                  (None, 10)                170       
                                                                 
 Sigmoid_Output (Dropout)    (None, 10)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 11

In [None]:
model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
    metrics = ['accuracy']
)

In [None]:
history = model.fit(
    X_train, y_train,
    batch_size = 250,
    epochs = 25,
    verbose = 1,
    validation_data = (X_val, y_val),
)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [None]:
y_pred = model.predict(X_test).astype(int).flatten()



In [None]:
np.mean(np.array(y_test.tolist()) == y_pred)

0.5032

<center><h2><b>Results and Analysis</b></h2></center>

<h5><b>-- Random Forest yielded 50.132% accuracy which has slightly improved to 50.158% after tuning the hyperparameters;</b></h5>

<h5><b>-- Support Vector Machine yielded 50.182% accuracy which has slightly improved to 50.265% after tuning the hyperparameters;</b></h5>

<h5><b>-- ANN model yielded an accuracy of 50.32% which is the best among these three algorithms implemented here;</b></h5>


Note: Improved data pre-processing is required for better performance of the Machine Learning models which will yield a better accuracy.