# Descriptive summary: Binary experiment (Brute Force)

In this notebook we build a cGAN model that generates Brute Force and Benign observations. This includes:
                            

1.   Building a discriminator 
2.   Building a generator 
3.   Combining the generator and discriminator model to update the generator
4.   Saving an initial testing quality of generated data

Then we generate data with our best cGAN model and use it to train a Random Forest model. Thereafter, we compare the performance of the aforementioned model to that of Random Forest trained on a real dataset. 



# Import relevant libraries 

In [None]:
# connect to google drive for storing and retrieving data and model 
from google.colab import drive

#  modules for importing and manipulating data
import pandas as pd 
import re 

# modules used throughout the gans process (mainly feeding noise and real/fake labels to generator)
import numpy as np 
from numpy.random import randint,randn
from numpy import(
    expand_dims,
    zeros,
    ones,
    asarray
)

# modules used to build gans (mainly diffrent types of layers, regularisation and optimisers)
import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.utils import plot_model
from tensorflow.keras import backend
from tensorflow.keras.initializers import RandomNormal
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
      Input,
      BatchNormalization,
      Dense,
      Reshape,
      Flatten,
      LeakyReLU,
      Dropout,
      Lambda,
      Activation,
      Embedding,
      Concatenate,
      multiply
)



# modules for data transformation, sampling and modelling 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# modules for scoring and evaluating performance
from sklearn.metrics import (
    roc_curve,
    roc_auc_score,
    confusion_matrix,
    classification_report,
    accuracy_score,
    f1_score
)

# Import data from drive

Note: 


1. Data was previously downloaded from source: https://www.unb.ca/cic/datasets/ids-2017.html

2. Pre-processing: the data was pre-processed as was mentioned in the main text. 

3. Down sampled to only include 8k observations per class. This was due to the storage and computation constraints experienced when working with colab.

In [None]:
# mount drive 
drive.mount('/content/drive')

# import data from drive 
data_extension = '/content/drive/MyDrive/RESEARCH-PROJECT/Binary_experiments/Datasets/2017bruteForce.parquet'
df1 = pd.read_parquet(data_extension, engine='auto')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Split data in to training and testing samples

In [None]:
# train, test split (75/25)
X_train, X_test, y_train, y_test = train_test_split(
    df1.drop(["attack_map","Label"], axis=1), df1['attack_map'], test_size=0.25, random_state=42,stratify=df1['attack_map']
)

# Scale features to be between 0-1


Standardisation is a form of scaling that standardises all value ranges within our dataset to be between 0-1.

 In doing so, this:


1. speeds up convergence 
2. creates a more stable training process 


In [None]:
# Scale all features to be between 0-1 this is to ensure all features are viewed with equal importance by the algorithm
scaler = StandardScaler()

# Make sure to only fit the scaler on to the training data
X_train[X_train.columns] = scaler.fit_transform(X_train[X_train.columns])
X_test[X_train.columns] = scaler.transform(X_test[X_train.columns])

# Build discriminator
1.   We have two input layers, one for the features and one for the labels.

2. An embedding layer is used to encode the label input into a 67-dimensional vector, i.e. one dimension per feature.

3.  We then use the multiply layer to condition features on the label embedding. This is referred to as model input.

4.  This "model input" is fed through 3 hidden layers.

5.   We also use dropout after the second and third layer, to discourage overfitting.

6.  Finally, we have one output layer that distinguishes between real and generated observations.





In [None]:
# define the supervised discriminator model
def define_discriminator(out_shape=67, num_classes=2):
    # Initialiser that generates tensors with a normal distribution
    init = RandomNormal(mean=0.0, stddev=0.02)
     
    # label input 
    label = Input(shape=(1,), dtype='int64', name="First_input_layer")

    # embed and flatten labels: we then get vector of shape 1 x output shape 
    label_embedding = Flatten()(Embedding(num_classes, out_shape)(label))
    
    # feed features through the second input layer 
    gen_sample = Input(shape=(out_shape,),name="Second_input_layer")
    
    # condition the discrimination of generated features
    model_input = multiply([gen_sample, label_embedding])
    
    # our first hidden layer and leaky relu activation
    fe = Dense(units=512, kernel_initializer=init, name="First_hidden_layer")(model_input)
    fe = LeakyReLU(alpha=0.2)(fe)
    
    # our second hidden layer and leaky relu activation
    fe = Dense(256, kernel_initializer=init, name="Second_hidden_layer")(fe)
    fe = LeakyReLU(alpha=0.2)(fe)

    # apply dropout
    fe = Dropout(0.4)(fe)

    # our third hidden layer and leaky relu activation
    fe = Dense(68, kernel_initializer=init, name="Third_hidden_layer")(fe)
    fe = LeakyReLU(alpha=0.2)(fe)
    
    # apply dropout
    fe = Dropout(0.4)(fe)
    
    # our output layer, with a single node (outputting the probability of an observation being real)
    fe = Dense(1,name="Output_layer")(fe)

    # sigmoid activation function 
    out_layer= Activation('sigmoid')(fe)
   
    # model layer groups layers into an object with training and inference features.
    model = Model([gen_sample, label], out_layer)
    
    # compile model 
    model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.02, beta_1=0.5),metrics=['accuracy'])

    # save model architecture plot 
    dot_img_file = '/content/drive/MyDrive/RESEARCH-PROJECT/Binary_experiments/Models/hey/Binary_discriminator.png'
    tf.keras.utils.plot_model(model, to_file=dot_img_file, show_shapes=True)
  
    return model
 

# Build generator

1.   we have two input layers, one for the noise z and one for the labels.

2. An embedding layer is used to encode the label input into a 100-dimensional vector, i.e. one per latent dimension.

3.  we then use the multiply layer to condition the noise on "label_embedding". This is referred to as model input.

4.  This "model input" is fed through 3 hidden layers.

5.   We also use batch normalisation after the first, second and third layer, to discourage overfitting.

6.  Finally, we have one output layer with 67 nodes, each outputting one feature.

7.  Note: we intentionally do not compile the generator model as it is not trained directly.





In [None]:
def define_generator(latent_dim=100, out_shape=67,num_classes=2):
    # Initialiser that generates tensors with a normal distribution
    init = RandomNormal(mean=0.0, stddev=0.02)
    
    # label input layer 
    label = Input(shape=(1,), dtype='int64', name="Label_input_layer")

    # convert label into 100 dimensional vector
    label_embedding = Flatten()(Embedding(num_classes, latent_dim)(label))
    
    # Noise z input layer
    noise = Input(shape=(latent_dim,),name="Noise_input_layer")

    # We condition the generation of features
    model_input = multiply([noise, label_embedding])
   
    # our first hidden layer and leaky relu activation
    fe = Dense(68, kernel_initializer=init, name="First_hidden_layer")(model_input)
    fe = LeakyReLU(alpha=0.2)(fe)
    
    # apply batch normalisation
    fe = BatchNormalization(momentum=0.8)(fe)

    # our second hidden layer and leaky relu activation
    fe = Dense(256, kernel_initializer=init, name="Second_hidden_layer")(fe)
    fe = LeakyReLU(alpha=0.2)(fe)

    # apply batch normalisation
    fe = BatchNormalization(momentum=0.8)(fe)

    # our third hidden layer and leaky relu activation
    fe = Dense(units=512, kernel_initializer=init, name="Third_hidden_layer")(fe)
    fe = LeakyReLU(alpha=0.2)(fe)

    # apply batch normalisation
    fe = BatchNormalization(momentum=0.5)(fe)

    # our output layer, with 67 nodes (one node per feature)
    out_layer= Dense(out_shape, activation='tanh',name="Output_layer") (fe)

    # define the generator model
    model = Model([noise, label], out_layer) 
    return model


# Combine the generator and discriminator models to form cGAN

"define_cgan()" function combines our predefined generator and discriminator by taking them both as inputs.

In [None]:
# define the combined generator and discriminator model, for updating the generator
def define_cgan(g_model, d_model):
	# make weights in the discriminator not trainable
	d_model.trainable = False
	
	# get noise and label inputs from generator model
	gen_noise, gen_label = g_model.input
	
	# get output from the generator model
	gen_output = g_model.output
	
	# connect output and label input from generator as inputs to discriminator
	gan_output = d_model([gen_output, gen_label])
	
	# define cgan model as taking noise and labels and outputting a classification
	model = Model([gen_noise, gen_label], gan_output)
 
	# compile model
	opt = Adam(lr=0.0002, beta_1=0.5)
	model.compile(loss='binary_crossentropy', optimizer=opt)

	# save model architecture plot
	plot_model(model, to_file='/content/drive/MyDrive/RESEARCH-PROJECT/Binary_experiments/Models/model_plotgan.png', show_shapes=True, show_layer_names=True)
	return model

# Load and select real data samples

Here we are selecting and returning a random sample of the real data and we assign it an initial class label for the discriminator i.e. we specify that this is a real observation, thus, belongs to y=1. 

Note: this "y=1" is to indicate a real observation, thus, here we are not indicating that it is either benign or brute force. The benign or brute force class is referred to as "labels".

In [None]:
# load the real data samples
def load_real_samples(X,y):
	print(X.shape, y.shape)
	return [X, y]

In [None]:
# select real samples
def generate_real_samples(dataset, n_samples):
    # split into observations  and labels
    features, labels = dataset
    # choose random instances
    rand = randint(0, 1000)
    # select observations and labels
    X = features.sample(n=n_samples,random_state=rand)
    labels = labels.sample(n=n_samples,random_state=rand)
	  # generate labels
    y = ones((n_samples, 1))
    return [X, labels], y

# Generate points in the latent space

"generate_latent_points()" function takes in as an argument:


1. the size of the latent space 
2. the number of points required 
3. the number of classes i.e. 0 benign and 1 brute force

Then, it returns a batch of input samples for the generator model




In [None]:
# generate points in latent space as input for the generator
def generate_latent_points(latent_dim, n_samples, n_classes=2):
	# generate points in the latent space
	x_input = randn(latent_dim * n_samples)
	# reshape into a batch of inputs for the network
	z_input = x_input.reshape(n_samples, latent_dim)
	# generate labels
	labels = randint(0, n_classes, n_samples)
	return [z_input, labels]

# Use points as inputs to generator 

Here, we utilise the points in the latent space as input to the generator.

In order to generate new observations we use the "generate_fake_samples()" function below. This takes as an argument:



1.   the generator model 
2.   the size of the latent space 


Then, it generates points in the latent space and uses them as input to the generator model.

Subsequently, the function returns the generated observations and their corresponding class label for the discriminator model, specifically y=0, indicating that they are fake/ generated.

In [None]:
# use the generator to generate fake examples, with labels
def generate_fake_samples(generator, latent_dim, n_samples):
	# generate points in latent space
	z_input, labels_input = generate_latent_points(latent_dim, n_samples)
	# predict outputs
	observations = generator.predict([z_input, labels_input])
	# create labels
	y = zeros((n_samples, 1))
	return [observations , labels_input], y

# Define training process

At this stage we are ready to fit the cGAN model.

The model is fit for 1000 training epochs and batch size of 1000 samples. These sizes are selected arbitrarily.

**The train() function takes in as arguments:**


1.   the defined generator, discriminator and cGAN models
2.   the dataset
3.   the size of the latent dimension
4.   the number of epochs and batch size 

Then, the generator model is saved after each epoch.








**The training process is performed as follows:**

1. the discriminator is updated using a half batch of real samples and then a half batch of fake samples. Thus, together they form one batch of weight updates

2.   the generator is then updated via the composite gan model

3. note: the labels are set to 1 for real and 0 for fake samples. This updates the generator towards getting better at synthesising real looking samples on the next batch


In [None]:
# train the generator and discriminator
def train(g_model, d_model, cgan_model, dataset, latent_dim, n_epochs=1000, n_batch=1000):
	bat_per_epo = int(dataset[0].shape[0] / n_batch)
	half_batch = int(n_batch / 2)
	# manually enumerate epochs
	for i in range(n_epochs):
		# enumerate batches over the training set
		for j in range(bat_per_epo):
			# get randomly selected 'real' samples
			[X_real, labels_real], y_real = generate_real_samples(dataset, half_batch)
			# update discriminator model weights
			d_loss1, _ = d_model.train_on_batch([X_real, labels_real], y_real)
			# generate 'fake' examples
			[X_fake, labels], y_fake = generate_fake_samples(g_model, latent_dim, half_batch)
			# update discriminator model weights
			d_loss2, _ = d_model.train_on_batch([X_fake, labels], y_fake)
			# prepare points in latent space as input for the generator
			[z_input, labels_input] = generate_latent_points(latent_dim, n_batch)
			# create inverted labels for the fake samples
			y_gan = ones((n_batch, 1))
			# update the generator via the discriminator's error
			g_loss = cgan_model.train_on_batch([z_input, labels_input], y_gan)
			# summarise loss on this batch
			print('>%d, %d/%d, d1=%.3f, d2=%.3f g=%.3f' %
				(i+1, j+1, bat_per_epo, d_loss1, d_loss2, g_loss))
	 
			if i >= 1:
			 g_model.save('/content/drive/MyDrive/RESEARCH-PROJECT/Binary_experiments/Generated models/bruteforce models/test_g_model_%04d.h5' % (i))

# Begin training process

In [None]:
# size of the latent space
latent_dim = 100
# create the discriminator
d_model = define_discriminator()
# create the generator
g_model = define_generator(latent_dim=latent_dim)
# create the cgan
cgan_model = define_cgan(g_model, d_model)
# load real observations data
dataset = load_real_samples(X_train,y_train)
# train model
train(g_model, d_model, cgan_model, dataset, latent_dim)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 First_input_layer (InputLayer)  [(None, 1)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 1, 67)        134         ['First_input_layer[0][0]']      
                                                                                                  
 Second_input_layer (InputLayer  [(None, 67)]        0           []                               
 )                                                                                                
                                                                                                  
 flatten (Flatten)              (None, 67)           0           ['embedding[0][0]']          

  super(Adam, self).__init__(name, **kwargs)


(12000, 67) (12000,)
>1, 1/12, d1=0.693, d2=1.301 g=0.850
>1, 2/12, d1=0.892, d2=1.066 g=0.697
>1, 3/12, d1=0.715, d2=0.702 g=0.710
>1, 4/12, d1=0.723, d2=0.694 g=0.717
>1, 5/12, d1=0.700, d2=0.676 g=0.697
>1, 6/12, d1=0.684, d2=0.667 g=0.581
>1, 7/12, d1=0.554, d2=0.566 g=0.701
>1, 8/12, d1=0.317, d2=0.138 g=1.230
>1, 9/12, d1=0.629, d2=0.041 g=0.493
>1, 10/12, d1=0.266, d2=0.032 g=0.272
>1, 11/12, d1=0.048, d2=0.011 g=0.244
>1, 12/12, d1=0.008, d2=0.004 g=0.221
>2, 1/12, d1=0.000, d2=0.002 g=0.104
>2, 2/12, d1=0.069, d2=0.005 g=0.084
>2, 3/12, d1=0.002, d2=0.010 g=0.128
>2, 4/12, d1=0.006, d2=0.001 g=0.098
>2, 5/12, d1=0.007, d2=0.002 g=0.132
>2, 6/12, d1=0.003, d2=0.002 g=0.111
>2, 7/12, d1=0.000, d2=0.002 g=0.102
>2, 8/12, d1=0.127, d2=10.084 g=1.309
>2, 9/12, d1=0.468, d2=10.656 g=4.360
>2, 10/12, d1=1.599, d2=0.000 g=3.511
>2, 11/12, d1=0.837, d2=0.000 g=2.046
>2, 12/12, d1=0.221, d2=0.613 g=1.088
>3, 1/12, d1=0.442, d2=3.272 g=2.537
>3, 2/12, d1=0.184, d2=2.257 g=9.697
>3, 3/12,

KeyboardInterrupt: ignored

# Find the best cGAN model post training 

Here we iterate over all saved models to assess the quality of the models. More precisely, we train a Random Forest on a generated sample and predict real data. The scores are then stored in a dictionary.

In [None]:
# define random forest classifier
cgan_rf=RandomForestClassifier(max_depth=8,n_estimators=100)

In [None]:
# generate points in latent space as input for the generator
def generate_latent_points(latent_dim, n_samples, n_classes=2):
	# generate points in the latent space
	x_input = randn(latent_dim * n_samples)
	# reshape into a batch of inputs for the network
	z_input = x_input.reshape(n_samples, latent_dim)
	# generate labels
	labels = randint(0, n_classes, n_samples)
	return [z_input, labels]

In [None]:
def generate_points_per_class(latentpoints, size):
    # generate latent points 
    latent_points0, labels = generate_latent_points(latentpoints, size)
    latent_points1, labels = generate_latent_points(latentpoints, size)
    
    # create labels
    labels_0 = asarray([0  for _ in range(size)])
    labels_1= asarray([1  for _ in range(size)])
    
    return latent_points0 ,labels_0 , latent_points1, labels_1


In [None]:
def test_initial_quality(steps,model_path,latentpoints,size):
    scores={}
    for x in range(1,steps,1):
      # load model
      model = load_model(model_path % x)

      # generate latent points and labels 
      latent_points0 ,labels_0 , latent_points1, labels_1 = generate_points_per_class(latentpoints, size)

      # generate data
      X_0  = model.predict([latent_points0, labels_0])
      X_1  = model.predict([latent_points1, labels_1])

      # convert generated data into dataframes
      gen_df_0 = pd.DataFrame(data = X_0,columns = X_train.columns)
      gen_df_1 = pd.DataFrame(data = X_1,
                          columns = X_train.columns)
      
      # add labels to frames
      gen_df_0['attack_map']=0
      gen_df_1['attack_map']=1

      # combine dataframes together 
      df_gan = pd.concat([gen_df_0,gen_df_1], ignore_index=True, sort=False)
   
      # train random forest model on generated data
      cgan_rf.fit(df_gan.drop('attack_map',axis=1), df_gan['attack_map'])
      
      # predict real 2017 data sample
      y_pred = cgan_rf.predict(X_test)

      # print classification results 
      print(classification_report(y_test,y_pred))
      
      # store results in dict
      scores.update({x: f1_score(y_test, y_pred,average='micro')})
    return scores 

generated dataset no: 1
              precision    recall  f1-score   support

           0       0.50      1.00      0.67      2000
           1       0.00      0.00      0.00      2000

    accuracy                           0.50      4000
   macro avg       0.25      0.50      0.33      4000
weighted avg       0.25      0.50      0.33      4000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


generated dataset no: 2
              precision    recall  f1-score   support

           0       0.99      0.48      0.65      2000
           1       0.66      0.99      0.79      2000

    accuracy                           0.74      4000
   macro avg       0.82      0.74      0.72      4000
weighted avg       0.82      0.74      0.72      4000

generated dataset no: 3
              precision    recall  f1-score   support

           0       1.00      0.28      0.44      2000
           1       0.58      1.00      0.74      2000

    accuracy                           0.64      4000
   macro avg       0.79      0.64      0.59      4000
weighted avg       0.79      0.64      0.59      4000

generated dataset no: 4
              precision    recall  f1-score   support

           0       1.00      0.55      0.71      2000
           1       0.69      1.00      0.82      2000

    accuracy                           0.78      4000
   macro avg       0.85      0.78      0.76      4000
we

KeyboardInterrupt: ignored

# Save generated data from our best model

Finally, the best performing model is then retrieved and data generated from said model is saved for more rigorous testing of quality in a different notebook.

In [None]:
# find the model with the highest f1 score within the scores dict
def keywithmaxval(dict1):
     """ a) create a list of the dict's keys and values; 
         b) return the key with the max value"""  
     v=list(dict1.values())
     k=list(dict1.keys())
     return int(k[v.index(max(v))])

In [None]:
def save_best_data(scores,model_path,latentpoints, size,save_path):
    best_model = keywithmaxval(scores)
    model = load_model(model_path % best_model)

    # generate latent points and labels 
    latent_points0 ,labels_0 , latent_points1, labels_1 = generate_points_per_class(latentpoints, size)

    # generate data
    X_0  = model.predict([latent_points0, labels_0])
    X_1  = model.predict([latent_points1, labels_1])

    # convert generated data into dataframes
    gen_df_0 = pd.DataFrame(data = X_0,columns = X_train.columns)
    gen_df_1 = pd.DataFrame(data = X_1,
                        columns = X_train.columns)

    # add labels to frames
    gen_df_0['attack_map']=0
    gen_df_1['attack_map']=1

    # combine dataframes together 
    df_gan = pd.concat([gen_df_0,gen_df_1], ignore_index=True, sort=False)
    
    # save best generated data 
    df_gan.to_parquet(f'{save_path}', index=False)

In [None]:
model_path='/content/drive/MyDrive/RESEARCH-PROJECT/Binary_experiments/Generated models/bruteforce models/test_g_model_%04d.h5'
save_path='/content/drive/MyDrive/RESEARCH-PROJECT/Binary_experiments/Generated data/Bruteforce data/2017GENERATED_BRUTEFORCE.parquet'
scores = test_initial_quality(1000,model_path,100, 1000)
save_best_data(scores,model_path,100, 10000,save_path)