## Question1

use TensorFlow to build neural networks to predict a multi-class classification problem with 5 possible categories (normal, DOS,
R2L, U2R, probing)

### Q1.1

Prepare the data and convert them to appropriate Tensor formats needed
for TensorFlow. You may use KDDTrain+.txt as the training dataset, 50% of
KDDTest+.txt as the validation dataset, and the remaining 50% as the test dataset.

**Data Preprocessing**

In [2]:
import pyspark
from pyspark.sql import SparkSession, SQLContext
from pyspark.ml import Pipeline,Transformer
from pyspark.ml.feature import Imputer,StandardScaler,StringIndexer,OneHotEncoder, VectorAssembler

from pyspark.sql.functions import *
from pyspark.sql.types import *
import numpy as np

col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","classes","difficulty"]

nominal_cols = ['protocol_type','service','flag']
binary_cols = ['land', 'logged_in', 'root_shell', 'su_attempted', 'is_host_login',
'is_guest_login']
continuous_cols = ['duration' ,'src_bytes', 'dst_bytes', 'wrong_fragment' ,'urgent', 'hot',
'num_failed_logins', 'num_compromised', 'num_root' ,'num_file_creations',
'num_shells', 'num_access_files', 'num_outbound_cmds', 'count' ,'srv_count',
'serror_rate', 'srv_serror_rate' ,'rerror_rate' ,'srv_rerror_rate',
'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate' ,'dst_host_count',
'dst_host_srv_count' ,'dst_host_same_srv_rate' ,'dst_host_diff_srv_rate',
'dst_host_same_src_port_rate' ,'dst_host_srv_diff_host_rate',
'dst_host_serror_rate' ,'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
'dst_host_srv_rerror_rate']

class OutcomeCreater(Transformer): # this defines a transformer that creates the outcome column
    
    def __init__(self):
        super().__init__()

    def _transform(self, dataset):
        def attack_category(attack_type):
            if attack_type == 'normal':
                return 0
            elif attack_type in ['port-Sweep', 'ip-Sweep', 'nmap', 'satan', 'saint', 'mscan']:
                return 1  # Probing
            elif attack_type in ['neptune', 'smurf', 'pod', 'teardrop', 'land', 'back', 'apache2',
                                'udpstorm', 'processtable', 'mail-Bomb']:
                return 2  # Dos
            elif attack_type in ['buffer-Overflow', 'load-Module', 'perl', 'rootkit', 'xterm',
                                'ps', 'sqlattack']:
                return 3  # U2R
            else:
                return 4  # R2L
          
        # Convert the function to a UDF, specifying IntegerType for output
        label_to_multiclasses = udf(attack_category, IntegerType())
        output_df = dataset.withColumn('outcomes', label_to_multiclasses(col('classes'))).drop("classes")  
        output_df = output_df.withColumn('outcomes', col('outcomes').cast(DoubleType()))
        output_df = output_df.drop('difficulty_level')
        return output_df

class FeatureTypeCaster(Transformer): # this transformer will cast the columns as appropriate types  
    def __init__(self):
        super().__init__()

    def _transform(self, dataset):
        output_df = dataset
        for col_name in binary_cols + continuous_cols:
            output_df = output_df.withColumn(col_name,col(col_name).cast(DoubleType()))

        return output_df
class ColumnDropper(Transformer): # this transformer drops unnecessary columns
    def __init__(self, columns_to_drop = None):
        super().__init__()
        self.columns_to_drop=columns_to_drop
    def _transform(self, dataset):
        output_df = dataset
        for col_name in self.columns_to_drop:
            output_df = output_df.drop(col_name)
        return output_df

def get_preprocess_pipeline():
    # Stage where columns are casted as appropriate types
    stage_typecaster = FeatureTypeCaster()

    # Stage where nominal columns are transformed to index columns using StringIndexer
    nominal_id_cols = [x+"_index" for x in nominal_cols]
    nominal_onehot_cols = [x+"_encoded" for x in nominal_cols]
    stage_nominal_indexer = StringIndexer(inputCols = nominal_cols, outputCols = nominal_id_cols )

    # Stage where the index columns are further transformed using OneHotEncoder
    stage_nominal_onehot_encoder = OneHotEncoder(inputCols=nominal_id_cols, outputCols=nominal_onehot_cols)

    # Stage where all relevant features are assembled into a vector (and dropping a few)
    feature_cols = continuous_cols+binary_cols+nominal_onehot_cols
    corelated_cols_to_remove = ["dst_host_serror_rate","srv_serror_rate","dst_host_srv_serror_rate",
                     "srv_rerror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate"]
    for col_name in corelated_cols_to_remove:
        feature_cols.remove(col_name)
    stage_vector_assembler = VectorAssembler(inputCols=feature_cols, outputCol="vectorized_features")

    # Stage where we scale the columns
    stage_scaler = StandardScaler(inputCol= 'vectorized_features', outputCol= 'features')
    
    # Stage for creating the outcome column representing whether there is attack 
    stage_outcome = OutcomeCreater()

    # Removing all unnecessary columbs, only keeping the 'features' and 'outcome' columns
    stage_column_dropper = ColumnDropper(columns_to_drop = nominal_cols+nominal_id_cols+
        nominal_onehot_cols+ binary_cols + continuous_cols + ['vectorized_features'])
    
    # Connect the columns into a pipeline
    pipeline = Pipeline(stages=[stage_typecaster,stage_nominal_indexer,stage_nominal_onehot_encoder,
        stage_vector_assembler,stage_scaler,stage_outcome,stage_column_dropper])
    return pipeline 

**Prepare Dataset**

In [18]:
import tensorflow as tf
from tensorflow import keras
# Contribute Spark
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Pytorch") \
    .getOrCreate()

# Load train and test data
nslkdd_raw = spark.read.csv('/Users/kitiya/Documents/CMU@2024/14763 Systems&ToolChain/hw6/hw6-pytorch-Kitiyaparnnn/NSL-KDD/KDDTrain+.txt',header=False).toDF(*col_names)
nslkdd_test_raw = spark.read.csv('/Users/kitiya/Documents/CMU@2024/14763 Systems&ToolChain/hw6/hw6-pytorch-Kitiyaparnnn/NSL-KDD/KDDTest+.txt',header=False).toDF(*col_names)
print("train dataset: ", nslkdd_raw.count())
print("test dataset: ", nslkdd_test_raw.count())

preprocess_pipeline = get_preprocess_pipeline()
preprocess_pipeline_model = preprocess_pipeline.fit(nslkdd_raw)

nslkdd_df = preprocess_pipeline_model.transform(nslkdd_raw)
nslkdd_df_test = preprocess_pipeline_model.transform(nslkdd_test_raw)


nslkdd_df.cache()
nslkdd_df_test.cache()

to_array = udf(lambda v: v.toArray().tolist(), ArrayType(FloatType()))

# Split test data into validation dataset and test dataset
val_df, test_df = nslkdd_df_test.randomSplit(weights=[0.5,0.5], seed=100)

# Convert Spark to Pandas
train_df_pandas = nslkdd_df.withColumn('features',to_array('features')).toPandas()
val_df_pandas = val_df.withColumn('features',to_array('features')).toPandas()
test_df_pandas = test_df.withColumn('features',to_array('features')).toPandas()

# Convert Pandas to Tensor
x_train = tf.constant(np.array(train_df_pandas['features'].values.tolist()))
y_train = tf.constant(np.array(train_df_pandas['outcomes'].values.tolist()))

x_val = tf.constant(np.array(val_df_pandas['features'].values.tolist()))
y_val = tf.constant(np.array(val_df_pandas['outcomes'].values.tolist()))

x_test = tf.constant(np.array(test_df_pandas['features'].values.tolist()))
y_test = tf.constant(np.array(test_df_pandas['outcomes'].values.tolist()))

train dataset:  125973
test dataset:  22544


                                                                                

### Q1.2

 Build a Neural Network using tf.keras, and conduct training and validation.
Select the appropriate loss function and choose at least one metric. Set the appropriate
number of epochs. After training, evaluate your trained model on the test data.

In your submission, include a screenshot of the output of the fit function and the
evaluate function.

In [19]:
import datetime

# Create a tensorflow model
model = keras.Sequential(
    [
        keras.layers.Dense(5, activation='relu'),
        keras.layers.Dense(5)
    ]
)


model.compile(optimizer=keras.optimizers.SGD(learning_rate=0.05), 
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[keras.metrics.SparseTopKCategoricalAccuracy(name='Accuracy')])

# tensorboard
log_dir = "logs/multiClasses/"+ datetime.datetime.now().strftime("%d-%m-%Y-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

model.fit(x=x_train, y=y_train, epochs=5, verbose=2,
          validation_data=(x_val,y_val),
          callbacks=[tensorboard_callback])




Epoch 1/5


2024-11-17 13:27:29.379510: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-17 13:27:53.007837: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


3937/3937 - 26s - loss: 0.1437 - Accuracy: 1.0000 - val_loss: 2.2816 - val_Accuracy: 1.0000 - 26s/epoch - 7ms/step
Epoch 2/5
3937/3937 - 25s - loss: 0.0993 - Accuracy: 1.0000 - val_loss: 2.3989 - val_Accuracy: 1.0000 - 25s/epoch - 6ms/step
Epoch 3/5
3937/3937 - 26s - loss: 0.0967 - Accuracy: 1.0000 - val_loss: 2.5542 - val_Accuracy: 1.0000 - 26s/epoch - 7ms/step
Epoch 4/5
3937/3937 - 26s - loss: 0.1007 - Accuracy: 1.0000 - val_loss: 2.5032 - val_Accuracy: 1.0000 - 26s/epoch - 7ms/step
Epoch 5/5
3937/3937 - 25s - loss: 0.1020 - Accuracy: 1.0000 - val_loss: 2.8751 - val_Accuracy: 1.0000 - 25s/epoch - 6ms/step


<keras.src.callbacks.History at 0x376fb3fa0>

In [20]:
# test
print("Evaluate on test set:")
model.evaluate(x_test,y_test, verbose=2)

Evaluate on test set:
351/351 - 2s - loss: 2.8747 - Accuracy: 1.0000 - 2s/epoch - 6ms/step


[2.8747165203094482, 1.0]

In [21]:
model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_18 (Dense)            (None, 5)                 570       
                                                                 
 dense_19 (Dense)            (None, 5)                 30        
                                                                 
Total params: 600 (2.34 KB)
Trainable params: 600 (2.34 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Refer to q1-2_1.png, q1-2_2.png

### Q1.3

Display the results in TensorBoard. In your submission, include screenshots
of the loss and the metrics for both the training and the validation run.

**Run in the terminal**

`load_ext tensorboard`

`tensorboard --logdir logs/multiClasses/`

Refer to q1-3_1.png, q1-3_2.png

*Reference*

- Guannan Qu. (2024). Lecture 17: TensorFlow sildes.
- homework6.ipynb