<span style = "color:blue; font-size:24px">DecisionTree</span>

This notebook only focuses on Decision Tree model training.

ZeekData24 Attack Profiles 

Dataset 1: Multiple Attack Types 

Dataset 2: Multiple Attack Types 

Dataset 3: Multiple Attack Types 

Dataset 4: Multiple Attack Types 

Dataset 5: Multiple Attack Types 

Datest 6: Benign Data 

Dataset 7: Benign Data


Script combines all datasets and checks to see what unique attacks are contained in the dataframe and splits the data into attack specific dataframes. Each dataframe contains one attack type and all benign data from the original merged dataframe. Decision tree models are then trained for each attack type using both attack and benign data.

In [3]:
# Spark imports
import pyspark
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Python imports
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MultiLabelBinarizer

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Pre-Preprocess Mission Log") \
    .master("spark://192.168.1.2:7077") \
    .config("spark.driver.cores", "2") \
    .config("spark.driver.memory", "10g") \
    .config("spark.executor.memory", "12g") \
    .config("spark.executor.cores", "3") \
    .config("spark.dynamicAllocation.shuffleTracking.enabled", "true") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "5") \
    .config("spark.dynamicAllocation.maxExecutors", "8") \
    .config("spark.executor.instances", "5") \
    .getOrCreate()

# Paths containing network data
data_paths = [
    "hdfs://192.168.1.2:9000/datasets-uwf-edu/UWF-TestZeekData24/parquet/2024-02-25 - 2024-03-03/part-00000-8b838a85-76eb-4896-a0b6-2fc425e828c2-c000.snappy.parquet",
    "hdfs://192.168.1.2:9000/datasets-uwf-edu/UWF-TestZeekData24/parquet/2024-03-03 - 2024-03-10/part-00000-0955ed97-8460-41bd-872a-7375a7f0207e-c000.snappy.parquet",
    "hdfs://192.168.1.2:9000/datasets-uwf-edu/UWF-TestZeekData24/parquet/2024-03-10 - 2024-03-17/part-00000-071774ae-97f3-4f31-9700-8bfcdf41305a-c000.snappy.parquet",
    "hdfs://192.168.1.2:9000/datasets-uwf-edu/UWF-TestZeekData24/parquet/2024-03-17 - 2024-03-24/part-00000-5f556208-a1fc-40a1-9cc2-a4e24c76aeb3-c000.snappy.parquet",
    "hdfs://192.168.1.2:9000/datasets-uwf-edu/UWF-TestZeekData24/parquet/2024-03-24 - 2024-03-31/part-00000-ea3a47a3-0973-4d6b-a3a2-8dd441ee7901-c000.snappy.parquet",
    "hdfs://192.168.1.2:9000/datasets-uwf-edu/UWF-TestZeekData24/parquet/2024-10-27 - 2024-11-03/part-00000-69700ccb-c1c1-4763-beb7-cd0f1a61c268-c000.snappy.parquet",
    "hdfs://192.168.1.2:9000/datasets-uwf-edu/UWF-TestZeekData24/parquet/2024-11-03 - 2024-11-10/part-00000-f078acc1-ab56-40a6-a6e1-99d780645c57-c000.snappy.parquet"
]

# Container to hold the processed DataFrames
df_list = []

# Counter variable
j = 0

# Loop through each path, load and process the data
for path in data_paths:
    # Load each dataset
    df = spark.read.parquet(path)
   
    # Select relevant columns
    df = df.select("ts", "duration", "orig_bytes", "resp_bytes", "orig_ip_bytes", "resp_ip_bytes", "label_tactic")
   
    # Show all rows of attack labels before any preprocessing
    print(f"Dataset {j+1}: All rows of 'label_tactic' before preprocessing:")
    all_label_tactics = df.select("label_tactic").distinct().collect()
    for row in all_label_tactics:
        print(row['label_tactic'])
              
    # Handle missing values
    df = df.fillna({
        "duration": 0,
        "orig_bytes": 0,
        "resp_bytes": 0,
        "orig_ip_bytes": 0,
        "resp_ip_bytes": 0
    })
    
    df_list.append(df)
    j += 1

# Combine all DataFrames into one
combined_df = df_list[0]
for df in df_list[1:]:
    combined_df = combined_df.union(df)

# Get distinct values of label_tactic
distinct_label_tactics = combined_df.select("label_tactic").distinct().collect()
distinct_label_tactics = [row['label_tactic'] for row in distinct_label_tactics]

# Print distinct label_tactic values
print("Distinct label_tactic values:")
for tactic in distinct_label_tactics:
    print(tactic)

# Create new DataFrames for each label_tactic
dataframes = {}
for tactic in distinct_label_tactics:
    if tactic != "none":
        tactic_df = combined_df.filter((F.col("label_tactic") == tactic) | (F.col("label_tactic") == "none"))
        dataframes[tactic] = tactic_df

# Train a decision tree model for each attack DataFrame
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import json

models = {}
for tactic, df in dataframes.items():
    # Prepare the data for training
    feature_columns = ["duration", "orig_bytes", "resp_bytes", "orig_ip_bytes", "resp_ip_bytes"]
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features", handleInvalid="skip")
    df = assembler.transform(df)
    
    # Convert label_tactic to a numerical label
    df = df.withColumn("label", F.when(F.col("label_tactic") == tactic, 1).otherwise(0))
    
    # Split the data into attack and benign sets
    attack_df = df.filter(F.col("label") == 1)
    benign_df = df.filter(F.col("label") == 0)
    
    # Sample benign data to match 70/30 ratio
    benign_sample_size = int(attack_df.count() * (70 / 30))
    sampling_fraction = min(benign_sample_size / benign_df.count(), 1.0)
    benign_sample_df = benign_df.sample(withReplacement=False, fraction=sampling_fraction)
    
    # Combine attack and sampled benign data
    combined_df = attack_df.union(benign_sample_df)
    
    # Split the combined data into training and test sets
    train_df, test_df = combined_df.randomSplit([0.8, 0.2], seed=42)
    
    # Train the decision tree model
    dt = DecisionTreeClassifier(featuresCol="features", labelCol="label")
    model = dt.fit(train_df)
    models[tactic] = model
    
    # Evaluate the model
    predictions = model.transform(test_df)
    evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
    accuracy = evaluator.evaluate(predictions)
    
    # Calculate additional metrics
    precision_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
    recall_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")
    f1_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
    
    precision = precision_evaluator.evaluate(predictions)
    recall = recall_evaluator.evaluate(predictions)
    f1_score = f1_evaluator.evaluate(predictions)
    
    print(f"\nModel for label_tactic: {tactic}")
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1 Score: {f1_score}")


Dataset 1: All rows of 'label_tactic' before preprocessing:
Privilege Escalation
Reconnaissance
Credential Access
Persistence
Initial Access
Exfiltration
Defense Evasion
Dataset 2: All rows of 'label_tactic' before preprocessing:
Privilege Escalation
Reconnaissance
Credential Access
Persistence
Initial Access
Exfiltration
Defense Evasion
Dataset 3: All rows of 'label_tactic' before preprocessing:
Privilege Escalation
Reconnaissance
Credential Access
Persistence
Initial Access
Exfiltration
Defense Evasion
Dataset 4: All rows of 'label_tactic' before preprocessing:
Privilege Escalation
Reconnaissance
Credential Access
Persistence
Initial Access
Exfiltration
Defense Evasion
Dataset 5: All rows of 'label_tactic' before preprocessing:
Privilege Escalation
Reconnaissance
Credential Access
Persistence
Initial Access
Exfiltration
Defense Evasion
Dataset 6: All rows of 'label_tactic' before preprocessing:
none
Dataset 7: All rows of 'label_tactic' before preprocessing:
none
Distinct label_tacti

                                                                                


Model for label_tactic: Privilege Escalation
Accuracy: 0.9995077528919517
Precision: 0.9995080988140235
Recall: 0.9995077528919517
F1 Score: 0.9995076373309837


                                                                                


Model for label_tactic: Reconnaissance
Accuracy: 0.9975283213182287
Precision: 0.9975479068102127
Recall: 0.9975283213182287
F1 Score: 0.997531176440728


                                                                                


Model for label_tactic: Credential Access
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


                                                                                


Model for label_tactic: Persistence
Accuracy: 0.9995077528919517
Precision: 0.9995080988140235
Recall: 0.9995077528919517
F1 Score: 0.9995076373309837


                                                                                


Model for label_tactic: Initial Access
Accuracy: 0.9993050729673384
Precision: 0.9993052114885043
Recall: 0.9993050729673384
F1 Score: 0.9993049356500434


                                                                                


Model for label_tactic: Exfiltration
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0





Model for label_tactic: Defense Evasion
Accuracy: 0.9997539975399754
Precision: 0.9997540839475834
Recall: 0.9997539975399754
F1 Score: 0.9997539686732755


                                                                                