Trevor Snedden  
CYBV 471
competition 1

# <h1 style="text-align: center; ">Competition 1 </h>

### Imports and device management

Optional Intel Extension for Scikit-learn was used in this Notebook. References can be found at [Intel's](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html#gs.gqalpy) website. Further download and install documentation using Conda, pip, and Intel pipelines can be found at [pypi.org](https://pypi.org/project/scikit-learn-intelex/)

In [None]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE #help balance the training set
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score

import matplotlib.pyplot as plt
%matplotlib inline




## Section 1: Data Processing

Added the `encoding` and `encoding_errors` args to fesolve the error "'utf-8' codec can't decode byte 0x92 in position 1646: invalid start byte"

In [None]:
test_set = pd.read_csv("UNSW_NB15_testing-set.csv")
training_set = pd.read_csv("UNSW_NB15_training-set.csv")
features = pd.read_csv("UNSW-NB15_features.csv", encoding='utf-8', encoding_errors="replace")
print(features.head(10))


Varify that there are no Null values in the datasets that can scew results. **x.isnull().sum()** to count the number of null values.

In [None]:
print(training_set['attack_cat'].value_counts())
print(f"null value: {training_set.isnull().sum()}") #none

In [None]:
print(test_set['attack_cat'].value_counts())
print(f"Null values: {test_set['attack_cat'].isnull().sum()}")

In [None]:
training_set.columns

In [None]:
#drop the unused
train_df = training_set.drop(columns=['id', 'label'])
test_df = test_set.drop(columns=['id', 'label'])

In [None]:
#one-hot-encode

train_df_encoded = pd.get_dummies(training_set, columns=['proto', 'service', 'state'])
test_df_encoded = pd.get_dummies(test_set, columns=['proto', 'service', 'state'])
print(train_df_encoded.head())

align train and test set after one-hot to ensure both sets have same number of columns

this ensure that any categorical features that exist in the training set but not in the test set are still accounted for

In [None]:
train_df_encoded, test_df_encoded = train_df_encoded.align(test_df_encoded, join='left', axis=1)
#just to make sure no na
test_df_encoded.fillna(0, inplace=True)

- scaling and normalization is not always necessary for trees
since trees split data by threshholds. 

- encoding is required though since trees can't operate wth categorical (non-numeric data) need to convert with label encoding or one hot encoding.

First thoughts selected features:  
 Due to the tree automaticaly selecting key features, my feature selection will be later in the notebook after GridSearch is conducted.  


An advantage of Random Forest trees are their ability of calculating **feature importance**. These feature importance are accessible to be able to see and determin key features to the models decisions. This allows for further data selection that can improve model performance. Since this is detecting faults the model can determin features that don't concern themselfs with detection. (fix this line)


split data into features and targets

In [None]:
#define x and y(targ) for train and test
x_train = train_df_encoded.drop(columns=['attack_cat']) #features
y_train = train_df_encoded['attack_cat'] #targ attack cat

x_test = test_df_encoded.drop(columns=['attack_cat'])
y_test = test_df_encoded['attack_cat']#targs



Enter SMOTE into the training data to give the minority features dummy features to match the majority class. Normal in this case

In [None]:
smote = SMOTE(random_state=42)

#apply smote
x_train, y_train = smote.fit_resample(x_train, y_train)
print("Class distribution after SMOTE:", y_train.value_counts())

In [None]:
feature_names = x_train.columns
feature_names

use train_test_split to further split into validation and training sets

In [None]:
# 80 20 split
x_train, x_val, y_train, y_val = train_test_split(
    x_train, y_train, test_size=0.2, random_state=42
    )

# Section 2: Model

After looking at the data and seeing the large amount of 'normal' variables i decided to imagine attacks as faults in the system. With the thought I decided to use a random forest tree model using scikit-learn's RandomForestClassifier(). This also allows me to explore the functionality of Random Forest Trees.  

Due to the nature of trees minimal data processing was required although one-hot-encoding was utilized to convert categorical values to a readable numerical value.  

For tuning i decided to use a grid search to find best fit parameters.

Use metrics such as accuracy, precision, recall, and F1-score to understand performance across the attacks

Created RangomForestGridSearch Class for easier model tuning


In [None]:
class RandomForestGridSearch:
    def __init__(self, grid_params:dict, cv:int=5, n_jobs:int=-1, verbose:int=2, random_state:int=42):
        """
        Initialize the RandomForestGridSearch class with hyperparmaeters, cross_validation and other configs

        :param grid_params: Dictionary of hyperparameters for the grid search
        :param cv: Number of cross_validation folds. ex. cv=5; 4 train, 1 validation.
        :param n_jobs: number of parallel jobs (-1 means using all available cores).
        :param verbose: Verbosity level for progress output
        :param random_state: Random seed for reproducibility
        """
#   define the grid
        self.grid_params = grid_params
        self.cv = cv
        self.n_jobs = n_jobs
        self.verbose = verbose
        self.random_state = random_state
        self.best_rf_model = None # to store best model during grid search
        self.best_params = None

    def fit(self, X_train, Y_train):
        """
        Perform grid search with cross_val on the training data

        :param X_train: Training set (features)
        :param Y_train: Training labels
        """

        #initialize RF_model (not training)
        RF_model = RandomForestClassifier(random_state= self.random_state)

    #initialize gridsearch
        grid_search= GridSearchCV(estimator=RF_model,
                                param_grid=self.grid_params,
                                cv=self.cv, #number of folds. 4train 1test
                                n_jobs=self.n_jobs, #all available cores
                                verbose=self.verbose  #print progress
                                ) 
        
        #perform the grid search
        grid_search.fit(X_train, Y_train)

        #extract best params and model
        self.best_rf_model = grid_search.best_estimator_ #best model
        self.best_params = grid_search.best_params_
        print(f"best score: {grid_search.best_score_}\nbest_params: {grid_search.best_params_}")

    def predict(self, X):
        """
        Use the best model found during grid search to make prediction on the new data

        :param X: Feature set for prediciton  (val or test set)
        :return: Predicted labels
        """
        if self.best_rf_model is None:
            raise ValueError("need to fit the model before making prediction")
        return self.best_rf_model.predict(X)
    
    def evaluate(self, X, y_true):
        """
        Evaluate model performance

        :param X: Feature set for evaluation
        :y_true: True target labels for accuracy 
        :return: Accuracy of model on data
        """
        y_pred = self.predict(X)
        accuracy = accuracy_score(y_true, y_pred)
        print(f"Accuracy: {accuracy}")
        return accuracy
    
    def feature_importances(self, feature_names ,plot:bool=False):
        """
        get and display the features
        :param feature_names: list of feature names corresponding to the feature matrix
        """
        if self.best_rf_model is None:
            raise ValueError('Need to train model')
        
        #get feature importance
        feature_ranking = self.best_rf_model.feature_importances_
        #sort by importance
        idx = np.argsort(feature_ranking)[::-1]

        #print the features
        print("Feaure Rankings")
        for f in range(len(feature_ranking)):
            print(f"{f+1}: Feature {feature_names[idx[f]]} ({feature_ranking[idx[f]]})")
        if plot:        
            #plot feature_rankings
            plt.figure(figsize=(10,6))
            plt.title("Feature Rankings")
            plt.barh(range(len(feature_ranking)), feature_ranking[idx], align="center")
            plt.yticks(range(len(feature_ranking,)), [feature_names[i] for i in idx])
            plt.xlabel('Relativce Importance')
            plt.show()
        


In [None]:
grid_params = {
    #'n_estimators': [100,200,300], #trees
    'n_estimators': [290,300,310], #trees
    'max_depth': [9,10,11], #depth of trees
    #'max_depth': [None, 10,20], #depth of trees
    'min_samples_split': [9,10,11], #number of samples to split internal node
    #'min_samples_split': [2,5,10], #number of samples to split internal node
    'min_samples_leaf': [1,2,3], # min number of sample each leaf node
}

rf_grid_search = RandomForestGridSearch(grid_params=grid_params)

#fit the model
rf_grid_search.fit(x_train, y_train)

#evaluate the model
rf_grid_search.evaluate(x_val, y_val)
#make predictions on test set
y_val_pred = rf_grid_search.predict(x_test)

#get important features
rf_grid_search.feature_importances(feature_names=feature_names)

In [None]:
# Get the feature importances and their corresponding feature names
importances = rf_grid_search.best_rf_model.feature_importances_
indices = np.argsort(importances)[::-1]  # Sort in descending order

# Get the top features (those with importance >= 0.02)
top_features = []
feature_rank = []
for f in range(x_train.shape[1]):  # Iterate over the number of features
    if importances[indices[f]] >= 0.01:  # Only include features with importance >= 0.02
        top_features.append(feature_names[indices[f]])
        feature_rank.append(importances[indices[f]])
        print(f"{f + 1}. Feature {feature_names[indices[f]]} ({importances[indices[f]]})")

# Plot the feature importances
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.title("Top Feature Importances (Importance >= 0.02)")
plt.barh(range(len(top_features)), feature_rank, align="center")
plt.yticks(range(len(top_features)), top_features)
plt.xlabel("Relative Importance")
plt.show()


## Section 3 test and results

### test with selected top Features

In [None]:
# Filter X_train and X_val to include only the top features
X_train_filtered = x_train[top_features[2:]]  # Keep only the top important features in training set
X_val_filtered = x_val[top_features[2:]]      # Keep only the top important features in validation set


Removal of 'id' and 'label' increased accuracy. 

In [None]:
# Initialize a new RandomForest model
new_rf_model = RandomForestClassifier(random_state=42, n_estimators=290, max_depth=11, min_samples_split=10, min_samples_leaf=2, class_weight='balanced') #balanced gives more support to the underrepresented data

# Train the new model on the filtered feature set
new_rf_model.fit(X_train_filtered, y_train)


In [None]:
# Predict on the filtered validation set
y_val_pred_filtered = new_rf_model.predict(X_val_filtered)

# Evaluate accuracy on the filtered validation set
val_accuracy_filtered = accuracy_score(y_val, y_val_pred_filtered)
print(f"Validation Accuracy with Top Features: {val_accuracy_filtered}")

print(classification_report(y_val, y_val_pred_filtered))


In [None]:
# Filter X_test to include only the top features
X_test_filtered = x_test[top_features[2:]]

# Predict on the filtered test set
y_test_pred_filtered = new_rf_model.predict(X_test_filtered)

# Evaluate accuracy on the test set
test_accuracy_filtered = accuracy_score(y_test, y_test_pred_filtered)
print(f"Test Accuracy with Top Features: {test_accuracy_filtered}")

print(classification_report(y_test, y_test_pred_filtered))


## Test Accuracy summary:
Overall accuracy of 78% on test data.  
recision: The proportion of true positives (correctly predicted) out of all predicted positives.  
Recall: The proportion of true positives out of all actual positives.  
F1-Score: The harmonic mean of precision and recall, balancing both metrics.  

Macro Average:
Precision = 0.48: On average, the model is 48% precise across all classes. The low precision for Analysis and Backdoor is dragging down the overall average.

Recall = 0.49: On average, the model is correctly identifying 49% of the true instances for each class. This shows that, overall, the model is missing many instances in the minority classes.

F1-Score = 0.46: The overall balance between precision and recall for all classes is still relatively low.

Weighted Average:  
Precision = 0.84: The overall model is precise for the dataset as a whole, heavily influenced by the large number of Normal and Exploits samples.

Recall = 0.78: This reflects the overall recall, again driven mostly by the majority classes (Normal and Exploits).

F1-Score = 0.79: A solid performance when weighted by the size of the classes, but still indicating that there’s room for improvement, particularly for the minority classes.

#### Key Observations:  
Improvement in Minority Classes:

For the DoS class, recall has significantly improved from ~9% to 72%, indicating that SMOTE helped the model better identify DoS attacks.

The Backdoor class also saw a slight improvement in recall from 0% to 19%, but precision is still very low, which indicates the model is struggling to distinguish Backdoor attacks from other types of attacks.

The Analysis class still has very poor recall and precision, indicating that even with SMOTE, the model is having difficulty recognizing this class. This could be due to feature similarity with other classes or insufficient information in the features to differentiate it.

Exploits and Normal Classes:

The model performs well on the Exploits and Normal classes, though the recall for Exploits dropped compared to before. This could be due to the balancing caused by SMOTE, where more minority class samples were added, forcing the model to focus more on them and slightly reducing performance on the majority classes.


Calculate roc curve for interpritation: 

AUC (Area Under the Curve): The AUC value for each class gives a good indication of how well the model is distinguishing between that class and the others. A perfect model will have an AUC of 1.0, while a model that randomly guesses will have an AUC of 0.5.

Micro-average ROC: This curve gives an overall picture of the model's performance across all classes, treating each prediction equally.

In [None]:

n_classes = len(set(y_train))  # Number of classes in your dataset

# Binarize the labels for ROC computation
y_test_bin = label_binarize(y_test, classes=list(set(y_train)))  # One-hot encode the test labels

# nitialize dictionaries to hold False Positive Rate (FPR), True Positive Rate (TPR), and AUC for each class
fpr = dict()
tpr = dict()
roc_auc = dict()

y_test_prob = new_rf_model.predict_proba(X_test_filtered)

# Compute ROC curve and ROC area for each class
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_test_prob[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test_bin.ravel(), y_test_prob.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])


plot curve using matplotlib

In [None]:
# Plot all ROC curves
plt.figure(figsize=(10, 8))

# Plot ROC curve for each class
for i in range(n_classes):
    plt.plot(fpr[i], tpr[i], label=f'ROC curve for class {i} (AUC = {roc_auc[i]:.2f})')

# Plot micro-average ROC curve
plt.plot(fpr["micro"], tpr["micro"],
         label=f'micro-average ROC curve (AUC = {roc_auc["micro"]:.2f})',
         linestyle=':', linewidth=4)

# Plot settings
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line for random guess
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) curves for multi-class classification')
plt.legend(loc="lower right")
plt.show()

## Conclusion  

This was a good insight on how Random Forest Trees opporate and how to incorporate a GridSearch to find best params.  

After applying SMOTE to handle class imbalance and then focusing on key top features, a Random Forest model was trained and evaluated. Using ROC (Receiver Operating Characteristic) curve to visualize how well the model distinguishes between each class, providing further insights into the model's performance.

Key takeaways from this process:

Improved Performance on Minority Classes:

Applying SMOTE improved the recall for minority classes like DoS and Backdoor, though the model still struggles with precision for these classes.  
The ROC curves for the minority classes gave a more granular view of how well the model distinguishes these minority classes from others.

ROC Curves:

computed ROC curves for each class, and the Area Under the Curve (AUC) values indicated how well the model distinguishes each class from the rest.
The micro-average ROC curve provided an aggregate performance measure across all classes, which is especially useful when dealing with imbalanced datasets.

Challenges:

Despite using SMOTE, some classes such as Analysis and Backdoor continued to show lower performance, highlighting that further improvements are necessary.
The ROC curve revealed that the model performs better on classes like Normal and Exploits, but struggles with minority classes, which are harder to classify even after balancing the data.

Potential Improvements:

Tuning Hyperparameters: Further hyperparameter tuning, especially with class weighting in Random Forest, could help further boost performance for minority classes.
Advanced Models: Using more advanced models like XGBoost or Gradient Boosting might yield better results for multi-class, imbalanced datasets.
Feature Engineering: Adding more meaningful features or refining existing ones could improve the model's ability to differentiate between the attack types, particularly for hard-to-classify groups.

Overall Accuracy:

The overall accuracy was around 78%, and while the model performs well on majority classes (Normal, Exploits), it still requires fine-tuning to balance precision and recall for minority classes.