<a href="https://colab.research.google.com/github/LucassZhou/DDoS-Attack-Flow-Identification/blob/main/DDoS_Attack_Flow_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction:**

Distributed Denial of Service (DDoS) attacks pose a significant threat to the availability of online services. These attacks involve attackers deliberately sending massive volumes of traffic to a target system, overwhelming its computational resources—such as CPU and memory—and rendering the services unavailable. Effectively identifying these malicious traffic flows is crucial to mitigating the impact of DDoS attacks.

In this project, we propose a machine learning-based approach to accurately identify attack flows, addressing one of the key challenges in the field: the classification of imbalanced data in a supervised learning context. Our methodology not only distinguishes between benign and malicious traffic but also offers insights into the effectiveness of different algorithms under conditions where class distributions are highly uneven. This research contributes to the development of more resilient online systems, capable of sustaining their operations under potential DDoS threats.

In [35]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib

**1. Data Collection:**

In this project, we utilize datasets provided by Impact, specifically focusing on traffic patterns associated with Mirai botnet behavior. The dataset, titled WIDE_Mirai_Undirection_features_without_IPAddress_120.csv, includes over 100 extracted features relevant to network traffic analysis, aiding in the identification and classification of DDoS attack flows.

In [36]:
data_path = '/content/WIDE_Mirai_Undirection_features_without_IPAddress_120.csv'
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,Src port,Dst port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,Tot Len Fwd Pkts,Tot Len Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,...,Bwd max ICMP unreachable IAT,Fwd min ICMP unreachable IAT,Bwd min ICMP unreachable IAT,Fwd var ICMP unreachable IAT,Bwd var ICMP unreachable IAT,Fwd RST num,Bwd RST num,Fwd ICMP time out num,Bwd ICMP time out num,Label
0,57358,23,6,48098.59263,4,0.0,240,0.0,60.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,15995,23,6,0.0,1,0.0,60,0.0,60.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,54737,2323,6,0.0,1,0.0,60,0.0,60.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,34584,23,6,0.0,1,0.0,60,0.0,60.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,9092,23,6,11834.72724,10,0.0,600,0.0,60.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [37]:
# Print the shape of the DataFrame to see the dimensions of the data.
print(data.shape)

(230273, 118)


In [38]:
# Drop the 'Label' column from the dataset to separate the features from the target variable.
X = data.drop(['Label'], axis=1)

# Extract the 'Label' column to use as the target variable for model training.
y = data.Label

In [39]:
# Print the shapes of the features and target variable arrays to verify their dimensions.
# This step is crucial for ensuring that the feature matrix and target vector have consistent
# and expected number of entries for successful model training.
print(f'X shape: {X.shape}')

print(f'y shape: {y.shape}')

X shape: (230273, 117)
y shape: (230273,)


**Dataset Description:**

The dataset utilized in this project is characterized by a significant class imbalance, a common challenge in the domain of cybersecurity analytics, particularly when dealing with DDoS attacks. The distribution of the classes is as follows:

**Label 0 (Non-attack traffic): 200,001 instances**

**Label 1 (DDoS attack traffic): 30,272 instances**


This imbalance is indicative of real-world conditions, where normal traffic substantially outnumbers attack vectors. Such a disparity poses unique challenges for machine learning models, as they tend to be biased towards the majority class, potentially leading to a high number of false negatives in predicting DDoS attacks. Addressing this imbalance is crucial for developing an effective classifier that can reliably detect attack flows without significantly compromising on the accuracy for either class.

In [40]:
class_distribution = y.value_counts()
print(class_distribution)

Label
0    200001
1     30272
Name: count, dtype: int64


**2. Feature Selection**

To improve the efficiency and performance of our machine learning model, we implemented a feature selection process aimed at removing redundant features from our dataset. Redundancy among features can often lead to overfitting and can negatively impact model performance, especially in datasets with a large number of features.

**Methodology:**
We utilized Pearson's correlation coefficient to evaluate the linear relationship between pairs of features. The rationale behind using Pearson's correlation is that highly correlated features contribute similar information to the model, thus one of the features can be removed without substantial loss of information.

In [41]:
from scipy import stats

def Redundant_features(X, threshold):
    """
    Evaluate redundant features in a dataset based on the Pearson correlation coefficient.

    Parameters:
    - X (pd.DataFrame or np.ndarray): The DataFrame or ndarray from which features should be evaluated.
    - threshold (float): The threshold for the Pearson correlation above which features are considered redundant.

    Returns:
    - list: The list of indices representing redundant features.

    Raises:
    - ValueError: If `X` is not a pandas DataFrame or numpy ndarray.
    - ValueError: If `threshold` is not a float.
    """
    # Validate input types
    if not isinstance(X, (pd.DataFrame, np.ndarray)):
        raise ValueError("X must be a pandas DataFrame or a numpy ndarray")
    if not isinstance(threshold, (float, int)):
        raise ValueError("threshold must be a float")

    index_list = []
    num_features = X.shape[1]

    # Decide on the correct indexing based on the data type of X
    get_data_column = lambda i: X.iloc[:, i] if isinstance(X, pd.DataFrame) else X[:, i]

    # Evaluate correlation between features
    for i in range(num_features):
        for j in range(i + 1, num_features):
            correlation, _ = stats.pearsonr(get_data_column(i), get_data_column(j))
            if abs(correlation) > threshold:
                index_list.append(j)

    # Remove duplicates and sort the list of indices
    redundant_indices = sorted(set(index_list))

    print(f'The index of redundant features: {redundant_indices}')
    print(f'Number of redundant features: {len(redundant_indices)}')

    return redundant_indices

Using a threshold of ∣correlation∣>0.7, we identified 69 redundant features.

In [43]:
redundant_indices = Redundant_features(X, 0.7)

The index of redundant features: [6, 7, 10, 11, 14, 15, 17, 18, 20, 21, 22, 23, 24, 25, 26, 30, 31, 33, 34, 35, 39, 40, 44, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 67, 68, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 88, 89, 90, 96, 97, 98, 99, 100, 101, 102, 106, 107, 110, 111, 112]
Number of redundant features: 69


In [44]:
def drop_redundant_features(X, redundant_indices):

    """
    Removes redundant features from a DataFrame based on specified indices.

    Parameters:
    - X (pd.DataFrame): The DataFrame from which features should be dropped.
    - redundant_indices (list of int): The list of column indices to be removed.

    Returns:
    - pd.DataFrame: A new DataFrame with the specified features removed.

    Raises:
    - ValueError: If `X` is not a pandas DataFrame or if `redundant_indices` contains invalid indices.
    """
    # Check if X is a pandas DataFrame
    if not isinstance(X, pd.DataFrame):
        raise ValueError("X must be a pandas DataFrame")

    # Validate indices
    if not all(isinstance(i, int) for i in redundant_indices):
        raise ValueError("redundant_indices must contain only integers")

    # Ensure all indices are within the column range of X
    if max(redundant_indices, default=-1) >= len(X.columns) or min(redundant_indices, default=len(X.columns)) < 0:
        raise ValueError("redundant_indices contains out-of-bound indices")

    # Get column labels to drop
    column_labels_to_drop = X.columns[redundant_indices]

    # Drop the columns and return the filtered DataFrame
    X_filtered = X.drop(column_labels_to_drop, axis=1)
    return X_filtered

Drop the redundant features based on their indices:

In [45]:
# Remove redundant features from the DataFrame
X = drop_redundant_features(X, redundant_indices)

X.head()

Unnamed: 0,Src port,Dst port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Bwd Pkt Len Max,Bwd Pkt Len Min,...,Fwd SYN rate,Fwd ICMP unreachable num,Bwd ICMP unreachable num,Fwd ICMP unreachable rate,Bwd max ICMP unreachable IAT,Fwd min ICMP unreachable IAT,Fwd RST num,Bwd RST num,Fwd ICMP time out num,Bwd ICMP time out num
0,57358,23,6,48098.59263,4,0.0,60.0,60.0,0.0,0.0,...,8.3e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15995,23,6,0.0,1,0.0,60.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,54737,2323,6,0.0,1,0.0,60.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,34584,23,6,0.0,1,0.0,60.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,9092,23,6,11834.72724,10,0.0,60.0,60.0,0.0,0.0,...,0.000845,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
# Check and print the new shape of the DataFrame to verify the number of remaining features
print(X.shape)

(230273, 48)


In [14]:
X.head()

Unnamed: 0,Src port,Dst port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Bwd Pkt Len Max,Bwd Pkt Len Min,...,Fwd SYN rate,Fwd ICMP unreachable num,Bwd ICMP unreachable num,Fwd ICMP unreachable rate,Bwd max ICMP unreachable IAT,Fwd min ICMP unreachable IAT,Fwd RST num,Bwd RST num,Fwd ICMP time out num,Bwd ICMP time out num
0,57358,23,6,48098.59263,4,0.0,60.0,60.0,0.0,0.0,...,8.3e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15995,23,6,0.0,1,0.0,60.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,54737,2323,6,0.0,1,0.0,60.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,34584,23,6,0.0,1,0.0,60.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,9092,23,6,11834.72724,10,0.0,60.0,60.0,0.0,0.0,...,0.000845,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Since the dataset is imbalanced, we use AUC to evaluate the performance for each feature, and select the most informative features.

**Justification for Using AUC:** The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) provides a robust measure for evaluating the performance of each feature independently of class distribution. AUC assesses a feature’s ability to discriminate between the classes over a range of classification thresholds. This makes it an invaluable metric in the context of imbalanced data:

**Sensitivity to Class Imbalance:** Unlike accuracy, AUC is not influenced by the high prevalence of one class over another. It evaluates how well a feature can distinguish between classes, which is crucial when the minority class is of greater interest.

**Feature Ranking Based on Discrimination Capability:** By calculating AUC scores for each feature, we can rank features based on their ability to differentiate between the two classes. Higher AUC scores indicate that a feature is more effective at distinguishing between the classes, thus more important for the model.

In [49]:
from sklearn import metrics

def Calculate_AUC(X, y):
    """
    Calculate the AUC scores for each feature in the dataset.

    Parameters:
    - X (pd.DataFrame or np.ndarray): The input features data where each column represents a feature.
    - y (array-like): The target variable (binary) against which to evaluate the AUC score of each feature.

    Returns:
    - list: A list of AUC scores for each feature in the dataset.

    Raises:
    - ValueError: If X is neither a pandas DataFrame nor a numpy ndarray.
    """
    AUC = []
    # Iterate over each feature column in X
    for i in range(X.shape[1]):

        # Retrieve feature data based on the type of X
        if isinstance(X, pd.DataFrame):
            # Extract feature column from pandas DataFrame and convert to numpy array
            X_feature = X.iloc[:, i].values
        elif isinstance(X, np.ndarray):
            # Directly use the feature column from numpy array
            X_feature = X[:, i]
        else:
            # Raise error if X is not a DataFrame or ndarray
            raise ValueError("X must be a pandas DataFrame or a numpy ndarray.")

        # Reshape feature array to (-1, 1) for compatibility with sklearn's functions
        X_feature = X_feature.reshape(-1, 1)

        # Calculate the AUC score for the current feature against the target
        score = metrics.roc_auc_score(y, X_feature, average = 'micro')

        # Adjust AUC score to ensure it is not misleadingly low due to inverse labeling
        if score < 0.5:
            score = 1 - score

        # Append the adjusted or correct AUC score to the list
        AUC.append(score)

    return AUC

In [52]:
# Calcualte the AUC score for the DataFrame
AUC = Calculate_AUC(X, y)

print(AUC)

[0.9380570733906372, 0.8556746038448719, 0.8980480097599512, 0.5918913299857391, 0.5082382617784322, 0.5991120044399778, 0.587388052588014, 0.5802221699118777, 0.5991120044399778, 0.5991120044399778, 0.6156373747894739, 0.5017110777620002, 0.5043024784876076, 0.5043024784876076, 0.5030974845125775, 0.9506127469362653, 0.5043024784876076, 0.5018457634323878, 0.5030974845125775, 0.5000199999000006, 0.5000024999875001, 0.5000199999000006, 0.5000199999000006, 0.5000199999000006, 0.9186054069729651, 0.5933145334273329, 0.5166249168754156, 0.5007424962875185, 0.5166249168754156, 0.5007424962875185, 0.5071099644501778, 0.5001449992750036, 0.5149649251753741, 0.5149649251753741, 0.511797441012795, 0.5014274928625357, 0.9494735065469758, 0.5000299998500007, 0.5317016600038563, 0.5140374298128509, 0.5007549962250188, 0.5079974600127, 0.5000049999750001, 0.5079974600127, 0.5031874840625796, 0.5003399983000085, 0.5025274873625631, 0.5000049999750001]


In [54]:
def select_top_n_features(X, AUC, n):
    """
    Selects the top n features based on their AUC scores.

    Parameters:
    - X (pd.DataFrame): The DataFrame containing the features.
    - AUC (list): A list of AUC scores corresponding to each feature in X.
    - n (int): The number of top features to select based on AUC scores.

    Returns:
    - list: The names of the top n features with the highest AUC scores.
    """
    # Convert column names from the DataFrame to a list
    feature_names = X.columns.tolist()

    # Combine feature names with their corresponding AUC scores
    paired = list(zip(feature_names, AUC))

    # Sort the list of tuples by the AUC score in descending order
    paired_sorted = sorted(paired, key=lambda x: x[1], reverse=True)

    # Separate the sorted pairs back into two lists
    feature_names_sorted, AUC_sorted = zip(*paired_sorted)

    # Select the top 'n' feature names and their corresponding AUC scores
    top_n_feature_names = feature_names_sorted[:n]
    top_n_AUC_scores = AUC_sorted[:n]

    # Print the top n feature names and their AUC scores
    print("Top n feature names:", top_n_feature_names)
    print("Top n AUC scores:", top_n_AUC_scores)

    return top_n_feature_names

In [58]:
# Select top n features' names according to the AUC scores
top_n_feature_names = select_top_n_features(X, AUC, 10)

Top n feature names: ('Fwd PSH Flags', 'Fwd SYN num', 'Src port', 'Fwd ICMP num', 'Protocol', 'Dst port', 'Flow Bytes per second', 'Tot Bwd Pkts', 'Bwd Pkt Len Max', 'Bwd Pkt Len Min')
Top n AUC scores: (0.9506127469362653, 0.9494735065469758, 0.9380570733906372, 0.9186054069729651, 0.8980480097599512, 0.8556746038448719, 0.6156373747894739, 0.5991120044399778, 0.5991120044399778, 0.5991120044399778)


In [59]:
X_best_feature = X[list(top_n_feature_names)]
X_best_feature.head()

Unnamed: 0,Fwd PSH Flags,Fwd SYN num,Src port,Fwd ICMP num,Protocol,Dst port,Flow Bytes per second,Tot Bwd Pkts,Bwd Pkt Len Max,Bwd Pkt Len Min
0,0.0,4.0,57358,0.0,6,23,0.00499,0.0,0.0,0.0
1,0.0,1.0,15995,0.0,6,23,0.0,0.0,0.0,0.0
2,0.0,1.0,54737,0.0,6,2323,0.0,0.0,0.0,0.0
3,0.0,1.0,34584,0.0,6,23,0.0,0.0,0.0,0.0
4,0.0,10.0,9092,0.0,6,23,0.050698,0.0,0.0,0.0


We have successfully removed redundant features and identified the most informative features within our imbalanced dataset. The refined feature set, now stored in X_best_feature, has been reassigned to X to facilitate the development of our next machine learning model.

In [60]:
X = X_best_feature

**3. Model Development**

To ensure fair comparison and consistency across all features, we have applied L2 normalization to the dataset using sklearn.preprocessing.normalize. This step adjusts the feature vectors to have a Euclidean length of one, mitigating potential biases that can arise from the varying scales of raw data values.

In [22]:
from sklearn import preprocessing

X = preprocessing.normalize(X, norm = 'l2')

In [23]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

KNN_clf = KNeighborsClassifier()

# Perform 10-fold cross-validation to evaluate the model
# 'f1_macro' scoring is used to get the F1 score for each class and average them
# This helps in evaluating the model fairly especially in imbalanced datasets
score = cross_val_score(KNN_clf, X, y, cv = 10, scoring = 'f1_macro')

# Output the average F1 score from the cross-validation
print(score.mean())

0.9640946577992183


In [None]:
from sklearn.ensemble import RandomForestClassifier

RF_clf = RandomForestClassifier()

# Perform 10-fold cross-validation to evaluate the model
# 'f1_macro' scoring is used to get the F1 score for each class and average them
# This helps in evaluating the model fairly especially in imbalanced datasets
score = cross_val_score(RF_clf, X, y, cv = 10, scoring = 'f1_macro')

# Output the average F1 score from the cross-validation
print(score.mean())

0.9756989459932068


In [None]:
from sklearn.tree import DecisionTreeClassifier

DT_clf = DecisionTreeClassifier()

# Perform 10-fold cross-validation to evaluate the model
# 'f1_macro' scoring is used to get the F1 score for each class and average them
# This helps in evaluating the model fairly especially in imbalanced datasets
score = cross_val_score(DT_clf, X, y, cv = 10, scoring = 'f1_macro')

# Output the average F1 score from the cross-validation
print(score.mean())

0.9770745932094342


In [None]:
from sklearn.linear_model import LogisticRegression

LR_clf = LogisticRegression()

# Perform 10-fold cross-validation to evaluate the model
# 'f1_macro' scoring is used to get the F1 score for each class and average them
# This helps in evaluating the model fairly especially in imbalanced datasets
score = cross_val_score(LR_clf, X, y, cv = 10, scoring = 'f1_macro')

# Output the average F1 score from the cross-validation
print(score.mean())

0.9026263917378221


In [24]:
from sklearn.neural_network import MLPClassifier

MLP_clf = MLPClassifier()

# Perform 10-fold cross-validation to evaluate the model
# 'f1_macro' scoring is used to get the F1 score for each class and average them
# This helps in evaluating the model fairly especially in imbalanced datasets
score = cross_val_score(MLP_clf, X, y, cv = 10, scoring = 'f1_macro')

# Output the average F1 score from the cross-validation
print(score.mean())



0.936326278675697


**Conclusion**

In this project, we successfully developed a machine learning system designed to identify DDoS attack flows, enabling effective mitigation strategies against such cyber threats. Utilizing a real-world dataset sourced from Impact, we conducted thorough evaluations to select the most informative features, ensuring our model focuses on the most critical data points. Through the application of robust machine learning techniques, we classified attack flows with high accuracy. The system demonstrated excellent performance in detecting these disruptive activities, showcasing its potential as a reliable tool in cybersecurity defenses.