# AI-Driven DDoS Detection in Cloud Environments Using Machine Learning

## Project Overview

This project presents the design and implementation of an intelligent Intrusion Detection System (IDS) capable of detecting Distributed Denial-of-Service (DDoS) attacks in cloud computing environments using Machine Learning techniques.

Cloud infrastructures are highly vulnerable to DDoS attacks, where malicious actors flood servers with massive traffic to exhaust bandwidth, computing resources, and service availability. Traditional rule-based detection systems often fail to identify evolving and zero-day attack patterns.

To address this limitation, this project leverages supervised Machine Learning algorithms trained on real network traffic flow datasets to automatically classify traffic as benign or malicious.

The system analyzes statistical flow features such as packet counts, byte rates, duration metrics, and protocol behavior to identify abnormal traffic patterns associated with DDoS attacks.


In [None]:
import pandas as pd
import numpy as np

## Dataset Loading

The dataset used in this project is derived from the CIC-DDoS2019 intrusion detection dataset, which contains labeled network traffic flows representing various DDoS attack types and benign traffic.

The dataset is stored in **Parquet format**, a columnar storage file type optimized for high-performance data analytics.

Each row in the dataset represents a network traffic flow, and each column represents a statistical feature extracted from packet capture data.


In [None]:
data = pd.read_parquet("/kaggle/input/cicddos2019/UDP-training.parquet")
data.head()

## Dataset Inspection and Exploratory Analysis

Before training Machine Learning models, it is essential to understand the structure and composition of the dataset.

In this step, we examine:

- Dataset dimensions  
- Feature names  
- Label distribution  

This helps identify class imbalance issues and validates dataset readiness for model training.


In [None]:
print("Dataset Shape:", data.shape)
print("\nColumns:\n", data.columns)

print("\nLabel Distribution:\n")
print(data['Label'].value_counts())

## Feature and Target Separation

Machine Learning models require input features (independent variables) and output labels (dependent variable).

- Features (X): Traffic characteristics  
- Target (y): Attack classification label  


In [None]:
X = data.drop('Label', axis=1)
y = data['Label']

print(X.shape)
print(y.shape)

## Label Encoding

Textual labels must be converted into numerical form so that Machine Learning models can process them mathematically.


In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

print(set(y))

## Train-Test Split

The dataset is divided into training and testing subsets to evaluate model performance on unseen data.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

print("Train Shape:", X_train.shape)
print("Test Shape:", X_test.shape)

## Feature Scaling

Feature scaling standardizes feature ranges, improving model convergence and performance.


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Random Forest Model Training

Random Forest is an ensemble classifier that constructs multiple decision trees and outputs majority predictions.


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

rf_model.fit(X_train, y_train)

## Traffic Prediction

The trained model is used to classify unseen traffic samples.


In [None]:
y_pred = rf_model.predict(X_test)

## Model Evaluation

Evaluation metrics include Accuracy, Precision, Recall, and F1â€‘Score.


In [None]:
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

## Confusion Matrix Visualization

A confusion matrix visualizes classification performance and misclassification patterns.


In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()