# Capstone - Cyberattack Data Science

### This notebook explores a synthetic yet realistic dataset of cyber security attacks, curated by Incribo [on Kaggle](https://www.kaggle.com/datasets/teamincribo/cyber-security-attacks) which simulates 40,000 records across 25 diverse metrics—including IP addresses, protocols, packet types, malware indicators, anomaly scores, and attack signatures.

## Objective
The goal of this project is to build a predictive model capable of identifying Distributed Denial of Service (DDoS) attacks from network traffic data. Using a labeled dataset with three attack types (DDoS, Malware, Intrusion).
This project blends exploratory data analysis (EDA), feature engineering, and machine learning to uncover insights that could inform real-world intrusion detection systems (IDS) and cybersecurity strategies.

In [1]:
# import necessary libraries
import pandas as pd        # Data manipulation
import numpy as np         # Numerical operations
import matplotlib.pyplot as plt   # Basic plotting
import seaborn as sns             # Statistical plots
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
#read file into environment
df = pd.read_csv("cybersecurity_attacks.csv")

In [3]:
#look at the shape-Row and Columns
df.shape

(40000, 25)

In [None]:
# Information about the Dataset
df.info()

In [None]:
df.head(3).T

In [None]:
#Understand data distribution
df.describe()

In [None]:
# Timestamp Parsing
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Hour'] = df['Timestamp'].dt.hour
df['Day'] = df['Timestamp'].dt.dayofweek


In [None]:
# Check for Missing Values
df.isnull().sum().sort_values(ascending=False)

In [None]:
#Fill in mission value per column
df['Malware Indicators'] = df['Malware Indicators'].fillna('Ioc Detected')

In [None]:
#Fill in mission value per column
df['Alerts/Warnings'] = df['Alerts/Warnings'].fillna('Alert Triggered')

In [None]:
#Fill in mission value per column
df['Firewall Logs'] = df['Firewall Logs'].fillna('Log Data')

In [None]:
#check null values for Proxy Information
df['Proxy Information'].isnull().sum()

In [None]:
#Fill in mission value per column
df['IDS/IPS Alerts'] = df['IDS/IPS Alerts'].fillna('Alert Data')

In [None]:
# Check for Missing Values
df.isnull().sum().sort_values(ascending=False)

In [None]:
#Look at unique Attack Type
df['Attack Type'].nunique()

In [None]:
#check numerical features
numeric_columns = df.select_dtypes(include='number').columns
numeric_columns

In [None]:
# check for outliers
sns.boxplot(x='Severity Level', y='Packet Length', data=df)
plt.show()

In [None]:
# Alerts & Attacks mapping
pd.crosstab(df['Attack Type'], df['Alerts/Warnings'])

In [None]:
# check for correlation 

# Select numerical features only
numerical_features = [
    'Source Port', 'Destination Port',
    'Protocol', 'Packet Length',
    'Anomaly Scores', 'Severity Level',
    'Hour', 'Day'
]


# Create a subset of the DataFrame
pairplot_df = df[numerical_features + ['Attack Type']].copy()

# downsample because dataset is large
pairplot_df = pairplot_df.sample(n=1000, random_state=42)

# Plot
sns.pairplot(pairplot_df, hue='Attack Type', palette='Set1', diag_kind='kde')
plt.suptitle('Pair Plot of Numerical Features Colored by Attack Type', y=1.02)
plt.show()

In [None]:
# Check for correlation

# Create a subset of the DataFrame
corr_df = df.select_dtypes(include='number')

# Compute correlation matrix
corr_matrix = corr_df.corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

In [None]:
# check average DDOS attacks at certain time
df.groupby('Attack Type')[['Packet Length', 'Hour', 'Day']].mean()

In [None]:
df['Attack Type'].values

In [None]:
# Standardize text format
df['Attack Type'] = df['Attack Type'].str.strip().str.lower()

#One-hot encode the three attack types
attack_dummies = pd.get_dummies(df['Attack Type'], prefix='is')

#Drop any existing one-hot columns before adding new ones
df.drop(columns=[col for col in attack_dummies.columns if col in df.columns], inplace=True)

#Concatenate the new columns to the original DataFrame
df = pd.concat([df, attack_dummies], axis=1)
df.head()

In [None]:
#drop column not needed
df.drop(columns=[
    'is_intrusion',
    'is_malware',
], inplace=True)

In [None]:
df.head()

In [None]:
df.drop(columns=['Attack Type'], inplace=True)

In [None]:
#Select Independent features
numerical_features = [
    'Source Port', 'Destination Port',
    'Packet Length', 'Anomaly Scores',
    'Hour', 'Day'
]

In [None]:
categorical_features = [
    'Protocol', 'Packet Type', 'Traffic Type',
    'Malware Indicators', 'Alerts/Warnings',
    'Severity Level', 'Firewall Logs', 'IDS/IPS Alerts'
]

## Objective
The goal of this project was to build a predictive model capable of identifying Distributed Denial of Service (DDoS) attacks from network traffic data. Using a labeled dataset with three attack types (DDoS, Malware, Intrusion), we engineered a binary target variable (is_ddos) and trained a logistic regression classifier to distinguish DDoS traffic from other forms of malicious activity.

Model Overview
Model Type: Logistic Regression

Target Variable: is_ddos (1 = DDoS attack, 0 = other)

Features Used: A combination of numerical (e.g., packet length, anomaly scores, port numbers) and categorical (e.g., protocol type, severity level, IDS/IPS alerts) indicators relevant to DDoS behavior

Class Imbalance Handling: Applied class_weight='balanced' to address the underrepresentation of DDoS samples

In [None]:
# Select Independent adn dependent features- Set X/y
X = df[numerical_features + categorical_features].copy()

# One-hot encode categorical features
X = pd.get_dummies(X, columns=categorical_features, drop_first=True)

# Step 3: Define target
y = df['is_ddos'].astype(int)

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Instantiate the model
model = LogisticRegression(class_weight='balanced', max_iter=10000)

In [None]:
#fit model
model.fit(X_train, y_train)

In [None]:
#predic model
y_pred = model.predict(X_test)

In [None]:
#get necessary predictions
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

#This produce a confusion matrix of : 
True Negatives: 2621  
False Positives: 2693  
False Negatives: 1306  
True Positives: 1380 

In [None]:
#Compare to null
# Null model prediction: always predict 0
y_null_pred = [0] * len(y_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("Null Accuracy:", accuracy_score(y_test, y_null_pred))
print("Null Precision:", precision_score(y_test, y_null_pred, zero_division=0))
print("Null Recall:", recall_score(y_test, y_null_pred, zero_division=0))
print("Null F1 Score:", f1_score(y_test, y_null_pred, zero_division=0))

## Conclusion
The logistic regression model demonstrates meaningful predictive capability for DDoS detection, especially after addressing class imbalance. While precision remains moderate, the model’s ability to identify over half of true DDoS cases marks a strong foundation for further refinement. Future improvements may include feature scaling, advanced classifiers (e.g., Random Forest, XGBoost), and deeper feature engineering to boost precision without sacrificing recall.