# About Data

1. The dataset contains 25 columns, including:
  * Network Traffic Features
  * Security Analysis Features
  * User & Device information


2. Define the Goal:
  * Predict whether an attack will be logged, blocked, or ignored(Action Taken)
  * Predict 'Severity level'
  * Detect anomalies in network traffic

3. Preprocessing:
  * Convert categorical variables to numerical form
  * Handle missing values
  * Extract useful features

4. Choose a Model:
  * Classification Models(for 'Action Taken' or 'Severity Level')
    * Logistic Regression, Random Forest, XGBoost, Neural Networks
  * Anomaly Detection(for identifying suspicious traffic)
    * isolation Forest, One-Class SVM, autoencoders



In [2]:
! kaggle datasets download teamincribo/cyber-security-attacks
! unzip cyber-security-attacks.zip

Dataset URL: https://www.kaggle.com/datasets/teamincribo/cyber-security-attacks
License(s): apache-2.0
cyber-security-attacks.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  cyber-security-attacks.zip
  inflating: README.md               
  inflating: cybersecurity_attacks.csv  


In [3]:
import pandas as pd

df = pd.read_csv('cybersecurity_attacks.csv')
df.head()

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,...,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source
0,2023-05-30 06:33:58,103.216.15.12,84.9.164.252,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,...,Logged,Low,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server
1,2020-08-26 07:08:30,78.199.217.198,66.191.137.154,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,...,Blocked,Low,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall
2,2022-11-13 08:23:25,63.79.210.48,198.219.82.17,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,...,Ignored,Low,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall
3,2023-07-02 10:38:46,163.42.196.10,101.228.192.255,20018,32534,UDP,385,Data,HTTP,Totam maxime beatae expedita explicabo porro l...,...,Blocked,Medium,Fateh Kibe,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...,Segment B,"Jaunpur, Rajasthan",,,Alert Data,Firewall
4,2023-07-16 13:11:07,71.166.185.76,189.243.174.238,6131,26646,TCP,1462,Data,DNS,Odit nesciunt dolorem nisi iste iusto. Animi v...,...,Blocked,Low,Dhanush Chad,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,Segment C,"Anantapur, Tripura",149.6.110.119,,Alert Data,Firewall


# Preprocessing
1. Handling missing values
2. encode categorical variables
3. converting timestamp to features
4. feature selection & scaling



In [17]:
# 1. Handling missing values
# Check for missing values
missing_values = df.isnull().sum()
''' this stores the count of missing values for each column, where the index
represents column names and the value represent the count of null values. '''
print(missing_values[missing_values > 0]) # This filters the Series to show only those columns where the count of missing values is greater than zero.


Malware Indicators    20000
Proxy Information     19851
Firewall Logs         19961
IDS/IPS Alerts        20050
dtype: int64


In [20]:
missing_columns = missing_values[missing_values > 0].index

def fill_missing_values(df, missing_columns):
  for column in missing_columns:
    # If some categorical columns have missing values, fill them with 'unknown'
    if df[column].dtype == 'object':
      df[column].fillna('unknown', inplace=True)
    # If numerical columns have missing values, fill them with their median
    else:
      df[column].fillna(df[column].mean(), inplace=True)

fill_missing_values(df, missing_columns)

missing_values = df.isnull().sum()
''' this stores the count of missing values for each column, where the index
represents column names and the value represent the count of null values. '''
print(missing_values[missing_values > 0]) # This filters the Series to show only those columns where the count of missing values is greater than zero.


Series([], dtype: int64)


In [21]:
# 2. Encode categorical variables

# OHE(for columns with a few categories)
# drop_first=True removes one of the columns to prevent multicollinearity in regression model
# Multicollinearity: occurs when two or more independent variables in a regression model are highly correlated with each other.
df = pd.get_dummies(df, columns=['Protocol', 'Traffic Type', 'Severity Level'], drop_first=True)

# Label Encoding(for Action Taken since it's our target variable in classificaiton)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Action Taken'] = encoder.fit_transform(df['Action Taken'])

In [24]:
# 3. Converting timestamp to features

# Convert the 'Timestamp' column from a string or integer format into a pandas datetime obeject
# This enables easy extraction of date-realated components.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

df['Hour'] = df['Timestamp'].dt.hour
df['Day'] = df['Timestamp'].dt.day
df['Month'] = df['Timestamp'].dt.month
df['Year'] = df['Timestamp'].dt.year

df.drop(columns=['Timestamp'], inplace=True)  # Drop original timestamp

Unnamed: 0,Hour
0,6
1,7
2,8
3,10
4,13


In [25]:
# 4. Feature Selection & Scaling

# We should normalize numeric columns so that they are on a similar scale
from sklearn.preprocessing import StandardScaler

numeric_columns = []
# Select numeric columns
for column in df.columns:
  if df[column].dtype == 'int64' or df[column].dtype == 'float64':
    numeric_columns.append(column)

# Apply Standard Scaling
scaler = StandardScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Preprocessing Done!!


# Classification Models

1. Define the Classification Problem
  * build models to predict 1) Action Taken 2) Severity Level
2. Prepare the Data
  * Features(X): All relevant network traffic attributes(ports, protocol, packet length, etc.)
  * Target(y): Action Taken(Encoded as 0=Ignored, 1=Logged, 2=Blocked)
  * Split the dataset into training(80%) and testig(20%) sets.
  * Train multiple models and compare their accuracy
3. Train Classification models
  * Logistic Regression(Baseline)
  * Random Forest(Good for feature importance)
  * XGBoost(High performance)

4. Evaluation
  * Accuracy score
  * Confusion matrix
  * Feature importance

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt



# Selecting features (X) and target variable (y)
X = df.drop(columns=['Action Taken', 'Source IP Address', 'Destination IP Address', 'Packet Type', 'Payload Data', 'Malware Indicators', 'Alerts/Warnings'])
y = df['Action Taken']

# Dropping non-relevant categorical columns
columns_to_drop = ['User Information', 'Device Information', 'Geo-location Data', 'Proxy Information']
X = X.drop(columns=columns_to_drop)

# One-Hot Encoding relevant categorical columns
X = pd.get_dummies(X, columns=['Attack Type', 'Attack Signature', 'Network Segment', 'Firewall Logs', 'IDS/IPS Alerts', 'Log Source'], drop_first=True)


# Splitting data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Training Logistic Regression Model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)
log_acc = accuracy_score(y_test, y_pred_log)

# Training Random Forest Model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, y_pred_rf)

# Training XGBoost Model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
xgb_acc = accuracy_score(y_test, y_pred_xgb)

# Model Accuracy Comparison
model_performance = pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest", "XGBoost"],
    "Accuracy": [log_acc, rf_acc, xgb_acc]
})

# Display results
tools.display_dataframe_to_user(name="Model Performance Comparison", dataframe=model_performance)

# Confusion Matrix for Best Model (XGBoost in most cases)
best_model = xgb_model if xgb_acc >= max(log_acc, rf_acc) else (rf_model if rf_acc >= log_acc else log_model)
y_pred_best = y_pred_xgb if best_model == xgb_model else (y_pred_rf if best_model == rf_model else y_pred_log)

plt.figure(figsize=(6, 5))
sns.heatmap(confusion_matrix(y_test, y_pred_best), annot=True, fmt="d", cmap="Blues", xticklabels=encoder.classes_, yticklabels=encoder.classes_)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix of Best Model")
plt.show()

# Classification Report
report = classification_report(y_test, y_pred_best, target_names=encoder.classes_)
print("Classification Report for Best Model:\n", report)


ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.