# Project 4: Clustering — Anomaly Detection in Network Traffic

### What’s the Problem We’re Solving?
In this project, I wanted to see if we could spot suspicious or potentially dangerous network traffic without relying on labels. Basically, can we detect attacks using clustering — without being told what’s an attack and what isn’t?

This idea comes from real-world challenges in cybersecurity. New types of attacks show up all the time, and if we rely only on labeled data, we might miss something critical. That’s where clustering comes in — grouping similar behavior to see if certain patterns stand out.

**Main questions:**
- Can we cluster network data in a way that separates normal from attack traffic?
- Are certain features better at helping us distinguish these patterns?
- Could this method uncover behaviors that traditional systems miss?

### What Is Clustering (and Why Does It Help Here?)
Clustering is all about grouping things that behave alike. Unlike classification (which needs labels), clustering just looks for natural groupings in the data. It’s an unsupervised technique — which is perfect for spotting patterns when we don’t know what to expect.

**K-Means Clustering** is one of the most common methods. It finds groups by minimizing the distance between points and their cluster centers.

**Hierarchical Clustering** starts with every point as its own cluster and merges the most similar ones together step by step.

### Dataset Overview
I used the [NSL-KDD dataset](https://www.kaggle.com/code/eneskosar19/intrusion-detection-system-nsl-kdd), which is popular for intrusion detection research. It’s an improved version of the original KDD Cup ‘99 dataset, designed to remove redundancy and bias.

Each record in the dataset represents a connection and includes details like:
- Duration
- Protocol used (TCP, UDP, etc.)
- Service type (e.g., HTTP, FTP)
- Bytes sent/received
- Flags and connection status
- And whether it was an attack (which we ignore for clustering)

### Visualization: PCA with Clusters

In [None]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Simulated sample data
data = pd.DataFrame({
    'duration': [0, 0, 2, 0, 3],
    'protocol_type': ['tcp', 'udp', 'tcp', 'icmp', 'tcp'],
    'service': ['http', 'domain_u', 'smtp', 'eco_i', 'ftp'],
    'src_bytes': [181, 239, 235, 0, 145],
    'dst_bytes': [5450, 486, 1337, 0, 324],
    'flag': ['SF', 'SF', 'SF', 'REJ', 'SF'],
    'count': [9, 19, 29, 0, 5],
    'srv_count': [9, 19, 5, 0, 5],
    'label': ['normal', 'normal', 'neptune', 'neptune', 'smurf']
})

# Encode and scale
data_encoded = pd.get_dummies(data.drop('label', axis=1))
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_encoded)

# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)

# KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(data_scaled)

# Plot
plot_df = pd.DataFrame(pca_result, columns=['PCA1', 'PCA2'])
plot_df['Cluster'] = clusters

plt.figure(figsize=(8, 5))
sns.scatterplot(data=plot_df, x='PCA1', y='PCA2', hue='Cluster', palette='viridis')
plt.title('PCA Projection with KMeans Clusters')
plt.show()


### Preprocessing Steps
- Removed the label column to simulate unsupervised learning
- One-hot encoded categorical features
- Scaled numeric values to standard range
- Used PCA to reduce dimensions for easier visualization

### Clustering Models
I tried a few clustering models:

- **K-Means** worked quickly and showed promising results.
- **Agglomerative Clustering** helped visualize cluster relationships but was slower on bigger sets.

I decided to move forward with K-Means because it gave good separation and scaled better.

### What Did We Learn?
The clusters revealed some useful insights. Certain types of attacks, especially DoS, grouped together due to their behavior — like high source bytes or zero destination response. This shows clustering might help detect suspicious behavior even without supervision.

That said, some clusters were mixed — so this shouldn't replace supervised models. But it’s a solid first step toward flagging possible anomalies automatically.

### Visuals

![Cluster Distribution](cluster_distribution.png)

![PCA Clusters](pca_scatter_clusters.png)