# 🧠 Network Anomaly Detection using Unsupervised Learning
In this project, we aim to detect **anomalies in network traffic** that could indicate potential **security breaches** or **system failures**, using unsupervised learning techniques.

We use the **KDD Cup 1999 dataset**, which includes a variety of simulated intrusions and normal network traffic. This project leverages:

- Isolation Forest (tree-based anomaly detection)
- Autoencoder (deep learning reconstruction-based anomaly detection)

---

## 📥 Step 1: Load and Inspect the Dataset

In [None]:
import pandas as pd
import numpy as np

# Load 10% corrected dataset
df = pd.read_csv("kddcup.data_10_percent.gz", header=None)

# Preview dataset
df.head()

## 🔧 Step 2: Data Preprocessing
- Encode categorical variables
- Normalize numerical features

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

df_encoded = df.copy()

# Label encode object columns
for col in df_encoded.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col])

# Normalize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_encoded.drop(columns=[41]))  # drop label column


## 🌲 Step 3: Anomaly Detection using Isolation Forest

In [None]:
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05, random_state=42)
df_encoded['anomaly_iforest'] = iso_forest.fit_predict(X_scaled)

# Count anomalies
df_encoded['anomaly_iforest'].value_counts()

## 🤖 Step 4: Anomaly Detection using Autoencoder

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models

input_dim = X_scaled.shape[1]

autoencoder = models.Sequential([
    layers.Input(shape=(input_dim,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(input_dim, activation='linear')
])

autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(X_scaled, X_scaled, epochs=5, batch_size=256, shuffle=True)

# Reconstruction error
reconstructions = autoencoder.predict(X_scaled)
mse = tf.keras.losses.mse(X_scaled, reconstructions).numpy()

# Threshold: mean + 3*std
threshold = np.mean(mse) + 3 * np.std(mse)
df_encoded['anomaly_autoencoder'] = (mse > threshold).astype(int)

df_encoded['anomaly_autoencoder'].value_counts()

## 📊 Step 5: Optional - Evaluation using Known Labels

In [None]:
from sklearn.metrics import classification_report

df_encoded['true_label'] = (df[41] != 'normal.').astype(int)

print("Isolation Forest:
")
print(classification_report(df_encoded['true_label'], df_encoded['anomaly_iforest'] == -1))

print("Autoencoder:
")
print(classification_report(df_encoded['true_label'], df_encoded['anomaly_autoencoder']))

## ✅ Conclusion
In this project, we applied **Isolation Forest** and **Autoencoder** to detect anomalies in the KDD Cup 1999 dataset.

- Isolation Forest is simple and fast but may not capture complex patterns.
- Autoencoder provides a deep learning approach that can adapt to intricate feature interactions.

**Next steps**:
- Perform PCA or t-SNE for better visual analysis
- Fine-tune thresholds and evaluate ROC-AUC
- Explore other unsupervised techniques like One-Class SVM or DBSCAN