# üîê Unsupervised Network Intrusion Detection

## Overview

This notebook implements an unsupervised machine learning approach to detect anomalous network traffic using K-Means clustering on the NSL-KDD dataset. The model is trained without using attack labels, learning traffic behavior patterns directly from the data. Ground-truth labels are used only for evaluation to assess how well the discovered clusters separate normal and malicious activity.

## Workflow

The notebook demonstrates a complete practical workflow including:

- Data preprocessing and feature encoding
- K-Means clustering for pattern discovery
- Performance analysis and evaluation metrics

## Dataset

The **NSL-KDD dataset** is an improved version of the KDD Cup 1999 dataset, containing network connection records with features describing connection characteristics and labeled attack types.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

####Load Data

In [4]:
import kagglehub

path = kagglehub.dataset_download("hassan06/nslkdd")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'nslkdd' dataset.
Path to dataset files: /kaggle/input/nslkdd


In [5]:
import os

os.listdir(path)

['KDDTest+.arff',
 'KDDTest-21.arff',
 'KDDTest1.jpg',
 'KDDTrain+.txt',
 'KDDTrain+_20Percent.txt',
 'KDDTest-21.txt',
 'KDDTest+.txt',
 'KDDTrain+.arff',
 'index.html',
 'nsl-kdd',
 'KDDTrain+_20Percent.arff',
 'KDDTrain1.jpg']

In [9]:
training_dataset = os.path.join(path, "KDDTrain+.txt")

train_df = pd.read_csv(training_dataset, header=None, sep=",")

In [10]:
display(train_df.head(10))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,33,34,35,36,37,38,39,40,41,42
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21
5,0,tcp,private,REJ,0,0,0,0,0,0,...,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,neptune,21
6,0,tcp,private,S0,0,0,0,0,0,0,...,0.04,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,21
7,0,tcp,private,S0,0,0,0,0,0,0,...,0.06,0.07,0.0,0.0,1.0,1.0,0.0,0.0,neptune,21
8,0,tcp,remote_job,S0,0,0,0,0,0,0,...,0.09,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,21
9,0,tcp,private,S0,0,0,0,0,0,0,...,0.05,0.06,0.0,0.0,1.0,1.0,0.0,0.0,neptune,21


In [11]:
print("Training Dataset shape: ", train_df.shape)
print("Training samples: ", train_df.shape[0])
print("Training features: ", train_df.shape[1])

Training Dataset shape:  (125973, 43)
Training samples:  125973
Training features:  43
