## Real-time Anomaly Detection in Network Traffic

Cybersecurity is a major concern for a lot of organizations. Malicious acitvities, such as intrusions, denial-of-service (DoS) attacks, or data exfiltration, often present as anomalies in networks traffic patterns. Detecting these anomalies is cruical for rapid response to security threats.

The KDD Dataset was created in 1999 to see if scientists can create an efficient way to detect cybersecurity attacks. The NSL-KDD dataset is an improved dataset of this dataset from 1999 to train new ML models.

Goal:
* Create a model that can differentiate between normal and anamalous network traffic in real-time.

Test Data used: 
* NSL-KDD Data Set: Improved version of classic KDD'99 dataset, specifically designed for evaluating intrustion detection systems.
* Link: https://www.unb.ca/cic/datasets/nsl.html

In [1]:
# Let's listify the contents
steps = [
    "0. An end-to-end workflow to try and solve this issue",
    "1. Getting the data ready (ET; Extract, Transform and Load)",
    "2. Choose the right estimator/algorithm for our problems (scikit-learn??)",
    "3. Fit the model/algorithm and use it to make predictions on our data (70% training set, 15% validation set, 15% test set)",
    "4. Evaluating a model (Try out a few models and choose the one with the best prediction abilities, remember though, not too perfect)",
    "5. Improve a model (Try improving model by modifying hyperparameters)",
    "6. Save and load a trained model (Save, Export and load)"]

print("\n".join(steps))

0. An end-to-end workflow to try and solve this issue
1. Getting the data ready (ET; Extract, Transform and Load)
2. Choose the right estimator/algorithm for our problems (scikit-learn??)
3. Fit the model/algorithm and use it to make predictions on our data (70% training set, 15% validation set, 15% test set)
4. Evaluating a model (Try out a few models and choose the one with the best prediction abilities, remember though, not too perfect)
5. Improve a model (Try improving model by modifying hyperparameters)
6. Save and load a trained model (Save, Export and load)


In [2]:
# Import all the libraries to use
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Get the dataset
import kagglehub

# Download latest version
path = kagglehub.dataset_download("hassan06/nslkdd")

print("Path to dataset files: ", path)

Path to dataset files:  /Users/sanjaysubramanian/.cache/kagglehub/datasets/hassan06/nslkdd/versions/1


#### First we import the data from Kaggle and get the training data ready!

In [4]:
import os

path = '/Users/sanjaysubramanian/.cache/kagglehub/datasets/hassan06/nslkdd/versions/1'

# First, see what files are in the directory
files = os.listdir(path)
print("Files in directory:", files)

# 3. Construct the full path to the file
train_file_path = path +  '/KDDTrain+.txt'

# This dataset has no header, so we specify header=None
df_train = pd.read_csv(train_file_path, header=None)

print(df_train.head())

Files in directory: ['KDDTest+.arff', 'index.html', 'KDDTest1.jpg', 'KDDTrain+_20Percent.txt', 'KDDTrain+.txt', 'KDDTrain+_20Percent.arff', 'KDDTest-21.txt', 'KDDTest+.txt', 'KDDTest-21.arff', 'nsl-kdd', 'KDDTrain1.jpg', 'KDDTrain+.arff']
   0    1         2   3    4     5   6   7   8   9   ...    33    34    35  \
0   0  tcp  ftp_data  SF  491     0   0   0   0   0  ...  0.17  0.03  0.17   
1   0  udp     other  SF  146     0   0   0   0   0  ...  0.00  0.60  0.88   
2   0  tcp   private  S0    0     0   0   0   0   0  ...  0.10  0.05  0.00   
3   0  tcp      http  SF  232  8153   0   0   0   0  ...  1.00  0.00  0.03   
4   0  tcp      http  SF  199   420   0   0   0   0  ...  1.00  0.00  0.00   

     36    37    38    39    40       41  42  
0  0.00  0.00  0.00  0.05  0.00   normal  20  
1  0.00  0.00  0.00  0.00  0.00   normal  15  
2  0.00  1.00  1.00  0.00  0.00  neptune  19  
3  0.04  0.03  0.01  0.00  0.01   normal  21  
4  0.00  0.00  0.00  0.00  0.00   normal  21  

[5 rows x

In [5]:
columnTrainPath = path + "/KDDTrain+.arff"

df_train_column = pd.read_csv(columnTrainPath, comment="@", header=None)
df_train_column

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8,udp,private,SF,105,145,0,0,0,0,...,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,30,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


In [6]:
columnTrainTXTPath = path + "/KDDTrain+.TXT"

dfTrainColumn = pd.read_csv(columnTrainTXTPath, header=None)
df_train_column

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8,udp,private,SF,105,145,0,0,0,0,...,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,30,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


In [33]:
column_names = [
        'duration', 'protocol_type', 'service', 'flag', 'src_bytes',
        'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
        'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
        'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
        'num_access_files', 'num_outbound_cmds', 'is_host_login',
        'is_guest_login', 'count', 'srv_count', 'serror_rate',
        'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
        'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
        'dst_host_srv_count', 'dst_host_same_srv_rate',
        'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
        'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
        'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
        'dst_host_srv_rerror_rate', 'class'
    ]

# Define categorical columns and their possible values
categorical_columns = {
    'protocol_type': ['tcp', 'udp', 'icmp'],
    'service': ['aol', 'auth', 'bgp', 'courier', 'csnet_ns', 'ctf', 'daytime', 
                'discard', 'domain', 'domain_u', 'echo', 'eco_i', 'ecr_i', 'efs', 
                'exec', 'finger', 'ftp', 'ftp_data', 'gopher', 'harvest', 'hostnames', 
                'http', 'http_2784', 'http_443', 'http_8001', 'imap4', 'IRC', 'iso_tsap', 
                'klogin', 'kshell', 'ldap', 'link', 'login', 'mtp', 'name', 'netbios_dgm', 
                'netbios_ns', 'netbios_ssn', 'netstat', 'nnsp', 'nntp', 'ntp_u', 'other', 
                'pm_dump', 'pop_2', 'pop_3', 'printer', 'private', 'red_i', 'remote_job', 
                'rje', 'shell', 'smtp', 'sql_net', 'ssh', 'sunrpc', 'supdup', 'systat', 
                'telnet', 'tftp_u', 'tim_i', 'time', 'urh_i', 'urp_i', 'uucp', 'uucp_path', 
                'vmnet', 'whois', 'X11', 'Z39_50'],
    'flag': ['OTH', 'REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S0', 'S1', 'S2', 'S3', 'SF', 'SH'],
    'land': [0, 1],
    'logged_in': [0, 1],
    'is_host_login': [0, 1],
    'is_guest_login': [0, 1],
    'class': ['normal', 'anomaly']
}

# Define numeric columns that should be float
numeric_columns = [
    'duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot',
    'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 
    'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate',
    'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
    'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
    'dst_host_rerror_rate', 'dst_host_srv_rerror_rate'
]

In [34]:
def processData(filePath, colName=column_names, catCol=categorical_columns):
    """
    This function accepts the filepath along with required column names and categorical columns.
    It then uses pandas read_csv to create a pandas array with necessary data.
    """

    df_val = pd.read_csv(path + filePath, comment="@", header=None, names=colName)

    # Convert categorical columns to proper data types
    for col, categories in catCol.items():
        if col in ['land', 'logged_in', 'is_host_login', 'is_guest_login']:
            # Convert binary columns to int
            df_val[col] = df_val[col].astype(int)
        else:
            # Convert to categorical
            df_val[col] = pd.Categorical(df_val[col], categories=categories)

    for col1 in numeric_columns:
        df_val[col] = pd.to_numeric(df_val[col1], errors='coerce')

    return df_val

In [28]:
class DataDescriptor:
    """
    A class to describe and analyze dataset information
    """
    
    def __init__(self, data):
        self.data = data
    
    def basic_info(self):
        """Display basic information about the dataset"""
        print("Dataset Shape:", self.data.shape)
        print("Column Data Types:")
        print(self.data.dtypes)
    
    def missing_values(self):
        """Check for missing values"""
        print("Missing values per column:")
        print(self.data.isnull().sum())
        print("Total missing values:", self.data.isnull().sum().sum())
    
    def preview_data(self):
        """Display first few rows"""
        print("First few rows:")
        print(self.data.head())
    
    def class_distribution(self):
        """Display class distribution"""
        print("Class distribution:")
        print(self.data['class'].value_counts())
    
    def summary_stats(self):
        """Display summary statistics for numeric columns"""
        print("Summary statistics for numeric columns:")
        print(self.data.describe())
    
    def memory_usage(self):
        """Display memory usage information"""
        print("Memory usage:")
        print(self.data.memory_usage(deep=True).sum(), "bytes")
    
    def describe_all(self):
        """Run all description methods"""
        self.basic_info()
        print("\n" + "="*50 + "\n")
        self.missing_values()
        print("\n" + "="*50 + "\n")
        self.preview_data()
        print("\n" + "="*50 + "\n")
        self.class_distribution()
        print("\n" + "="*50 + "\n")
        self.summary_stats()
        print("\n" + "="*50 + "\n")
        self.memory_usage()

In [37]:
df_train_column = processData("/KDDTrain+.arff")
descriptor = DataDescriptor(df_train_column)
df_train_column

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,0.00
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,0.00
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,0.00
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,0.01
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,0.00
125969,8,udp,private,SF,105,145,0,0,0,0,...,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,0.00
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,30,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,0.00
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,0.00


In [39]:
# Display basic information about the dataset
descriptor.basic_info()

Dataset Shape: (125973, 42)
Column Data Types:
duration                          int64
protocol_type                  category
service                        category
flag                           category
src_bytes                         int64
dst_bytes                         int64
land                              int64
wrong_fragment                    int64
urgent                            int64
hot                               int64
num_failed_logins                 int64
logged_in                         int64
num_compromised                   int64
root_shell                        int64
su_attempted                      int64
num_root                          int64
num_file_creations                int64
num_shells                        int64
num_access_files                  int64
num_outbound_cmds                 int64
is_host_login                     int64
is_guest_login                    int64
count                             int64
srv_count                        

In [40]:
# Check for missing values
descriptor.missing_values()

Missing values per column:
duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate   

In [16]:
print("\nColumn Data Types:")
print(df_train_column.dtypes)


Column Data Types:
duration                          int64
protocol_type                  category
service                        category
flag                           category
src_bytes                         int64
dst_bytes                         int64
land                              int64
wrong_fragment                    int64
urgent                            int64
hot                               int64
num_failed_logins                 int64
logged_in                         int64
num_compromised                   int64
root_shell                        int64
su_attempted                      int64
num_root                          int64
num_file_creations                int64
num_shells                        int64
num_access_files                  int64
num_outbound_cmds                 int64
is_host_login                     int64
is_guest_login                    int64
count                             int64
srv_count                         int64
serror_rate         

In [17]:
print("\nFirst few rows:")
print(df_train_column.head())


First few rows:
   duration protocol_type   service flag  src_bytes  dst_bytes  land  \
0         0           tcp  ftp_data   SF        491          0     0   
1         0           udp     other   SF        146          0     0   
2         0           tcp   private   S0          0          0     0   
3         0           tcp      http   SF        232       8153     0   
4         0           tcp      http   SF        199        420     0   

   wrong_fragment  urgent  hot  ...  dst_host_srv_count  \
0               0       0    0  ...                  25   
1               0       0    0  ...                   1   
2               0       0    0  ...                  26   
3               0       0    0  ...                 255   
4               0       0    0  ...                 255   

   dst_host_same_srv_rate  dst_host_diff_srv_rate  \
0                    0.17                    0.03   
1                    0.00                    0.60   
2                    0.10           

In [12]:
# Display class distribution
print("\nClass distribution:")
print(df_train_column['class'].value_counts())


Class distribution:
class
normal     67343
anomaly    58630
Name: count, dtype: int64


In [13]:
# Display summary statistics for numeric columns
print("\nSummary statistics for numeric columns:")
print(df_train_column.describe())


Summary statistics for numeric columns:
           duration     src_bytes     dst_bytes           land  \
count  125973.00000  1.259730e+05  1.259730e+05  125973.000000   
mean      287.14465  4.556674e+04  1.977911e+04       0.000198   
std      2604.51531  5.870331e+06  4.021269e+06       0.014086   
min         0.00000  0.000000e+00  0.000000e+00       0.000000   
25%         0.00000  0.000000e+00  0.000000e+00       0.000000   
50%         0.00000  4.400000e+01  0.000000e+00       0.000000   
75%         0.00000  2.760000e+02  5.160000e+02       0.000000   
max     42908.00000  1.379964e+09  1.309937e+09       1.000000   

       wrong_fragment         urgent            hot  num_failed_logins  \
count   125973.000000  125973.000000  125973.000000      125973.000000   
mean         0.022687       0.000111       0.204409           0.001222   
std          0.253530       0.014366       2.149968           0.045239   
min          0.000000       0.000000       0.000000           0.0000

In [18]:
df_train_column

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8,udp,private,SF,105,145,0,0,0,0,...,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,30,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


#### Next we get the test data ready