## Real-time Anomaly Detection in Network Traffic

Cybersecurity is a major concern for a lot of organizations. Malicious acitvities, such as intrusions, denial-of-service (DoS) attacks, or data exfiltration, often present as anomalies in networks traffic patterns. Detecting these anomalies is cruical for rapid response to security threats.

The KDD Dataset was created in 1999 to see if scientists can create an efficient way to detect cybersecurity attacks. The NSL-KDD dataset is an improved dataset of this dataset from 1999 to train new ML models.

Goal:
* Create a model that can differentiate between normal and anamalous network traffic in real-time.

Test Data used: 
* NSL-KDD Data Set: Improved version of classic KDD'99 dataset, specifically designed for evaluating intrustion detection systems.
* Link: https://www.unb.ca/cic/datasets/nsl.html

In [73]:
# Let's listify the contents
steps = [
    "0. An end-to-end workflow to try and solve this issue",
    "1. Getting the data ready (ETL Extract, Transform and Load)",
    "2. Choose the right estimator/algorithm for our problems (scikit-learn??)",
    "3. Fit the model/algorithm and use it to make predictions on our data (70% training set, 15% validation set, 15% test set)",
    "4. Evaluating a model (Try out a few models and choose the one with the best prediction abilities, remember though, not too perfect)",
    "5. Improve a model (Try improving model by modifying hyperparameters)",
    "6. Save and load a trained model (Save, Export and load)"]

print("\n".join(steps))

0. An end-to-end workflow to try and solve this issue
1. Getting the data ready (ETL Extract, Transform and Load)
2. Choose the right estimator/algorithm for our problems (scikit-learn??)
3. Fit the model/algorithm and use it to make predictions on our data (70% training set, 15% validation set, 15% test set)
4. Evaluating a model (Try out a few models and choose the one with the best prediction abilities, remember though, not too perfect)
5. Improve a model (Try improving model by modifying hyperparameters)
6. Save and load a trained model (Save, Export and load)


In [2]:
# Import all the libraries to use
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### 1. Data ETL

Extracting the data, transforming it to usable form. We won't be loading the data but just be using the variable form of it.

In [3]:
# Get the dataset
import kagglehub

# Download latest version
path = kagglehub.dataset_download("hassan06/nslkdd")

print("Path to dataset files: ", path)

Path to dataset files:  /Users/sanjaysubramanian/.cache/kagglehub/datasets/hassan06/nslkdd/versions/1


#### First we import the data from Kaggle and get the training data ready!

In [48]:
import os

path = '/Users/sanjaysubramanian/.cache/kagglehub/datasets/hassan06/nslkdd/versions/1'

# First, see what files are in the directory
files = os.listdir(path)
print("Files in directory:", files)

Files in directory: ['KDDTest+.arff', 'index.html', 'KDDTest1.jpg', 'KDDTrain+_20Percent.txt', 'KDDTrain+.txt', 'KDDTrain+_20Percent.arff', 'KDDTest-21.txt', 'KDDTest+.txt', 'KDDTest-21.arff', 'nsl-kdd', 'KDDTrain1.jpg', 'KDDTrain+.arff']


In [33]:
column_names = [
        'duration', 'protocol_type', 'service', 'flag', 'src_bytes',
        'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
        'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
        'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
        'num_access_files', 'num_outbound_cmds', 'is_host_login',
        'is_guest_login', 'count', 'srv_count', 'serror_rate',
        'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
        'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
        'dst_host_srv_count', 'dst_host_same_srv_rate',
        'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
        'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
        'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
        'dst_host_srv_rerror_rate', 'class'
    ]

# Define categorical columns and their possible values
categorical_columns = {
    'protocol_type': ['tcp', 'udp', 'icmp'],
    'service': ['aol', 'auth', 'bgp', 'courier', 'csnet_ns', 'ctf', 'daytime', 
                'discard', 'domain', 'domain_u', 'echo', 'eco_i', 'ecr_i', 'efs', 
                'exec', 'finger', 'ftp', 'ftp_data', 'gopher', 'harvest', 'hostnames', 
                'http', 'http_2784', 'http_443', 'http_8001', 'imap4', 'IRC', 'iso_tsap', 
                'klogin', 'kshell', 'ldap', 'link', 'login', 'mtp', 'name', 'netbios_dgm', 
                'netbios_ns', 'netbios_ssn', 'netstat', 'nnsp', 'nntp', 'ntp_u', 'other', 
                'pm_dump', 'pop_2', 'pop_3', 'printer', 'private', 'red_i', 'remote_job', 
                'rje', 'shell', 'smtp', 'sql_net', 'ssh', 'sunrpc', 'supdup', 'systat', 
                'telnet', 'tftp_u', 'tim_i', 'time', 'urh_i', 'urp_i', 'uucp', 'uucp_path', 
                'vmnet', 'whois', 'X11', 'Z39_50'],
    'flag': ['OTH', 'REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S0', 'S1', 'S2', 'S3', 'SF', 'SH'],
    'land': [0, 1],
    'logged_in': [0, 1],
    'is_host_login': [0, 1],
    'is_guest_login': [0, 1],
    'class': ['normal', 'anomaly']
}

# Define numeric columns that should be float
numeric_columns = [
    'duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot',
    'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 
    'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate',
    'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
    'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
    'dst_host_rerror_rate', 'dst_host_srv_rerror_rate'
]

In [106]:
def processData(filePath, colName=column_names, catCol=categorical_columns):
    """
    This function accepts the filepath along with required column names and categorical columns.
    It then uses pandas read_csv to create a pandas array with necessary data.
    """

    df_val = pd.read_csv(path + filePath, comment="@", header=None, names=colName)

    # Convert categorical columns to proper data types
    for col, categories in catCol.items():
        if col in ['land', 'logged_in', 'is_host_login', 'is_guest_login']:
            # Convert binary columns to int
            df_val[col] = df_val[col].astype(int)
        elif col != 'class':
            # Convert to categorical
            df_val[col] = pd.Categorical(df_val[col], categories=categories)

    for col1 in numeric_columns:
        if col != 'class':
            df_val[col] = pd.to_numeric(df_val[col1], errors='coerce')

    return df_val

In [103]:
class DataDescriptor:
    """
    A class to describe and analyze dataset information
    """
    
    def __init__(self, data):
        self.data = data
    
    def basic_info(self):
        """Display basic information about the dataset"""
        print("Dataset Shape:", self.data.shape)
        print("Column Data Types:")
        print(self.data.dtypes)
    
    def missing_values(self):
        """Check for missing values"""
        print("Missing values per column:")
        print(self.data.isnull().sum())
        print("Total missing values:", self.data.isnull().sum().sum())
    
    def preview_data(self):
        """Display first few rows"""
        print("First few rows:")
        print(self.data.head())
    
    def class_distribution(self):
        """Display class distribution"""
        print("Class distribution:")
        print(self.data['class'].value_counts())
    
    def summary_stats(self):
        """Display summary statistics for numeric columns"""
        print("Summary statistics for numeric columns:")
        print(self.data.describe())
    
    def memory_usage(self):
        """Display memory usage information"""
        print("Memory usage:")
        print(self.data.memory_usage(deep=True).sum(), "bytes")
    
    def describe_all(self):
        """Run all description methods"""
        self.basic_info()
        print("\n" + "="*50 + "\n")
        self.missing_values()
        print("\n" + "="*50 + "\n")
        self.preview_data()
        print("\n" + "="*50 + "\n")
        self.class_distribution()
        print("\n" + "="*50 + "\n")
        self.summary_stats()
        print("\n" + "="*50 + "\n")
        self.memory_usage()

#### Normal Training Dataset

In [107]:
df_train = processData("/KDDTrain+.arff")
descriptor = DataDescriptor(df_train)
df_train

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8,udp,private,SF,105,145,0,0,0,0,...,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,30,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


In [108]:
# Display basic information about the dataset
descriptor.basic_info()

Dataset Shape: (125973, 42)
Column Data Types:
duration                          int64
protocol_type                  category
service                        category
flag                           category
src_bytes                         int64
dst_bytes                         int64
land                              int64
wrong_fragment                    int64
urgent                            int64
hot                               int64
num_failed_logins                 int64
logged_in                         int64
num_compromised                   int64
root_shell                        int64
su_attempted                      int64
num_root                          int64
num_file_creations                int64
num_shells                        int64
num_access_files                  int64
num_outbound_cmds                 int64
is_host_login                     int64
is_guest_login                    int64
count                             int64
srv_count                        

In [109]:
# Check for missing values
descriptor.missing_values()

Missing values per column:
duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate   

In [110]:
descriptor.preview_data()

First few rows:
   duration protocol_type   service flag  src_bytes  dst_bytes  land  \
0         0           tcp  ftp_data   SF        491          0     0   
1         0           udp     other   SF        146          0     0   
2         0           tcp   private   S0          0          0     0   
3         0           tcp      http   SF        232       8153     0   
4         0           tcp      http   SF        199        420     0   

   wrong_fragment  urgent  hot  ...  dst_host_srv_count  \
0               0       0    0  ...                  25   
1               0       0    0  ...                   1   
2               0       0    0  ...                  26   
3               0       0    0  ...                 255   
4               0       0    0  ...                 255   

   dst_host_same_srv_rate  dst_host_diff_srv_rate  \
0                    0.17                    0.03   
1                    0.00                    0.60   
2                    0.10            

In [111]:
# Display class distribution
descriptor.class_distribution()

Class distribution:
class
normal     67343
anomaly    58630
Name: count, dtype: int64


In [112]:
# Display summary statistics for numeric columns
descriptor.summary_stats()

Summary statistics for numeric columns:
           duration     src_bytes     dst_bytes           land  \
count  125973.00000  1.259730e+05  1.259730e+05  125973.000000   
mean      287.14465  4.556674e+04  1.977911e+04       0.000198   
std      2604.51531  5.870331e+06  4.021269e+06       0.014086   
min         0.00000  0.000000e+00  0.000000e+00       0.000000   
25%         0.00000  0.000000e+00  0.000000e+00       0.000000   
50%         0.00000  4.400000e+01  0.000000e+00       0.000000   
75%         0.00000  2.760000e+02  5.160000e+02       0.000000   
max     42908.00000  1.379964e+09  1.309937e+09       1.000000   

       wrong_fragment         urgent            hot  num_failed_logins  \
count   125973.000000  125973.000000  125973.000000      125973.000000   
mean         0.022687       0.000111       0.204409           0.001222   
std          0.253530       0.014366       2.149968           0.045239   
min          0.000000       0.000000       0.000000           0.00000

In [113]:
X_train = df_train.drop("class", axis=1)
X_train.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,150,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0
1,0,udp,other,SF,146,0,0,0,0,0,...,255,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0
2,0,tcp,private,S0,0,0,0,0,0,0,...,255,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0
3,0,tcp,http,SF,232,8153,0,0,0,0,...,30,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
y_train = df_train["class"]
y_train.head()

0     normal
1     normal
2    anomaly
3     normal
4     normal
Name: class, dtype: object

#### Testing Dataset

In [124]:
# Testing Dataset
df_test = processData("/KDDTest+.arff")
descriptor2 = DataDescriptor(df_test)
df_test

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,private,REJ,0,0,0,0,0,0,...,10,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00,anomaly
1,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00,anomaly
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00,normal
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00,anomaly
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.00,0.0,0.83,0.71,anomaly
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,0,tcp,smtp,SF,794,333,0,0,0,0,...,141,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00,normal
22540,0,tcp,http,SF,317,938,0,0,0,0,...,255,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00,normal
22541,0,tcp,http,SF,54540,8314,0,0,0,2,...,255,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07,anomaly
22542,0,udp,domain_u,SF,42,42,0,0,0,0,...,252,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00,normal


In [125]:
# Next up, we seperate the resulting column from the test data set to train the models properly
X_test = df_test.drop("class", axis=1)
X_test.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,tcp,private,REJ,0,0,0,0,0,0,...,255,10,0.04,0.06,0.0,0.0,0.0,0.0,1.0,1.0
1,0,tcp,private,REJ,0,0,0,0,0,0,...,255,1,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,134,86,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,3,57,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,29,86,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71


In [126]:
y_test = df_test["class"]
y_test.head()

0    anomaly
1    anomaly
2     normal
3    anomaly
4    anomaly
Name: class, dtype: object

In [127]:
# Basic information like data types
descriptor2.basic_info()

Dataset Shape: (22544, 42)
Column Data Types:
duration                          int64
protocol_type                  category
service                        category
flag                           category
src_bytes                         int64
dst_bytes                         int64
land                              int64
wrong_fragment                    int64
urgent                            int64
hot                               int64
num_failed_logins                 int64
logged_in                         int64
num_compromised                   int64
root_shell                        int64
su_attempted                      int64
num_root                          int64
num_file_creations                int64
num_shells                        int64
num_access_files                  int64
num_outbound_cmds                 int64
is_host_login                     int64
is_guest_login                    int64
count                             int64
srv_count                         

In [128]:
# Are there null values in here?
descriptor2.missing_values()

Missing values per column:
duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate   

In [129]:
# Give the first few rows of this dataset
descriptor2.preview_data()

First few rows:
   duration protocol_type   service  flag  src_bytes  dst_bytes  land  \
0         0           tcp   private   REJ          0          0     0   
1         0           tcp   private   REJ          0          0     0   
2         2           tcp  ftp_data    SF      12983          0     0   
3         0          icmp     eco_i    SF         20          0     0   
4         1           tcp    telnet  RSTO          0         15     0   

   wrong_fragment  urgent  hot  ...  dst_host_srv_count  \
0               0       0    0  ...                  10   
1               0       0    0  ...                   1   
2               0       0    0  ...                  86   
3               0       0    0  ...                  57   
4               0       0    0  ...                  86   

   dst_host_same_srv_rate  dst_host_diff_srv_rate  \
0                    0.04                    0.06   
1                    0.00                    0.06   
2                    0.61      

In [130]:
# Distribution of classes
descriptor2.class_distribution()

Class distribution:
class
anomaly    12833
normal      9711
Name: count, dtype: int64


In [131]:
# Describe each of the stats
descriptor2.summary_stats()

Summary statistics for numeric columns:
           duration     src_bytes     dst_bytes          land  wrong_fragment  \
count  22544.000000  2.254400e+04  2.254400e+04  22544.000000    22544.000000   
mean     218.859076  1.039545e+04  2.056019e+03      0.000311        0.008428   
std     1407.176612  4.727864e+05  2.121930e+04      0.017619        0.142599   
min        0.000000  0.000000e+00  0.000000e+00      0.000000        0.000000   
25%        0.000000  0.000000e+00  0.000000e+00      0.000000        0.000000   
50%        0.000000  5.400000e+01  4.600000e+01      0.000000        0.000000   
75%        0.000000  2.870000e+02  6.010000e+02      0.000000        0.000000   
max    57715.000000  6.282565e+07  1.345927e+06      1.000000        3.000000   

             urgent           hot  num_failed_logins     logged_in  \
count  22544.000000  22544.000000       22544.000000  22544.000000   
mean       0.000710      0.105394           0.021647      0.442202   
std        0.036473 

In [115]:
X_train

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00
1,0,udp,other,SF,146,0,0,0,0,0,...,255,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00
2,0,tcp,private,S0,0,0,0,0,0,0,...,255,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00
3,0,tcp,http,SF,232,8153,0,0,0,0,...,30,255,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0,tcp,private,S0,0,0,0,0,0,0,...,255,25,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00
125969,8,udp,private,SF,105,145,0,0,0,0,...,255,244,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00
125970,0,tcp,smtp,SF,2231,384,0,0,0,0,...,255,30,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00
125971,0,tcp,klogin,S0,0,0,0,0,0,0,...,255,8,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00


In [120]:
# Convert categories to numbers.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categoricalFeatures = ["protocol_type", "service", "flag"]
oneHot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", oneHot, categoricalFeatures)], remainder="passthrough")

transformed_X_Train = transformer.fit_transform(X_train)
feature_names = transformer.get_feature_names_out()
visualize_transformed_X_train = pd.DataFrame(transformed_X_Train, columns=feature_names)
visualize_transformed_X_train

Unnamed: 0,one_hot__protocol_type_icmp,one_hot__protocol_type_tcp,one_hot__protocol_type_udp,one_hot__service_IRC,one_hot__service_X11,one_hot__service_Z39_50,one_hot__service_aol,one_hot__service_auth,one_hot__service_bgp,one_hot__service_courier,...,remainder__dst_host_count,remainder__dst_host_srv_count,remainder__dst_host_same_srv_rate,remainder__dst_host_diff_srv_rate,remainder__dst_host_same_src_port_rate,remainder__dst_host_srv_diff_host_rate,remainder__dst_host_serror_rate,remainder__dst_host_srv_serror_rate,remainder__dst_host_rerror_rate,remainder__dst_host_srv_rerror_rate
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,150.0,25.0,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,1.0,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,26.0,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,30.0,255.0,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,25.0,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00
125969,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,244.0,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00
125970,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,30.0,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00
125971,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,8.0,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00


In [119]:
# Processing y_train
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
transformed_y_Train = le.fit_transform(y_train)
transformed_y_Train[:5]

array([1, 1, 0, 1, 1])

In [133]:
# Convert categories to numbers.
transformed_X_Test = transformer.fit_transform(X_test)
feature_names = transformer.get_feature_names_out()
visualize_transformed_X_test = pd.DataFrame(transformed_X_Test, columns=feature_names)
visualize_transformed_X_test

Unnamed: 0,one_hot__protocol_type_icmp,one_hot__protocol_type_tcp,one_hot__protocol_type_udp,one_hot__service_IRC,one_hot__service_X11,one_hot__service_Z39_50,one_hot__service_auth,one_hot__service_bgp,one_hot__service_courier,one_hot__service_csnet_ns,...,remainder__dst_host_count,remainder__dst_host_srv_count,remainder__dst_host_same_srv_rate,remainder__dst_host_diff_srv_rate,remainder__dst_host_same_src_port_rate,remainder__dst_host_srv_diff_host_rate,remainder__dst_host_serror_rate,remainder__dst_host_srv_serror_rate,remainder__dst_host_rerror_rate,remainder__dst_host_srv_rerror_rate
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,10.0,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,1.0,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,134.0,86.0,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,57.0,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,29.0,86.0,0.31,0.17,0.03,0.02,0.00,0.0,0.83,0.71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,100.0,141.0,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00
22540,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,197.0,255.0,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00
22541,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07
22542,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,252.0,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00


In [134]:
# Processing y_train
transformed_y_Test = le.fit_transform(y_test)
transformed_y_Test[:5]

array([0, 0, 1, 0, 0])

### 2. Trying out the models

First we will be working with sklearn to try out 3 models which where the best for working with the dataset that I have right now.

1. Logistic Regression (Good starting model for data with 100k+ entries)
2. Random Forest Classifier (Powerful ensemble model which is highly and accurate and resistant to overfitting. Good for tabular data)
3. LightGBM (Gradient Boosting Model that works well with large dataset.)

In [85]:
# Import Logistic Regression model from sklearns linear_model
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression(random_state=40)

In [122]:
# Get all parameters for this:
logReg.get_params()
logReg.fit(transformed_X_Train, transformed_y_Train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [135]:
logReg.score(transformed_X_Test, transformed_y_Test)

ValueError: X has 116 features, but LogisticRegression is expecting 122 features as input.

In [98]:
y_train.dtype

dtype('float64')

In [99]:
y_train.unique()

array([0.  , 0.01, 1.  , 0.16, 0.57, 0.61, 0.68, 0.96, 0.88, 0.84, 0.91,
       0.12, 0.79, 0.02, 0.04, 0.29, 0.03, 0.99, 0.97, 0.07, 0.75, 0.2 ,
       0.44, 0.08, 0.06, 0.81, 0.32, 0.55, 0.33, 0.5 , 0.05, 0.15, 0.45,
       0.66, 0.93, 0.89, 0.62, 0.67, 0.27, 0.41, 0.98, 0.71, 0.38, 0.74,
       0.19, 0.39, 0.11, 0.92, 0.7 , 0.72, 0.73, 0.9 , 0.17, 0.53, 0.87,
       0.69, 0.77, 0.78, 0.52, 0.35, 0.34, 0.13, 0.28, 0.95, 0.47, 0.31,
       0.86, 0.65, 0.82, 0.94, 0.64, 0.26, 0.14, 0.6 , 0.22, 0.4 , 0.1 ,
       0.25, 0.09, 0.18, 0.54, 0.21, 0.59, 0.85, 0.8 , 0.37, 0.49, 0.56,
       0.76, 0.51, 0.24, 0.48, 0.83, 0.36, 0.58, 0.3 , 0.42, 0.63, 0.43,
       0.46, 0.23])

In [100]:
y_train.head()

0    0.00
1    0.00
2    0.00
3    0.01
4    0.00
Name: class, dtype: float64