## Problem Statement
 In the realm of cybersecurity, **network anomaly detection** is a critical task that involves identifying unusual patterns or behaviors that deviate from the norm within network traffic. These anomalies could signify a range of security threats, from compromised devices and malware infections to large-scale cyber-attacks like **DDoS (Distributed Denial of Service)**.

The challenge lies in accurately detecting these anomalies in real-time, amidst the vast and continuous streams of network data, which are often **noisy and heterogeneous**.

Traditional methods of network anomaly detection often rely on predefined rules or signatures based on known attack patterns. However, these methods fall short in detecting **new or evolving threats** that do not match the existing signatures. Furthermore, as network environments grow in complexity, maintaining and updating these rules becomes increasingly **cumbersome and less effective**.


## Dataset Location/Link

https://drive.google.com/file/d/1AlZak8gC27ntWFR0-ZJ0tMxVWFac-XPf/view?usp=drive_link 

# Network Anomaly Detection Dataset Features

This document outlines and explains the various features commonly used in network anomaly detection datasets such as KDD Cup 1999 or NSL-KDD. These features help in analyzing traffic behavior and detecting anomalies or potential attacks.

---

## 1. Basic Connection Features

- **Duration**:  
  Length of time (in seconds) that the connection lasted.

- **Protocol_type**:  
  The protocol used in the connection (e.g., TCP, UDP, ICMP).

- **Service**:  
  The destination network service accessed during the connection (e.g., HTTP, Telnet, FTP).

- **Flag**:  
  Status of the connection (indicates normal or error state). It shows the result of the connection attempt (e.g., SF, S0, REJ).

- **Src_bytes**:  
  Number of data bytes sent from the source to the destination during the connection.

- **Dst_bytes**:  
  Number of data bytes sent from the destination back to the source.

- **Land**:  
  A binary flag indicating if the connection is to/from the same IP address and port (1 if same, 0 otherwise).

- **Wrong_fragment**:  
  Number of incorrect (incomplete or overlapping) IP packet fragments.

- **Urgent**:  
  Number of packets with the URG (urgent) flag set in the TCP header.

---

## 2. Content-Related Features

These features analyze the actual data within the connection for suspicious activity.

- **Hot**:  
  Number of 'hot' indicators in the content (e.g., system directory access, file creations, program executions).

- **Num_failed_logins**:  
  Number of failed login attempts before a successful login.

- **Logged_in**:  
  Binary flag indicating if the user is successfully logged in (1) or not (0).

- **Num_compromised**:  
  Number of compromised conditions (such as a system call to gain unauthorized privileges).

- **Root_shell**:  
  Binary flag indicating if a root shell was obtained during the session (1 if yes, 0 otherwise).

- **Su_attempted**:  
  Binary flag indicating if the "su root" command was attempted (1 if yes, 0 otherwise).

- **Num_root**:  
  Number of root-level operations performed during the connection.

- **Num_file_creations**:  
  Number of file creation operations performed.

- **Num_shells**:  
  Number of shell prompts invoked.

- **Num_access_files**:  
  Number of attempts to access control files (e.g., `/etc/passwd`).

- **Num_outbound_cmds**:  
  Number of outbound commands issued in an FTP session (usually 0 in most datasets).

- **Is_hot_login**:  
  Indicates whether the login is to a "hot" account (root or admin). (1 if yes, 0 otherwise).

- **Is_guest_login**:  
  Indicates whether the login is a guest login (1 if yes, 0 otherwise).

---

## 3. Time-Related Traffic Features

These features describe the temporal behavior of connections in a short time window (usually 2 seconds).

- **Count**:  
  Number of connections to the same destination host as the current connection in the past two seconds.

- **Srv_count**:  
  Number of connections to the same service as the current connection in the past two seconds.

- **Serror_rate**:  
  Percentage of connections that had SYN errors (flags: S0, S1, S2, S3) among the `count` connections.

- **Srv_serror_rate**:  
  Percentage of connections with SYN errors among the `srv_count` connections.

- **Rerror_rate**:  
  Percentage of connections that were rejected (flag REJ) among the `count` connections.

- **Srv_rerror_rate**:  
  Percentage of connections with REJ flags among the `srv_count` connections.

- **Same_srv_rate**:  
  Percentage of connections to the same service among the `count` connections.

- **Diff_srv_rate**:  
  Percentage of connections to different services among the `count` connections.

- **Srv_diff_host_rate**:  
  Percentage of connections to different hosts (IP addresses) among the `srv_count` connections.

---

## 4. Host-Based Traffic Features

These are features computed over a larger time window (typically 100 connections) to detect long-term or stealthy attacks.

- **Dst_host_count**:  
  Number of connections having the same destination host.

- **Dst_host_srv_count**:  
  Number of connections having the same service (port) to the destination host.

- **Dst_host_same_srv_rate**:  
  Percentage of connections to the same service among the `dst_host_count` connections.

- **Dst_host_diff_srv_rate**:  
  Percentage of connections to different services among the `dst_host_count` connections.

- **Dst_host_same_src_port_rate**:  
  Percentage of connections to the same source port among the `dst_host_srv_count` connections.

- **Dst_host_srv_diff_host_rate**:  
  Percentage of connections to different destination hosts using the same service.

- **Dst_host_serror_rate**:  
  Percentage of SYN error connections among the `dst_host_count` connections.

- **Dst_host_srv_serror_rate**:  
  Percentage of SYN error connections among the `dst_host_srv_count` connections.

- **Dst_host_rerror_rate**:  
  Percentage of REJ error connections among the `dst_host_count` connections.

- **Dst_host_srv_rerror_rate**:  
  Percentage of REJ error connections among the `dst_host_srv_count` connections.


In [2]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

## 2.1: Data Discovery and access

In [3]:
data = pd.read_csv('../data/Net.csv')
data.head()

Unnamed: 0,duration,protocoltype,service,flag,srcbytes,dstbytes,land,wrongfragment,urgent,hot,...,dsthostsamesrvrate,dsthostdiffsrvrate,dsthostsamesrcportrate,dsthostsrvdiffhostrate,dsthostserrorrate,dsthostsrvserrorrate,dsthostrerrorrate,dsthostsrvrerrorrate,attack,lastflag
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21


### Shallow Copy vs Deep Copy in Pandas

### **Shallow Copy**
- Created with: `df.copy(deep=False)`
- Copies **only references** to the underlying data.
- Changes to data in one DataFrame **may affect** the other.
- Faster, but risky if you plan to modify data.

### **Deep Copy**
- Created with: `df.copy()` or `df.copy(deep=True)`
- Creates a **full, independent copy** of the data.
- Changes in one DataFrame **do not affect** the other.
- Safer for transformations, but uses more memory.

| Copy Type     | Independent Data? | Speed   | Memory Usage | Safe for Modifications? |
|---------------|-------------------|---------|--------------|-------------------------|
| Shallow Copy  | ❌ No              | ✅ Fast | ✅ Low       | ❌ No                    |
| Deep Copy     | ✅ Yes             | ⚠️ Slower | ⚠️ Higher    | ✅ Yes                   |

**Example:**
```python
df_copy = df_original.copy(deep=True)  # Deep copy (safe)
df_view = df_original.copy(deep=False) # Shallow copy (linked to original)


In [4]:
# Create a deep copy
df = data.copy()

### 2.2: Inspect the data

In [5]:
print(df.shape)       # (rows, columns)
print(df.columns)     # Column names
print(df.dtypes)      # Data types
print(df.head())      # First rows


(125973, 43)
Index(['duration', 'protocoltype', 'service', 'flag', 'srcbytes', 'dstbytes',
       'land', 'wrongfragment', 'urgent', 'hot', 'numfailedlogins', 'loggedin',
       'numcompromised', 'rootshell', 'suattempted', 'numroot',
       'numfilecreations', 'numshells', 'numaccessfiles', 'numoutboundcmds',
       'ishostlogin', 'isguestlogin', 'count', 'srvcount', 'serrorrate',
       'srvserrorrate', 'rerrorrate', 'srvrerrorrate', 'samesrvrate',
       'diffsrvrate', 'srvdiffhostrate', 'dsthostcount', 'dsthostsrvcount',
       'dsthostsamesrvrate', 'dsthostdiffsrvrate', 'dsthostsamesrcportrate',
       'dsthostsrvdiffhostrate', 'dsthostserrorrate', 'dsthostsrvserrorrate',
       'dsthostrerrorrate', 'dsthostsrvrerrorrate', 'attack', 'lastflag'],
      dtype='object')
duration                    int64
protocoltype               object
service                    object
flag                       object
srcbytes                    int64
dstbytes                    int64
land         

In [6]:
missing_info = df.isnull().sum()
missing_percent = (missing_info / len(df)) * 100
print(pd.DataFrame({"Missing": missing_info, "% Missing": missing_percent}))


                        Missing  % Missing
duration                      0        0.0
protocoltype                  0        0.0
service                       0        0.0
flag                          0        0.0
srcbytes                      0        0.0
dstbytes                      0        0.0
land                          0        0.0
wrongfragment                 0        0.0
urgent                        0        0.0
hot                           0        0.0
numfailedlogins               0        0.0
loggedin                      0        0.0
numcompromised                0        0.0
rootshell                     0        0.0
suattempted                   0        0.0
numroot                       0        0.0
numfilecreations              0        0.0
numshells                     0        0.0
numaccessfiles                0        0.0
numoutboundcmds               0        0.0
ishostlogin                   0        0.0
isguestlogin                  0        0.0
count      

- No missing values in the dataset

In [7]:
df.describe(include='all')

Unnamed: 0,duration,protocoltype,service,flag,srcbytes,dstbytes,land,wrongfragment,urgent,hot,...,dsthostsamesrvrate,dsthostdiffsrvrate,dsthostsamesrcportrate,dsthostsrvdiffhostrate,dsthostserrorrate,dsthostsrvserrorrate,dsthostrerrorrate,dsthostsrvrerrorrate,attack,lastflag
count,125973.0,125973,125973,125973,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,...,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973,125973.0
unique,,3,70,11,,,,,,,...,,,,,,,,,23,
top,,tcp,http,SF,,,,,,,...,,,,,,,,,normal,
freq,,102689,40338,74945,,,,,,,...,,,,,,,,,67343,
mean,287.14465,,,,45566.74,19779.11,0.000198,0.022687,0.000111,0.204409,...,0.521242,0.082951,0.148379,0.032542,0.284452,0.278485,0.118832,0.12024,,19.50406
std,2604.51531,,,,5870331.0,4021269.0,0.014086,0.25353,0.014366,2.149968,...,0.448949,0.188922,0.308997,0.112564,0.444784,0.445669,0.306557,0.319459,,2.291503
min,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0
25%,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,18.0
50%,0.0,,,,44.0,0.0,0.0,0.0,0.0,0.0,...,0.51,0.02,0.0,0.0,0.0,0.0,0.0,0.0,,20.0
75%,0.0,,,,276.0,516.0,0.0,0.0,0.0,0.0,...,1.0,0.07,0.06,0.02,1.0,1.0,0.0,0.0,,21.0


## 3: Exploratory data analysis

### 3.1 Univariate Analysis

Goal: understand each variable alone.

In [9]:
columns = df.columns

In [10]:
for col in columns:
    if df[col].dtype == 'object':
        print(f"Column: {col}")
        print(df[col].value_counts())
        print("\n")
    else:
        print(f"Column: {col}")
        print(df[col].describe())
        print("\n")

Column: duration
count    125973.00000
mean        287.14465
std        2604.51531
min           0.00000
25%           0.00000
50%           0.00000
75%           0.00000
max       42908.00000
Name: duration, dtype: float64


Column: protocoltype
protocoltype
tcp     102689
udp      14993
icmp      8291
Name: count, dtype: int64


Column: service
service
http         40338
private      21853
domain_u      9043
smtp          7313
ftp_data      6860
             ...  
tftp_u           3
http_8001        2
aol              2
harvest          2
http_2784        1
Name: count, Length: 70, dtype: int64


Column: flag
flag
SF        74945
S0        34851
REJ       11233
RSTR       2421
RSTO       1562
S1          365
SH          271
S2          127
RSTOS0      103
S3           49
OTH          46
Name: count, dtype: int64


Column: srcbytes
count    1.259730e+05
mean     4.556674e+04
std      5.870331e+06
min      0.000000e+00
25%      0.000000e+00
50%      4.400000e+01
75%      2.760000e+02
m