## Hypothesis Validation 

### Objectives:

The objective of hypothesis validation in this project is to apply statistical testing to uncover meaningful behavioural differences between normal and malicious network traffic. This helps identify patterns that could improve early threat detection and support cybersecurity decision-making.


1. **Detect Statistical Differences**  
   Quantify whether key features (such as `src_bytes` or `duration`) show significant differences between normal and malicious traffic classes.

2. **Identify Risk-Associated Attributes**  
   Determine whether certain categorical features (such as `service` type) are disproportionately linked to malicious activity.

3. **Validate Hypotheses with Statistical Rigor**  
   Use appropriate hypothesis tests (e.g. Mann-Whitney U, Chi-square, T-test) to ensure findings are statistically valid and not due to random chance.

4. **Support Explainable Insights**  
   Back statistical results with clear visualisations (e.g. boxplots, stacked bar charts) to help non-technical stakeholders understand threat patterns.

5. **Inform Detection Logic**  
   Use validated hypotheses to inform future detection rules, classification models, or security monitoring strategies.


<br>

 *Disclaimer: Some of the code snippets in this project were created or refined with the assistance of ChatGPT to support learning and exploration.*

 


In [2]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import mannwhitneyu


# Load Cleaned Train Data

df = pd.read_csv('../data/cleaned/cleaned_train.csv')  
df. head()


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0,tcp,ftp_data,SF,491,0,False,0,False,0,...,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal
1,0,udp,other,SF,146,0,False,0,False,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal
2,0,tcp,private,S0,0,0,False,0,False,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
3,0,tcp,http,SF,232,8153,False,0,False,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal
4,0,tcp,http,SF,199,420,False,0,False,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal


In [3]:
 #check column name
 df.columns.tolist()

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate',
 'class']

In [None]:
# shows descriptive statistics
df.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,wrong_fragment,hot,num_failed_logins,num_compromised,su_attempted,num_root,num_file_creations,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
count,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,...,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0,25184.0
mean,305.151009,24338.34,3492.956,0.023745,0.198102,0.001191,0.227922,0.00135,0.249921,0.014732,...,182.547133,115.094346,0.519925,0.082513,0.147469,0.031854,0.285886,0.279896,0.11779,0.118807
std,2686.976829,2411188.0,88844.81,0.260262,2.154541,0.045425,10.419006,0.048793,11.502668,0.529686,...,99.000475,110.649559,0.448949,0.18713,0.30841,0.110591,0.445361,0.446099,0.305877,0.317377
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,84.0,10.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,61.0,0.51,0.03,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,279.0,531.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.07,0.06,0.02,1.0,1.0,0.0,0.0
max,42862.0,381709100.0,5151385.0,3.0,77.0,4.0,884.0,2.0,975.0,40.0,...,255.0,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Interpreting Summary Statistics from df.describe()

| Statistic | What It Tells You | How to Interpret It in This Dataset |
|-----------|-------------------|-------------------------------------|
| `count`   | Total number of non-null values for each column. | Confirms no missing data (all values = 25,184), so no imputation needed. |
| `mean`    | The average value of the column. | Helps understand the general magnitude, but can be misleading if data is skewed (e.g. `src_bytes` mean is much higher than the median). |
| `std`     | Standard deviation — how spread out the values are. | High `std` means values vary widely; useful for identifying columns with potential outliers (`src_bytes`, `duration`). |
| `min`     | The smallest observed value in the column. | Indicates the lower boundary. If it’s 0 for most features, the data might be sparse or binary-like. |
| `25%`     | First quartile — 25% of the data falls below this value. | Good for spotting skew and sparsity. If 25% = 0, the feature has many zero entries (common in intrusion datasets). |
| `50%`     | Median — middle value of the distribution. | More reliable than the mean when data is skewed. If 50% = 0 and mean is much higher, it confirms long-tailed or zero-inflated features. |
| `75%`     | Third quartile — 75% of the data falls below this value. | Helps identify where most values lie and contrast with max. If 75% is very low but max is very high, the feature has strong outliers. |
| `max`     | The highest observed value in the column. | Useful for spotting extreme cases or anomalies — especially in features like `num_root`, `src_bytes`, or `duration`. |


<br>

### Interpretation of Descriptive Statistics and Spread

| Feature                   | Interpretation |
|---------------------------|----------------|
| `duration`                | Most connections are very short (median = 0), but some last a long time (max = 42,862 sec). High variance and extreme skew indicate that outliers dominate this feature. |
| `src_bytes`               | 50% of traffic sends fewer than 44 bytes, yet some connections send over 382 million. Standard deviation and max are very high, suggesting strong right skew and heavy-tailed behaviour. |
| `dst_bytes`               | Similar to `src_bytes`: most connections receive little or no data in return. 75% are below 531 bytes, but outliers exceed 5 million. |
| `wrong_fragment`          | Almost always 0 (median and 75% = 0), with a rare max of 3. Low variance but may indicate rare fragmentation attacks. |
| `hot`                     | Majority are 0 (sparse feature), but some spike up to 77. High deviation shows rare but concentrated activity. |
| `num_failed_logins`       | Nearly always 0 with few values >0. Low variance; useful in spotting brute-force attempts. |
| `num_compromised`         | 75% of values are 0, but a few entries reach 884. High variance implies rare but critical compromise indicators. |
| `su_attempted`            | Rare escalation attempts (max = 2). Mostly zeros with low spread. |
| `num_root`                | 75% = 0, but max = 975. Extreme outliers reflect elevated privilege gains during attacks. |
| `num_file_creations`      | Rare feature creation events; 75% = 0, but max = 40. Might signal malware behaviour. |
| `dst_host_count` / `srv_count` | Median = 255. Some destination hosts are contacted extremely frequently — can signal scanning or DDoS. |
| `dst_host_same_srv_rate`  | Median = 0.51, 75% = 1.0. Suggests some hosts serve the same service repeatedly — common in automated scanning or web server attacks. |
| `dst_host_diff_srv_rate`  | Some hosts receive highly varied traffic (max = 1.0), indicating exploratory or multi-vector behaviour. |
| `dst_host_serror_rate` / `srv_serror_rate` | Many zero values, but also cases with all requests failing (rate = 1.0). Strong indicator of failed scans or blocked intrusions. |

<br>


## Key Features to Focus On (Based on Descriptive Statistics)

| Feature | Why It’s Important |
|---------|--------------------|
| `src_bytes` | Indicates how much data the source sends. Extremely high or zero values can signal exfiltration or scanning. |
| `dst_bytes` | Measures how much data is returned from the destination. Often zero in failed attacks or scans; large spikes can indicate suspicious responses. |
| `duration` | Connection time. Most are very short, but rare long connections could signal persistence or data transfer. |
| `num_failed_logins` | Tracks failed authentication attempts. Useful for identifying brute-force or password guessing attacks. |
| `num_compromised` | Reflects whether the session led to system compromise. Even a few non-zero values are highly significant. |
| `su_attempted` | Attempts to gain superuser privileges. Rare but critical indicator of intrusion intent. |
| `num_root` | Measures whether root-level access was achieved. High values indicate successful privilege escalation. |
| `hot` | Counts number of "hot" indicators (e.g. suspicious operations). Rare but concentrated in some attacks. |
| `num_file_creations` | Tracks new file creations during session — often associated with malware installation or persistence mechanisms. |
| `dst_host_same_srv_rate` | High values (close to 1.0) mean the same service is being accessed repeatedly — can indicate scanning or bot behavior. |
| `dst_host_diff_srv_rate` | Measures variation in services accessed — high values could indicate service enumeration or reconnaissance. |
| `dst_host_serror_rate` | High error rates suggest repeated failed connection attempts — common in port scanning or dropped sessions. |
| `dst_host_rerror_rate` | High response error rates are strong indicators of probes or misconfigured/malicious attempts. |


<br>