# Deep Security Threat Analysis

### Load Data

In [1]:
import glob
import pandas as pd

csv_files = glob.glob("data/*.csv")
df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)

### (a) Detecting Complex Attack Patterns

#### Stealthy Port Scan Detection

A stealthy port scan is a scanning technique where an attacker probes multiple ports on the same destination IP, but spreads the activity over a long period to avoid detection.

**Approach:**
- Group network flows by source IP and destination IP
- For each pair, count how many unique destination ports were contacted
- Also count how many distinct time windows (e.g., per hour) the activity spans
- Flag (source, destination) pairs that contacted many ports (e.g., ≥ 10) across multiple hours

In [2]:
df['startDateTime'] = pd.to_datetime(df['startDateTime'])
df['hour'] = df['startDateTime'].dt.floor('h')
grouped = df.groupby(['source', 'destination'])

scan_candidates = []

for (src, dst), group in grouped:
    unique_ports = group['destinationPort'].nunique()
    active_hours = group['hour'].nunique()
    if unique_ports >= 10 and active_hours > 1:
        scan_candidates.append((src, dst, unique_ports, active_hours))

scan_df = pd.DataFrame(scan_candidates, columns=['source', 'destination', 'unique_ports', 'active_hours'])

print("Stealthy Port Scan Candidates:")
print(scan_df.to_string(index=False))

Stealthy Port Scan Candidates:
       source    destination  unique_ports  active_hours
192.168.1.101  192.168.5.122           222           145
192.168.1.102  192.168.5.122           124           126
192.168.1.103  192.168.5.122           195           145
192.168.1.104  192.168.5.122           387           145
192.168.1.105  192.168.1.101           562            38
192.168.1.105  192.168.1.103           560           110
192.168.1.105    192.168.1.2            82            13
192.168.1.105  192.168.2.112           824             4
192.168.1.105  192.168.5.122           606           145
192.168.2.106  192.168.5.122           102           145
192.168.2.107  192.168.5.122           328           139
192.168.2.107 201.114.230.31            24             8
192.168.2.107  213.17.121.14            15            10
192.168.2.107  221.12.138.82            10             7
192.168.2.107  93.46.200.235            17            12
192.168.2.108  192.168.5.122           200           145


#### Slow DDoS Detection

A slow DDoS attack involves sending a large number of small, seemingly benign requests over an extended period, often from many different source IPs. It does not create an immediate traffic spike, but can gradually exhaust server resources.

**Approach:**
- Round flow timestamps to hourly windows
- For each destination IP and hour, count total number of flows
- Count the number of unique source IPs
- Compute the average flow size using `totalSourceBytes`
- Flag destination-hour pairs that show high flow count, high source diversity, and low average flow size

In [3]:
df['hour'] = df['startDateTime'].dt.floor('h')

agg = df.groupby(['destination', 'hour']).agg(
    flow_count=('source', 'count'),
    unique_sources=('source', 'nunique'),
    avg_bytes=('totalSourceBytes', 'mean')
).reset_index()

flow_threshold = agg['flow_count'].quantile(0.95)
source_threshold = agg['unique_sources'].quantile(0.95)
avg_bytes_threshold = agg['avg_bytes'].quantile(0.25)

slow_ddos = agg[
    (agg['flow_count'] > flow_threshold) &
    (agg['unique_sources'] > source_threshold) &
    (agg['avg_bytes'] < avg_bytes_threshold)
]

print("Suspected slow DDoS targets:")
print(slow_ddos.to_string(index=False))

Suspected slow DDoS targets:
  destination                hour  flow_count  unique_sources  avg_bytes
192.168.2.107 2010-06-14 21:00:00          74              73 124.054054
192.168.2.107 2010-06-14 22:00:00         193             183 142.233161
192.168.2.107 2010-06-14 23:00:00         163             153 142.141104
192.168.2.113 2010-06-13 16:00:00        1672               7 116.430622
209.85.51.222 2010-06-13 10:00:00         172               4 124.627907


#### IP Hopping Detection

IP hopping is a technique used by attackers to evade detection by switching between multiple source IPs while targeting the same destination, often on the same service port.

**Approach:**
- Round timestamps to hourly windows
- Group flows by destination IP, hour, and destination port
- For each group, count the number of unique source IPs
- Flag destination-port-hour combinations that were contacted by an unusually high number of source IPs, indicating possible IP rotation

In [4]:
df['hour'] = df['startDateTime'].dt.floor('h')

hopping_counts = df.groupby(['destination', 'hour', 'destinationPort'])['source'].nunique().reset_index(name='unique_sources')

q1 = hopping_counts['unique_sources'].quantile(0.25)
q3 = hopping_counts['unique_sources'].quantile(0.75)
iqr = q3 - q1

if iqr == 0:
    threshold = hopping_counts['unique_sources'].quantile(0.995)
else:
    threshold = q3 + 1.5 * iqr

ip_hopping = hopping_counts[hopping_counts['unique_sources'] > threshold]

print("Possible IP hopping behavior detected:")
print(ip_hopping.to_string(index=False))

Possible IP hopping behavior detected:
    destination                hour  destinationPort  unique_sources
  142.166.14.69 2010-06-14 06:00:00               80              10
  142.166.14.69 2010-06-14 07:00:00               80              11
  142.166.14.69 2010-06-15 10:00:00               80              12
  142.166.14.69 2010-06-15 11:00:00               80              14
  142.166.14.69 2010-06-15 12:00:00               80              14
  142.166.14.69 2010-06-15 13:00:00               80              11
  142.166.14.69 2010-06-15 14:00:00               80              10
  142.166.14.69 2010-06-16 01:00:00               80              10
  142.166.14.69 2010-06-16 11:00:00               80              13
  142.166.14.69 2010-06-16 12:00:00               80              11
  142.166.14.69 2010-06-16 13:00:00               80              10
  142.166.14.69 2010-06-17 06:00:00               80              10
  142.166.14.69 2010-06-17 07:00:00               80            

### (b) Malicious Payload Identification

#### Unusual Payload Pattern Detection

This analysis focuses on detecting payloads that may indicate suspicious or malicious behavior based on their structure or content.

**Approach:**
- Inspect Base64 and UTF payload fields from both source and destination
- Flag flows with very long Base64-encoded payloads, which may indicate obfuscated or embedded malicious content
- Identify payloads that are repeated across multiple flows — repeated exact payloads are often used by bots or malware
- Detect flows where Base64 payload is present but UTF payload is empty, suggesting hidden or non-human-readable data

In [5]:
df['sourcePayloadAsBase64'] = df['sourcePayloadAsBase64'].fillna('')
df['sourcePayloadAsUTF'] = df['sourcePayloadAsUTF'].fillna('')
df['destinationPayloadAsBase64'] = df['destinationPayloadAsBase64'].fillna('')
df['destinationPayloadAsUTF'] = df['destinationPayloadAsUTF'].fillna('')

df['source_b64_len'] = df['sourcePayloadAsBase64'].str.len()
df['dest_b64_len'] = df['destinationPayloadAsBase64'].str.len()

long_threshold = df[['source_b64_len', 'dest_b64_len']].stack().quantile(0.995)

long_payloads = df[(df['source_b64_len'] > long_threshold) | (df['dest_b64_len'] > long_threshold)]

source_repeats = df['sourcePayloadAsBase64'].value_counts()
dest_repeats = df['destinationPayloadAsBase64'].value_counts()

common_source_payloads = source_repeats[source_repeats > 10].index
common_dest_payloads = dest_repeats[dest_repeats > 10].index

repeated_payloads = df[
    df['sourcePayloadAsBase64'].isin(common_source_payloads) |
    df['destinationPayloadAsBase64'].isin(common_dest_payloads)
]

obfuscated_flows = df[
    ((df['sourcePayloadAsBase64'] != '') & (df['sourcePayloadAsUTF'] == '')) |
    ((df['destinationPayloadAsBase64'] != '') & (df['destinationPayloadAsUTF'] == ''))
]

print(f"Flows with unusually long payloads: {len(long_payloads)}")
print(f"Flows with repeated payloads: {len(repeated_payloads)}")
print(f"Flows with obfuscated (non-UTF) Base64 content: {len(obfuscated_flows)}")

Flows with unusually long payloads: 8852
Flows with repeated payloads: 1573087
Flows with obfuscated (non-UTF) Base64 content: 84856


#### Suspicious Encrypted Traffic Detection

Encrypted protocols like HTTPS (port 443) and SSH (port 22) typically involve a handshake that generates moderate to large payload sizes. Flows that claim to use encrypted ports but contain missing or abnormally small payloads may indicate fake traffic, scanning activity, or evasion techniques.

**Approach:**
- Filter flows using known encrypted ports (e.g., 443, 22, 993)
- Check if their payloads (Base64 and UTF) are empty or very small
- Flag such flows as suspicious since they do not match normal encrypted behavior

In [6]:
encrypted_flows = df[df['destinationPort'].isin([443, 22, 993])].copy()

encrypted_flows['source_b64_len'] = encrypted_flows['sourcePayloadAsBase64'].str.len()
encrypted_flows['dest_b64_len'] = encrypted_flows['destinationPayloadAsBase64'].str.len()

suspicious_encrypted = encrypted_flows[
    (encrypted_flows['source_b64_len'] < 50) & 
    (encrypted_flows['dest_b64_len'] < 50)
]

print(f"Suspicious encrypted flows (small or missing payloads): {len(suspicious_encrypted)}")

Suspicious encrypted flows (small or missing payloads): 63968


#### Detecting Consistent Command-and-Control (C2) Patterns

Malware often communicates with its command server by sending small, repeated messages at regular intervals. These C2 connections are persistent and automated.

**Approach:**
- For each `(source, destination)` pair:
  - Count flows
  - Calculate time between flows and its standard deviation
  - Compute average payload size
  - Measure total communication duration
- Dynamically calculate thresholds using percentiles:
  - Top 10% flow counts
  - Bottom 10% payload sizes
  - Bottom 10% timing variation
  - Top 50% duration
- Flag pairs that meet all conditions

In [7]:
df['payload_size'] = (
    df['sourcePayloadAsBase64'].fillna('').str.len() +
    df['destinationPayloadAsBase64'].fillna('').str.len()
)

pairs = []
flow_counts, payloads, variances, durations = [], [], [], []

for (src, dst), group in df.groupby(['source', 'destination']):
    if len(group) < 5:
        continue

    times = group['startDateTime'].sort_values().values.astype('datetime64[s]')
    intervals = (times[1:] - times[:-1]).astype('timedelta64[s]').astype(int)
    if len(intervals) < 2:
        continue

    flow_count = len(group)
    avg_payload = group['payload_size'].mean()
    std_interval = intervals.std()
    duration = (times[-1] - times[0]).astype('timedelta64[h]').astype(float)

    flow_counts.append(flow_count)
    payloads.append(avg_payload)
    variances.append(std_interval)
    durations.append(duration)

    pairs.append(((src, dst), (flow_count, avg_payload, std_interval, duration)))

fc_thresh = pd.Series(flow_counts).quantile(0.90)
pl_thresh = pd.Series(payloads).quantile(0.10)
var_thresh = pd.Series(variances).quantile(0.10)
dur_thresh = pd.Series(durations).quantile(0.50)

c2_final = []

for (src, dst), (fc, pl, var, dur) in pairs:
    if fc >= fc_thresh and pl <= pl_thresh and var <= var_thresh and dur >= dur_thresh:
        c2_final.append((src, dst, fc, round(pl, 2), round(var, 2), round(dur, 2)))

c2_df = pd.DataFrame(c2_final, columns=[
    'source', 'destination', 'flow_count', 'avg_payload', 'interval_std (s)', 'duration (hrs)'
])

print("Consistent C2-like IP pairs (using dynamic thresholds):")
print(c2_df.to_string(index=False))

Consistent C2-like IP pairs (using dynamic thresholds):
       source  destination  flow_count  avg_payload  interval_std (s)  duration (hrs)
192.168.2.109 95.211.98.12       25095          0.0              14.9            10.0
