# Project: Advanced Network Forensics & Statistical C2 Hunting

**Mission:** Beyond simple beacons. Modern malware uses "Jitter" (randomized timing) and "Protocol Tunneling" to evade detection. You are hunting for a sophisticated C2 (Command & Control) channel hidden within standard traffic.

**Objective:** Use statistical distribution analysis, TTL (Time-to-Live) fingerprinting, and packet size entropy to differentiate between human and machine activity.

---

## PHASE 1: Protocol Entropy & Distribution
**Context:** Our `network_flows` table contains byte-level telemetry. 
**Expert Task:** Calculate the **Standard Deviation** and **Coefficient of Variation** of packet sizes (`bytes`) per protocol. High-volume, high-variation protocols are likely file transfers; low-variation, fixed-size packets are often heartbeats.

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns

engine = create_engine("mysql+pymysql://app_user:appuserpassword456@localhost/security_lab")
df = pd.read_sql("SELECT * FROM network_flows", engine)

# Statistical Summary per Source-Dest Pair
stats = df.groupby(['src_ip', 'dest_ip']).agg({
    'bytes': ['mean', 'std', 'count'],
    'duration_ms': ['mean', 'std']
}).reset_index()

## PHASE 2: Jitter Detection (Advanced Time-Series)
**The Logic:** Basic beacons fire every X seconds. Advanced malware adds "Jitter" (e.g., 300 seconds +/- 10%). 

**Expert Task:** 
1. Sort traffic by `timestamp` for a specific `src_ip`.
2. Calculate the time difference (delta) between consecutive flows.
3. Use a **Histogram** to visualize the deltas. A single sharp peak = simple beacon. A small cluster (Normal Distribution) = Jittered C2. Uniform distribution = Real Human Traffic.

In [None]:
# Filter for top talker
target_ip = df['src_ip'].value_counts().index[0]
df_target = df[df['src_ip'] == target_ip].sort_values('timestamp')

df_target['timestamp'] = pd.to_datetime(df_target['timestamp'])
df_target['delta'] = df_target['timestamp'].diff().dt.total_seconds()

sns.histplot(df_target['delta'].dropna(), bins=50, kde=True)
plt.title(f"Inter-Arrival Time Distribution for {target_ip}")

## PHASE 3: TTL & OS Fingerprinting
**The Logic:** Every Operating System has a default TTL (Time-to-Live): 
- Linux/Unix: 64
- Windows: 128
- Network Devices: 255

**Expert Task:** Analyze the `ttl` values (if available) or `bytes` patterns to identify "Proxying" or "Tunneling." If a single IP address shows traffic patterns from both Linux (TTL 64) and Windows (TTL 128), it is likely a compromised host being used as a pivot point or a VPN exit node.

In [None]:
# YOUR CODE HERE

## PHASE 4: Hunting "Out-of-Hours" Anomalies
**Expert Task:** Create a 2D Heatmap (Hour of Day vs. Day of Week). Use the `bytes` column to identify massive data transfers that occur outside of standard working hours (e.g., 2 AM on a Sunday).

In [None]:
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
df['day'] = pd.to_datetime(df['timestamp']).dt.day_name()

pivot = df.pivot_table(index='day', columns='hour', values='bytes', aggfunc='sum').fillna(0)
sns.heatmap(pivot, cmap="YlGnBu")
plt.title("Network Traffic Heatmap (Byte Volume)")

## PHASE 5: C2 Confidence Scoring
**Expert Task:** Implement a simple scoring algorithm. 
- Consistent Interval (+20 points)
- Low Packet Size Variation (+20 points)
- Out-of-hours activity (+30 points)
- Connection to known malicious ASN (+30 points)

Generate a "Top 10 High-Confidence C2" list for the Incident Response team.

In [None]:
# YOUR CODE HERE