# Mini-Project: Cyber Czech 2019 Exercise Traffic and Log Analysis
# Writers name 
## date 22/04/2025
## Individual project
This notebook presents a story-driven analysis of the network traffic flows and event logs captured during the Cyber Czech 2019 Red Team/Blue Team exercise at Masaryk University, Brno, Czech Republic. We will:

1. Introduce the dataset and its context.

2. Load and clean each data source (IPFIX flows, Syslog, Winlog).

3. Tag events related to Red Team activity.

4. Explore and visualize trends and patterns over time.

5. Interpret findings and suggest next steps for deeper analysis.

## Dataset Overview
The data was downloaded from the following link: https://zenodo.org/record/3746129

The data were collected over two distinct intervals:

- Day 1: March 19, 2019, 11:00 – 18:00 (CET)

- Day 2: March 20, 2019, 08:00 – 15:30 (CET)

Three event types are provided in JSON format:

- cz.muni.csirt.IpfixEntry: IPFIX flows with parsed application protocols.

- cz.muni.csirt.SyslogEntry: Linux Syslog entries.

- cz.muni.csirt.WinlogEntry: Windows Event Log entries.

Auxiliary files include network topology and Red Team IP/hostname schedules for reference.

In [54]:
# Imports & global config
import pathlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import seaborn as sns
import gzip
import json
import ipaddress
from datetime import datetime, timedelta

# Define paths
NB_ROOT   = pathlib.Path(".").resolve()
DATA_ROOT = NB_ROOT

IPFIX_GZ   = DATA_ROOT / "cz.muni.csirt.IPFlowEntry" / "data.json.gz"
SYSLOG_GZ  = DATA_ROOT / "cz.muni.csirt.SyslogEntry" / "data.json.gz"
WINLOG_GZ  = DATA_ROOT / "cz.muni.csirt.WinlogEntry" / "data.json.gz"
AUX_DIR    = DATA_ROOT / "auxiliary-material"

RED_IP_CSV = AUX_DIR / "redteam-reserved-ip-ranges.csv"
ATTACK_CSV = AUX_DIR / "redteam-attack-schedule.csv"

PRG_TZ = "Europe/Prague"

# Configure display and plotting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
sns.set(style='whitegrid')
%matplotlib inline

Explanation: We import libraries for data manipulation (pandas, numpy), visualization (matplotlib, seaborn), file I/O (gzip, json), IP range handling (ipaddress), and operating-system paths (pathlib). We define constants pointing to our data files and auxiliary CSVs using the same file structure as the original notebook. We also set the Prague timezone for consistency and configure display and plotting styles.

In [55]:
# Helper utilities
def load_json_gz(path):
    return pd.read_json(path, lines=True, compression="gzip")


def stream_json_gz(path, chunksize=100_000):
    """Stream large .json.gz line-by-line in chunks."""
    with gzip.open(path, "rt", encoding="utf-8") as f:
        rows = []
        for line in f:
            rows.append(json.loads(line))
            if len(rows) >= chunksize:
                yield pd.DataFrame(rows)
                rows.clear()
        if rows:
            yield pd.DataFrame(rows)


def to_prague(series):
    dt = pd.to_datetime(series, errors="coerce")
    if dt.dt.tz is None:
        return dt.dt.tz_localize(PRG_TZ, nonexistent="shift_forward")
    else:
        return dt.dt.tz_convert(PRG_TZ)


def cidr_dataframe(csv_path):
    txt = csv_path.read_text().strip()
    cidrs = [c.strip() for c in txt.split(",") if c.strip()]
    nets  = [ipaddress.ip_network(c) for c in cidrs]
    return pd.DataFrame({"network": nets})

def preview(df, n=5):
    """Display the first n rows of a DataFrame with context."""
    print(f"Previewing first {n} rows of DataFrame ({len(df)} total rows):")
    display(df.head(n))

Explanation: Helper functions : load_json_gz for reading JSON lines with gzip compression, stream_json_gz for chunked streaming, to_prague for timezone localization/conversion, and cidr_dataframe for loading CIDR ranges into a DataFrame.

In [56]:
# Load and expand Red-Team IP ranges
df_red_ips = cidr_dataframe(RED_IP_CSV)
preview(df_red_ips.rename(columns={'network':'ip_range'}))

# Define known Red-Team hostnames manually (from topology or exercise documentation)
red_hosts = [
    # Example hostnames -- replace with actual names extracted from the topology
    'redteam-win01',
    'redteam-linux01',
    'redteam-gateway01',
]
print("Known Red-Team hostnames:", red_hosts)

Previewing first 5 rows of DataFrame (28 total rows):


Unnamed: 0,ip_range
0,27.3.0.0/19
1,212.5.0.0/19
2,213.5.0.0/21
3,202.2.96.0/19
4,110.5.80.0/20


Known Red-Team hostnames: ['redteam-win01', 'redteam-linux01', 'redteam-gateway01']


In [57]:
# Load & prepare Red-Team data 
red_ranges = cidr_dataframe(RED_IP_CSV)
print(f"Loaded {len(red_ranges)} Red-Team IP ranges")
print("\nRed-Team Networks:")
display(red_ranges)

# Load attack schedule
attack_raw = pd.read_csv(ATTACK_CSV, comment="#")
print("Raw attack schedule loaded")

# Normalize column names and preview
attack_raw.columns = attack_raw.columns.str.strip()
print("\nAttack Schedule Preview:")
display(attack_raw.head())

Loaded 28 Red-Team IP ranges

Red-Team Networks:


Unnamed: 0,network
0,27.3.0.0/19
1,212.5.0.0/19
2,213.5.0.0/21
3,202.2.96.0/19
4,110.5.80.0/20
5,200.110.240.0/20
6,66.231.64.0/20
7,181.118.144.0/20
8,37.32.0.0/19
9,193.151.128.0/19


Raw attack schedule loaded

Attack Schedule Preview:


Unnamed: 0,Exercise time[hh:mm],Attack type,Importance,Affected network segments
0,00:00 – 00:30,Network reconnaissance,Low,DMZ
1,00:30 – 01:00,Denial of service,Low,DMZ
2,01:00 – 02:00,Web attacks,Medium,DMZ
3,01:30 – 02:00,Phishing,Medium,DMZ
4,02:00 – 02:40,Web attacks,High,DMZ


Explanation: We load the CSV of reserved IP ranges into a DataFrame of ipaddress.ip_network objects. Since the attack schedule CSV does not include hostnames, we manually define the list of Red-Team hostnames based on the network topology

In [None]:
# Load and clean IPFIX flow data 
df_ipfix = load_json_gz(IPFIX_GZ)
print("IPFIX raw rows:", len(df_ipfix))

# Handle duplicate 'timestamp' columns if present
duplicate_columns = df_ipfix.columns[df_ipfix.columns.duplicated()].tolist()
if duplicate_columns:
    print("Dropping duplicate columns:", duplicate_columns)
    df_ipfix = df_ipfix.loc[:, ~df_ipfix.columns.duplicated()]

# Convert time
if "flowStartMilliseconds" in df_ipfix.columns:
    df_ipfix["time"] = pd.to_datetime(df_ipfix["flowStartMilliseconds"], unit="ms", errors="coerce")
    df_ipfix["time"] = to_prague(df_ipfix["time"])
else:
    raise ValueError("Missing expected time column in IPFIX flows")

# Drop nulls and sort
df_ipfix_clean = (
    df_ipfix.dropna(subset=["time", "sourceIPv4Address", "destinationIPv4Address"])
            .sort_values("time", ignore_index=True)
)

print("Cleaned IPFIX rows:", len(df_ipfix_clean))
print("IPFIX Time Span:", df_ipfix_clean.time.min(), "→", df_ipfix_clean.time.max())

# Preview cleaned data
preview(df_ipfix_clean)

Explanation: We directly load the entire IPFIX dataset using load_json_gz, handle any duplicate columns, convert the flowStartMilliseconds to a timezone-aware time column localized to Prague, and drop rows missing key fields. We then sort by time and preview the cleaned DataFrame.

In [None]:
# Load and clean Syslog data 
# Large dataset – streamed in chunks to avoid memory errors
syslog_chunks = []
for chunk in stream_json_gz(SYSLOG_GZ):
    if "timegenerated" in chunk.columns:
        chunk["time"] = to_prague(chunk["timegenerated"])
    else:
        raise ValueError("Missing 'timegenerated' column in Syslog dataset")

    chunk = chunk.dropna(subset=["time"])
    syslog_chunks.append(chunk)

df_syslog_clean = pd.concat(syslog_chunks, ignore_index=True).sort_values("time", ignore_index=True)

print("Cleaned Syslog rows:", len(df_syslog_clean))
print("Syslog Time Span:", df_syslog_clean.time.min(), "→", df_syslog_clean.time.max())

# Preview cleaned Syslog DataFrame
preview(df_syslog_clean)

Explanation: We stream Syslog records in chunks, converting the timegenerated field to a timezone-aware time column localized to Prague. We then drop rows missing the timestamp, concatenate all chunks, sort by time, and preview the cleaned dataset. This chunked approach ensures memory efficiency for large logs.

In [None]:
# Load and clean Winlog data 
# Large dataset – streamed in chunks to avoid memory errors
winlog_chunks = []
for chunk in stream_json_gz(WINLOG_GZ):
    if "orig_timestamp" in chunk.columns:
        chunk["time"] = to_prague(chunk["orig_timestamp"])
    else:
        raise ValueError("Missing 'orig_timestamp' column in Winlog dataset")

    chunk = chunk.dropna(subset=["time"])
    winlog_chunks.append(chunk)

df_winlog_clean = pd.concat(winlog_chunks, ignore_index=True).sort_values("time", ignore_index=True)

print("Cleaned Winlog rows:", len(df_winlog_clean))
print("Winlog Time Span:", df_winlog_clean.time.min(), "→", df_winlog_clean.time.max())

print(df_winlog_clean.head(5))

Explanation: We iterate through the Winlog JSON lines in chunks, converting the timeCreated field to a timezone-aware time column localized to Prague. We filter out any entries missing required fields (time, xmlPayload), then concatenate and sort all chunks into a single DataFrame. This ensures efficient processing and yields a clean Winlog dataset ready for downstream analysis.

In [None]:
# Apply tags on cleaned Winlog DataFrame
df_winlog_clean['ip_red'] = df_winlog_clean['xml'].apply(
    lambda text: any(str(net.network_address) in text for net in df_red_ips['network'])
)
df_winlog_clean['host_red'] = df_winlog_clean['xml'].apply(
    lambda text: any(host in text for host in red_hosts)
)
preview(df_winlog_clean[['time','xml','ip_red','host_red']])

In [None]:

# Tag IPFIX flows involving Red-Team IPs 
def is_red_ip(ip_str):
    ip = ipaddress.ip_address(ip_str)
    return any(ip in net for net in red_ranges["network"])

df_ipfix_clean["red_source"] = df_ipfix_clean["sourceIPv4Address"].apply(is_red_ip)
df_ipfix_clean["red_dest"]   = df_ipfix_clean["destinationIPv4Address"].apply(is_red_ip)
df_ipfix_clean["involves_red"] = df_ipfix_clean["red_source"] | df_ipfix_clean["red_dest"]

print("Tagged IPFIX flows involving Red-Team IPs:", df_ipfix_clean["involves_red"].sum())



In [None]:
# Tag Syslog entries mentioning Red-Team IPs 
import re

# Precompile regex patterns for all red-team IPs
ip_patterns = [re.escape(str(net)) for net in red_ranges["network"]]
regex_pattern = re.compile(r"|".join(ip_patterns))

# Flag Syslog messages containing any red-team IP
df_syslog_clean["mentions_red_ip"] = df_syslog_clean["message"].astype(str).apply(lambda msg: bool(regex_pattern.search(msg)))

print("Tagged Syslog entries mentioning Red-Team IPs:", df_syslog_clean["mentions_red_ip"].sum())
print("Sample tagged Syslog rows:")
display(df_syslog_clean[df_syslog_clean["mentions_red_ip"]].head())



In [None]:
# Tag Syslog entries mentioning Red-Team IPs 
import re

# Expand all red-team CIDRs into IP address strings
red_ips = set()
for net in red_ranges["network"]:
    # limit expansion for very large ranges
    if net.num_addresses <= 1024:
        red_ips.update(str(ip) for ip in net.hosts())

# Compile a regex pattern for exact IP address matches
ip_regex_pattern = re.compile(r"\b(" + "|".join(map(re.escape, red_ips)) + r")\b")

# Tag messages mentioning Red-Team IPs
df_syslog_clean["mentions_red_ip"] = df_syslog_clean["message"].astype(str).apply(
    lambda msg: bool(ip_regex_pattern.search(msg))
)

# Summary
print("Tagged Syslog entries mentioning Red-Team IPs:", df_syslog_clean["mentions_red_ip"].sum())
print("Sample tagged Syslog rows:")
display(df_syslog_clean[df_syslog_clean["mentions_red_ip"]].head())


In [None]:

# Extract & count Red-Team IPs in Syslog messages 
# A more thorough scan of message fields for Red-Team IPs 
ip_regex = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")

matched_red_ips = set()
redteam_count = 0

for msg in df_syslog_clean["message"]:
    found_ips = ip_regex.findall(str(msg))
    for ip in found_ips:
        if is_red_ip(ip):
            matched_red_ips.add(ip)
            redteam_count += 1

print(f"Red Team IPs found in message field: {redteam_count}")
print("Unique Red Team IPs matched:")
print(matched_red_ips)



In [None]:
# Tag Winlog entries mentioning Red-Team IPs 
# Check all string fields that could hold IPs (e.g., 'message' or 'xml')
winlog_text_field = "xml" if "xml" in df_winlog_clean.columns else "message"

df_winlog_clean["mentions_red_ip"] = df_winlog_clean[winlog_text_field].astype(str).apply(lambda msg: bool(regex_pattern.search(msg)))

print("Tagged Winlog entries mentioning Red-Team IPs:", df_winlog_clean["mentions_red_ip"].sum())
print(" Sample tagged Winlog rows:")
display(df_winlog_clean[df_winlog_clean["mentions_red_ip"]].head())



In [None]:
# Red-Team IPs in IPFIX 
df_red_in_ipfix = df_ipfix_clean[
    df_ipfix_clean["sourceIPv4Address"].isin(matched_red_ips) |
    df_ipfix_clean["destinationIPv4Address"].isin(matched_red_ips)
]

print("IPFIX rows with Red-Team IPs as src or dst:", len(df_red_in_ipfix))
display(df_red_in_ipfix.head())



In [None]:
# Red-Team IPs in Winlog entries 
ip_regex = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")  # in case not already defined

df_winlog_clean["mentions_red_ip"] = df_winlog_clean["xml"].astype(str).apply(
    lambda text: any(ip in matched_red_ips for ip in ip_regex.findall(text))
)

print("Winlog entries mentioning Red-Team IPs:", df_winlog_clean["mentions_red_ip"].sum())
display(df_winlog_clean[df_winlog_clean["mentions_red_ip"]].head())


In [None]:
# Syslog messages exactly matching known Red-Team IPs
df_syslog_matched_ips = df_syslog_clean[df_syslog_clean["message"].astype(str).apply(
    lambda msg: any(ip in msg for ip in matched_red_ips)
)]

print("Syslog rows containing known Red-Team IPs:", len(df_syslog_matched_ips))
display(df_syslog_matched_ips.head())


In [None]:
# Summary counts for Red-Team activity 
print("\nSummary of Red-Team tagged entries across datasets:")
print("IPFIX flows involving Red-Team IPs:", df_ipfix_clean["involves_red"].sum())
print("Syslog entries mentioning Red-Team IPs:", df_syslog_clean["mentions_red_ip"].sum())
print("Winlog entries mentioning Red-Team IPs:", df_winlog_clean["mentions_red_ip"].sum())
print("Unique Red-Team IPs found in logs:", len(matched_red_ips))


In [None]:
# Prepare data for visualization 
# Create hourly time bins for each log type tagged with Red-Team activity
ipfix_hourly = df_ipfix_clean[df_ipfix_clean["involves_red"]].set_index("time").resample("1h").size()
syslog_hourly = df_syslog_clean[df_syslog_clean["mentions_red_ip"]].set_index("time").resample("1h").size()
winlog_hourly = df_winlog_clean[df_winlog_clean["mentions_red_ip"]].set_index("time").resample("1h").size()

# Store for later visualization
activity_by_hour = pd.DataFrame({
    "IPFIX": ipfix_hourly,
    "Syslog": syslog_hourly,
    "Winlog": winlog_hourly
}).fillna(0).astype(int)

print("Hourly Red-Team activity prepared for plotting")
display(activity_by_hour.head())


In [None]:
# Line plot of Red-Team activity over time 
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 6))
activity_by_hour.plot(ax=plt.gca(), marker="o")

plt.title("Red-Team Activity Over Time (Hourly Resolution)")
plt.xlabel("Time")
plt.ylabel("Tagged Event Count")
plt.grid(True)
plt.legend(title="Dataset")
plt.tight_layout()
plt.show()


In [None]:
# Red-Team Syslog events by severity 
if "severity" in df_syslog_clean.columns:
    severity_summary = (
        df_syslog_clean[df_syslog_clean["mentions_red_ip"]]
        .groupby("severity")["message"]
        .count()
        .sort_values(ascending=False)
    )
    
    print("Syslog Red-Team Events by Severity:")
    display(severity_summary)
else:
    print("No 'severity' column found in Syslog dataset")



In [None]:
# Syslog Severity Levels Reference 
severity_levels = {
    0: "Emergency — system is unusable",
    1: "Alert — action must be taken immediately",
    2: "Critical — critical conditions",
    3: "Error — error conditions",
    4: "Warning — warning conditions",
    5: "Notice — normal but significant",
    6: "Informational — informational messages",
    7: "Debug — debug-level messages"
}

print("Syslog Severity Level Definitions (Defined by RFC 5424 / RFC 3164):")
for level, meaning in severity_levels.items():
    print(f"  {level}: {meaning}")



In [None]:
# Bar plot of Syslog Red-Team severity levels
if 'severity_summary' in globals():
    severity_summary.plot(kind="bar", figsize=(10, 4), color="tomato")
    plt.title("Syslog Red-Team Events by Severity Level")
    plt.xlabel("Severity Level")
    plt.ylabel("Message Count")
    plt.grid(True, axis="y", linestyle=":")
    plt.tight_layout()
    plt.show()
else:
    print("No severity summary available. Run previous severity summary cell first.")


In [None]:
# Syslog Red-Team severity over time 
if "severity" in df_syslog_clean.columns:
    severity_timeline = (
        df_syslog_clean[df_syslog_clean["mentions_red_ip"]]
        .set_index("time")
        .groupby([pd.Grouper(freq="1h"), "severity"])
        .size()
        .unstack(fill_value=0)
    )

    severity_timeline.plot(figsize=(14, 6), marker="o")
    plt.title("Red-Team Syslog Severity Over Time")
    plt.xlabel("Time")
    plt.ylabel("Log Count")
    plt.grid(True)
    plt.legend(title="Severity")
    plt.tight_layout()
    plt.show()
else:
    print("No 'severity' column found in Syslog dataset")


In [None]:
# Winlog Red-Team events by severity 
if "level" in df_winlog_clean.columns:
    winlog_severity_summary = (
        df_winlog_clean[df_winlog_clean["mentions_red_ip"]]
        .groupby("level")["message"]
        .count()
        .sort_values(ascending=False)
    )

    print("Winlog Red-Team Events by Severity:")
    display(winlog_severity_summary)
else:
    print("No 'level' column found in Winlog dataset")



In [None]:
# Preview XML structure from Winlog 
import xml.etree.ElementTree as ET

sample_xml = df_winlog_clean["xml"].dropna().iloc[0]  # get first non-null XML
root = ET.fromstring(sample_xml)

# Pretty print structure and text
def walk_xml_tree(elem, level=0):
    indent = "  " * level
    print(f"{indent}<{elem.tag.split('}')[-1]}>", end="")
    if elem.text and elem.text.strip():
        print(":", elem.text.strip())
    else:
        print()
    for child in elem:
        walk_xml_tree(child, level + 1)

walk_xml_tree(root)


In [None]:
# Parse severity level from Winlog XML 
import xml.etree.ElementTree as ET

def extract_winlog_level(xml_str):
    try:
        root = ET.fromstring(xml_str)
        level = root.findtext(".//{*}System/{*}Level")  # catch namespace
        return int(level) if level is not None else None
    except:
        return None

# Create 'level' column from XML
if "xml" in df_winlog_clean.columns:
    df_winlog_clean["level"] = df_winlog_clean["xml"].apply(extract_winlog_level)
    print("Extracted Winlog severity levels from XML (refined parser)")
    display(df_winlog_clean[["time", "level", "mentions_red_ip"]].dropna().head())
else:
    print("No 'xml' column found in Winlog dataset")

In [None]:
# Extract numeric severity level from Winlog XML 
import xml.etree.ElementTree as ET

def extract_numeric_level(xml_str):
    try:
        root = ET.fromstring(xml_str)
        # Use namespace-agnostic tag access
        for elem in root.iter():
            if elem.tag.endswith("Level") and elem.text and elem.text.strip().isdigit():
                return int(elem.text.strip())
    except Exception as e:
        return None

df_winlog_clean["level"] = df_winlog_clean["xml"].apply(extract_numeric_level)

print("Extracted numeric Winlog severity levels from XML")
display(df_winlog_clean[["time", "level", "mentions_red_ip"]].dropna().head())



In [None]:
# List of known Red-Team hostnames from the topology image
red_hostnames = {"red-team1", "red-team2", "red-team3", "red-team4", "red-team5", "red-team6", "red-team7"}

# Extract <Computer> tag from XML
def extract_winlog_computer(xml_str):
    try:
        root = ET.fromstring(xml_str)
        computer = root.findtext(".//{*}System/{*}Computer")
        return computer.lower() if computer else None
    except:
        return None

# Apply and tag Red-Team hosts
df_winlog_clean["computer"] = df_winlog_clean["xml"].apply(extract_winlog_computer)
df_winlog_clean["red_host"] = df_winlog_clean["computer"].isin(red_hostnames)


In [None]:
# Aggregate Red-Team Activity Across Datasets 

# IPFIX: select key columns for red-involved flows
ipfix_red = (
    df_ipfix_clean[df_ipfix_clean["involves_red"]]
    .loc[:, ["time", "sourceIPv4Address", "destinationIPv4Address"]]
    .assign(source="IPFIX", message="Flow involving Red-Team IP")
)

# Syslog: select tagged messages
syslog_red = (
    df_syslog_clean[df_syslog_clean["mentions_red_ip"]]
    .loc[:, ["time", "message"]]
    .assign(source="Syslog", sourceIPv4Address=None, destinationIPv4Address=None)
)

winlog_red = (
    df_winlog_clean[(df_winlog_clean["mentions_red_ip"]) | (df_winlog_clean["red_host"])]
    .loc[:, ["time", "xml"]]
    .rename(columns={"xml": "message"})
    .assign(source="Winlog", sourceIPv4Address=None, destinationIPv4Address=None)
)


# Combine all red-team activity into one timeline
redteam_events = pd.concat([ipfix_red, syslog_red, winlog_red], ignore_index=True).sort_values("time")

print("📦 Aggregated Red-Team Event Rows:", len(redteam_events))
display(redteam_events.head())



In [None]:
# Plot Red-Team Activity Timeline 
activity_timeline = (
    redteam_events.set_index("time")
    .groupby([pd.Grouper(freq="1h"), "source"])
    .size()
    .unstack(fill_value=0)
)

activity_timeline.plot(figsize=(14, 6), marker="o")
plt.title("Red-Team Activity Timeline by Source")
plt.xlabel("Time")
plt.ylabel("Event Count")
plt.grid(True)
plt.legend(title="Log Source")
plt.tight_layout()
plt.show()



In [None]:
# Plot Non-Red-Team Activity Timeline 

# Filter non-redteam rows in each dataset
ipfix_nonred = (
    df_ipfix_clean[~df_ipfix_clean["involves_red"]]
    .loc[:, ["time", "sourceIPv4Address", "destinationIPv4Address"]]
    .assign(source="IPFIX", message="Flow not involving Red-Team IP")
)

syslog_nonred = (
    df_syslog_clean[~df_syslog_clean["mentions_red_ip"]]
    .loc[:, ["time", "message"]]
    .assign(source="Syslog", sourceIPv4Address=None, destinationIPv4Address=None)
)

winlog_nonred = (
    df_winlog_clean[~df_winlog_clean["mentions_red_ip"]]
    .loc[:, ["time", "xml"]]
    .rename(columns={"xml": "message"})
    .assign(source="Winlog", sourceIPv4Address=None, destinationIPv4Address=None)
)

# Combine
nonred_events = pd.concat([ipfix_nonred, syslog_nonred, winlog_nonred], ignore_index=True).sort_values("time")

# Plot
nonred_timeline = (
    nonred_events.set_index("time")
    .groupby([pd.Grouper(freq="1h"), "source"])
    .size()
    .unstack(fill_value=0)
)

nonred_timeline.plot(figsize=(14, 6), marker="o")
plt.title("Non-Red-Team Activity Timeline by Source")
plt.xlabel("Time")
plt.ylabel("Event Count")
plt.grid(True)
plt.legend(title="Log Source")
plt.tight_layout()
plt.show()



In [None]:
# Side-by-side Comparison of Activity Timelines 
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6), sharey=True)

activity_timeline.plot(ax=axes[0], marker="o", title="Red-Team Activity")
axes[0].set_xlabel("Time")
axes[0].set_ylabel("Event Count")
axes[0].grid(True)
axes[0].legend(title="Source")

nonred_timeline.plot(ax=axes[1], marker="o", title="Non-Red-Team Activity")
axes[1].set_xlabel("Time")
axes[1].grid(True)
axes[1].legend(title="Source")

plt.tight_layout()
plt.show()



In [None]:
# Overlay Red-Team and Non-Red-Team Activity Timeline 
combined = (
    pd.concat([
        redteam_events.assign(team="Red-Team"),
        nonred_events.assign(team="Non-Red-Team")
    ])
    .set_index("time")
    .groupby([pd.Grouper(freq="1h"), "team"])
    .size()
    .unstack(fill_value=0)
)

combined.plot(figsize=(14, 6), marker="o")
plt.title("Combined Red-Team and Non-Red-Team Activity Timeline")
plt.xlabel("Time")
plt.ylabel("Event Count")
plt.grid(True)
plt.legend(title="Team")
plt.tight_layout()
plt.show()



In [None]:
# Red-Team Activity Severity Overlay 
# Only Winlog has severity levels (via 'level' column)
if "level" in df_winlog_clean.columns:
    df_red_severity = df_winlog_clean[
        (df_winlog_clean["mentions_red_ip"] | df_winlog_clean["red_host"]) & df_winlog_clean["level"].notna()
    ].copy()


    df_red_severity["level"] = pd.to_numeric(df_red_severity["level"], errors="coerce")

    if not df_red_severity.empty:
        severity_overlay = (
            df_red_severity.set_index("time")
            .groupby([pd.Grouper(freq="1h"), "level"])
            .size()
            .unstack(fill_value=0)
        )

        if not severity_overlay.empty:
            severity_overlay.plot(figsize=(14, 6), marker="o")
            plt.title("Red-Team Activity Severity Timeline (Winlog)")
            plt.xlabel("Time")
            plt.ylabel("Event Count by Severity Level")
            plt.grid(True)
            plt.legend(title="Severity Level")
            plt.tight_layout()
            plt.show()
        else:
            print("No severity data found to plot.")
    else:
        print("No red-team severity entries found in Winlog.")
else:
    print("No 'level' column available in Winlog data.")



In [None]:
# Global Distribution of Winlog Severity Levels 

# Ensure 'level' is numeric
df_winlog_clean["level"] = pd.to_numeric(df_winlog_clean["level"], errors="coerce")

# Drop NaNs and count frequency
severity_counts = df_winlog_clean["level"].dropna().value_counts().sort_index()

# Bar plot
severity_counts.plot(kind="bar", figsize=(10, 5), color="steelblue")
plt.title("Global Distribution of Winlog Event Severity Levels")
plt.xlabel("Severity Level")
plt.ylabel("Number of Events")
plt.grid(axis="y")
plt.tight_layout()
plt.show()

# Display raw counts
print("Event counts per severity level:")
display(severity_counts)



In [None]:
# Overlay Winlog Severity on Combined Timeline 

# Filter Winlog entries with valid level and time
df_severity_ts = df_winlog_clean[df_winlog_clean["level"].notna()].copy()
df_severity_ts["level"] = pd.to_numeric(df_severity_ts["level"], errors="coerce")

# Aggregate: severity level count per hour
severity_by_hour = (
    df_severity_ts.set_index("time")
    .groupby([pd.Grouper(freq="1h"), "level"])
    .size()
    .unstack(fill_value=0)
)

# Plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Plot combined red/non-red activity
combined.plot(ax=ax1, marker="o", linewidth=2)
ax1.set_ylabel("Event Count (Red vs. Non-Red)")
ax1.set_title("Combined Activity Timeline with Winlog Severity Overlay")
ax1.grid(True)

# Create secondary axis
ax2 = ax1.twinx()

# Overlay severity bars (stacked)
severity_by_hour.plot(
    ax=ax2,
    kind="bar",
    stacked=True,
    alpha=0.3,
    width=1.0,
    legend=False,
    color=plt.cm.Reds(np.linspace(0.4, 1, severity_by_hour.shape[1]))
)
ax2.set_ylabel("Winlog Events by Severity Level")
ax2.set_ylim(top=max(severity_by_hour.sum(axis=1).max() * 1.2, 10))  # give some headroom

# Legend fix
handles, labels = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles + handles2, labels + labels2, title="Legend", loc="upper left")

plt.tight_layout()
plt.show()



In [None]:
# Enhanced Overlay: Red-Team vs Non-Red + Winlog Severity 

# Recalculate severity overlay for Winlog 
df_severity_ts = df_winlog_clean[df_winlog_clean["level"].notna()].copy()
df_severity_ts["level"] = pd.to_numeric(df_severity_ts["level"], errors="coerce")

severity_by_hour = (
    df_severity_ts.set_index("time")
    .groupby([pd.Grouper(freq="1h"), "level"])
    .size()
    .unstack(fill_value=0)
)

# Prepare plot
fig, ax1 = plt.subplots(figsize=(14, 6))

# Plot Red-Team vs Non-Red-Team activity as lines
combined.plot(ax=ax1, marker='o', linewidth=2)
ax1.set_ylabel("Event Count (Red / Non-Red)")
ax1.set_title("Red-Team vs Non-Red-Team Activity with Winlog Severity Overlay")
ax1.grid(True)
ax1.set_xlabel("Time")

# Create second y-axis for severity levels
ax2 = ax1.twinx()

# Plot severity as stacked area plot (soft color, transparent)
severity_by_hour.plot.area(
    ax=ax2,
    stacked=True,
    alpha=0.3,
    linewidth=0,
    cmap="OrRd"
)

ax2.set_ylabel("Winlog Event Count by Severity Level")

# Custom legend with separation
h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()

# Position legends clearly
first_legend = ax1.legend(h1, l1, title="Activity Type", loc="upper left")
second_legend = ax2.legend(h2, l2, title="Severity Level", loc="upper right")
ax1.add_artist(first_legend)

plt.tight_layout()
plt.show()




In [None]:
print("Unique severity levels in Winlog:")
print(df_winlog_clean["level"].dropna().unique())


In [None]:
# Map severity levels to human-friendly labels
severity_labels = {
    0: "LogAlways",
    1: "Critical",
    2: "Error",
    3: "Warning",
    4: "Information"
}

# Rename columns
severity_by_hour.columns = severity_by_hour.columns.map(severity_labels)

# Plot with better colormap and legend title
fig, ax = plt.subplots(figsize=(14, 6))
combined.plot(ax=ax, marker="o", linewidth=2, legend=True)
severity_by_hour.plot.area(ax=ax, alpha=0.3, cmap="tab10")

plt.title("Combined Activity with Severity Overlay (Winlog)")
plt.xlabel("Time")
plt.ylabel("Event Count")
plt.grid(True)
plt.legend(title="Log Source & Severity", loc="upper left")
plt.tight_layout()
plt.show()


In [None]:
# Use numeric column if not already
df_winlog_clean["level"] = pd.to_numeric(df_winlog_clean["level"], errors="coerce")

# Drop rows with level 0 (LogAlways) and NaN
df_severity = df_winlog_clean[df_winlog_clean["level"].notna() & (df_winlog_clean["level"] != 0)].copy()

# Add hourly column
df_severity["hour"] = df_severity["time"].dt.floor("h")

# Find most common severity level per hour (excluding level 0)
top_severity_per_hour = (
    df_severity.groupby("hour")["level"]
    .agg(lambda x: x.value_counts().idxmax())
    .rename("most_frequent_level")
    .reset_index()
)

# Map level to labels
severity_labels = {
    1: "Critical",
    2: "Error",
    3: "Warning",
    4: "Information"
}
top_severity_per_hour["label"] = top_severity_per_hour["most_frequent_level"].map(severity_labels)

print("Most frequent severity per hour (excluding LogAlways):")
display(top_severity_per_hour.head())


In [None]:
# Reuse the combined timeline
fig, ax = plt.subplots(figsize=(14, 6))

combined.plot(ax=ax, marker="o", linewidth=2)

# Plot top severity level as scatter points
colors = {
    "Critical": "red",
    "Error": "orange",
    "Warning": "gold",
    "Information": "blue",
    "LogAlways": "gray"
}

for level_label, color in colors.items():
    points = top_severity_per_hour[top_severity_per_hour["label"] == level_label]
    ax.scatter(points["hour"], [0]*len(points), label=level_label, color=color, s=40, marker="s")

plt.title("Combined Activity with Dominant Severity per Hour")
plt.xlabel("Time")
plt.ylabel("Event Count")
plt.grid(True)
plt.legend(title="Team / Severity")
plt.tight_layout()
plt.show()
