[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/Binkerton13/cyber-ml-training/blob/main/scenarios/scenario_02/notebook.ipynb)


In [None]:
import re

def detect_repo_url():
    import json
    from google.colab import output

    # Colab stores notebook metadata in a JS object
    meta = output.eval_js('JSON.stringify(IPython.notebook.metadata)')
    meta = json.loads(meta)

    # Try to extract the GitHub URL if present
    if 'colab' in meta and 'github_url' in meta['colab']:
        url = meta['colab']['github_url']
        # Convert GitHub URL → raw.githubusercontent URL
        raw = re.sub(r'https://github.com/', 
                     'https://raw.githubusercontent.com/', 
                     url)
        raw = raw.replace('/blob/', '/')
        # Trim notebook filename to get repo root
        return raw.rsplit('/', 1)[0]
    else:
        raise ValueError("Notebook not opened from GitHub in Colab.")

repo_base = detect_repo_url()
repo_base


# Scenario 02 — Lateral Movement in a Windows Domain (Intermediate)

In this scenario, you will investigate a multi-stage intrusion involving credential theft and lateral movement across a Windows domain.

You have three log sources:
- `auth.csv` — Authentication events
- `process.csv` — Windows process creation events
- `network.csv` — Network connections

Your goals:
1. Identify the compromised user (patient zero)
2. Identify the attacker IP
3. Identify malicious processes
4. Identify suspicious network exfiltration
5. Map activity to MITRE ATT&CK
6. Train an ML model to detect anomalous behavior
7. Generate structured outputs for automated grading

Cells marked **STUDENT TASK** require your input or interpretation.

## Step 1 — Setup

This cell loads required libraries and sets up paths.
You generally should not modify this.

In [None]:
import pandas as pd
import numpy as np
import json
import os
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import LabelEncoder

print("Libraries loaded.")

## Step 2 — Load Logs

We load the three log sources for this scenario.

**STUDENT TASK:** Review the dataframes and get familiar with the fields.

In [None]:
log_base = repo_base + "/logs/"

auth_df = pd.read_csv(log_base + "auth.csv")
proc_df = pd.read_csv(log_base + "process.csv")
net_df = pd.read_csv(log_base + "network.csv")


print("Auth logs:")
display(auth_df.head())
print("\nProcess logs:")
display(proc_df.head())
print("\nNetwork logs:")
display(net_df.head())

## Step 3 — SOC Analysis: Identify Suspicious Activity

In this section, you will:
- Look for unusual authentication patterns
- Look for suspicious processes
- Look for suspicious network connections

### 3A — Suspicious Authentication

**STUDENT TASK:** Identify logins from unusual IPs or to unusual hosts.

In [None]:
# Simple heuristic: external-looking IPs (non-10.x.x.x)
suspicious_auth = auth_df[~auth_df['source_ip'].str.startswith('10.')]
suspicious_auth

### 3B — Suspicious Processes

**STUDENT TASK:** Identify processes that look like credential theft or hacking tools.

In [None]:
# Simple heuristic: processes with suspicious names
suspicious_proc = proc_df[proc_df['process'].str.contains('mimikatz', case=False, na=False)]
suspicious_proc

### 3C — Suspicious Network Connections

**STUDENT TASK:** Identify large outbound transfers to external IPs.

In [None]:
# Simple heuristic: high bytes_sent and non-10.x.x.x destination
suspicious_net = net_df[(net_df['bytes_sent'] > 40000) & (~net_df['dst_ip'].str.startswith('10.'))]
suspicious_net

## Step 4 — Extract IOCs

You will now extract Indicators of Compromise (IOCs) from the suspicious activity.

**STUDENT TASK:** Build a list of IOCs (IPs, processes, hosts, users) that you believe are malicious or suspicious.

In [None]:
ioc_list = []

# Example: attacker IPs from suspicious auth
ioc_list.extend(list(suspicious_auth['source_ip'].unique()))

# Example: suspicious processes
ioc_list.extend(list(suspicious_proc['process'].unique()))

# Example: exfil destination IPs
ioc_list.extend(list(suspicious_net['dst_ip'].unique()))

# Deduplicate
ioc_list = list(set(ioc_list))
ioc_list

## Step 5 — MITRE ATT&CK Mapping

Based on the activity you observed, map relevant MITRE ATT&CK techniques.

Examples that may apply in this scenario:
- T1078 — Valid Accounts
- T1003 — OS Credential Dumping
- T1021 — Remote Services
- T1041 — Exfiltration Over C2 Channel

**STUDENT TASK:** Update the list below with the techniques you believe apply.

In [None]:
mitre_mapping = [
    "T1078",  # Valid Accounts
    "T1003",  # OS Credential Dumping
    "T1021",  # Remote Services
    "T1041"   # Exfiltration Over C2 Channel
]
mitre_mapping

## Step 6 — Detection Rule

Write a simple detection rule that could be used to detect this activity in the future.

Examples:
- Match on attacker IP
- Match on suspicious process name
- Match on large outbound transfers to external IPs

**STUDENT TASK:** Update the detection rule string below to reflect your logic.

In [None]:
detection_rule = "source_ip == 'REPLACE_WITH_ATTACKER_IP' OR process == 'mimikatz.exe'"
detection_rule

## Step 7 — ML Analysis

You will now build a simple anomaly detection model over combined features from the logs.

We will:
- Engineer features from auth, process, and network logs
- Train an Isolation Forest model
- Score suspicious events

**STUDENT TASK:** Follow the steps and interpret the results.

In [None]:
# Feature engineering: auth logs
auth_df['timestamp'] = pd.to_datetime(auth_df['timestamp'])
auth_df['hour'] = auth_df['timestamp'].dt.hour

auth_le_user = LabelEncoder()
auth_le_host = LabelEncoder()

auth_df['user_enc'] = auth_le_user.fit_transform(auth_df['username'])
auth_df['host_enc'] = auth_le_host.fit_transform(auth_df['destination_host'])

auth_features = auth_df[['hour', 'user_enc', 'host_enc']]
auth_features.head()

In [None]:
# Feature engineering: network logs
net_df['timestamp'] = pd.to_datetime(net_df['timestamp'])
net_df['hour'] = net_df['timestamp'].dt.hour

net_le_src = LabelEncoder()
net_le_dst = LabelEncoder()

net_df['src_enc'] = net_le_src.fit_transform(net_df['src_ip'])
net_df['dst_enc'] = net_le_dst.fit_transform(net_df['dst_ip'])

net_features = net_df[['hour', 'src_enc', 'dst_enc', 'bytes_sent']]
net_features.head()

In [None]:
# Combine features (simple concatenation of auth + network)
combined_features = pd.concat([
    auth_features.assign(source="auth"),
    net_features.assign(source="net")
], ignore_index=True)

# Drop the 'source' column for modeling
X = combined_features.drop(columns=['source'])
X.head()

In [None]:
# Train Isolation Forest
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X)
print("Model trained.")

### Score Suspicious Events

We will score the suspicious network events identified earlier.

**STUDENT TASK:** Interpret whether the model agrees with your SOC findings.

In [None]:
# Align suspicious_net rows with net_features
sus_net_features = net_features.loc[suspicious_net.index]

if not sus_net_features.empty:
    sus_score = model.decision_function(sus_net_features)
    anomaly_score = float(sus_score.mean())
else:
    anomaly_score = 0.0

anomaly_score

## Step 8 — Triage Summary and ML Explanation

**STUDENT TASK:** Fill in the summary and explanation strings below.

- `triage_summary` should describe what happened in plain language.
- `ml_explanation` should explain what the model saw and how it supports (or contradicts) your SOC findings.

In [None]:
triage_summary = "STUDENT TODO: Summarize the attack chain, compromised user, attacker IP, malicious process, and exfil attempt."
ml_explanation = "STUDENT TODO: Explain how the anomaly score relates to the suspicious network activity and overall intrusion."

triage_summary, ml_explanation

## Step 9 — Generate Required Outputs

This section creates the files required for automated grading:
- `soc_output.json`
- `ml_output.json`

Do not change the keys or structure of these outputs.

In [None]:
os.makedirs("student_output", exist_ok=True)

soc_output = {
    "ioc_list": ioc_list,
    "mitre_mapping": mitre_mapping,
    "triage_summary": triage_summary,
    "detection_rule": detection_rule
}

ml_output = {
    "anomaly_score": anomaly_score,
    "model_used": "IsolationForest",
    "features": list(X.columns),
    "explanation": ml_explanation
}

with open("student_output/soc_output.json", "w") as f:
    json.dump(soc_output, f, indent=4)

with open("student_output/ml_output.json", "w") as f:
    json.dump(ml_output, f, indent=4)

print("Outputs saved to student_output/")