[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Binkerton13/cyber-ml-training/blob/main/scenarios/scenario_02/notebook.ipynb)

# Scenario 02 — Multi‑Source Investigation (Auth, Process, Network)

In this scenario, you will investigate suspicious activity using three log sources:

- **Authentication logs** (auth)
- **Process execution logs** (process)
- **Network connection logs** (network)

Your goals:

1. Identify suspicious behavior across these data sources.
2. Extract Indicators of Compromise (IOCs).
3. Map activity to MITRE ATT&CK techniques.
4. Build a simple anomaly detection model on network activity.
5. Produce SOC and ML outputs in a structured format for automated grading.

This notebook is intentionally **not** a push‑button demo. You will be given structure, hints, and partial scaffolding, but you are expected to:

- Make decisions about features and models.
- Write and modify code.
- Justify your investigative choices.


## 1. Configure repository path

We will load logs directly from the GitHub repository using a **fixed repo root**.

- When you move this scenario to a team repo or organization, you only need to update **one line** below.
- The rest of the notebook will continue to work without modification.

**Instructor note:** This avoids relying on unstable Colab metadata APIs and keeps the notebook portable and future‑proof.

In [None]:
# TODO (Instructor when migrating repos):
# If this repo moves to a new GitHub org or user, update ONLY this line:
repo_root = "https://raw.githubusercontent.com/Binkerton13/cyber-ml-training/main"

# This notebook's folder inside the repo
scenario_path = "scenarios/scenario_02"

# Base path for logs
log_base = f"{repo_root}/{scenario_path}/logs/"
log_base

## 2. Load logs

We will load three log files:

- `auth.csv` — authentication events
- `process.csv` — process execution events
- `network.csv` — network connection events

Timestamps may include timezone information and a trailing `Z`. We will normalize them to a consistent, timezone‑aware format.

**Your focus here:**

- Understand the schema of each log.
- Get a feel for what “normal” vs “suspicious” might look like.


In [None]:
import pandas as pd

auth_path = log_base + "auth.csv"
proc_path = log_base + "process.csv"
net_path = log_base + "network.csv"

auth_df = pd.read_csv(auth_path)
proc_df = pd.read_csv(proc_path)
net_df = pd.read_csv(net_path)

auth_df.head()

In [None]:
proc_df.head()

In [None]:
net_df.head()

## 3. Normalize timestamps and engineer basic time features

Many logs include timestamps with timezone offsets and sometimes a trailing `Z` (e.g., `2026-02-02T03:56:21.906268+00:00Z`).

We will:

- Strip the trailing `Z` if present.
- Parse timestamps as timezone‑aware datetimes.
- Extract useful time‑based features (e.g., hour of day).

**Why this matters:**

- Time‑of‑day patterns are often useful for anomaly detection.
- Consistent timestamp parsing avoids subtle bugs in downstream analysis.


In [None]:
# Normalize timestamps in auth logs
auth_df['timestamp'] = pd.to_datetime(
    auth_df['timestamp'].astype(str).str.replace('Z', '', regex=False),
    utc=True,
    errors='coerce'
)
auth_df['hour'] = auth_df['timestamp'].dt.hour

# Normalize timestamps in process logs
proc_df['timestamp'] = pd.to_datetime(
    proc_df['timestamp'].astype(str).str.replace('Z', '', regex=False),
    utc=True,
    errors='coerce'
)
proc_df['hour'] = proc_df['timestamp'].dt.hour

# Normalize timestamps in network logs
net_df['timestamp'] = pd.to_datetime(
    net_df['timestamp'].astype(str).str.replace('Z', '', regex=False),
    utc=True,
    errors='coerce'
)
net_df['hour'] = net_df['timestamp'].dt.hour

auth_df[['timestamp', 'hour']].head()

In [None]:
net_df[['timestamp', 'hour']].head()

## 4. Initial SOC investigation

In this phase, you will:

- Look for suspicious authentication patterns (e.g., repeated failures, unusual users, odd times).
- Look for suspicious processes (e.g., unusual binaries, odd parent/child relationships).
- Look for suspicious network connections (e.g., unusual ports, destinations, or volumes).

**Your job:**

- Use filtering, grouping, and basic analysis to identify suspicious activity.
- Start forming hypotheses about what is happening in the environment.

Below are some starter cells you can modify or extend. They are intentionally minimal.

In [None]:
# Example: Look at failed logins
auth_df[auth_df['event_type'] == 'failed_login'].head()

In [None]:
# TODO: Explore authentication patterns
# Ideas:
# - Group by user and count failures
# - Look at failures by hour of day
# - Look for unusual source IPs

# Write your exploration code here.


In [None]:
# TODO: Explore process execution patterns
# Ideas:
# - Look at rare processes
# - Look at processes started by unusual parents
# - Look at processes started around the time of suspicious auth events

# Write your exploration code here.


In [None]:
# TODO: Explore network patterns
# Ideas:
# - Look at unusual destination ports
# - Look at connections from suspicious hosts
# - Look at bursts of connections in a short time window

# Write your exploration code here.


## 5. Extract Indicators of Compromise (IOCs)

Based on your investigation, extract IOCs such as:

- Suspicious IP addresses
- Suspicious user accounts
- Suspicious hostnames
- Suspicious process names or hashes

    
**Your job:**

- Build a list of IOCs that you believe are relevant to this scenario.
- You can store them as strings (e.g., IPs, usernames, process names).

This list will be part of your SOC output.

In [None]:
# TODO: Populate this list with your identified IOCs.
# Example:
# ioc_list = [
#     "192.0.2.10",  # suspicious IP
#     "evil_user",   # suspicious user
#     "malware.exe"  # suspicious process
# ]

ioc_list = []  # Replace with your IOCs
ioc_list

## 6. MITRE ATT&CK mapping

Map the observed behavior to MITRE ATT&CK techniques.

**Examples (not exhaustive):**

- Brute force logins → `T1110` (Brute Force)
- Use of valid accounts → `T1078` (Valid Accounts)
- Suspicious process execution → `T1059` (Command and Scripting Interpreter)
- Lateral movement via remote services → `T1021` (Remote Services)

**Your job:**

- Choose one or more ATT&CK technique IDs that best describe the activity.
- Store them as a list of strings.


In [None]:
# TODO: Add relevant MITRE ATT&CK technique IDs.
# Example:
# mitre_mapping = ["T1110", "T1078"]

mitre_mapping = []  # Replace with your chosen techniques
mitre_mapping

## 7. Build an anomaly detection model on network activity

In this scenario, you will build a simple anomaly detection model using **network logs**.

### Suggested approach

- Use features such as:
  - Hour of day (`hour`)
  - Source port (`src_port`)
  - Destination port (`dst_port`)
- Train an unsupervised anomaly detection model on **all network events**.
- Score a subset of **suspicious network events**.

### Model options (you choose)

- `IsolationForest` (unsupervised, tree‑based)
- `LocalOutlierFactor` (density‑based)
- `OneClassSVM` (boundary‑based)

**Your job:**

- Choose a model.
- Decide which features to use.
- Train the model.
- Score suspicious events.
- Produce a single `anomaly_score` summarizing how anomalous the suspicious activity is.


In [None]:
# Feature selection for network logs
# You can modify this if you want to experiment with different features.

net_features = net_df[['hour', 'src_port', 'dst_port']].copy()
net_features.head()

### 7.1 Select suspicious network events

You may want to define a subset of network events that you consider "suspicious" based on your earlier investigation.

Examples:

- Connections from a suspicious host
- Connections to unusual ports
- Connections during unusual hours

**Your job:**

- Define `sus_net` as a subset of `net_df` containing events you consider suspicious.
- Then we will extract features from `sus_net` for scoring.


In [None]:
# TODO: Define a subset of suspicious network events.
# Example (you should customize this):
# sus_net = net_df[net_df['dst_port'].isin([4444, 8080])]

sus_net = net_df.copy()  # Placeholder: currently treats all events as suspicious
sus_net.head()

In [None]:
# Extract features for suspicious network events
sus_net_features = sus_net[['hour', 'src_port', 'dst_port']].copy()
sus_net_features.head()

### 7.2 Initialize and train your anomaly detection model

**Your job:**

- Choose a model (e.g., `IsolationForest`).
- Initialize it with reasonable parameters.
- Fit it on `net_features` (the full network dataset).

Below is a scaffold with a commented example.

In [None]:
# TODO: Initialize your anomaly detection model.
# Example using IsolationForest:
# from sklearn.ensemble import IsolationForest
# model = IsolationForest(
#     n_estimators=100,
#     contamination=0.05,
#     random_state=42
# )

model = None  # Replace with your model instance


In [None]:
# TODO: Fit your model on the training features.
# Example:
# model.fit(net_features)

if model is not None:
    model.fit(net_features)
else:
    raise ValueError("You must initialize 'model' before fitting.")


### 7.3 Score suspicious events and compute an anomaly score

**Your job:**

- Use your trained model to score `sus_net_features`.
- Compute a single `anomaly_score` summarizing how anomalous the suspicious activity is.

Common approaches:

- Use `decision_function` or `score_samples`.
- Take the mean, min, or max of the scores.

Below is a scaffold with a suggested pattern.

In [None]:
# TODO: Score suspicious events and compute an anomaly score.
# Example pattern:
# if not sus_net_features.empty:
#     sus_score = model.decision_function(sus_net_features)
#     anomaly_score = float(sus_score.mean())
# else:
#     anomaly_score = 0.0

if sus_net_features.empty:
    anomaly_score = 0.0
else:
    # Replace this with your scoring logic
    try:
        sus_score = model.decision_function(sus_net_features)
        anomaly_score = float(sus_score.mean())
    except Exception as e:
        raise RuntimeError(f"Error scoring suspicious events: {e}")

anomaly_score

## 8. Triage summary

Write a short narrative summary of what you believe happened in this scenario.

Consider including:

- Initial access vector (if identifiable)
- Lateral movement or privilege escalation
- Data exfiltration or command‑and‑control behavior
- Key IOCs and MITRE techniques

**Your job:**

- Write a concise but clear triage summary.


In [None]:
# TODO: Write your triage summary here.

triage_summary = ""  # Replace with your summary
triage_summary

## 9. Save SOC and ML outputs

This cell saves your work in a structured format for automated grading.

**Do not change the keys or filenames.**

You may, however, change the *values* by updating your work in previous cells.


In [None]:
import os
import json

os.makedirs("student_output", exist_ok=True)

soc_output = {
    "ioc_list": ioc_list,
    "mitre_mapping": mitre_mapping,
    "triage_summary": triage_summary,
    # Optional: you can describe a detection rule in plain language or pseudo‑query
    "detection_rule": "Describe your detection logic here (e.g., suspicious ports, hosts, or auth patterns)."
}

ml_output = {
    "anomaly_score": anomaly_score,
    "model_used": type(model).__name__ if model is not None else None,
    "features": list(net_features.columns),
    "explanation": "Briefly explain how your model and features capture anomalous network behavior."
}

with open("student_output/soc_output.json", "w") as f:
    json.dump(soc_output, f, indent=4)

with open("student_output/ml_output.json", "w") as f:
    json.dump(ml_output, f, indent=4)

print("Outputs saved to student_output/")