# Scenario 02 — Incremental Help Notebook (SOC + ML Hybrid)

This notebook provides **optional, step‑by‑step hints** for Scenario 02.

Each section includes:
- A conceptual hint
- A more concrete hint
- An optional reveal cell

Use this notebook only if you get stuck — the main scenario notebook is designed to be completed independently.

## 1. Understanding the logs

**Goal:** Understand what each log source represents.

### First Hint
Think of the logs as different windows into system behavior: authentication, processes, and network activity.

### Second Hint
Inspect columns like `timestamp`, `user`, `process`, `command`, `src_ip`, `dst_ip`, and `event_type`.

### Reveal (optional)

In [None]:
auth_df.head(), auth_df.info(), process_df.head(), process_df.info(), network_df.head(), network_df.info()

## 2. Timestamp normalization

**Goal:** Ensure timestamps are parsed correctly and usable for correlation.

### First Hint
Use `pd.to_datetime()` with `utc=True` to parse timestamps.

### Second Hint
Extract `hour` or `date` to help identify unusual activity times.

### Reveal (optional)

In [None]:
auth_df['timestamp'] = pd.to_datetime(auth_df['timestamp'], utc=True, errors='coerce')
auth_df['hour'] = auth_df['timestamp'].dt.hour

process_df['timestamp'] = pd.to_datetime(process_df['timestamp'], utc=True, errors='coerce')
process_df['hour'] = process_df['timestamp'].dt.hour

network_df['timestamp'] = pd.to_datetime(network_df['timestamp'], utc=True, errors='coerce')
network_df['hour'] = network_df['timestamp'].dt.hour

## 3. Initial SOC sweep

**Goal:** Identify suspicious authentication, process, or network events.

### First Hint
Look for failed logins, unusual users, odd process names, or rare IPs.

### Second Hint
Group by `user`, `process`, or `src_ip` to find anomalies.

### Reveal (optional)

In [None]:
auth_df['event'].value_counts().head(), process_df['process'].value_counts().head(), network_df['src_ip'].value_counts().head()

## 4. Extracting IOCs

**Goal:** Identify concrete artifacts of malicious activity.

### First Hint
Look for suspicious users, processes, IPs, or file paths.

### Second Hint
Focus on values that appear in multiple suspicious events.

### Reveal (optional)

In [None]:
susp_users = auth_df[auth_df['event'] == 'failed_login']['user'].unique().tolist()
susp_users

## 5. MITRE ATT&CK mapping

**Goal:** Describe the behavior using ATT&CK techniques.

### First Hint
Think about credential access, execution, persistence, and lateral movement.

### Second Hint
Common techniques include:
- `T1110` (Brute Force)
- `T1059` (Command Execution)
- `T1021` (Remote Services)
- `T1047` (WMI Execution)

### Reveal (optional)

In [None]:
example_mitre = ["T1110", "T1059"]
example_mitre

## 6. Feature engineering (ML section)

**Goal:** Create meaningful numeric features for anomaly detection.

### First Hint
Think about counts, frequencies, unique values, and time‑based features.

### Second Hint
Examples: number of processes spawned, number of failed logins, number of unique destinations.

### Reveal (optional)

In [None]:
example_features = pd.DataFrame({
    'failed_logins': [auth_df[auth_df['event']=='failed_login'].shape[0]],
    'unique_processes': [process_df['process'].nunique()],
    'unique_dest_ips': [network_df['dst_ip'].nunique()]
})
example_features

## 7. Choosing an anomaly detection model

**Goal:** Select a model appropriate for unsupervised anomaly detection.

### First Hint
IsolationForest is a strong default for tabular anomaly detection.

### Second Hint
LocalOutlierFactor works well for density‑based anomalies; OneClassSVM is more sensitive.

### Reveal (optional)

In [None]:
from sklearn.ensemble import IsolationForest
example_model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
example_model

## 8. Training your model

**Goal:** Fit your model on your engineered features.

### First Hint
Ensure your feature matrix contains only numeric columns.

### Second Hint
Call `.fit()` on your model with the selected feature matrix.

### Reveal (optional)

In [None]:
X = example_features
example_model.fit(X)

## 9. Scoring suspicious events

**Goal:** Compute anomaly scores.

### First Hint
Use `decision_function()` or `score_samples()`.

### Second Hint
Aggregate scores (mean, min, max) to produce a single `anomaly_score`.

### Reveal (optional)

In [None]:
scores = example_model.decision_function(X)
float(scores.mean())

## 10. Writing your ML explanation

**Goal:** Explain your model choice and feature selection.

### First Hint
Describe why your model fits the structure of the data.

### Second Hint
Explain how your chosen features capture behavioral anomalies.

### Reveal (optional)

In [None]:
example_explanation = """
I chose IsolationForest because it handles high‑dimensional tabular data well
and does not assume a specific distribution. The features I selected capture
authentication failures, process diversity, and network behavior, which are
strong indicators of anomalous system activity.
"""
example_explanation