[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/Binkerton13/cyber-ml-training/blob/main/scenarios/scenario_03/notebook.ipynb)

# Scenario 03 — AWS Cloud Compromise (Advanced)

In this scenario, you will investigate a multi-stage intrusion in an AWS-like cloud environment.

You have three log sources:
- `iam.csv` — IAM/CloudTrail-style events
- `api_calls.csv` — API calls (discovery, collection, exfiltration)
- `storage_access.csv` — S3-style access logs

Your goals:
1. Identify the compromised user
2. Identify the attacker IP and region
3. Identify privilege escalation events
4. Identify sensitive data access and exfiltration
5. Map activity to MITRE ATT&CK (cloud techniques)
6. Train an ML model to detect anomalous behavior
7. Generate structured outputs for automated grading

Cells marked **STUDENT TASK** require your input or interpretation.

## Step 1 — Setup

This cell loads required libraries and sets up paths.

In [None]:
import pandas as pd
import numpy as np
import json
import os
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import LabelEncoder

print("Libraries loaded.")
repo_root = "https://raw.githubusercontent.com/Binkerton13/cyber-ml-training/main"
scenario_path = "scenarios/scenario_02"   # this notebook’s folder
log_base = f"{repo_root}/{scenario_path}/logs/"

## Step 2 — Load Logs

We load the IAM, API, and storage access logs for this scenario.

**STUDENT TASK:** Review the dataframes and get familiar with the fields.

In [None]:
iam_df = pd.read_csv(log_base + "iam.csv")
api_df = pd.read_csv(log_base + "api_calls.csv")
storage_df = pd.read_csv(log_base + "storage_access.csv")


print("IAM logs:")
display(iam_df.head())
print("\nAPI logs:")
display(api_df.head())
print("\nStorage access logs:")
display(storage_df.head())

## Step 3 — SOC Analysis: Identify Suspicious Activity

### 3A — Suspicious IAM Activity

**STUDENT TASK:** Identify logins from unusual regions and privilege escalation events.

In [None]:
# Suspicious regions (non-standard)
normal_regions = ["us-east-1", "us-west-2"]
suspicious_iam = iam_df[~iam_df["region"].isin(normal_regions)]
suspicious_iam

In [None]:
# Privilege escalation actions
priv_actions = ["CreateAccessKey", "AttachRolePolicy", "AssumeRole"]
priv_iam = iam_df[iam_df["action"].isin(priv_actions)]
priv_iam

### 3B — Suspicious API Calls

**STUDENT TASK:** Identify discovery, collection, and exfiltration API calls.

In [None]:
suspicious_api = api_df[api_df["api_call"].isin(["ListBuckets", "ListObjects", "GetObject", "PutObject"])]
suspicious_api

### 3C — Suspicious Storage Access

**STUDENT TASK:** Identify large reads from sensitive buckets and large writes to external buckets.

In [None]:
suspicious_storage = storage_df[(storage_df["bytes_read"] > 40000) | (storage_df["bytes_written"] > 40000000)]
suspicious_storage

## Step 4 — Extract IOCs

**STUDENT TASK:** Build a list of IOCs (users, IPs, buckets, objects) that you believe are malicious or suspicious.

In [None]:
ioc_list = []

# Suspicious users
ioc_list.extend(list(suspicious_iam["user"].unique()))

# Suspicious source IPs
ioc_list.extend(list(suspicious_iam["source_ip"].unique()))
ioc_list.extend(list(suspicious_storage["source_ip"].unique()))

# Buckets
ioc_list.extend(list(suspicious_storage["bucket"].unique()))

# Deduplicate
ioc_list = list(set(ioc_list))
ioc_list

## Step 5 — MITRE ATT&CK Mapping

Based on the activity you observed, map relevant MITRE ATT&CK techniques.

Examples that may apply:
- T1078 — Valid Accounts
- T1098 — Account Manipulation
- T1087 — Account Discovery
- T1530 — Data from Cloud Storage
- T1567 — Exfiltration to Cloud Storage

**STUDENT TASK:** Update the list below with the techniques you believe apply.

In [None]:
mitre_mapping = [
    "T1078",
    "T1098",
    "T1087",
    "T1530",
    "T1567"
]
mitre_mapping

## Step 6 — Detection Rule

**STUDENT TASK:** Write a detection rule that could be used to detect this activity.

You might match on:
- Suspicious region
- Attacker IP
- Sensitive bucket + large bytes_written
- Privilege escalation actions

In [None]:
detection_rule = "region NOT IN ['us-east-1','us-west-2'] AND action IN ['CreateAccessKey','AttachRolePolicy','AssumeRole']"
detection_rule

## Step 7 — ML Analysis

We will build a simple anomaly detection model over API call behavior.

**STUDENT TASK:** Follow the steps and interpret the results.

In [None]:
# Feature engineering for API calls
api_df["timestamp"] = pd.to_datetime(api_df["timestamp"])
api_df["hour"] = api_df["timestamp"].dt.hour

le_user = LabelEncoder()
le_call = LabelEncoder()

api_df["user_enc"] = le_user.fit_transform(api_df["user"])
api_df["call_enc"] = le_call.fit_transform(api_df["api_call"])

X = api_df[["hour", "user_enc", "call_enc", "latency_ms"]]
X.head()

In [None]:
# Train Isolation Forest
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X)
print("Model trained.")

### Score Suspicious API Calls

**STUDENT TASK:** Interpret whether the model agrees with your SOC findings.

In [None]:
sus_api_features = X.loc[suspicious_api.index]

if not sus_api_features.empty:
    sus_score = model.decision_function(sus_api_features)
    anomaly_score = float(sus_score.mean())
else:
    anomaly_score = 0.0

anomaly_score

## Step 8 — Triage Summary and ML Explanation

**STUDENT TASK:** Fill in the summary and explanation strings below.

In [None]:
triage_summary = "STUDENT TODO: Summarize the compromised user, attacker IP/region, privilege escalation, sensitive data access, and exfiltration."
ml_explanation = "STUDENT TODO: Explain how the anomaly score and features relate to the suspicious API behavior."

triage_summary, ml_explanation

## Step 9 — Generate Required Outputs

This section creates the files required for automated grading:
- `soc_output.json`
- `ml_output.json`

Do not change the keys or structure of these outputs.

In [None]:
os.makedirs("student_output", exist_ok=True)

soc_output = {
    "ioc_list": ioc_list,
    "mitre_mapping": mitre_mapping,
    "triage_summary": triage_summary,
    "detection_rule": detection_rule
}

ml_output = {
    "anomaly_score": anomaly_score,
    "model_used": "IsolationForest",
    "features": list(X.columns),
    "explanation": ml_explanation
}

with open("student_output/soc_output.json", "w") as f:
    json.dump(soc_output, f, indent=4)

with open("student_output/ml_output.json", "w") as f:
    json.dump(ml_output, f, indent=4)

print("Outputs saved to student_output/")