[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Binkerton13/cyber-ml-training/blob/main/scenarios/scenario_03/scenario_03C_ml/notebook.ipynb)

# Scenario 03C — AWS CloudTrail API Anomaly Detection (ML‑Heavy)

In this scenario, you will build an **anomaly detection model** to identify suspicious AWS CloudTrail API behavior.

You will use:
- `cloud_api.csv` — raw CloudTrail API events
- `cloud_api_features.csv` — engineered features for ML

Your goals:
1. Explore CloudTrail API behavior.
2. Select meaningful ML features.
3. Choose an anomaly detection model.
4. Train the model.
5. Score suspicious API sequences.
6. Justify your model choice.
7. Save your ML output for grading.

This is an **ML‑only** scenario — no SOC triage or MITRE mapping here.
Focus on feature engineering, model selection, and anomaly scoring.

## 1. Configure repository path

We load logs directly from GitHub using a fixed repo root.
Update only the `repo_root` line if the repo moves.

In [None]:
# TODO (Instructor when migrating repos):
# If this repo moves to a new GitHub org or user, update ONLY this line:
repo_root = "https://raw.githubusercontent.com/Binkerton13/cyber-ml-training/main"
scenario_path = "scenarios/scenario_03"
log_base = f"{repo_root}/{scenario_path}/logs/"
log_base

## 2. Load CloudTrail logs and ML feature dataset

We load:
- `cloud_api.csv` — raw CloudTrail events
- `cloud_api_features.csv` — ML‑ready features

Your job is to understand the feature space and decide which features matter.

In [None]:
import pandas as pd

api_path = log_base + "cloud_api.csv"
feat_path = log_base + "cloud_api_features.csv"

cloud_api_df = pd.read_csv(api_path)
features_df = pd.read_csv(feat_path)

cloud_api_df.head()

In [None]:
features_df.head()

## 3. Normalize timestamps

CloudTrail timestamps are ISO‑8601. Normalize them and extract time‑based features.

In [None]:
cloud_api_df['timestamp'] = pd.to_datetime(
    cloud_api_df['timestamp'].astype(str).str.replace('Z', '', regex=False),
    utc=True,
    errors='coerce'
)
cloud_api_df['hour'] = cloud_api_df['timestamp'].dt.hour
cloud_api_df[['timestamp', 'hour']].head()

## 4. Explore CloudTrail API behavior

Look for patterns in:
- API call frequency
- Regions
- Resource types
- Error codes
- Latency

Use the starter cells below and extend them.

In [None]:
# Example: top API actions
cloud_api_df['event_name'].value_counts().head(20)

In [None]:
# TODO: Explore CloudTrail behavior
# Ideas:
# - Group by user or role
# - Look at unusual regions
# - Look at spikes in API activity
# - Look at error codes

# Write your exploration code here.


## 5. Select ML features

Use `features_df` to choose which columns to include in your model.

Examples:
- `api_count_last_1h`
- `unique_resources_accessed`
- `region_entropy`
- `error_rate`
- `time_of_day`

Your job: choose a subset of features and justify your choice.

In [None]:
# TODO: Select your ML features
# Example (replace with your choices):
# selected_features = features_df[['api_count_last_1h', 'unique_resources', 'error_rate']]

selected_features = features_df.copy()  # Placeholder — modify this
selected_features.head()

## 6. Choose an anomaly detection model

Model options:
- IsolationForest
- LocalOutlierFactor
- OneClassSVM

Your job:
- Choose a model
- Initialize it
- Justify your choice in the explanation section later

In [None]:
# TODO: Initialize your anomaly detection model
# Example:
# from sklearn.ensemble import IsolationForest
# model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)

model = None  # Replace with your model instance

In [None]:
# TODO: Fit your model
# Example:
# model.fit(selected_features)

if model is not None:
    model.fit(selected_features)
else:
    raise ValueError("You must initialize 'model' before fitting.")

## 7. Score suspicious API sequences

Use your model to compute anomaly scores.

You may:
- Score all events
- Score a filtered subset
- Aggregate scores (mean, min, max)

Your job: produce a single `anomaly_score`.

In [None]:
# TODO: Score suspicious events
# Example pattern:
# scores = model.decision_function(selected_features)
# anomaly_score = float(scores.mean())

try:
    scores = model.decision_function(selected_features)
    anomaly_score = float(scores.mean())
except Exception as e:
    anomaly_score = 0.0
    print("Error scoring events:", e)

anomaly_score

## 8. Explain your model choice

Write a short explanation of:
- Why you chose your model
- Why your features make sense
- How your model captures anomalous API behavior

In [None]:
# TODO: Write your explanation here

explanation = ""  # Replace with your explanation
explanation

## 9. Save ML output

This cell saves your ML results for grading.
Do not change the keys or filename.

In [None]:
import os, json

os.makedirs("student_output", exist_ok=True)

ml_output = {
    "anomaly_score": anomaly_score,
    "model_used": type(model).__name__ if model is not None else None,
    "features": list(selected_features.columns),
    "explanation": explanation
}

with open("student_output/ml_output.json", "w") as f:
    json.dump(ml_output, f, indent=4)

print("ML output saved to student_output/ml_output.json")