# Scenario 03C — Incremental Help Notebook (AWS CloudTrail ML)

This notebook provides **optional, step‑by‑step hints** for Scenario 03C.

Each section includes:
- A conceptual hint
- A more concrete hint
- An optional reveal cell

Use this notebook only if you get stuck — the main notebook is designed to be completed independently.

## 1. Understanding CloudTrail logs

**Goal:** Understand what `cloud_api.csv` represents.

### First Hint
CloudTrail logs record every AWS API call made by users, roles, or services.

### Second Hint
Look at fields like `event_name`, `user`, `role`, `region`, `resource`, and `timestamp`.

### Reveal (optional)

In [None]:
cloud_api_df.head(), cloud_api_df.info()

## 2. Understanding the ML feature dataset

**Goal:** Understand what `cloud_api_features.csv` contains.

### First Hint
These features summarize behavioral patterns over time.

### Second Hint
Look for counts, unique values, entropy, error rates, and time‑based features.

### Reveal (optional)

In [None]:
features_df.head(), features_df.describe(include='all')

## 3. Timestamp normalization

**Goal:** Ensure timestamps are parsed correctly and usable for correlation.

### First Hint
Use `pd.to_datetime()` with `utc=True` to parse ISO‑8601 timestamps.

### Second Hint
Extract `hour` or `day` to capture time‑of‑day patterns.

### Reveal (optional)

In [None]:
cloud_api_df['timestamp'] = pd.to_datetime(
    cloud_api_df['timestamp'].astype(str).str.replace('Z', '', regex=False),
    utc=True,
    errors='coerce'
)
cloud_api_df['hour'] = cloud_api_df['timestamp'].dt.hour
cloud_api_df[['timestamp', 'hour']].head()

## 4. Exploring CloudTrail API behavior

**Goal:** Identify patterns in API usage.

### First Hint
Look for unusual API calls, regions, or spikes in activity.

### Second Hint
Group by `event_name`, `user`, or `region` to find anomalies.

### Reveal (optional)

In [None]:
cloud_api_df['event_name'].value_counts().head(20)

## 5. Selecting ML features

**Goal:** Choose which columns from `cloud_api_features.csv` to use.

### First Hint
Pick features that capture behavior, not identifiers.

### Second Hint
Good candidates include counts, unique values, error rates, and time‑based features.

### Reveal (optional)

In [None]:
# Example feature selection pattern
candidate_cols = [c for c in features_df.columns if c not in ['user', 'role', 'resource']]
features_df[candidate_cols].head()

## 6. Choosing an anomaly detection model

**Goal:** Select a model appropriate for unsupervised anomaly detection.

### First Hint
IsolationForest is a strong default for tabular anomaly detection.

### Second Hint
LocalOutlierFactor works well for density‑based anomalies; OneClassSVM is more sensitive.

### Reveal (optional)

In [None]:
from sklearn.ensemble import IsolationForest
example_model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
example_model

## 7. Training your model

**Goal:** Fit your model on the selected features.

### First Hint
Ensure your feature matrix contains only numeric columns.

### Second Hint
Call `.fit()` on your model with the selected feature matrix.

### Reveal (optional)

In [None]:
# Example training pattern
X = features_df.select_dtypes(include=['int64', 'float64'])
example_model.fit(X)

## 8. Scoring suspicious API sequences

**Goal:** Compute anomaly scores.

### First Hint
Use `decision_function()` or `score_samples()`.

### Second Hint
Aggregate scores (mean, min, max) to produce a single `anomaly_score`.

### Reveal (optional)

In [None]:
# Example scoring pattern
scores = example_model.decision_function(X)
float(scores.mean())

## 9. Writing your explanation

**Goal:** Explain your model choice and feature selection.

### First Hint
Describe why your model fits the structure of the data.

### Second Hint
Explain how your chosen features capture behavioral anomalies.

### Reveal (optional)

In [None]:
example_explanation = """
I chose IsolationForest because it handles high‑dimensional tabular data well
and does not assume a specific distribution. The features I selected capture
API call frequency, resource diversity, and error behavior, which are strong
signals of anomalous cloud activity.
"""
example_explanation