### Imports & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

project_root = Path.cwd().parents[0] if Path.cwd().name == "notebooks" else Path.cwd()
data_path = project_root / "data" / "discord_cleaned.csv"

df = pd.read_csv(data_path)
df.head()


# SafePlayAI – Discord Phishing EDA

## Results-driven Analytical Method

1. **Understand the problem:** Children receive phishing/scam messages on Discord.
2. **Start at the end:** Reduce exposure to scam messages by improving detection performance.
3. **Identify resources:** Discord chat history (`discord.csv`) including labels and message features.
4. **Obtain and prepare data:** Use `data_prep.py` to clean and engineer features.
5. **Do the work:** Explore data distributions, relationships, correlations and class balance.
6. **Present minimum viable answer:** Visual patterns + model experiment summary.
7. **Iterate if necessary:** Tune features, sampling strategies, and models.


### Class balance and basic stats

In [None]:
df["label"].value_counts(normalize=True).plot(kind="bar")
plt.title("Class Balance (0 = non-phishing, 1 = phishing)")
plt.xlabel("Label")
plt.ylabel("Proportion")
plt.show()

df.describe()


### Feature distributions

In [None]:
numeric_cols = ["message_length", "word_count", "time_since_join", "has_link", "has_mention", "num_roles"]

fig, axes = plt.subplots(len(numeric_cols), 1, figsize=(8, 3 * len(numeric_cols)))
for ax, col in zip(axes, numeric_cols):
    sns.histplot(data=df, x=col, hue="label", kde=True, ax=ax, stat="density")
    ax.set_title(f"Distribution of {col} by label")
plt.tight_layout()
plt.show()


### Correlation heatmap

In [None]:
plt.figure(figsize=(10, 6))
corr = df[numeric_cols + ["label"]].corr()
sns.heatmap(corr, annot=True, fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


### link to experiment results

In [None]:
from pathlib import Path

results_path = project_root / "outputs" / "ab_test_results.csv"
results = pd.read_csv(results_path)
results


### Conclusion

## Experiment Summary & Hypothesis Evaluation

- **Model A** used only numeric features.
- **Model B** used numeric + TF-IDF text features from `msg_content`.
- We compared Accuracy, Precision, Recall, and F1.
- We ran a **t-test** on cross-validated F1 scores:

**Null Hypothesis (H₀):** Model B does not significantly improve F1.  
**Alternative (H₁):** Model B significantly improves F1.

Based on the p-value printed by `experiment_ab_test.py`:

- If `p < 0.05` → Reject H₀ → the additional text features significantly improve phishing detection.
- If `p ≥ 0.05` → Fail to reject H₀ → there is no statistically significant improvement.

These results support the design of a SafePlayAI tool that flags risky Discord messages
using both numeric metadata and message text features.
