# 🧩 MVP in Phase 1: Minimum Viable Product

⚙️ **Phase 1: Robust, Interpretable Traditional Models (MVP)**

This is your first working version of the spear-phishing detection system.

## 🎯 Key Objectives

| Objective | Why |
|----------|-----|
| 🛠️ Build working pipeline | From raw email → clean text → features → prediction |
| ⏱️ Do it quickly | Get feedback before building complex deep learning |
| 👨‍💼 Show stakeholders or users | Validate the idea and usefulness |
| 📈 Create a benchmark | Compare with advanced models later (BERT, etc.) |

## 🧠 MVP Workflow Overview

| Step | Tool |
|------|------|
| Load + clean emails | `pandas`, `preprocess_text()` |
| Convert text to vectors | `TfidfVectorizer` |
| Train a basic model | `LogisticRegression` or `XGBoost` |
| Predict & evaluate | `classification_report`, `confusion_matrix` |
| Optional: Wrap in CLI/UI | Flask, CLI or Gradio |

## 🧪 Code: Train and Evaluate MVP Model

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Load cleaned dataset
df = pd.read_csv("data/phishing_email_clean.csv")

# Vectorize
vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 🧱 Architecture Diagram (Conceptual)

```
+------------------+     +----------------+     +----------------------+     +-------------------+
|  Raw Emails      | --> | Preprocessing  | --> | TF-IDF Vectorization | --> | ML Classifier     |
| (text_combined)  |     | (clean_text)   |     | (3000 features)      |     | (LogReg/XGBoost)  |
+------------------+     +----------------+     +----------------------+     +-------------------+
                                                                                     |
                                                                                     v
                                                                         +----------------------+
                                                                         | Prediction: Phish?   |
                                                                         +----------------------+
```

## 🚀 Phase 2 and Beyond

- **Phase 2** → Fine-tune BERT/DistilBERT for deep NLP understanding
- **Phase 3** → Real-time feedback + active learning (e.g., retrain on flagged emails)
- **Phase 4** → Add retaliation/intelligence module: quarantine, alert, or auto-response

Would include monitoring, confidence scoring, and optional honeypot-style traps.

In [None]:
# 📈 ROC & Precision-Recall Curves
from sklearn.metrics import roc_curve, auc, precision_recall_curve
import numpy as np

# Compute ROC curve and AUC
y_probs = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

# Compute Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_probs)
plt.figure()
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

In [None]:
# 🔄 Automated Data Ingestion Example
import os
def ingest_data_from_folder(folder_path):
    emails = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as f:
                content = f.read()
                emails.append(content)
    return pd.DataFrame({'text_combined': emails})

# Example: df_new = ingest_data_from_folder('new_emails/')

## 🔁 Retaliation Logic Prototype (Concept)

- If model prediction = phishing **AND** confidence > threshold:
  - 🚨 Alert security team
  - 🧪 Log IP/domain to threat database
  - 🔒 Move email to quarantine folder
  - ⚠️ Optional: Notify sender if internal

**Rule Engine Example:**
```python
if prediction == 'phishing' and confidence_score > 0.95:
    alert_security_team(email_id)
    quarantine_email(email_id)
    log_threat(sender_ip, domain)
```

Long-term goal: Use reinforcement learning to optimize responses based on outcomes.