# AI-Powered Phishing Email Detection System

## Introduction
Phishing attacks are among the most prevalent forms of cyber threats, often relying on deceptive emails to trick recipients into revealing sensitive information or clicking malicious links. Traditional rule-based systems for phishing detection struggle to adapt to the rapidly evolving language and structure of phishing emails. As a result, artificial intelligence (AI) methods—particularly machine learning—have become essential tools for building more flexible and accurate detection systems.

In this project, we develop an AI-powered phishing email detection system that classifies emails as phishing or legitimate using natural language features and metadata. Our focus is on building a lightweight, interpretable prototype using the XGBoost classifier, a gradient boosting algorithm known for its performance and efficiency.

We use a publicly available dataset from Kaggle that includes both phishing and legitimate emails with labeled examples. You can access the dataset here:

[🔗 Phishing Email Dataset on Kaggle](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset)

The project involves the following core components:
*   Data cleaning and feature extraction from email content.
*   Training and evaluation of an XGBoost classification model.
*   Applying explainability techniques (e.g., SHAP) to interpret model predictions.
*   Testing the model on real-world-like examples and documenting its strengths and limitations.

The goal is to create a simple, explainable, and effective prototype that could form the basis of a real-world email threat detection tool.

## Step 1: Environment set up

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import xgboost as xgb

## Step 2: Dataset Exploration

In [3]:
# Where 'text_combined' contains the text data and 'label' is 1 for phishing, 0 for legitimate

df = pd.read_csv('../dataset/phishing_email.csv')

# Check the dataset
print(df.head())
print(f"Dataset shape: {df.shape}")
print(f"Class distribution:\n{df['label'].value_counts()}")

                                       text_combined  label
0  hpl nom may 25 2001 see attached file hplno 52...      0
1  nom actual vols 24 th forwarded sabrae zajac h...      0
2  enron actuals march 30 april 1 201 estimated a...      0
3  hpl nom may 30 2001 see attached file hplno 53...      0
4  hpl nom june 1 2001 see attached file hplno 60...      0
Dataset shape: (82486, 2)
Class distribution:
label
1    42891
0    39595
Name: count, dtype: int64


## Step 3: Split the data into training and testing sets

In [4]:
X = df['text_combined']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 65988
Testing set size: 16498


## Step 4: Create TF-IDF features

In [5]:
# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,  # Limit features to avoid dimensionality issues
    min_df=5,           # Ignore terms that appear in less than 5 documents
    max_df=0.7,         # Ignore terms that appear in more than 70% of documents
    stop_words='english',
    ngram_range=(1, 2)  # Use both unigrams and bigrams
)

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data using the fitted vectorizer
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"TF-IDF feature matrix shape: {X_train_tfidf.shape}")

TF-IDF feature matrix shape: (65988, 5000)


## Step 5: Train the XGBoost model

In [9]:
# Initialize XGBoost classifier
xgb_model = xgb.XGBClassifier(
    max_depth=6,
    learning_rate=0.1,
    n_estimators=100,
    objective='binary:logistic',
    use_label_encoder=False,
    random_state=42,
    early_stopping_rounds=10
)

# Train the model
xgb_model.fit(
    X_train_tfidf, 
    y_train,
    eval_set=[(X_test_tfidf, y_test)]
)

print("Model training completed!")

Parameters: { "use_label_encoder" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	validation_0-logloss:0.64268
[1]	validation_0-logloss:0.60204
[2]	validation_0-logloss:0.56781
[3]	validation_0-logloss:0.53883
[4]	validation_0-logloss:0.51243
[5]	validation_0-logloss:0.48954
[6]	validation_0-logloss:0.46951
[7]	validation_0-logloss:0.44971
[8]	validation_0-logloss:0.43335
[9]	validation_0-logloss:0.41757
[10]	validation_0-logloss:0.40399
[11]	validation_0-logloss:0.39093
[12]	validation_0-logloss:0.37736
[13]	validation_0-logloss:0.36701
[14]	validation_0-logloss:0.35537
[15]	validation_0-logloss:0.34548
[16]	validation_0-logloss:0.33601
[17]	validation_0-logloss:0.32847
[18]	validation_0-logloss:0.32165
[19]	validation_0-logloss:0.31340
[20]	validation_0-logloss:0.30582
[21]	validation_0-logloss:0.29925
[22]	validation_0-logloss:0.29289
[23]	validation_0-logloss:0.28735
[24]	validation_0-logloss:0.28266
[25]	validation_0-logloss:0.27542
[26]	validation_0-logloss:0.27094
[27]	validation_0-logloss:0.26418
[28]	validation_0-logloss:0.25935
[29]	validation_0-loglos

## Step 6: Evaluate the model

In [10]:
# Make predictions
y_pred = xgb_model.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9599

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.94      0.96      7919
           1       0.94      0.98      0.96      8579

    accuracy                           0.96     16498
   macro avg       0.96      0.96      0.96     16498
weighted avg       0.96      0.96      0.96     16498


Confusion Matrix:
[[7413  506]
 [ 155 8424]]


## Step 7: Check feature importance (optional)


In [11]:
# Get feature importance
feature_importance = xgb_model.feature_importances_

# Create a DataFrame to better visualize feature importance
feature_names = tfidf_vectorizer.get_feature_names_out()
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
})

# Sort by importance
importance_df = importance_df.sort_values('Importance', ascending=False)

# Display the top 20 most important features
print("\nTop 20 most important features:")
print(importance_df.head(20))


Top 20 most important features:
                                              Feature  Importance
4959                                            wrote    0.035769
1571                                            enron    0.026712
820                                                cc    0.026674
242   _______________________________________________    0.017291
1222                                             date    0.016642
2619                                              lar    0.016331
1798                                             file    0.016046
1000                                          company    0.013923
560                                          aug 2008    0.013609
2703                                             list    0.013142
4692                                       university    0.012088
4847                                          watches    0.011482
3447                                         pleasure    0.011041
119                                        

## Step 8: Save the model and vectorizer for future use


In [None]:
import pickle

# Save the model
pickle.dump(xgb_model, open("phishing_xgboost_model.pkl", "wb"))

# Save the TF-IDF vectorizer
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pkl", "wb"))

print("Model and vectorizer saved successfully!")

## Step 10: Fine-tune the model (optional)


In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Initialize grid search
grid_search = GridSearchCV(
    estimator=xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False),
    param_grid=param_grid,
    scoring='f1',
    cv=3,
    n_jobs=-1,
    verbose=2
)

# Fit grid search (this may take time)
grid_search.fit(X_train_tfidf, y_train)

# Best parameters
print(f"Best parameters: {grid_search.best_params_}")

# Train final model with best parameters
best_xgb_model = xgb.XGBClassifier(
    **grid_search.best_params_,
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False
)

best_xgb_model.fit(X_train_tfidf, y_train)

# Evaluate the tuned model
y_pred_tuned = best_xgb_model.predict(X_test_tfidf)
print("\nTuned Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_tuned):.4f}")
print(classification_report(y_test, y_pred_tuned))