Problem Statement: Email Phishing Detection

Goal: Predict whether an email is phishing (fraudulent) or legitimate based on its content and metadata.

Dataset features might include:
<ol>
<li>email_length → number of characters in the email


<li>num_links → number of hyperlinks in the email

<li>num_special_chars → number of suspicious characters (e.g., $, %, @)

<li>contains_login_request → whether the email asks for login details (Yes/No)

<li>sender_domain → domain of the sender (categorical)

Target: is_phishing → 1 for phishing, 0 for legitimate

<h2> Sample Prediction Input

After fitting your Naive Bayes model, predict

| Subject                            | Contains\_Link | Contains\_Attachment | Urgent\_Words | From\_Trusted\_Domain |
| ---------------------------------- | -------------- | -------------------- | ------------- | --------------------- |
| "Update your password to continue" | 1              | 0                    | 1             | 0                     |


In [None]:
# Import Data
import pandas as pd
import string

df = pd.read_csv("/content/sample_data/email_phishing.csv")
df.head()

Unnamed: 0,Email_ID,Subject,Contains_Link,Contains_Attachment,Urgent_Words,From_Trusted_Domain,Label
0,1,Urgent: Verify your bank account,Yes,No,Yes,No,Phishing
1,2,Meeting schedule for next week,No,Yes,No,Yes,Legit
2,3,Claim your lottery prize now,Yes,No,Yes,No,Phishing
3,4,Invoice attached for your recent purchase,No,Yes,No,Yes,Legit
4,5,Security alert: Unusual login detected,Yes,No,Yes,No,Phishing


In [None]:
# Encode label
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Label"] = le.fit_transform(df["Label"])

In [None]:
# Preprocess text
def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

df["Subject"] = df["Subject"].apply(clean_text)

cols = ["Contains_Link", "Contains_Attachment", "Urgent_Words", "From_Trusted_Domain"]
for col in cols:
    df[col] = df[col].astype(str).str.strip().str.lower()
    df[col] = df[col].map({
        "yes": 1, "no": 0,
        "true": 1, "false": 0,
        "1": 1, "0": 0
    }).fillna(0).astype(int)

df.head()

Unnamed: 0,Email_ID,Subject,Contains_Link,Contains_Attachment,Urgent_Words,From_Trusted_Domain,Label
0,1,urgent verify your bank account,1,0,1,0,1
1,2,meeting schedule for next week,0,1,0,1,0
2,3,claim your lottery prize now,1,0,1,0,1
3,4,invoice attached for your recent purchase,0,1,0,1,0
4,5,security alert unusual login detected,1,0,1,0,1


In [None]:
#TFIDF vectorization convert text to numerical features
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

v = TfidfVectorizer(stop_words="english", max_features=100)
subject_features = v.fit_transform(df["Subject"])

#  print(subject_features.toarray())

In [None]:
# Train Test Split
from sklearn.model_selection import train_test_split
X = df[["Contains_Link", "Contains_Attachment", "Urgent_Words", "From_Trusted_Domain"]]
y = df["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Predict a new email
new_email = pd.DataFrame([{
    "Contains_Link": 2,
    "Contains_Attachment": 1,
    "Urgent_Words": 5,
    "From_Trusted_Domain": 0
}])

print("\nPrediction for new email:", "Phishing" if model.predict(new_email)[0] == 1 else "Legit")

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Confusion Matrix:
 [[2 0]
 [0 3]]

Prediction for new email: Phishing


In [None]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),   # Step 1: convert text → numbers
    ('nb', MultinomialNB())              # Step 2: train Naive Bayes
])

X_train, X_test, y_train, y_test = train_test_split(df["Subject"], df["Label"], test_size=0.2, random_state=42)

clf.fit(X_train, y_train)

In [None]:
emails = ['Update your password to continue']
print("Prediction:", le.inverse_transform(clf.predict(emails)))

Prediction: ['Phishing']
