# Phishing Email Detection using NLP

This Jupyter Notebook demonstrates the process of detecting phishing emails using Natural Language Processing (NLP) techniques in Python.

## Introduction

Phishing emails are fraudulent attempts to obtain sensitive information by disguising as trustworthy entities. Detecting them automatically can help in enhancing cybersecurity measures. NLP plays a crucial role in analyzing and classifying email content.

## Setup

Import necessary libraries for the project.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
import random

## Data Collection

In a real-world scenario, collecting a large and diverse set of emails (both phishing and legitimate) is crucial. For demonstration, I use a function to generate mock data.

In [2]:
# "Congratulations, you have been selected for a prize!",
# "Please confirm your attendance for tomorrow's meeting."

In [3]:
def generate_mock_emails(num_emails=100):
    phishing_templates = [
        "Urgent: Your account will be locked unless you update your information now.",
        "Congratulations! You've won a free gift card. Click here to claim it.",
        "Security Notice: Unusual activity detected. Confirm your identity.",
        "You are eligible for a tax refund. Submit your application immediately.",
        "Warning: Your subscription is about to expire. Renew now to avoid disruption."
    ]
    
    legitimate_templates = [
        "Don't forget our meeting scheduled for tomorrow at 3 PM.",
        "Please review the attached report and provide your feedback.",
        "Invitation to the company's annual dinner next weekend.",
        "Reminder: Your project deadline is approaching next Friday.",
        "Weekly newsletter: Updates and insights from the industry."
    ]
    
    # 0 for phishing, 1 for legitimate
    data = {'email': [], 'label': []} 

    for _ in range(num_emails):
        if random.random() < 0.5:
            data['email'].append(random.choice(phishing_templates))
            data['label'].append(0)
        else:
            data['email'].append(random.choice(legitimate_templates))
            data['label'].append(1)
    return pd.DataFrame(data)

## Data Preparation and Splitting

To preparing the dataset that will be used to train and test my phishing email detection model.
I break down this process into two main parts: generating a mock dataset and splitting this dataset into training and testing sets.

In [4]:
df = generate_mock_emails(1000)

X_train, X_test, y_train, y_test = train_test_split(df['email'], df['label'], test_size=0.3, random_state=42)

## Building the Machine Learning Pipeline

Here I create a pipeline that combines text vectorization with a machine learning classifier. This pipeline serves as the core of the model for detecting phishing emails.

In [5]:
pipeline = make_pipeline(TfidfVectorizer(), MultinomialNB())

## Model Training

Here I focus on training machine learning model using the pipeline I have created. The training process involves fitting the model to training data.

In [6]:
pipeline.fit(X_train, y_train)

## Predictions and Model Evaluation

After training the model, the next step is to use it for making predictions on test dataset and evaluating its performance. This process is crucial to understand how well the model generalizes to new, unseen data.

In [7]:
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       154
           1       1.00      1.00      1.00       146

    accuracy                           1.00       300
   macro avg       1.00      1.00      1.00       300
weighted avg       1.00      1.00      1.00       300

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


## Applying the Model to New Examples

Once the model is trained and evaluated, I can use it to predict the category of new, unseen emails. This step demonstrates how my phishing detection model can be applied in a real-world scenario.


In [8]:
test_emails = [
    "Congratulations, you have been selected for a prize!",
    "Please confirm your attendance for tomorrow's meeting."
]

predicted_labels = pipeline.predict(test_emails)
for email, label in zip(test_emails, predicted_labels):
    print(f'Email: {email}\nPredicted Label: {"Phishing" if label == 0 else "Legitimate"}\n')


Email: Congratulations, you have been selected for a prize!
Predicted Label: Phishing

Email: Please confirm your attendance for tomorrow's meeting.
Predicted Label: Legitimate



## Advanced Techniques

In more advanced techniques such as word embeddings (like Word2Vec) or deep learning models (like LSTM or Transformers) can be used. These methods can capture semantic relationships between words but require more data and computational resources.


## Conclusion

This notebook provided a basic framework for detecting phishing emails using NLP in Python. While the mock dataset used for demonstration is simplistic, in a real-world application, a model would be trained on a much larger and diverse dataset. The effectiveness of phishing detection relies heavily on the quality of the dataset and the sophistication of the NLP and machine learning techniques used.
