## Detecting Phishing Email using Support Vector Machine(SVM) and Random Forest Classisfier(RFC)

### PROBLEM STATEMENT

Phishing emails pose a major threat to individuals and organizations globally, as they aim to deceive recipients into revealing sensitive information or taking harmful actions. It is essential to detect and prevent phishing emails to protect personal and financial security. 
Recently, machine learning techniques have shown promise in addressing this increasing threat. The dataset includes the text body of emails and their classifications, enabling the detection of phishing emails through detailed analysis and machine learning-based classification.

### OBJECTIVE
To develop and implement machine learning models for detecting phishing emails by analyzing and classifying the email text body. This approach aims to enhance the ability to identify and prevent phishing attempts, thereby improving the security of individuals and organizations against such threats. The dataset, which includes various types of emails, will be utilized to train and validate the models for accurate phishing detection.

### Importing libraries and loading data

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings("ignore")

In [16]:
df = pd.read_csv("Phishing_Email.csv")

### Sanity Checks

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email


In [18]:
# Check for NAN/ missing values
df.isna().sum()

Unnamed: 0     0
Email Text    16
Email Type     0
dtype: int64

In [19]:
#Drop the NaN/Missing values
df = df.dropna()
print(df.isna().sum())

Unnamed: 0    0
Email Text    0
Email Type    0
dtype: int64


In [20]:
#check data frame
df.shape

(18634, 3)

In [21]:
# Count the occurrences of every E-mail type
email_type_counts = df['Email Type'].value_counts()
print(email_type_counts)

Email Type
Safe Email        11322
Phishing Email     7312
Name: count, dtype: int64


## Handling Imbalance
To address the class imbalance issue, we will consider the following strategies:

### 1. The Resampling Techniques:

a. Oversampling: Increase the number of instances in the minority class(es) by generating synthetic samples (e.g., using SMOTE) to balance the class distribution.

b. Undersampling: Reduce the number of instances in the majority class to match the minority class, effectively balancing the dataset.

In [22]:
# Using the undersampling technique
Safe_Email = df[df["Email Type"]== "Safe Email"]
Phishing_Email = df[df["Email Type"]== "Phishing Email"]
Safe_Email = Safe_Email.sample(Phishing_Email.shape[0])

In [23]:
# checking the shape
Safe_Email.shape,Phishing_Email.shape

((7312, 3), (7312, 3))

## Create new datasets

In [24]:
# create a new dataset with the balanced E-mail types
Data= pd.concat([Safe_Email, Phishing_Email], ignore_index = True)
Data.head()

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,8726,work in progress update dale started preparing...,Safe Email
1,17195,logical aspects of computational linguistics (...,Safe Email
2,2324,california update 3 / 16 / 01 ? a source withi...,Safe Email
3,1635,Because of this:\nhttp://hrw.org/press/2002/08...,Safe Email
4,2373,--- Matthias Saou wrote: > Once\nupon a time...,Safe Email


In [25]:
# split the data into a metrix of features X and Dependent Variable y
X = Data["Email Text"].values
y = Data["Email Type"].values

In [26]:
# split Data 
from sklearn.model_selection import train_test_split
X_train,x_test,y_train,y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Building a Random Forest Classifier Model

In [27]:
# defining the Classifier
classifier = Pipeline([("tfidf",TfidfVectorizer() ),("classifier",RandomForestClassifier(n_estimators=10))])# add another hyperparamters as prefered

## Train model

In [28]:
# Train the model
classifier.fit(X_train,y_train)

### Prediction

In [29]:
# Prediction
y_pred = classifier.predict(x_test)

### Check for Accuracy

In [30]:
# Importing classification_report,accuracy_score,confusion_matrix
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

In [31]:
#accuracy_score
accuracy_score(y_test,y_pred)

0.9241112123974475

In [32]:
#confusion_matrix
confusion_matrix(y_test,y_pred)

array([[2106,   92],
       [ 241, 1949]], dtype=int64)

In [33]:
#classification_report
classification_report(y_test,y_pred)

'                precision    recall  f1-score   support\n\nPhishing Email       0.90      0.96      0.93      2198\n    Safe Email       0.95      0.89      0.92      2190\n\n      accuracy                           0.92      4388\n     macro avg       0.93      0.92      0.92      4388\n  weighted avg       0.93      0.92      0.92      4388\n'

## Building Support Vector Machine (SVM)

In [34]:
# Importing SVM
from sklearn.svm import SVC

#Create the Pipeline
SVM = Pipeline([("tfidf", TfidfVectorizer()),("SVM", SVC(C = 100, gamma = "auto"))])

In [35]:
# traing the SVM model 
SVM.fit(X_train,y_train)

In [37]:
# y_pred. for SVM model
s_ypred = SVM.predict(x_test)

In [38]:
# check the SVM model accuracy
accuracy_score(y_test,s_ypred )

0.4990884229717411

## Conclusion
The notebook addressed the critical issue of phishing email detection using machine learning techniques. The objective was to develop a robust model capable of accurately classifying emails as either legitimate or phishing attempts based on various features.

### Model Performance

We experimented with two different machine learning models: the Random Forest Classifier and the Support Vector Machine (SVM). The performance of these models produced contrasting results:

**Random Forest Classifier:**

- **Accuracy:** 0.924
- The Random Forest Classifier delivered impressive results, achieving an accuracy of 0.924. It demonstrated a strong ability to correctly classify both legitimate and phishing emails. Additionally, the precision, recall, and F1-score offer further insights into the model's performance across different classes, which is crucial for understanding the trade-offs involved.

**Support Vector Machine (SVM):**

- **Accuracy:** 0.499
- In contrast, the Support Vector Machine (SVM) showed significantly lower performance, with an accuracy of only 0.499. This indicates that the SVM model struggled to effectively distinguish between legitimate and phishing emails in our dataset.

Although some further exploration can be done including:
Data Augmentation, Deep learning and Hyperparameter tuning.

To conclude, while the Random Forest Classifier demonstrated strong potential for Phishing Email Detection, the SVM model fell short in accuracy. This project serves as a starting point for more advanced investigations and enhancements in the ongoing effort to combat email phishing threats effectively.