Feature Selection for Spam and Phishing Detection

The ease of communicating through email saw a huge increase in Unsolicited Bulk Email(UBE). Unsolicited emails are broadly divided into 2 categories: Spam(mass mailing approach to marketing) and Phishing(impersonatisation for the purpose of stealing data). This project gives a machine learning to approach to classify them into spam and phishing categories. It considers a total of 40 features which are broadly categorised into URL based, body based, sender based, subject based and script based features.

An addition to this extraction involved feature selection using:

Low Variance filter
High correlation filter
Feature importances
mRMR

Getting Started

The project has been divided into 3 modules for convenience. The first module deals with extraction of intricate features from the emails and preparing the dataset for application fo the module. The entire procedure is illustrated as below!

Requirements

Python2
Jupyter Notebook
Necessary Python libraries (check first cells of Feature Extraction and Feature Selection)

Running

Run Feature Extraction sequentially to obtain 3 datasets in CSV format
Once CSVs with 40 features have been generated, now run the Feature Selection sequentially
In Feature Selection, the reference file chosen was dataset_HSP.csv which can be changed in the calling of mRMR_CSV('dataset_HSP', 'label')

Testing and Results

Voting Ensemble classifier with SVM, Naive Bayes, LDA, Adaboost, Random Forest and CART was used in the implementation of mRMR feature selection The selected feature accuracy was tested against the origibal (40 features) using the following algorithms:

Voting Ensemble classifier (same as above)
SVM
Stochastic Gradient Boosting
Extra Trees classifier
Adaboost
Random Forest
Bagged Decision Tree classifier (CART)
Naive Bayes

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
UCL		UCL
code		code
csv		csv
datasets		datasets
decision_trees		decision_trees
feature_selection_csv		feature_selection_csv
mbox		mbox
references		references
GR.pdf		GR.pdf
README.md		README.md
mRMR_validation.txt		mRMR_validation.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCL

UCL

code

code

csv

csv

datasets

datasets

decision_trees

decision_trees

feature_selection_csv

feature_selection_csv

mbox

mbox

references

references

GR.pdf

GR.pdf

README.md

README.md

mRMR_validation.txt

mRMR_validation.txt

Repository files navigation

Feature Selection for Spam and Phishing Detection

Getting Started

Requirements

Running

Testing and Results

About

Releases

Packages

Languages

TushaarGVS/Phishing

Folders and files

Latest commit

History

Repository files navigation

Feature Selection for Spam and Phishing Detection

Getting Started

Requirements

Running

Testing and Results

About

Resources

Stars

Watchers

Forks

Languages