There are a number of users who purchase products online and make payments through e-banking. There are e-banking websites that ask users to provide sensitive data such as username, password & credit card details, etc., often for malicious reasons. This type of e-banking website is known as a phishing website. Web service is one of the key communications software services for the Internet. Web phishing is one of many security threats to web services on the Internet.
Web phishing aims to steal private information, such as usernames, passwords, and credit card details, by way of impersonating a legitimate entity.
It will lead to information disclosure and property damage.
Large organizations may get trapped in different kinds of scams.
This Guided Project mainly focuses on applying a machine-learning algorithm to detect Phishing websites.
In order to detect and predict e-banking phishing websites, we proposed an intelligent, flexible and effective system that is based on using classification algorithms. We implemented classification algorithms and techniques to extract the phishing datasets criteria to classify their legitimacy. The e-banking phishing website can be detected based on some important characteristics like URL and domain identity, and security and encryption criteria in the final phishing detection rate. Once a user makes a transaction online when he makes payment through an e-banking website our system will use a data mining algorithm to detect whether the e-banking website is a phishing website or not.
Phishing is a type of attack where an attacker tricks the victim to give up sensitive information such as login credentials by disguising as a trustworthy entity. In this application we will try to detect a phishing website using the features that differentiates these domains from the legitimate ones. We will create our own dataset, train and test various machine learning models using Jupyter Notebooks on IBM Watson studio and deploy the best model to be used by the application for detection.
Various features that are used to create the dataset are as follows :
- Using IP Address - check if URL has an ip address in it
- HTTPS - checking the existance of 'https', trusted certificate authority and age of certificate
- URL Short - check if url has been shortened
- Having @ symbol - it leads the browser to ignore everything preceding the '@' symbol
- Having double-slash - means that the user will be redirected (http://www.legitimate.com//http://www.phishing.com)
- Domain registration Length - Trustworthy domains are regularly paid for several years
- favicon - favicon loaded from the domain or not
- Existance of https token in the domain part of the URL
- Request URL - examines whether the external objects contained within a webpage are loaded from another domain
- URL of Anchor - If the tags and the website have different domain names
- Links in tags - It is expected that tags (Meta, Script and Link) are linked to the same domain of the webpage.
- Server Form Handler - If it is blank or contains any other domain name
- Submitting information to email
- Abnormal URL - if domain name (from whois) not in url
- redirect count
- invisible iframe
- Age of domain
- web traffic - google rank for page
- statistical report - match it with top 10 domains and top 10 IPs from PhishTank
- Sign up for an IBM Cloud account
- Login to the IBM Watson Studio
- Install Python3.7
- Install dependencies
pip install -r packages.txt
The dataset created for this application uses around 250 legitimate and 250 phishing urls with 20 features each as mentioned above. You can add more data and features (feature_extraction.py) to the project to create your own dataset as shown below.
The URLs for phishing websites was retrieved from here (verified_online.csv) and The URLs for legitimate websites was retrieved from here (top1m.csv)
- Create the dataset for the phishing websites
python create_dataset.py <file_with_phishing_url> <number_of_urls_to_use> <output_file> <target_value>
python create_dataset.py verified_online.csv 500 dataset2.csv 1
- Create the dataset for the legitimate websites
python create_dataset.py <file_with_legitimate_url> <number_of_urls_to_use> <output_file> <target_value>
python create_dataset.py top1m.csv 500 dataset2.csv 0
Sign up for IBM's Watson Studio.
Note: By creating a project in Watson Studio a free tier
Object Storage
service will be created in your IBM Cloud account. Take note of your service names as you will need to select them in the following steps.
-
On Watson Studio's Welcome Page select
New Project
. -
Choose the
Data Science
option and clickCreate Project
. -
Name your project, select the Cloud Object Storage service instance and click
Create
- Drag and drop the dataset (
csv
) file you just created to Watson Studio's dashboard to upload it to Cloud Object Storage.
-
Create a New Notebook.
-
Import the notebook found in this repository
-
Give a name to the notebook and select a
Python 3.5
runtime environment, then clickCreate
.
To make the dataset available in the notebook, we need to refer to where it lives. Watson Studio automatically generates a connection to your Cloud Object Storage instance and gives access to your data.
- Go to the Files section to the right of the notebook and click
Insert to code
for the data you have uploaded. ChooseInsert pandas DataFrame
.
The steps should allow you to understand the dataset, analyze and visualize it. You will then go through the preprocessing and feature engineering processes to make the data suitable for modeling. Finally, you will build some machine learning models and test them to compare their performances.
- Navigate to your project and add a new machine learning model.
- Give it a name, choose a machine learning service, select model builder as model type as logistic regression is one of the best model for our dataset and is available in the builder, select the default runtime and select Manual.
- Add the reduced dataset to the model.
- Add a deployment
- Get the deployment url and the machine learning model instance tokens.
- Replace the deployment url and tokens in the check_url.py file
python check_url.py <url>
https://www.researchgate.net/publication/277476345_Phishing_Websites_Features