<a href="https://colab.research.google.com/github/Shivansh-datascience/Bank-Product-Recommendation/blob/main/Phising_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phising Detection System

Problem Statement: Website and URL Phishing Detection System with Risk Scoring (Classification and Regression)
Background:
Phishing attacks targeting users through fraudulent websites and URLs are growing increasingly sophisticated. Attackers use fake websites that impersonate legitimate ones, fooling users into providing sensitive information like usernames, passwords, and financial details. Unlike email phishing, website and URL phishing relies on disguising web addresses and creating convincing, yet fraudulent, web pages. Detecting these phishing websites and URLs is crucial in mitigating data breaches and other cybercrimes.

While traditional methods rely on manual inspection of URLs and websites, these methods are insufficient due to the volume and variety of attacks. The objective is to develop an automated phishing detection system that can classify phishing URLs and websites and assign a risk score to each potential phishing threat. This system will help identify high-risk threats and allow users to make quick, informed decisions.

Objective:
The objective of this project is to build a Phishing Detection System capable of:

Classifying phishing URLs and websites: Identifying if a given URL or website is phishing or legitimate.
Assigning a risk score: Calculating a risk score that quantifies the likelihood of a URL or website being a phishing attempt. The higher the score, the more suspicious the URL or website is.
This system will focus on URL-based and website-based phishing without considering email content or features.

Specific Problem:
Classification: Detect phishing websites and URLs by classifying them as either phishing or legitimate based on features like domain name, SSL certificate status, URL structure, and other key characteristics.
Risk Scoring (Regression): Assign a risk score to each phishing attempt, indicating the likelihood that the URL or website is phishing. The risk score can be used for prioritizing phishing attempts, especially when faced with large-scale attacks.
Key Challenges:
Obfuscated URLs: Phishing URLs may use techniques like domain name spoofing, URL shortening, or encoding to disguise their intent. Identifying these patterns can be difficult.

Behavioral Mimicry: Fraudulent websites often mimic the appearance and behavior of legitimate sites (e.g., fake login pages), which requires advanced feature extraction and detection mechanisms.

Dynamic Phishing Techniques: Phishing tactics continuously evolve, requiring the system to be adaptable to new strategies such as using HTTPS (SSL certificates) to look legitimate, or using look-alike domains.

Data Imbalance: Phishing URLs and websites are generally less common than legitimate ones, leading to potential class imbalance, which can affect model performance.

Feature Extraction: Efficiently extracting the most relevant features (e.g., domain age, URL length, SSL certificate presence, presence of suspicious keywords) is critical for building an effective detection system.

Approach:
1. Classification Task (Phishing Detection):
The classification model will predict whether a given URL or website is phishing or legitimate.

Key features for phishing classification could include:

URL Features:
Length of the URL: Phishing URLs often have unusually long or short lengths.
Domain name: Look-alike domains (e.g., "google.com" vs. "goggle.com").
Special characters: Presence of unusual characters or query strings in the URL.
Use of HTTPS: Phishing websites often fail to implement HTTPS or use self-signed certificates.
Encoding: Phishing URLs may contain base64 or URL encoding to obfuscate their intent.
Website Features:
SSL certificate status: Phishing websites may lack a valid SSL certificate or show warnings in the browser.
Suspicious content: Fake login forms, misleading user interfaces, and lack of security features like two-factor authentication.
Domain age: Newly created domains are more likely to be phishing sites.
Possible classification algorithms:

Logistic Regression
Decision Trees / Random Forests
Support Vector Machines (SVM)
Neural Networks (Deep Learning)
Gradient Boosting Methods (e.g., XGBoost, LightGBM)
2. Regression Task (Risk Scoring):
The regression model will predict a risk score that quantifies the likelihood of a phishing attempt for a given URL or website.

Features influencing the risk score:

Suspicious URL Features: Unusual URL length, special characters, or obscure domain names.
SSL Certificate Presence: Websites with self-signed or invalid certificates are more likely to be phishing.
Domain Reputation: New or low-reputation domains are more likely to be associated with phishing.
Suspicious Website Behavior: Fake login forms, requests for sensitive information, or abnormal user interactions.
Possible regression algorithms:

Linear Regression
Decision Trees / Random Forests
Gradient Boosting Machines (GBM)
Neural Networks (Deep Learning)
3. Model Evaluation:
For Classification: Evaluate using metrics like accuracy, precision, recall, F1-score, and ROC-AUC.
For Regression: Evaluate using mean squared error (MSE), mean absolute error (MAE), R-squared, and root mean squared error (RMSE).
4. Dataset:
To train and test the models, we will use datasets containing labeled phishing and legitimate URLs/websites, such as:

Phishing Websites Data (contains labeled phishing URLs).
UCI Phishing Websites Data (contains features of phishing websites).
Phishing URL Dataset (a collection of phishing and legitimate URLs).
Expected Outcomes:
Phishing URL and Website Classification: The system will classify a URL or website as phishing or legitimate with high accuracy.
Risk Scoring: The system will provide a risk score for each potential phishing attempt, allowing for prioritized handling of higher-risk threats.
Real-Time Detection: The system will be capable of detecting phishing websites and URLs in real-time.
User-Friendly Interface: The system will have an interface where users can input URLs or websites and receive classification results (phishing or legitimate) and a risk score.
Deliverables:
Phishing Detection Model: A machine learning model capable of classifying phishing URLs and websites.
Risk Scoring Model: A regression model that assigns a risk score to each phishing attempt.
User Interface (UI): A front-end application where users can input URLs/websites to check for phishing and view the classification and risk score.
Documentation: Detailed documentation on the methodologies used for both classification and regression tasks, along with performance evaluation results.
Conclusion:
This Phishing Detection and Risk Scoring System will provide an automated solution to identify phishing websites and URLs, with the added functionality of risk scoring to assess the likelihood of a phishing attempt. By utilizing both classification and regression techniques, the system will help users make informed decisions and prioritize the most dangerous phishing threats.

In [24]:
#importing all packages and  dependencies according to requirements
import pandas as pd
from pyspark.sql.session import SparkSession
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

from scipy import stats  #statistical testing
from sklearn.preprocessing import OneHotEncoder , MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split , GridSearchCV , StratifiedKFold , cross_val_score

#Classification Module packages for baseline model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import plot_tree , DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , StackingClassifier , VotingClassifier , GradientBoostingClassifier
from xgboost import XGBClassifier

#adding the base line functionality for evaluation of metrics
from sklearn.metrics import confusion_matrix , classification_report , roc_auc_score , accuracy_score
from mlxtend.plotting import plot_decision_regions

setting up the baseline model to identify the results and can improve the accuracy for model

# Extracted all the csv files from MYSQL Database and loading Files through colab environment

In [25]:
#created an sparksession for loading the large handle csv file
def load_csv_file(sparksession,file_path):
  sparksession.read.csv(file_path,header=True,inferSchema=True)
  return sparksession.read.csv(file_path,header=True,inferSchema=True)


""" creating an spark session for customer details"""
spark = SparkSession.builder.appName("Loading customer details").getOrCreate()
customer_details = load_csv_file(spark,"/content/customer_details.csv")
customer_details.show(5)  #fetching top 5 rows

""" creating an spark session for phishing details """
phishing_spark = SparkSession.builder.appName("Loading phishing details").getOrCreate()
phishing_details = load_csv_file(phishing_spark,"/content/phishing_details.csv")
phishing_details.show(5)   #fetching top 5 ros

+----------+--------------+-----------+----------+-----------------+--------------------+----------+--------------+-------------+----------------+-----------+
|URL_Length|Has_IP_Address|HTTPS_Usage|Domain_Age|Domain_Expiration|Number_of_Subdomains|Alexa_Rank|Number_of_Dots|Shortened_URL|Suspicious_Words|Customer_ID|
+----------+--------------+-----------+----------+-----------------+--------------------+----------+--------------+-------------+----------------+-----------+
|       112|             0|          0|         1|              256|                   1|    558200|             4|            1|               0|          1|
|       102|             0|          1|         9|             1474|                   1|    612421|             1|            1|               1|          2|
|        24|             0|          0|        18|              721|                   1|      6206|             3|            0|               1|          3|
|       116|             0|          1|       

# Converting the schema into pandas dataframe

In [26]:
#converting the customer details into pandas dataframe
customer_details = customer_details.toPandas()  #converted into pandas dataframe
phishing_details = phishing_details.toPandas()  #converted into pandas daraframe
print(f" top 5 rows of customer details {customer_details.head()}")
print(f" top 5 rows of phishing details {phishing_details.head()}")

 top 5 rows of customer details    URL_Length  Has_IP_Address  HTTPS_Usage  Domain_Age  Domain_Expiration  \
0         112               0            0           1                256   
1         102               0            1           9               1474   
2          24               0            0          18                721   
3         116               0            1          14               2001   
4          81               0            1           6               2452   

   Number_of_Subdomains  Alexa_Rank  Number_of_Dots  Shortened_URL  \
0                     1      558200               4              1   
1                     1      612421               1              1   
2                     1        6206               3              0   
3                     0      206480               1              1   
4                     1      212706               4              0   

   Suspicious_Words  Customer_ID  
0                 0            1  
1             

# Structure of each table

In [27]:
""" STructure for customer details """
customer_details.info()
print(f" memory used by each columns : {customer_details.memory_usage()}")
print(f" total size of each columns : {customer_details.size}")
print(f" total rows and columns : {customer_details.shape}")
print(f" descritive statistical summary : {customer_details.describe()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   URL_Length            15000 non-null  int32
 1   Has_IP_Address        15000 non-null  int32
 2   HTTPS_Usage           15000 non-null  int32
 3   Domain_Age            15000 non-null  int32
 4   Domain_Expiration     15000 non-null  int32
 5   Number_of_Subdomains  15000 non-null  int32
 6   Alexa_Rank            15000 non-null  int32
 7   Number_of_Dots        15000 non-null  int32
 8   Shortened_URL         15000 non-null  int32
 9   Suspicious_Words      15000 non-null  int32
 10  Customer_ID           15000 non-null  int32
dtypes: int32(11)
memory usage: 644.7 KB
 memory used by each columns : Index                     132
URL_Length              60000
Has_IP_Address          60000
HTTPS_Usage             60000
Domain_Age              60000
Domain_Expiration       60000
Nu

In [28]:
""" structure for phishing details """
phishing_details.info()
print(f" memory used by each columns : {phishing_details.memory_usage()}")
print(f" total size of each columns : {phishing_details.size}")
print(f" total rows and columns : {phishing_details.shape}")
print(f" descritive statistical summary : {phishing_details.describe()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Iframe_Usage            15000 non-null  int32  
 1   Mouse_Over_Behavior     15000 non-null  int32  
 2   Page_Redirects          15000 non-null  int32  
 3   DNS_Record_Validity     15000 non-null  int32  
 4   SSL_Certificate_Issuer  15000 non-null  int32  
 5   Loading_Time            15000 non-null  float64
 6   Number_of_Links         15000 non-null  int32  
 7   Favicon_Match           15000 non-null  int32  
 8   Is_Phishing             15000 non-null  int32  
 9   Risk_Score              15000 non-null  float64
 10  Customer_ID             15000 non-null  int32  
dtypes: float64(2), int32(9)
memory usage: 761.8 KB
 memory used by each columns : Index                        132
Iframe_Usage               60000
Mouse_Over_Behavior        60000
Page_Redirects      

from above observation result we can see that customer ID column is ame equivalent to customer details and phishing details
so merge both table

# merging both the dataframe

In [29]:
""" Merging both the DataFrame """
phishing_data = pd.merge(customer_details,phishing_details,on="Customer_ID")
print(phishing_data.head(5))  #top 5 rows
print(phishing_data.tail(5))  #bottom 5 rows

   URL_Length  Has_IP_Address  HTTPS_Usage  Domain_Age  Domain_Expiration  \
0         112               0            0           1                256   
1         102               0            1           9               1474   
2          24               0            0          18                721   
3         116               0            1          14               2001   
4          81               0            1           6               2452   

   Number_of_Subdomains  Alexa_Rank  Number_of_Dots  Shortened_URL  \
0                     1      558200               4              1   
1                     1      612421               1              1   
2                     1        6206               3              0   
3                     0      206480               1              1   
4                     1      212706               4              0   

   Suspicious_Words  ...  Iframe_Usage  Mouse_Over_Behavior  Page_Redirects  \
0                 0  ...             

In [14]:
phishing_data

Unnamed: 0,URL_Length,Has_IP_Address,HTTPS_Usage,Domain_Age,Domain_Expiration,Number_of_Subdomains,Alexa_Rank,Number_of_Dots,Shortened_URL,Suspicious_Words,...,Iframe_Usage,Mouse_Over_Behavior,Page_Redirects,DNS_Record_Validity,SSL_Certificate_Issuer,Loading_Time,Number_of_Links,Favicon_Match,Is_Phishing,Risk_Score
0,112,0,0,1,256,1,558200,4,1,0,...,1,1,3,1,0,4.19,22,0,0,24.819042
1,102,0,1,9,1474,1,612421,1,1,1,...,0,1,3,0,1,2.71,52,1,0,9.198524
2,24,0,0,18,721,1,6206,3,0,1,...,1,1,1,1,1,8.24,73,1,1,81.925751
3,116,0,1,14,2001,0,206480,1,1,1,...,0,1,3,1,1,8.77,12,1,0,8.777932
4,81,0,1,6,2452,1,212706,4,0,1,...,0,1,4,1,1,9.55,76,0,0,46.095709
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,82,1,0,12,2907,4,111304,1,1,1,...,1,1,4,1,1,7.03,65,0,0,17.856419
14996,109,1,0,0,885,4,855885,4,0,1,...,0,0,4,0,0,5.26,73,0,1,80.292622
14997,38,1,1,11,905,0,54183,2,0,0,...,0,1,4,1,1,4.96,73,1,0,25.445359
14998,67,0,0,10,503,0,753712,3,1,1,...,1,0,1,1,1,8.86,15,0,0,46.912222


# Identifying the Null values and duplicate values in Dataset

In [37]:
def identify_null_values(phishing_data):
  """ Identifying the null columns"""
  try:
    null_values = phishing_data.isnull().sum()  #calculated the null values accross each columns
    percentage_null_values = sum(null_values)/len(phishing_data) * 100   #calcualted the percentage of null values acccross each columns
  except Exception as e:
    raise e
  finally:
    return percentage_null_values , null_values    #return the percentage of null values and columns

percentage_null_values = identify_null_values(phishing_data)
print(f" percentage of null values accross each columns : {percentage_null_values}")
print(f" duplicate values accross each columns : {phishing_data.duplicated().sum()}")

 percentage of null values accross each columns : (0.0, URL_Length                0
Has_IP_Address            0
HTTPS_Usage               0
Domain_Age                0
Domain_Expiration         0
Number_of_Subdomains      0
Alexa_Rank                0
Number_of_Dots            0
Shortened_URL             0
Suspicious_Words          0
Customer_ID               0
Iframe_Usage              0
Mouse_Over_Behavior       0
Page_Redirects            0
DNS_Record_Validity       0
SSL_Certificate_Issuer    0
Loading_Time              0
Number_of_Links           0
Favicon_Match             0
Is_Phishing               0
Risk_Score                0
dtype: int64)
 duplicate values accross each columns : 0


sinc there are no duplicate values and null values in dataset so considering the datasets as cleaned dataset


creating an class wrapper function for assigned the columns with appropiate to their dataa type

In [42]:
class Column_Segregation:
  def __init__(self):
    self.integer_columns = []    #storing the integer value data type columns
    self.decimal_columns = []    #storing the decimal value data type columns
    self.categorical_columns = [] #storing the categorical value data type columns

  def store_columns(self):
    """ Storing the columns with appropiate data type """
    for columns in phishing_data.columns:
      if phishing_data.dtypes[columns] == "int64" or phishing_data.dtypes[columns] == "int32":
        self.integer_columns.append(columns)
      elif phishing_data.dtypes[columns] == "float64":
        self.decimal_columns.append(columns)
      else:
        self.categorical_columns.append(columns)
    return self.integer_columns , self.decimal_columns , self.categorical_columns


#calling the above class wrpaer functions
column_segregation = Column_Segregation()
integer_columns , decimal_columns , categorical_columns = column_segregation.store_columns()
print(f" integer columns : {integer_columns}")
print(f" decimal columns : {decimal_columns}")
print(f" categorical columns : {categorical_columns}")

 integer columns : ['URL_Length', 'Has_IP_Address', 'HTTPS_Usage', 'Domain_Age', 'Domain_Expiration', 'Number_of_Subdomains', 'Alexa_Rank', 'Number_of_Dots', 'Shortened_URL', 'Suspicious_Words', 'Customer_ID', 'Iframe_Usage', 'Mouse_Over_Behavior', 'Page_Redirects', 'DNS_Record_Validity', 'SSL_Certificate_Issuer', 'Number_of_Links', 'Favicon_Match', 'Is_Phishing']
 decimal columns : ['Loading_Time', 'Risk_Score']
 categorical columns : []


changing the data type of risk score columns into integer columns

In [43]:
phishing_data['Risk_Score'] = phishing_data['Risk_Score'].astype('int64')
phishing_data.dtypes

Unnamed: 0,0
URL_Length,int32
Has_IP_Address,int32
HTTPS_Usage,int32
Domain_Age,int32
Domain_Expiration,int32
Number_of_Subdomains,int32
Alexa_Rank,int32
Number_of_Dots,int32
Shortened_URL,int32
Suspicious_Words,int32
