# Phishing URL (Uniform Resource Locator) Detection 

 Phishers try to deceive their victims by social engineering or creating mockup websites to steal information such as account ID, username, password from individuals and organizations. Although many methods have been proposed to detect phishing websites, Phishers have evolved their methods to escape from these detection methods. One of the most successful methods for detecting these malicious activities is Machine Learning. This is because most Phishing attacks have some common characteristics which can be identified by machine learning methods.

## Loading / Cleaning Data
* Checking information on data set
* Checking for nulls
* Checking for duplicates

In [1]:
# Importing Modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
#Loading dataset into URL_DF
url_df = pd.read_csv(r'C:\Users\FAWAZ\Documents\AI-ML\new_data_urls.csv')

## Tags 
* 0 represnets phising links 
* 1 represents legit links

In [3]:
#First 10  rows of the dataset
url_df.head()

Unnamed: 0,url,status
0,0000111servicehelpdesk.godaddysites.com,0
1,000011accesswebform.godaddysites.com,0
2,00003.online,0
3,0009servicedeskowa.godaddysites.com,0
4,000n38p.wcomhost.com,0


In [4]:
#information about the dataset
url_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822010 entries, 0 to 822009
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     822010 non-null  object
 1   status  822010 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 12.5+ MB


In [5]:
#checing for nulls
url_df.isna().sum()

url       0
status    0
dtype: int64

In [6]:
#Checking for duplicates
url_df.duplicated().sum()

np.int64(13968)

In [7]:
#Dropping duplicates
url_df.drop_duplicates(inplace= True)

## Extracting Features From URL

### URLs have features in them which can be used to detect if they are are legit links or phising links. Features like:
* IP(Internet Protocol) address
* URL length
* @ symbol
* Double slash
* Prefix and suffix ( - ); hypen
* Sub Domain
* Presence of http/https in the URL
* URL shortening services
* Depth of the URL( / )


### Function used to extract features and compline them into a dataframe(table)

In [8]:
#Function for extracting features
import re
from urllib.parse import urlparse
import socket

# Check if the URL uses an IP address instead of a domain name
def using_ip(url):
    try:
        ip = urlparse(url).netloc
        socket.inet_aton(ip)
        return 1
    except:
        return 0

# Check if URL length is long
def long_url(url):
    return 1 if len(url) > 75 else 0

# Check if URL is very short
def short_url(url):
    return 1 if len(url) < 10 else 0

# Check for '@' symbol in URL
def has_at_symbol(url):
    return 1 if '@' in url else 0

# Check for '//' redirect in the path (not part of protocol)
def redirecting_double_slash(url):
    pos = url.find('//', 6)
    return 1 if pos != -1 else 0

# Check for prefix-suffix pattern in domain (e.g., phishing-site.com)
def prefix_suffix(url):
    domain = urlparse(url).netloc
    return 1 if '-' in domain else 0

# Count subdomains (phishing sites often use many)
def subdomains(url):
    domain = urlparse(url).netloc
    return 1 if domain.count('.') >= 2 else 0

# Check if HTTPS is used
def uses_https(url):
    return 0 if urlparse(url).scheme == 'https' else 1

# Check for unusual ports
def non_standard_port(url):
    port = urlparse(url).port
    return 1 if port and port not in [80, 443] else 0

# Check if HTTPS appears in domain
def https_in_domain(url):
    return 1 if 'https' in urlparse(url).netloc else 0

#listing shortening services
shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

# Checking for Shortening Services in URL (Tiny_URL)
def tinyURL(url):
    match= re.search(shortening_services, url)
    if match :
        return 1
    else:
        return 0
# Gives number of '/' in URL (URL_Depth)
def getDepth(url):
  s = urlparse(url).path.split('/')
  depth = 0
  for j in range(len(s)):
    if len(s[j]) != 0:
      depth = depth+1
  return depth


In [9]:
# compling the features extracted
def extracted_features(url):
    features= []
    features.append(using_ip(url))
    features.append(long_url(url))
    features.append(short_url(url))
    features.append(has_at_symbol(url))
    features.append(redirecting_double_slash(url))
    features.append(prefix_suffix(url))
    features.append(subdomains(url))
    features.append(uses_https(url))
    features.append(non_standard_port(url))
    features.append(https_in_domain(url))
    features.append(tinyURL(url))
    features.append(getDepth(url))
    return features
    

In [10]:
#testing the function
extracted_features('www.google.com')

[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]

In [11]:
#extracting the features from the url and putting it in a tabular for (DataFrame)
new_df= url_df['url'].apply(extracted_features).apply(pd.Series)

In [12]:
#cheching the first 5 rows of the new table
new_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0,0,0,0,0,0,0,1,0,0,0,1
1,0,0,0,0,0,0,0,1,0,0,0,1
2,0,0,0,0,0,0,0,1,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,0,1
4,0,0,0,0,0,0,0,1,0,0,1,1


In [13]:
# renaming the columns 
new_df.rename(columns= {
    0 : 'IP',
    1: 'Long_URL',
    2: 'Short_URL',
    3: 'Short_URL',
    4: 'Double_Slash',
    5: 'Hypen',
    6: 'Sub_Domains',
    7: 'https',
    8: 'Standard_Port',
    9: 'https_in_domain',
    10: 'shortening_service',
    11: 'Depth'
    }, inplace= True)

In [14]:
# checking if columns have been renamed
new_df.columns

Index(['IP', 'Long_URL', 'Short_URL', 'Short_URL', 'Double_Slash', 'Hypen',
       'Sub_Domains', 'https', 'Standard_Port', 'https_in_domain',
       'shortening_service', 'Depth'],
      dtype='object')

In [15]:
#combining the orignal dataset and the new one of extracted features 
url_final_df= pd.concat([url_df, new_df], axis= 1)

In [16]:
#checking the first 5 rows of the combined data set
url_final_df.head()

Unnamed: 0,url,status,IP,Long_URL,Short_URL,Short_URL.1,Double_Slash,Hypen,Sub_Domains,https,Standard_Port,https_in_domain,shortening_service,Depth
0,0000111servicehelpdesk.godaddysites.com,0,0,0,0,0,0,0,0,1,0,0,0,1
1,000011accesswebform.godaddysites.com,0,0,0,0,0,0,0,0,1,0,0,0,1
2,00003.online,0,0,0,0,0,0,0,0,1,0,0,0,1
3,0009servicedeskowa.godaddysites.com,0,0,0,0,0,0,0,0,1,0,0,0,1
4,000n38p.wcomhost.com,0,0,0,0,0,0,0,0,1,0,0,1,1


In [17]:
#removing the url column
url_final_df.drop(['url'], axis=1, inplace= True)

In [18]:
#checking the columns left
url_final_df.columns

Index(['status', 'IP', 'Long_URL', 'Short_URL', 'Short_URL', 'Double_Slash',
       'Hypen', 'Sub_Domains', 'https', 'Standard_Port', 'https_in_domain',
       'shortening_service', 'Depth'],
      dtype='object')

In [19]:
#checking how balanced the target column is
url_final_df['status'].value_counts(normalize= True)

status
1    0.528473
0    0.471527
Name: proportion, dtype: float64

## Model Building 
The data set is now to be splitted into Train and Test. The data set is divide into 2, 80% for training and the remaining 20% for testing the model's performance.
ScikitLearn and XGBoost are the modules used.

## Models used:
* Logistic Regression
* Random Forest Classifier
* XGBoost(Extreme Gradient Boosting) Classifier


In [20]:
#importing the modules for spliting the data, building the models and evaluating performance
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
import joblib

In [21]:
#splitting the data into independent(X) and dependent (y) variables
y= url_final_df['status']
X= url_final_df.drop(['status'], axis= 1)

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= 42, stratify= y)

In [22]:
X.shape

(808042, 12)

### Logistic Regression Model
A model used for classification. Mostly dataset with dicotomous classes.

In [23]:
#Fitting the Logitic Regression model
lr= LogisticRegression(random_state= 42)
lr.fit(X_train, y_train)

## Evaluation
* Classification_Report: Shows the precision, recall and f1 score on both class. It also shows accuracy.


In [24]:
#Evaluating the model's performance
pred= lr.predict(X_test)
cla = classification_report(y_test, pred)
print(cla)

              precision    recall  f1-score   support

           0       0.80      0.44      0.57     76203
           1       0.64      0.90      0.75     85406

    accuracy                           0.68    161609
   macro avg       0.72      0.67      0.66    161609
weighted avg       0.72      0.68      0.66    161609



## Function to test the model on real URLs

In [25]:
#function used to test the new url
import warnings
warnings.filterwarnings('ignore')
def test(url, model):
    get= extracted_features(url)
    get001 = [get]
    pred = model.predict(get001)
    return pred

In [26]:
#test1
test('0000111servicehelpdesk.godaddysites.com', lr)

array([1])

In [27]:
#test2
test('www.google.com', lr)

array([1])

## Random Forest
A model that can be used for classification problems as well as regression problems. It mimics the typical forest, by growing numerous trees, each on differernt factors.

In [28]:
# Fitting the Random Forest classifier model
rfc= RandomForestClassifier(random_state= 42)
rfc.fit(X_train, y_train)

In [29]:
#Evaluating the model's performance
pred= rfc.predict(X_test)
cla = classification_report(y_test, pred)
print(cla)

              precision    recall  f1-score   support

           0       0.78      0.71      0.74     76203
           1       0.76      0.82      0.79     85406

    accuracy                           0.77    161609
   macro avg       0.77      0.77      0.77    161609
weighted avg       0.77      0.77      0.77    161609



In [30]:
#test 3
test('0000111servicehelpdesk.godaddysites.com', rfc)

array([0])

In [31]:
#test 4
test('deepchecks.com/open-source/', rfc)


array([1])

In [32]:
#test 5
test('https://datasciencedojo.com/blog/machine-learning-models-testing-methods/', rfc)

array([0])

### XGBoost 
Extreme Gradient Boosting aka xgboost, is a tree based algorithm/model that grows numerous trees just like the random forest, the only difference here is that each tree is an improvement on the flaw of the previous tree.

In [33]:
#fitting the xgboost model
xgb = XGBClassifier(random_state= 42)
xgb.fit(X_train.values, y_train.values)

In [34]:
#evaluating the model's performance
predc= xgb.predict(X_test.values)
clas= classification_report(y_test, predc)
print(clas)

              precision    recall  f1-score   support

           0       0.78      0.71      0.74     76203
           1       0.76      0.82      0.79     85406

    accuracy                           0.77    161609
   macro avg       0.77      0.77      0.77    161609
weighted avg       0.77      0.77      0.77    161609



In [36]:
test('0000111servicehelpdesk.godaddysites.com', xgb)

array([0])

In [37]:
test('https://www.google.com', xgb)

array([0])

In [38]:
test('https://www.w3schools.com/python/python_ml_getting_started.asp', xgb)

array([1])

In [39]:
test('https://en.wikipedia.org/wiki/Machine_learning', xgb)

array([1])

In [40]:
joblib.dump(xgb ,'PHISING LINK MODEL.joblib')

['PHISING LINK MODEL.joblib']

## Conclusion
The model performance might be improved with addition of online features so as to ensure the model understands the pattern that point at whether a link is a malicious one or a legit one. 
So far the model works well and can identify physical features in URLs which might point if they are malicious or not.  