## [1] Feature Engineering
- At this section, we conduct feature engineering, to generate features based on URL
- All original features from the dataset were removed due to the lack of documentation on how they were created. Without clear definitions, these original features could not guarantee reproducibility, reliability, or interpretability, which are essential for a transparent and trustworthy model.
- Although many phishing datasets include additional features such as HTML-based attributes or traffic statistics, we intentionally restrict our feature set to URL-only features. This design choice ensures that the detector can be applied directly to raw URLs in real-world scenarios, where HTML content, domain metadata, or server-side information may not be easily accessible.
- Additionally, we choose to only focus on the protocol and hostname of the URL for 3 reasons
    - 'Ground Truth' for routing. The registered domain is the only part of the URL that reliably determines the true destination server an end-user will connect to.
    - Reduced feature space and better generalisation. Analysing the full URL which can include paths and queries can introduce greater dimensionality and variance into the feature space. These components are highly variable and easily manipulated by attackers to create millions of unique URLs that all point to the same malicious page. Because of our limited data size, it is important that we only focus on the most important portion of URL to prevent overfitting on the dfferent path variations.
    -  Dataset Consistency. Many data source (such as blacklisted domain) may only record the hostname and protocol without the specific path observed

#### Breakdown of an URL (example)
https://subdomain.example.com:8080/path/to/file.html?key1=value1&key2=value2#section
- Scheme / Protocol: https
- Host / Domain: subdomain.example.com
    - Subdomain: subdomain
    - Second-Level-Domain (SLD): example
    - Top-Level-Domain (TLD): com
- Port: 8080
- Path: path/to/file.html
    - Directory-Path: path/to
    - Filename: file.html
    - File-Extension: html
- Query: key1=value1&key2=value2
    - Params: key1=value1 and key2=value2
- Fragment/Anchor: section

In [1]:
# import packages for data processing
import pandas as pd
import numpy as np

# URL parsing
import re
from urllib.parse import urlparse
from collections import Counter
import tldextract
import ipaddress

# set display options
#pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

In [2]:
# load data
df = pd.read_csv('dataset/kaggle_phishing_dataset.csv')

In [3]:
def preprocess(df):
    # fix mapping of status column
    status_mapping = {'legitimate': 0, 'phishing': 1}
    df['target'] = df['status'].map(status_mapping)
    
    # drop all cols other than 'url' and 'target'
    df = df[['url', 'target']]
    return df

In [4]:
def decompose_single_url(url):
    parsed = urlparse(url)
    hostname = parsed.hostname or None
    ext = tldextract.extract(hostname) if hostname else None
    
    return {
        'url': url,
        'protocol': parsed.scheme or None,
        'hostname': hostname,
        'subdomains': ext.subdomain if ext else None,
        'sld': ext.domain if ext else None,
        'tld': ext.suffix if ext else None,
    }

In [5]:
def decompose_url(df):
    df_decomposed = df['url'].apply(lambda x: pd.Series(decompose_single_url(x)))
    df_decomposed = pd.concat([df_decomposed, df['target']], axis=1)
    # convert all empty strings to None
    df_decomposed.replace('', None, inplace=True)
    # transform url such that we truncate off at hostname level
    def reconstruct_base_url(row):
        if not row['hostname']:
            return row['url']  # return original if parsing failed
        protocol = row['protocol'] if row['protocol'] else 'http'
        return f"{protocol}://{row['hostname']}"
    df_decomposed['url'] = df_decomposed.apply(reconstruct_base_url, axis=1)
    return df_decomposed

In [None]:
df_processed = preprocess(df)
df_decomposed = decompose_url(df_processed)

In [7]:
df_decomposed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   url         11430 non-null  object
 1   protocol    11430 non-null  object
 2   hostname    11430 non-null  object
 3   subdomains  8814 non-null   object
 4   sld         11430 non-null  object
 5   tld         11333 non-null  object
 6   target      11430 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 625.2+ KB


In [8]:
df_decomposed.head(20)

Unnamed: 0,url,protocol,hostname,subdomains,sld,tld,target
0,http://www.crestonwood.com,http,www.crestonwood.com,www,crestonwood,com,0
1,http://shadetreetechnology.com,http,shadetreetechnology.com,,shadetreetechnology,com,1
2,https://support-appleld.com.secureupdate.duila...,https,support-appleld.com.secureupdate.duilawyeryork...,support-appleld.com.secureupdate,duilawyeryork,com,1
3,http://rgipt.ac.in,http,rgipt.ac.in,,rgipt,ac.in,0
4,http://www.iracing.com,http,www.iracing.com,www,iracing,com,0
5,http://appleid.apple.com-app.es,http,appleid.apple.com-app.es,appleid.apple,com-app,es,1
6,http://www.mutuo.it,http,www.mutuo.it,www,mutuo,it,0
7,http://www.shadetreetechnology.com,http,www.shadetreetechnology.com,www,shadetreetechnology,com,1
8,http://vamoaestudiarmedicina.blogspot.com,http,vamoaestudiarmedicina.blogspot.com,vamoaestudiarmedicina,blogspot,com,0
9,https://parade.com,https,parade.com,,parade,com,0


In [9]:
def extract_url_features(df):

    final_df = df.copy()

    ## PROTOCOL FEATURES
    final_df['is_https'] = (final_df['protocol'] == 'https').astype(int)
    final_df['is_http'] = (final_df['protocol'] == 'http').astype(int)

    ## DOMAIN FEATURES
    final_df['has_subdomain'] = final_df['subdomains'].notna().astype(int)
    final_df['has_tld'] = final_df['tld'].notna().astype(int)
    final_df['num_subdomain'] = final_df['subdomains'].apply(lambda x: len(x.split('.')) if x else 0)
    # check if is IP address
    def is_ip_address(hostname):
        try:
            ipaddress.ip_address(hostname)
            return 1  
        except:
            return 0
    final_df['is_domain_ip'] = final_df['hostname'].apply(is_ip_address)
    # detect punycode
    final_df['is_punycode'] = final_df['hostname'].str.contains('xn--', regex=False, na=False).astype(int)

    ## LENGTH FEATURES
    final_df['length_url'] = final_df['url'].str.len()
    final_df['length_hostname'] = final_df['hostname'].str.len()
    final_df['length_subdomains'] = final_df['subdomains'].str.len()
    final_df['length_tld'] = final_df['tld'].str.len()
    final_df['length_sld'] = final_df['sld'].str.len()

    ## PUNCTUATION FEATURES
    final_df['num_dots'] = final_df['url'].str.count(r'\.')
    final_df['num_hyphens'] = final_df['url'].str.count('-')
    final_df['num_at'] = final_df['url'].str.count('@')
    final_df['num_question_marks'] = final_df['url'].str.count(r'\?')
    final_df['num_and'] = final_df['url'].str.count('&')
    final_df['num_equal'] = final_df['url'].str.count('=')
    final_df['num_underscores'] = final_df['url'].str.count('_')    
    final_df['num_slashes'] = final_df['url'].str.count('/')
    final_df['num_percent'] = final_df['url'].str.count('%')
    final_df['num_dollars'] = final_df['url'].str.count(r'\$')
    final_df['num_colon'] = final_df['url'].str.count(':')
    final_df['num_semicolon'] = final_df['url'].str.count(';')
    final_df['num_comma'] = final_df['url'].str.count(',')
    final_df['num_hashtag'] = final_df['url'].str.count('#')
    final_df['num_tilde'] = final_df['url'].str.count('~')

    ## SUSPICIOUS PATTERNS FEATURES
    final_df['tld_in_subdomain'] = final_df['subdomains'].apply(lambda x: 1 if x and any(ext in x for ext in ['.com', '.net', '.org']) else 0)
    final_df['subdomain_longer_sld'] = (final_df['length_subdomains'] > final_df['length_sld']).astype(int)

    ## RATIO FEATURES
    final_df['ratio_digits_hostname'] = final_df['hostname'].apply(lambda x: sum(c.isdigit() for c in x) / len(x) if len(x) > 0 else 0)
    final_df['ratio_letter_hostname'] = final_df['hostname'].apply(lambda x: sum(c.isalpha() for c in x) / len(x) if len(x) > 0 else 0)
    final_df['ratio_special_char_hostname'] = final_df['hostname'].apply(lambda x: sum(not c.isalnum() and c not in ['/', ':', '.'] for c in x) / len(x) if len(x) > 0 else 0)
    
    # WORD-BASED FEATURES 
    words_host = final_df['hostname'].apply(lambda x: re.findall(r'\w+', x) if x else [])
    final_df['length_words_hostname'] = words_host.apply(len)
    final_df['avg_word_hostname'] = words_host.apply(lambda x: np.mean([len(w) for w in x]) if x else 0)

    ## CHARACTER BASED FEATURES
    final_df['num_unique_chars_hostname'] = final_df['hostname'].apply(lambda x: len(set(x)) if x else 0)
    final_df['num_unique_chars_subdomains'] = final_df['subdomains'].apply(lambda x: len(set(x)) if x else 0)
    final_df['num_unique_chars_sld'] = final_df['sld'].apply(lambda x: len(set(x)) if x else 0)
    final_df['num_non_ascii_hostname'] = final_df['hostname'].apply(lambda x: sum(1 for c in x if ord(c) > 127) if x else 0)
    final_df['longest_repeated_char_hostname'] = final_df['hostname'].apply(lambda x: max([len(list(g)) for k, g in re.findall(r'((.)\2*)', x)]) if x else 0)
    final_df['longest_repeated_char_subdomains'] = final_df['subdomains'].apply(lambda x: max([len(list(g)) for k, g in re.findall(r'((.)\2*)', x)]) if x else 0)
    final_df['longest_repeated_char_sld'] = final_df['sld'].apply(lambda x: max([len(list(g)) for k, g in re.findall(r'((.)\2*)', x)]) if x else 0)

    # URL SHORTENING FEATURES
    shortening_services = ['bit.ly', 'goo.gl', 'tinyurl', 't.co']
    final_df['has_shortened_hostname'] = final_df['hostname'].str.lower().apply(lambda x: 1 if any(service in x for service in shortening_services) else 0)
    
    # ENTROPY FEATURES
    def calculate_entropy(domain):
        if not domain or len(domain) == 0:
            return 0
        domain_clean = re.sub(r'[^a-z]', '', domain.lower())
        if len(domain_clean) == 0:
            return 0
        char_freq = Counter(domain_clean)
        entropy = -sum((count/len(domain_clean)) * np.log2(count/len(domain_clean)) 
                      for count in char_freq.values())
        return entropy
    final_df['entropy_hostname'] = final_df['hostname'].apply(calculate_entropy)
    final_df['entropy_subdomains'] = final_df['subdomains'].apply(calculate_entropy)
    final_df['entropy_sld'] = final_df['sld'].apply(calculate_entropy)

    
    return final_df

In [10]:
df_extracted = extract_url_features(df_decomposed)

In [11]:
df_extracted

Unnamed: 0,url,protocol,hostname,subdomains,sld,tld,target,is_https,is_http,has_subdomain,has_tld,num_subdomain,is_domain_ip,is_punycode,length_url,length_hostname,length_subdomains,length_tld,length_sld,num_dots,num_hyphens,num_at,num_question_marks,num_and,num_equal,num_underscores,num_slashes,num_percent,num_dollars,num_colon,num_semicolon,num_comma,num_hashtag,num_tilde,tld_in_subdomain,subdomain_longer_sld,ratio_digits_hostname,ratio_letter_hostname,ratio_special_char_hostname,length_words_hostname,avg_word_hostname,num_unique_chars_hostname,num_unique_chars_subdomains,num_unique_chars_sld,num_non_ascii_hostname,longest_repeated_char_hostname,longest_repeated_char_subdomains,longest_repeated_char_sld,has_shortened_hostname,entropy_hostname,entropy_subdomains,entropy_sld
0,http://www.crestonwood.com,http,www.crestonwood.com,www,crestonwood,com,0,0,1,1,1,1,0,0,26,19,3.0,3.0,11,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.894737,0.00,3,5.666667,11,1,9,0,1,1,1,0,3.028639,-0.000000,3.027169
1,http://shadetreetechnology.com,http,shadetreetechnology.com,,shadetreetechnology,com,1,0,1,0,1,0,0,0,30,23,,3.0,19,1,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.956522,0.00,2,11.000000,15,0,13,0,1,0,1,0,3.606937,0.000000,3.511085
2,https://support-appleld.com.secureupdate.duila...,https,support-appleld.com.secureupdate.duilawyeryork...,support-appleld.com.secureupdate,duilawyeryork,com,1,1,0,1,1,3,0,0,58,50,32.0,3.0,13,4,1,0,0,0,0,0,2,0,0,1,0,0,0,0,1,1,0.000000,0.900000,0.02,6,7.500000,18,14,11,0,1,1,1,0,3.842101,3.466101,3.392747
3,http://rgipt.ac.in,http,rgipt.ac.in,,rgipt,ac.in,0,0,1,0,1,0,0,0,18,11,,5.0,5,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.818182,0.00,3,3.000000,9,0,5,0,1,0,1,0,2.947703,0.000000,2.321928
4,http://www.iracing.com,http,www.iracing.com,www,iracing,com,0,0,1,1,1,1,0,0,22,15,3.0,3.0,7,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.866667,0.00,3,4.333333,10,1,6,0,1,1,1,0,3.026987,-0.000000,2.521641
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11425,http://www.fontspace.com,http,www.fontspace.com,www,fontspace,com,0,0,1,1,1,1,0,0,24,17,3.0,3.0,9,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.882353,0.00,3,5.000000,12,1,9,0,1,1,1,0,3.323231,-0.000000,3.169925
11426,http://www.budgetbots.com,http,www.budgetbots.com,www,budgetbots,com,1,0,1,1,1,1,0,0,25,18,3.0,3.0,10,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.888889,0.00,3,5.333333,12,1,8,0,1,1,1,0,3.327820,-0.000000,2.921928
11427,https://www.facebook.com,https,www.facebook.com,www,facebook,com,0,1,0,1,1,1,0,0,24,16,3.0,3.0,8,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.875000,0.00,3,4.666667,10,1,7,0,1,1,1,0,2.985228,-0.000000,2.750000
11428,http://www.mypublicdomainpictures.com,http,www.mypublicdomainpictures.com,www,mypublicdomainpictures,com,0,0,1,1,1,1,0,0,37,30,3.0,3.0,22,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0.000000,0.933333,0.00,3,9.333333,18,1,16,0,1,1,1,0,3.913800,-0.000000,3.879664


In [12]:
df_extracted.columns

Index(['url', 'protocol', 'hostname', 'subdomains', 'sld', 'tld', 'target',
       'is_https', 'is_http', 'has_subdomain', 'has_tld', 'num_subdomain',
       'is_domain_ip', 'is_punycode', 'length_url', 'length_hostname',
       'length_subdomains', 'length_tld', 'length_sld', 'num_dots',
       'num_hyphens', 'num_at', 'num_question_marks', 'num_and', 'num_equal',
       'num_underscores', 'num_slashes', 'num_percent', 'num_dollars',
       'num_colon', 'num_semicolon', 'num_comma', 'num_hashtag', 'num_tilde',
       'tld_in_subdomain', 'subdomain_longer_sld', 'ratio_digits_hostname',
       'ratio_letter_hostname', 'ratio_special_char_hostname',
       'length_words_hostname', 'avg_word_hostname',
       'num_unique_chars_hostname', 'num_unique_chars_subdomains',
       'num_unique_chars_sld', 'num_non_ascii_hostname',
       'longest_repeated_char_hostname', 'longest_repeated_char_subdomains',
       'longest_repeated_char_sld', 'has_shortened_hostname',
       'entropy_hostname', '

In [13]:
# create tree based model for random forest, do train test split
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
X = df_extracted.drop(columns=['url', 'protocol', 'hostname', 'sld', 'target'])
y = df_extracted['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)   

# for subdomain, tld, file_extension, keep the top 5 based on value counts, then set rest to others, then do one hot encoding
for col in ['subdomains', 'tld']:
    top_5 = df_decomposed[col].value_counts().nlargest(5).index
    X_train[col] = X_train[col].apply(lambda x: x if x in top_5 else 'other')
    X_test[col] = X_test[col].apply(lambda x: x if x in top_5 else 'other')
    X_train = pd.get_dummies(X_train, columns=[col], prefix=col)
    X_test = pd.get_dummies(X_test, columns=[col], prefix=col)

# fit randm forest and evaluate the rocauc
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)[:, 1]    

from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC AUC: {roc_auc:.4f}')

ROC AUC: 0.9076
