## [1] Feature Engineering
- At this section, we conduct feature engineering, to generate features based on URL
- All original features from the dataset were removed due to the lack of documentation on how they were created. Without clear definitions, these original features could not guarantee reproducibility, reliability, or interpretability, which are essential for a transparent and trustworthy model.
- Although many phishing datasets include additional features such as HTML-based attributes or traffic statistics, we intentionally restrict our feature set to URL-only features. This design choice ensures that the detector can be applied directly to raw URLs in real-world scenarios, where HTML content, domain metadata, or server-side information may not be easily accessible.

#### Breakdown of an URL (example)
https://subdomain.example.com:8080/path/to/file.html?key1=value1&key2=value2#section
- Scheme / Protocol: https
- Host / Domain: subdomain.example.com
    - Subdomain: subdomain
    - Second-Level-Domain (SLD): example
    - Top-Level-Domain (TLD): com
- Port: 8080
- Path: path/to/file.html
    - Directory-Path: path/to
    - Filename: file.html
    - File-Extension: html
- Query: key1=value1&key2=value2
    - Params: key1=value1 and key2=value2
- Fragment/Anchor: section

In [1]:
# import packages for data processing
import pandas as pd
import numpy as np

# URL parsing
import re
from urllib.parse import urlparse
from collections import Counter
import tldextract
import ipaddress

# set display options
#pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

In [2]:
# load data
df = pd.read_csv('dataset/kaggle_phishing_dataset.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 89 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   url                         11430 non-null  object 
 1   length_url                  11430 non-null  int64  
 2   length_hostname             11430 non-null  int64  
 3   ip                          11430 non-null  int64  
 4   nb_dots                     11430 non-null  int64  
 5   nb_hyphens                  11430 non-null  int64  
 6   nb_at                       11430 non-null  int64  
 7   nb_qm                       11430 non-null  int64  
 8   nb_and                      11430 non-null  int64  
 9   nb_or                       11430 non-null  int64  
 10  nb_eq                       11430 non-null  int64  
 11  nb_underscore               11430 non-null  int64  
 12  nb_tilde                    11430 non-null  int64  
 13  nb_percent                  114

In [4]:
def preprocess(df):
    # fix mapping of status column
    status_mapping = {'legitimate': 0, 'phishing': 1}
    df['target'] = df['status'].map(status_mapping)
    
    # drop all cols other than 'url' and 'target'
    df = df[['url', 'target']]
    return df

In [5]:
df_processed = preprocess(df)

In [6]:
def decompose_single_url(url):
    parsed = urlparse(url)
    hostname = parsed.hostname or None
    ext = tldextract.extract(hostname) if hostname else None
    
    # path decomposition
    path_parts = [p for p in parsed.path.split('/') if p] if parsed.path else []
    filename = path_parts[-1] if path_parts else None
    file_extension = filename.split('.')[-1] if filename and '.' in filename else None
    directory_path = '/'.join(path_parts[:-1]) if len(path_parts) > 1 else None
    
    # query parameters
    query_params = parsed.query.split('&') if parsed.query else None
    
    return {
        'url': url,
        'protocol': parsed.scheme or None,
        'hostname': hostname,
        'port': parsed.port if parsed.port is not None else None,
        'path': parsed.path or None,
        'query': parsed.query or None,
        'fragment': parsed.fragment or None,
        'subdomains': ext.subdomain if ext else None,
        'sld': ext.domain if ext else None,
        'tld': ext.suffix if ext else None,
        'filename': filename,
        'file_extension': file_extension,
        'directory_path': directory_path,
        'query_params': query_params
    }

In [7]:
def decompose_url(df):
    df_decomposed = df['url'].apply(lambda x: pd.Series(decompose_single_url(x)))
    df_decomposed = pd.concat([df_decomposed, df['target']], axis=1)
    # convert all empty strings to None
    df_decomposed.replace('', None, inplace=True)
    return df_decomposed

In [17]:
df_decomposed = decompose_url(df_processed)

In [18]:
df_decomposed.head()

Unnamed: 0,url,protocol,hostname,port,path,query,fragment,subdomains,sld,tld,filename,file_extension,directory_path,query_params,target
0,http://www.crestonwood.com/router.php,http,www.crestonwood.com,,/router.php,,,www,crestonwood,com,router.php,php,,,0
1,http://shadetreetechnology.com/V4/validation/a...,http,shadetreetechnology.com,,/V4/validation/a111aedc8ae390eabcfa130e041a10a4,,,,shadetreetechnology,com,a111aedc8ae390eabcfa130e041a10a4,,V4/validation,,1
2,https://support-appleld.com.secureupdate.duila...,https,support-appleld.com.secureupdate.duilawyeryork...,,/ap/89e6a3b4b063b8d/,cmd=_update&dispatch=89e6a3b4b063b8d1b&locale=_,,support-appleld.com.secureupdate,duilawyeryork,com,89e6a3b4b063b8d,,ap,"[cmd=_update, dispatch=89e6a3b4b063b8d1b, loca...",1
3,http://rgipt.ac.in,http,rgipt.ac.in,,,,,,rgipt,ac.in,,,,,0
4,http://www.iracing.com/tracks/gateway-motorspo...,http,www.iracing.com,,/tracks/gateway-motorsports-park/,,,www,iracing,com,gateway-motorsports-park,,tracks,,0


In [19]:
df_decomposed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   url             11430 non-null  object 
 1   protocol        11430 non-null  object 
 2   hostname        11430 non-null  object 
 3   port            27 non-null     float64
 4   path            10791 non-null  object 
 5   query           1544 non-null   object 
 6   fragment        50 non-null     object 
 7   subdomains      8814 non-null   object 
 8   sld             11430 non-null  object 
 9   tld             11333 non-null  object 
 10  filename        8083 non-null   object 
 11  file_extension  3047 non-null   object 
 12  directory_path  5770 non-null   object 
 13  query_params    1544 non-null   object 
 14  target          11430 non-null  int64  
dtypes: float64(1), int64(1), object(13)
memory usage: 1.3+ MB


In [29]:
def extract_url_features(df):

    final_df = df.copy()

    ## PROTOCOL FEATURES
    final_df['is_https'] = (final_df['protocol'] == 'https').astype(int)
    final_df['is_http'] = (final_df['protocol'] == 'http').astype(int)

    ## DOMAIN FEATURES
    final_df['has_subdomain'] = final_df['subdomains'].notna().astype(int)
    final_df['has_tld'] = final_df['tld'].notna().astype(int)
    final_df['num_subdomain'] = final_df['subdomains'].apply(lambda x: len(x.split('.')) if x else 0)
    # check if is IP address
    def is_ip_address(hostname):
        try:
            ipaddress.ip_address(hostname)
            return 1  
        except:
            return 0
    final_df['is_domain_ip'] = final_df['hostname'].apply(is_ip_address)
    # suspicious punctuation in domain
    final_df['num_hyphens_domain'] = final_df['hostname'].str.count('-')
    final_df['num_dots_domain'] = final_df['hostname'].str.count(r'\.')
    # detect punycode
    final_df['is_punycode'] = final_df['hostname'].str.contains('xn--', regex=False, na=False).astype(int)
    
    ## PORT FEATURES
    final_df['has_port'] = final_df['port'].notna().astype(int)

    ## PATH FEATURES
    final_df['has_path'] = final_df['path'].apply(lambda x: 1 if x and x != '/' else 0)
    final_df['path_depth'] = final_df['path'].apply(lambda x: len([p for p in x.split('/') if p]) if x else 0)
    final_df['has_directory_path'] = final_df['directory_path'].apply(lambda x: 1 if x and x != '/' else 0)

    ## FILE FEATURES
    final_df['has_filename'] = final_df['filename'].notna().astype(int)
    final_df['has_file_extension'] = final_df['file_extension'].notna().astype(int)

    ## QUERY FEATURES
    final_df['has_query'] = final_df['query'].notna().astype(int) 
    final_df['num_query_params'] = final_df['query_params'].apply(lambda x: len([p for p in x if p]) if x else 0)
    
    ## FRAGMENT FEATURES
    final_df['has_fragment'] = final_df['fragment'].notna().astype(int)

    ## LENGTH FEATURES
    final_df['length_url'] = final_df['url'].str.len()
    final_df['length_hostname'] = final_df['hostname'].str.len()
    final_df['length_tld'] = final_df['tld'].str.len()
    final_df['length_sld'] = final_df['sld'].str.len()
    final_df['length_subdomains'] = final_df['subdomains'].str.len()
    final_df['length_path'] = final_df['path'].str.len()
    final_df['length_query'] = final_df['query'].str.len()
    final_df['length_fragment'] = final_df['fragment'].str.len()

    ## PUNCTUATION FEATURES
    final_df['num_dots'] = final_df['url'].str.count(r'\.')
    final_df['num_hyphens'] = final_df['url'].str.count('-')
    final_df['num_at'] = final_df['url'].str.count('@')
    final_df['num_question_marks'] = final_df['url'].str.count(r'\?')
    final_df['num_and'] = final_df['url'].str.count('&')
    final_df['num_equal'] = final_df['url'].str.count('=')
    final_df['num_underscores'] = final_df['url'].str.count('_')    
    final_df['num_slashes'] = final_df['url'].str.count('/')
    final_df['num_percent'] = final_df['url'].str.count('%')
    final_df['num_dollars'] = final_df['url'].str.count(r'\$')
    final_df['num_colon'] = final_df['url'].str.count(':')
    final_df['num_semicolon'] = final_df['url'].str.count(';')
    final_df['num_comma'] = final_df['url'].str.count(',')
    final_df['num_hashtag'] = final_df['url'].str.count('#')
    final_df['num_tilde'] = final_df['url'].str.count('~')

    ## SUSPICIOUS PATTERNS FEATURES
    final_df['http_in_path'] = final_df['path'].str.lower().str.contains('http', regex=False, na=False).astype(int)
    final_df['tld_in_path'] = final_df['path'].apply(lambda x: 1 if x and any(ext in x.lower() for ext in ['.com', '.net', '.org']) else 0)
    final_df['tld_in_subdomain'] = final_df['subdomains'].apply(lambda x: 1 if x and any(ext in x for ext in ['.com', '.net', '.org']) else 0)
    final_df['subdomain_longer_sld'] = (final_df['length_subdomains'] > final_df['length_sld']).astype(int)
    final_df['double_slash_in_path'] = final_df['path'].str.contains('//', na=False).astype(int)

    ## RATIO FEATURES
    final_df['ratio_digits_url'] = final_df['url'].apply(lambda x: sum(c.isdigit() for c in x) / len(x) if len(x) > 0 else 0)
    final_df['ratio_digits_hostname'] = final_df['hostname'].apply(lambda x: sum(c.isdigit() for c in x) / len(x) if len(x) > 0 else 0)
    final_df['ratio_letter_url'] = final_df['url'].apply(lambda x: sum(c.isalpha() for c in x) / len(x) if len(x) > 0 else 0)
    final_df['ratio_special_char_url'] = final_df['url'].apply(lambda x: sum(not c.isalnum() and c not in ['/', ':', '.'] for c in x) / len(x) if len(x) > 0 else 0)
    # proportion of components
    final_df['ratio_path_url'] = final_df['length_path'] / final_df['length_url']
    final_df['ratio_hostname_url'] = final_df['length_hostname'] / final_df['length_url']
    
    # WORD-BASED FEATURES 
    words_raw = final_df['url'].apply(lambda x: re.findall(r'\w+', x) if x else [])
    words_host = final_df['hostname'].apply(lambda x: re.findall(r'\w+', x) if x else [])
    words_path = final_df['path'].apply(lambda x: re.findall(r'\w+', x) if x else [])
    final_df['length_words_url'] = words_raw.apply(len)
    final_df['avg_words_url'] = words_raw.apply(lambda x: np.mean([len(w) for w in x]) if x else 0)
    final_df['avg_word_hostname'] = words_host.apply(lambda x: np.mean([len(w) for w in x]) if x else 0)
    final_df['avg_word_path'] = words_path.apply(lambda x: np.mean([len(w) for w in x]) if x else 0)

    ## CHARACTER BASED FEATURES
    final_df['num_unique_chars_host'] = final_df['hostname'].apply(lambda x: len(set(x)) if x else 0)
    final_df['num_non_ascii_url'] = final_df['url'].apply(lambda x: sum(1 for c in x if ord(c) > 127) if x else 0)
    final_df['longest_repeated_char_url'] = final_df['url'].apply(lambda x: max([len(list(g)) for k, g in re.findall(r'((.)\2*)', x)]) if x else 0)
    final_df['longest_repeated_char_host'] = final_df['hostname'].apply(lambda x: max([len(list(g)) for k, g in re.findall(r'((.)\2*)', x)]) if x else 0)

    # URL SHORTENING FEATURES
    shortening_services = ['bit.ly', 'goo.gl', 'tinyurl', 't.co', 'ow.ly', 'is.gd', 'buff.ly']
    final_df['has_shortened_hostname'] = final_df['hostname'].str.lower().apply(lambda x: 1 if any(service in x for service in shortening_services) else 0)
    
    # ENTROPY FEATURES
    def calculate_entropy(domain):
        if not domain or len(domain) == 0:
            return 0
        domain_clean = re.sub(r'[^a-z]', '', domain.lower())
        if len(domain_clean) == 0:
            return 0
        char_freq = Counter(domain_clean)
        entropy = -sum((count/len(domain_clean)) * np.log2(count/len(domain_clean)) 
                      for count in char_freq.values())
        return entropy
    final_df['entropy_hostname'] = final_df['hostname'].apply(calculate_entropy)

    
    return final_df

In [30]:
df_feature_engineered = extract_url_features(df_decomposed)

In [31]:
df_feature_engineered

Unnamed: 0,url,protocol,hostname,port,path,query,fragment,subdomains,sld,tld,filename,file_extension,directory_path,query_params,target,is_https,is_http,has_subdomain,has_tld,num_subdomain,is_domain_ip,num_hyphens_domain,num_dots_domain,is_punycode,has_port,has_path,path_depth,has_directory_path,has_filename,has_file_extension,has_query,num_query_params,has_fragment,length_url,length_hostname,length_tld,length_sld,length_subdomains,length_path,length_query,length_fragment,num_dots,num_hyphens,num_at,num_question_marks,num_and,num_equal,num_underscores,num_slashes,num_percent,num_dollars,num_colon,num_semicolon,num_comma,num_hashtag,num_tilde,http_in_path,tld_in_path,tld_in_subdomain,subdomain_longer_sld,double_slash_in_path,ratio_digits_url,ratio_digits_hostname,ratio_letter_url,ratio_special_char_url,ratio_path_url,ratio_hostname_url,length_words_url,avg_words_url,avg_word_hostname,avg_word_path,num_unique_chars_host,num_non_ascii_url,longest_repeated_char_url,longest_repeated_char_host,has_shortened_hostname,entropy_hostname
0,http://www.crestonwood.com/router.php,http,www.crestonwood.com,,/router.php,,,www,crestonwood,com,router.php,php,,,0,0,1,1,1,1,0,0,2,0,0,1,1,0,1,1,0,0,0,37,19,3.0,11,3.0,11.0,,,3,0,0,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,0,0,0.000000,0.000000,0.810811,0.000000,0.297297,0.513514,6,5.000000,5.666667,4.500000,11,0,1,1,0,3.028639
1,http://shadetreetechnology.com/V4/validation/a...,http,shadetreetechnology.com,,/V4/validation/a111aedc8ae390eabcfa130e041a10a4,,,,shadetreetechnology,com,a111aedc8ae390eabcfa130e041a10a4,,V4/validation,,1,0,1,0,1,0,0,0,1,0,0,1,3,1,1,0,0,0,0,77,23,3.0,19,,47.0,,,1,0,0,0,0,0,0,5,0,0,1,0,0,0,0,0,0,0,0,0,0.220779,0.000000,0.688312,0.000000,0.610390,0.298701,6,11.666667,11.000000,14.666667,15,0,1,1,0,3.606937
2,https://support-appleld.com.secureupdate.duila...,https,support-appleld.com.secureupdate.duilawyeryork...,,/ap/89e6a3b4b063b8d/,cmd=_update&dispatch=89e6a3b4b063b8d1b&locale=_,,support-appleld.com.secureupdate,duilawyeryork,com,89e6a3b4b063b8d,,ap,"[cmd=_update, dispatch=89e6a3b4b063b8d1b, loca...",1,1,0,1,1,3,0,1,4,0,0,1,2,1,1,0,1,3,0,126,50,3.0,13,32.0,20.0,47.0,,4,1,0,1,2,3,2,5,0,0,1,0,0,0,0,0,0,1,1,0,0.150794,0.000000,0.698413,0.071429,0.158730,0.396825,15,7.266667,7.500000,8.500000,18,0,1,1,0,3.842101
3,http://rgipt.ac.in,http,rgipt.ac.in,,,,,,rgipt,ac.in,,,,,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,18,11,5.0,5,,,,,2,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0.000000,0.000000,0.722222,0.000000,,0.611111,4,3.250000,3.000000,0.000000,9,0,1,1,0,2.947703
4,http://www.iracing.com/tracks/gateway-motorspo...,http,www.iracing.com,,/tracks/gateway-motorsports-park/,,,www,iracing,com,gateway-motorsports-park,,tracks,,0,0,1,1,1,1,0,0,2,0,0,1,2,1,1,0,0,0,0,55,15,3.0,7,3.0,33.0,,,2,2,0,0,0,0,0,5,0,0,1,0,0,0,0,0,0,0,0,0,0.000000,0.000000,0.818182,0.036364,0.600000,0.272727,8,5.625000,4.333333,7.000000,10,0,1,1,0,3.026987
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11425,http://www.fontspace.com/category/blackletter,http,www.fontspace.com,,/category/blackletter,,,www,fontspace,com,blackletter,,category,,0,0,1,1,1,1,0,0,2,0,0,1,2,1,1,0,0,0,0,45,17,3.0,9,3.0,21.0,,,2,0,0,0,0,0,0,4,0,0,1,0,0,0,0,0,0,0,0,0,0.000000,0.000000,0.844444,0.000000,0.466667,0.377778,6,6.333333,5.000000,9.500000,12,0,1,1,0,3.323231
11426,http://www.budgetbots.com/server.php/Server%20...,http,www.budgetbots.com,,/server.php/Server%20update/index.php,email=USER@DOMAIN.com,,www,budgetbots,com,index.php,php,server.php/Server%20update,[email=USER@DOMAIN.com],1,0,1,1,1,1,0,0,2,0,0,1,3,1,1,1,1,1,0,84,18,3.0,10,3.0,37.0,21.0,,5,0,1,1,0,1,0,5,1,0,1,0,0,0,0,0,0,0,0,0,0.023810,0.000000,0.797619,0.047619,0.440476,0.214286,14,4.928571,5.333333,5.166667,12,0,1,1,0,3.327820
11427,https://www.facebook.com/Interactive-Televisio...,https,www.facebook.com,,/Interactive-Television-Pvt-Ltd-Group-M-100230...,ref=page_internal,,www,facebook,com,photos,,Interactive-Television-Pvt-Ltd-Group-M-1002305...,[ref=page_internal],0,1,0,1,1,1,0,0,2,0,0,1,2,1,1,0,1,1,0,105,16,3.0,8,3.0,63.0,17.0,,2,6,0,1,0,1,1,5,0,0,1,0,0,0,0,0,0,0,0,0,0.142857,0.000000,0.695238,0.085714,0.600000,0.152381,14,6.357143,4.666667,6.750000,10,0,1,1,0,2.985228
11428,http://www.mypublicdomainpictures.com/,http,www.mypublicdomainpictures.com,,/,,,www,mypublicdomainpictures,com,,,,,0,0,1,1,1,1,0,0,2,0,0,0,0,0,0,0,0,0,0,38,30,3.0,22,3.0,1.0,,,2,0,0,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,0,0,0.000000,0.000000,0.842105,0.000000,0.026316,0.789474,4,8.000000,9.333333,0.000000,18,0,1,1,0,3.913800


In [37]:
# create tree based model for random forest, do train test split
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
X = df_feature_engineered.drop(columns=['url', 'protocol', 'hostname', 'port', 'path', 'query', 'fragment', 'sld', 'filename', 'directory_path', 'query_params', 'target'])
y = df_feature_engineered['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)   

# for subdomain, tld, file_extension, keep the top 5 based on value counts, then set rest to others, then do one hot encoding
for col in ['subdomains', 'tld', 'file_extension']:
    top_5 = df_decomposed[col].value_counts().nlargest(5).index
    X_train[col] = X_train[col].apply(lambda x: x if x in top_5 else 'other')
    X_test[col] = X_test[col].apply(lambda x: x if x in top_5 else 'other')
    X_train = pd.get_dummies(X_train, columns=[col], prefix=col)
    X_test = pd.get_dummies(X_test, columns=[col], prefix=col)

# fit randm forest and evaluate the rocauc
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)[:, 1]    

from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC AUC: {roc_auc:.4f}')

ROC AUC: 0.9702
