# Hypothesis

Phishing websites have unique features and patterns that can allow a model to distinguish and classify them from legitimate websites. My goal is to demonstrate that a supervised model can be trained to achieve this with relative ease.

## Data Source

- **Phishing Websites Dataset**  
    [Phishing Websites](https://archive.ics.uci.edu/dataset/327/phishing+websites)  
    Mohammad, R. & McCluskey, L. (2012). *Phishing Websites [Dataset]*. UCI Machine Learning Repository. [https://doi.org/10.24432/C51W2X](https://www.semanticscholar.org/paper/An-assessment-of-features-related-to-phishing-using-Mohammad-Thabtah/0c0ff58063f4e078714ea74f112bc709ba9fed06).

- **PhiUSIIL Phishing URL (Website) Dataset**  
    [PhiUSIIL Phishing URL Dataset](https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset)  
    Prasad, A. & Chandra, S. (2024). *PhiUSIIL Phishing URL (Website) [Dataset]*. UCI Machine Learning Repository. [https://doi.org/10.1016/j.cose.2023.103545](https://doi.org/10.1016/j.cose.2023.103545).

# Process Overview

1. **Model Creation**  
    I will design and implement my own supervised machine learning model to classify phishing websites. The model will be trained using the datasets mentioned above.

2. **Outcome Analysis**  
    After training the model, I will analyze its performance using appropriate evaluation metrics such as accuracy, precision, recall, and F1-score. Visualizations will also be used to better understand the model's predictions.

3. **Comparison with Published Results**  
    The performance of my model will be compared with the predictions and results reported in the two referenced papers:
    - *Phishing Websites Dataset* by Mohammad & McCluskey (2012)
    - *PhiUSIIL Phishing URL Dataset* by Prasad & Chandra (2024)

    This comparison will help assess the effectiveness of my model and identify areas for improvement.

In [None]:

#package to add ucimlrepo to the python path

# !pip install ucimlrepo




In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
phishing_websites = fetch_ucirepo(id=327) 

# fetch dataset 
phiusiil_phishing_url_website = fetch_ucirepo(id=967) 
    


## Data Exploration

In [27]:
# data (as pandas dataframes) 
X1 = phiusiil_phishing_url_website.data.features 
y1 = phiusiil_phishing_url_website.data.targets 
  
# metadata 
print(phiusiil_phishing_url_website.metadata) 
  
# variable information 
print(X1.columns)

print(f"Number of rows in X1: {len(X1)}")
print(f"Number of rows in y1: {len(y1)}")

{'uci_id': 967, 'name': 'PhiUSIIL Phishing URL (Website)', 'repository_url': 'https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/967/data.csv', 'abstract': 'PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed, while constructing the dataset, are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 235795, 'num_features': 54, 'feature_types': ['Real', 'Categorical', 'Integer'], 'demographics': [], 'target_col': ['label'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2024, 'last_updated': 'Sun May 12 

In [37]:
print(y1['label'].unique())

[1 0]


## Targets 
    1 = legitimate
    2 = phishing URL

In [36]:
X1.head()

Unnamed: 0,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,URLCharProb,...,Bank,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef
0,https://www.southbankmosaics.com,31,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,0.061933,...,1,0,0,1,34,20,28,119,0,124
1,https://www.uni-mainz.de,23,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,0.050207,...,0,0,0,1,50,9,8,39,0,217
2,https://www.voicefmradio.co.uk,29,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,0.064129,...,0,0,0,1,10,2,7,42,2,5
3,https://www.sfnmjournal.com,26,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,0.057606,...,0,1,1,1,3,27,15,22,1,31
4,https://www.rewildingargentina.org,33,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,0.059441,...,1,1,0,1,244,15,34,72,1,85


## From dataset (page 11 of paper) :
5.3. Dataset validation
- URL verification: All the URLs in the dataset are collected from
valid sources and included in the dataset for verification.
- Null value: There is no null value in the dataset.
- Missing value: There is no missing value in the dataset.
- Duplicate records: All the records in the dataset are unique.
- Zero variance feature: There is no feature in the dataset with identical data values.
- Infinite values: There is no positive or negative infinite value in the
dataset.
- Class imbalance: The dataset has 57% legitimate URLs and 43%
phishing URLs that do not indicate a disproportionate distribution of
class labels


### Verifying the data 

In [40]:
# Check for null values
null_values_X1 = X1.isnull().sum().sum()

# Check for duplicate records
duplicates_X1 = X1.duplicated().sum()

# Check for zero variance features
zero_variance_X1 = (X1.nunique() == 1).sum()

# Check for class imbalance
class_distribution_y1 = y1['label'].value_counts(normalize=True)

# Print results
print(f"Null values in X1: {null_values_X1}")
print(f"Duplicate records in X1: {duplicates_X1}")
print(f"Zero variance features in X1: {zero_variance_X1 * 100} ")
print(f"Class distribution in y1:\n{class_distribution_y1}")


Null values in X1: 0
Duplicate records in X1: 0
Zero variance features in X1: 0 
Class distribution in y1:
label
1    0.571895
0    0.428105
Name: proportion, dtype: float64


## *Note:*
The dataset has undergone significant preprocessing to extract meaningful features from raw data, such as the `Domain` column. Instead of using the domain as a single raw string, it has been split into various derived features to provide a more granular representation. For example:

1. **`NoOfSubDomain`**: Represents the number of subdomains in the URL, derived from the structure of the domain.
2. **`NoOfObfuscatedChar`**: Counts the number of obfuscated characters in the domain, which can indicate phishing attempts.
3. **`IsHTTPS`**: Indicates whether the URL uses HTTPS, a feature derived from the protocol in the URL.
4. **`NoOfDegitsInURL`**: Counts the number of digits in the URL, which can be a sign of obfuscation or phishing.
5. **`NoOfEqualsInURL`, `NoOfQMarkInURL`, `NoOfAmpersandInURL`**: Count the occurrences of specific special characters (`=`, `?`, `&`) in the URL, which are often used in query strings or obfuscation.

These derived features provide a structured and standardized representation of the domain, making it easier for machine learning models to identify patterns and detect phishing attempts.  

Similarly, **`TLD`** is used to create TLDLegitimateProb, which creates a ratio of it's frequency comparing 10 milltions websites where higher TLDLegitimateProb of a URL may indicate a legitimate URL, and a lower TLDLegitimateProb value may help identify phishing URLs

------------------

However, for  **`Title`**, I believe more useful information can be extracted. 

Currently the dataset creates *URLTitleMatchScore* from **`Title`** to identify the discrepancy between the URL and
the webpage title, where a lower score means the 
A lower score can be a sign that the website is a webpage title does not match the content, and vice versa. 



**However, they fail to assess other characteristics.**




## Extra Encoding/Pre-processing Steps for Feature **`Title`**

#### The code below adds new derived features using Sentiment Analysis, POS Tag Counts and Word Embedding Averages

# Implementation 1 #
### Using All Features on Randon Forest Classifier

In [42]:
# Checking number of unique Domain and TLD values  
unique_domains = X1['Domain'].nunique()
unique_tlds = X1['TLD'].nunique()
print(f"Number of unique domains: {unique_domains}")
print(f"Number of unique TLDs: {unique_tlds}")


Number of unique domains: 220086
Number of unique TLDs: 695
