# Dataset Generation: SAFE / WARNING / ALERT URLs

Python 3.11.14 compatible

This notebook:
- Extracts phishing URLs (ALERT)
- Extracts benign URLs
- Labels benign URLs as SAFE or WARNING based on missing HTTP security headers
- Generates a final CSV dataset

## Install Dependencies

In [1]:

!pip install requests pandas python-whois




## Imports

In [2]:

import csv
import requests
import pandas as pd
from urllib.parse import urlparse


## Load Phishing URLs (ALERT Class)
Upload a CSV containing phishing URLs

In [5]:
from pathlib import Path
import pandas as pd

print("Notebook cwd:", Path.cwd())

PHISHING_INPUT = Path("phishing_urls.csv")  # exact file in your workspace
if not PHISHING_INPUT.exists():
    raise FileNotFoundError(f"Could not find: {PHISHING_INPUT.resolve()}")

df_phish = pd.read_csv(PHISHING_INPUT)

# Normalize URL column
url_col = next(
    (c for c in df_phish.columns if c.strip().lower() in {"url", "phish_url", "phishing_url"}),
    None
)
if url_col is None:
    raise ValueError(f"No URL column found. Columns: {list(df_phish.columns)}")

phish_urls = df_phish[url_col].dropna().head(300).tolist()
print(f"Loaded {len(phish_urls)} phishing URLs from {PHISHING_INPUT}")


Notebook cwd: c:\Project_Research\bearl1
Loaded 300 phishing URLs from phishing_urls.csv


## Load Benign URLs (SAFE / WARNING Class)
Use Tranco / Alexa / manual list

In [8]:
import pandas as pd

# 300 benign URLs (expandable)
base_domains = [
    "https://www.google.com", "https://www.microsoft.com", "https://www.apple.com",
    "https://www.amazon.com", "https://www.wikipedia.org", "https://www.github.com",
    "https://www.stackoverflow.com", "https://www.nytimes.com", "https://www.bbc.com",
    "https://www.cnn.com", "https://www.reddit.com", "https://www.linkedin.com",
    "https://www.yahoo.com", "https://www.weather.com", "https://www.imdb.com",
    "https://www.netflix.com", "https://www.paypal.com", "https://www.adobe.com",
    "https://www.dropbox.com", "https://www.zoom.us", "https://www.salesforce.com",
    "https://www.walmart.com", "https://www.target.com", "https://www.costco.com",
    "https://www.irs.gov", "https://www.usa.gov", "https://www.nih.gov",
    "https://www.nasa.gov", "https://www.cdc.gov", "https://www.un.org"
]

paths = ["", "/about", "/contact", "/help", "/support", "/privacy", "/terms", "/login", "/products", "/services"]
safe_urls = [d + p for d in base_domains for p in paths][:300]

df_safe = pd.DataFrame({"url": safe_urls})
df_safe.to_csv("safe_urls.csv", index=False)

print(f"Created safe_urls.csv with {len(df_safe)} URLs")


Created safe_urls.csv with 300 URLs


## HTTP Header Check Logic

In [9]:

REQUIRED_HEADERS = [
    "Content-Security-Policy",
    "X-Frame-Options",
    "X-Content-Type-Options",
    "Strict-Transport-Security"
]

def missing_headers(url):
    try:
        r = requests.get(url, timeout=10, allow_redirects=True)
        return [h for h in REQUIRED_HEADERS if h not in r.headers]
    except Exception:
        return REQUIRED_HEADERS


## Dataset Generation

In [10]:

dataset = []

# ALERT → phishing
for u in phish_urls:
    dataset.append({
        "url": u,
        "label": "ALERT"
    })

# SAFE / WARNING → benign
for u in safe_urls:
    missing = missing_headers(u)
    label = "SAFE" if len(missing) == 0 else "WARNING"

    dataset.append({
        "url": u,
        "label": label
    })

print("Total dataset size:", len(dataset))


Total dataset size: 600


## Save Final Dataset

In [11]:

OUTPUT_FILE = "url_dataset_safe_warning_alert.csv"

with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["url", "label"])
    writer.writeheader()
    writer.writerows(dataset)

print("Dataset saved as:", OUTPUT_FILE)




## Label Distribution

In [12]:

df_final = pd.DataFrame(dataset)
print(df_final["label"].value_counts())


label
ALERT      300
SAFE        94
Name: count, dtype: int64


## Notes (For Thesis / Review)

- ALERT labels are ground-truth phishing URLs
- WARNING labels indicate missing OWASP-recommended security headers
- SAFE labels indicate strong header enforcement
- Dataset is fully reproducible and explainable
