# QML for Cybersecurity: URL Classification with QSVM (Qiskit)

This notebook walks through a tiny, end-to-end experiment:

1. **Setup**: Install dependencies in a Python venv or environment with Qiskit and scikit-learn.
2. **Data**: Load a small URL dataset (malicious vs benign). A sample `urls_sample.csv` is provided.
3. **Features**: Extract simple lexical features from URLs.
4. **Baseline**: Train a classical SVM.
5. **QML**: Train a QSVM using a quantum kernel (simulated backend).
6. **Compare**: Evaluate and discuss.

> **Note**: This uses a **simulator** (no QC hardware). The goal is pedagogy: how quantum kernels slot into a familiar ML workflow.

## 0) Environment Setup

Run these commands in your shell **once** (outside the notebook) to prepare a virtual environment:

```bash
python3 -m venv qml_env
source qml_env/bin/activate
pip install --upgrade pip
pip install qiskit qiskit-machine-learning scikit-learn pandas numpy matplotlib
```

If you're in WSL, do the above inside Ubuntu. Then start Jupyter (e.g., `pip install notebook` and `jupyter notebook`).

In [None]:
# 1) Imports
import os, re, math
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

import inspect

try:
    from qiskit_aer import AerSimulator
except ImportError:
    AerSimulator = None

if 'AerSimulator' not in globals() or AerSimulator is None:
    try:
        from qiskit.providers.aer import AerSimulator  # pragma: no cover
    except ImportError:
        AerSimulator = None

try:
    from qiskit import Aer  # legacy fallback
except ImportError:
    Aer = None

from qiskit.circuit.library import ZZFeatureMap

try:
    from qiskit_machine_learning.kernels import FidelityQuantumKernel
except ImportError:
    from qiskit_machine_learning.kernels import QuantumKernel as FidelityQuantumKernel

print("Ready.")


## 1) Load Data
If you have your own CSV of URLs, set `CSV_PATH` to that file. Otherwise, use the provided sample `urls_sample.csv` (two columns: `url`, `label`).

In [None]:
CSV_PATH = 'urls_sample.csv'  # change to your path if needed
if not Path(CSV_PATH).exists():
    raise FileNotFoundError("CSV not found. Place your dataset or the provided urls_sample.csv next to this notebook.")
df = pd.read_csv(CSV_PATH)
df.head()

## 2) Feature Extraction
We'll compute a small set of lexical features from each URL. Keep the dimensionality small (≤ 6) so the simulator remains snappy (feature dimension = number of qubits for the QSVM kernel).

In [None]:
SUSPICIOUS_TLDS = {'.ru','.tk','.top','.xyz','.zip','.click','.gq','.cn','.pw','.work','.cf','.ga','.ml'}
SUSPICIOUS_WORDS = {'login','verify','update','billing','gift','account','reset','secure','wallet','airdrop','claim','prize','invoice','urgent','paypal','bank','office365','security'}

def is_ip_domain(url: str) -> int:
    # crude: check if host looks like IPv4
    m = re.search(r"://([\d\.]+)", url)
    if not m:
        return 0
    host = m.group(1)
    return int(bool(re.match(r"^(?:\d{1,3}\.){3}\d{1,3}$", host)))

def tld_flag(url: str) -> int:
    m = re.search(r"://([^/]+)", url)
    if not m:
        return 0
    host = m.group(1).lower()
    for t in SUSPICIOUS_TLDS:
        if host.endswith(t):
            return 1
    return 0

def count_chars(url: str, ch: str) -> int:
    return url.count(ch)

def contains_words(url: str, words: set) -> int:
    lower = url.lower()
    return int(any(w in lower for w in words))

def has_https(url: str) -> int:
    return int(url.lower().startswith('https://'))

def url_features(url: str):
    return [
        len(url),                              # 0 length
        sum(c.isdigit() for c in url),        # 1 digits
        count_chars(url, '.'),                # 2 dots
        count_chars(url, '-'),                # 3 hyphens
        has_https(url),                       # 4 https flag
        is_ip_domain(url),                    # 5 ip-as-domain
        tld_flag(url),                        # 6 suspicious tld
        contains_words(url, SUSPICIOUS_WORDS) # 7 suspicious tokens
    ]

X = np.array([url_features(u) for u in df['url']])
y = df['label'].values
X.shape, y.shape

## 3) Train/Test Split and Scaling
We'll standardize features for classical SVM. For QSVM with a quantum kernel, we'll also use scaled features to keep ranges consistent.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)
X_train_s.shape

## 4) Classical Baseline (SVM)
We start with a strong classical baseline: RBF-kernel SVM.

In [None]:
svm = SVC(kernel='rbf', gamma='scale')
svm.fit(X_train_s, y_train)
pred = svm.predict(X_test_s)
print("Classical SVM accuracy:", (pred == y_test).mean())
print(classification_report(y_test, pred, digits=4))
print(confusion_matrix(y_test, pred))

## 5) Quantum Kernel + QSVM (Simulated)

We restrict to a small number of features to keep the simulator fast. Select the top `d` dimensions. Try `d = 4` or `d = 6` and observe runtime vs. performance.

In [None]:
d = 4  # number of features/qubits for the quantum feature map; try 4 or 6
cols = list(range(d))  # take first d features; try other subsets for curiosity
Xtr_d = X_train_s[:, cols]
Xte_d = X_test_s[:, cols]

if AerSimulator is not None:
    backend = AerSimulator()
elif Aer is not None:
    backend = Aer.get_backend('aer_simulator')
else:
    raise ImportError('Qiskit Aer is not available. Install qiskit-aer to run the quantum section.')

feature_map = ZZFeatureMap(feature_dimension=d, reps=2)  # try reps=1..3 for depth tradeoffs
kernel_params = inspect.signature(FidelityQuantumKernel.__init__).parameters
if 'backend' in kernel_params:
    qkernel = FidelityQuantumKernel(feature_map=feature_map, backend=backend)
elif 'quantum_instance' in kernel_params:
    class _LegacyQuantumInstance:
        def __init__(self, backend):
            self._backend = backend

        @property
        def backend(self):
            return self._backend

    qkernel = FidelityQuantumKernel(feature_map=feature_map, quantum_instance=_LegacyQuantumInstance(backend))
else:
    qkernel = FidelityQuantumKernel(feature_map=feature_map)

Ktr = qkernel.evaluate(Xtr_d)
Kte = qkernel.evaluate(Xte_d, Xtr_d)

qsvm = SVC(kernel='precomputed')
qsvm.fit(Ktr, y_train)
qpred = qsvm.predict(Kte)
print('QSVM accuracy:', (qpred == y_test).mean())
print(classification_report(y_test, qpred, digits=4))
print(confusion_matrix(y_test, qpred))


## 6) Compare & Reflect

- How do accuracies compare?
- How does runtime scale as you increase `d` (qubits) or `reps` (circuit depth)?
- Are there particular features that help QSVM more than classical SVM?
- What would change if we used a real noisy backend instead of a simulator?

## 7) Extensions / Homework

1. Try different feature subsets instead of the first `d` columns (e.g., choose `[0,2,3,7]`).
2. Add n-gram or character-based features (but keep `d` small for QSVM runs).
3. Swap the feature map (e.g., `PauliFeatureMap`).
4. Try a variational classifier (VQC) with a small ansatz (may be slower).
5. Replace the dataset with a real malicious URL corpus and compare trends.
6. Profile timing for classical vs QSVM to understand scaling effects.