# Phishing Site URLs Detection

This notebook demonstrates how to train a machine learning model using the **Phishing Site URLs** dataset from Kaggle.

Dataset: [taruntiwarihp/phishing-site-urls](https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls)

## 1. Setup
First, we need to install the `kagglehub` library to download the dataset automatically.

In [None]:
!pip install kagglehub pandas scikit-learn joblib

## 2. Global Imports

In [None]:
import kagglehub
import pandas as pd
import numpy as np
import glob
import os
import joblib
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## 3. Download and Load Dataset
We use `kagglehub` to fetch the latest version of the dataset.

In [None]:
print("Downloading dataset...")
path = kagglehub.dataset_download("taruntiwarihp/phishing-site-urls")
print("Path to dataset:", path)

# Find the CSV file
csv_files = glob.glob(os.path.join(path, "*.csv"))
if csv_files:
    csv_path = csv_files[0]
    print(f"Loading data from: {csv_path}")
    df = pd.read_csv(csv_path)
    print("\nDataset loaded successfully!")
    print(f"Shape: {df.shape}")
    print(df.head())
else:
    print("Error: No CSV file found.")

## 4. Preprocessing
The dataset usually contains a 'URL' column and a 'Label' column. We will use the URLs as features and Labels as the target.

In [None]:
# Inspect columns to find label and url
print("Columns:", df.columns)

# Assuming standard column names, but you can adjust if needed
url_col = 'URL'
label_col = 'Label'

# Sample Split
X = df[url_col]
y = df[label_col]

print(f"\nTotal samples: {len(X)}")
print("Class distribution:")
print(y.value_counts())

In [None]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

## 5. Feature Extraction (TF-IDF)
We use TF-IDF with a character analyzer to capture patterns in the URL string (e.g., 'http', '.com', 'secure', 'login').

In [None]:
print("Vectorizing URLs...")
# Character n-grams (3-5 chars) are very effective for URLs
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(3, 5), max_features=5000, min_df=5, max_df=0.9)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_tfidf.shape}")

## 6. Model Training (Logistic Regression)
Logistic Regression is chosen for its efficiency and effectiveness on high-dimensional sparse data like text/TF-IDF.

In [None]:
print("Training model...")
model = LogisticRegression(max_iter=1000, n_jobs=-1, solver='saga')
model.fit(X_train_tfidf, y_train)
print("Training completed.")

## 7. Evaluation

In [None]:
y_pred = model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## 8. Save Model
Save the trained model and vectorizer for future use.

In [None]:
joblib.dump(model, 'kaggle_phishing_model.joblib')
joblib.dump(vectorizer, 'kaggle_vectorizer.joblib')
print("Model saved to 'kaggle_phishing_model.joblib'")
print("Vectorizer saved to 'kaggle_vectorizer.joblib'")

## 9. Test with New URLs

In [None]:
samples = [
    "https://www.google.com",
    "http://phishing-bank-login.com/secure",
    "https://www.kaggle.com",
    "http://192.168.1.1/login"
]

transformed_samples = vectorizer.transform(samples)
predictions = model.predict(transformed_samples)

for url, pred in zip(samples, predictions):
    print(f"URL: {url} -> Prediction: {pred}")