# Project: Web Page Phishing Detection using Machine Learning

Goal: The goal of this project is to develop a machine learning-based system that can automatically detect phishing websites by analyzing their features, thereby enhancing cybersecurity and protecting users from fraudulent online activities.

### Importing Libraries
* NumPy (np): Used for mathematical operations like arrays, numbers, and calculations.

* Pandas (pd): Used to handle and analyze data in table form (DataFrame).

* Matplotlib & Seaborn: Used to create graphs, charts, and visualizations.

* Warnings: Used to ignore unnecessary warning messages so the output looks clean.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Data Loading & Preview
* Load the phishing dataset into a DataFrame called data. 
* A DataFrame is like an Excel sheet with rows and columns.
* data.head() shows the first 5 rows so we can preview the dataset

In [None]:
data = pd.read_csv('web-page-phishing.csv')
data.head()

### Data Dimensions
* df.shape shows the dataset size (rows, columns).

In [None]:
data.shape

### Data Summary
* The .info() shows the number of entries (rows), the total number of columns, column name, the number of non-null values, and the data type (dtype) of the values in that column. 
* This helps in checking for missing data and ensuring the data types are correct for analysis.

In [None]:
data.info()

### Checking for Missing Values
* The .isna() method checks every single cell in the DataFrame and returns True if a value is missing (like NaN) and False if it is not.
* The .sum() gives a clear and simple overview of how many missing values exist in each feature.

In [None]:
data.isna().sum()

In [None]:
data.describe()

### Understanding the Target Variable
* The .map() method is used to replace the numerical values in the 'phishing' column with more descriptive labels. In this case, 0 is changed to "non-phishing" and 1 is changed to "phishing".
* The .value_counts() ives a clear overview of the number of phishing and non-phishing entries in the dataset.

In [None]:
data['phishing'].map({0: "non-phishing", 1: "phishing"}).value_counts()

### Data Visualization
* Creating a Count Plot: The sns.countplot() function from the Seaborn library is used to create a bar chart
* The bar chart shows that the number of websites classified as 0 (non-phishing) is significantly greater than the number of websites classified as 1 (phishing).

In [None]:
sns.countplot(x='phishing', data=data, palette="Set2")
plt.title("Distribution of Phishing vs Legitimate Websites")
plt.show()

### Correlation Heatmap
The heatmap visualizes the correlation matrix of the features, highlighting the relationships between different variables. This helps in identifying highly correlated features, which might be considered for dimensionality reduction or feature selection.

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap of Features")
plt.show()

* This code efficiently generates multiple plots at once by using a for loop to go through a list of features. For each feature, sns.histplot() creates a histogram to show the distribution of its values.
* The plots reveal that for features like n_dots, n_hypens, n_slash, and n_redirection, the distributions for both classes are very similar and concentrated at low values.

In [None]:
features = ['url_length','n_dots','n_hypens','n_slash','n_redirection']
for col in features:
    plt.figure(figsize=(6,4))
    sns.histplot(data=data, x=col, hue="phishing", kde=True, bins=40, palette="husl")
    plt.title(f"Distribution of {col} by Class")
    plt.show()

* Similar to the previous step, this code uses a for loop to create multiple plots efficiently. This time, it uses sns.boxplot() to create a box plot for each feature.
* Each plot shows two boxes, one for non-phishing (class 0) and one for phishing (class 1).
* The box itself represents the middle 50% of the data (from the 25th to the 75th percentile).
* The line inside the box is the median (the 50th percentile).
* The "whiskers" (lines extending from the box) show the general range of the data.
* The individual dots or circles outside the whiskers are considered outliers.

In [None]:
for col in features:
    plt.figure(figsize=(6,4))
    sns.boxplot(x='phishing', y=col, data=data, palette="Set3")
    plt.title(f"{col} vs Class")
    plt.show()

### Visualizing Data with PCA
* This code cell prepares the data for Principal Component Analysis (PCA). It first separates the features (X) from the target variable (y).
* The scatter plot shows the two principal components. Each point represents a website, with its color indicating whether it's a phishing (1) or non-phishing (0) website.
* From the plot, we can see that the two classes are not clearly separable, with phishing and non-phishing websites largely mixed together.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

X = data.drop('phishing', axis=1)
y = data['phishing']

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_scaled)

plt.figure(figsize=(8,6))
sns.scatterplot(x=pca_result[:,0], y=pca_result[:,1], hue=y, palette="Set1", alpha=0.7)
plt.title("PCA: Phishing vs Legitimate Websites")
plt.show()

### Separating Features and Target
*  The first line data_cleaned = data.dropna(subset=['phishing']) creates a new DataFrame by dropping any rows where the 'phishing' column has a missing value.
* Features (X): The line X = data_cleaned.drop('phishing', axis=1) creates a new DataFrame X that contains all the columns except the 'phishing' column.
* Target (y): The line y = data_cleaned['phishing'] creates a Series y that contains only the 'phishing' column.

In [None]:
data_cleaned = data.dropna(subset=['phishing'])
X = data_cleaned.drop('phishing', axis=1)
y = data_cleaned['phishing']

### Splitting Data for Modeling
* This is a critical step in machine learning. The train_test_split() function divides the dataset into two parts: a training set (80% of the data in this case) and a testing set (20%).
* test_size=0.2: Specifies that 20% of the data will be used for the testing set.
* random_state=42: This ensures that the split is the same every time you run the code, making the results reproducible.
* stratify=y : t ensures that the proportion of phishing vs. non-phishing entries is the same in both the training and testing sets, providing a fair evaluation of the model's performance.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Training and Evaluating Machine Learning Models
* Importing Models: This code first imports three different machine learning models: Logistic Regression, Random Forest Classifier, and XGBoost Classifier. These models will be used to classify websites as either phishing or non-phishing.
* Creating a Dictionary of Models: A dictionary named models is created to store the different model objects. Using a dictionary allows us to easily manage and iterate through each model.
* Training and Evaluation Loop: The for loop automates the training and evaluation process for each model.
* model.fit(): This line trains the model using the training data (X_train, y_train). The model learns patterns from this data.
* model.predict(): This line uses the trained model to make predictions on the unseen test data (X_test).
* accuracy_score(): This function compares the model's predictions (y_pred) to the actual values (y_test) to calculate the accuracy. Accuracy is a metric that tells us the percentage of correct predictions made by the model.
* The output shows the accuracy for each model.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(eval_metric='logloss')

}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))

### Advanced Model Evaluation
* Confusion Matrix: This table shows exactly where the model made mistakes. For example, in the Logistic Regression output, the top row shows that 11,935 non-phishing sites were correctly predicted, but 808 were incorrectly predicted as phishing.
* Classification Report: This provides three key metrics for each class:
* Precision: The model's accuracy when it predicts a certain class.
* Recall: The model's ability to find all actual instances of a class.
* F1-score: A balance between precision and recall, useful for imbalanced datasets.
* The Random Forest model has slightly higher scores across all metrics, especially for the phishing class (1), which is the minority class. Its ROC-AUC score of 0.96 is also higher than Logistic Regression's 0.93

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

for name, model in models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:,1] if hasattr(model, "predict_proba") else None

    print(f"\n{name}")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    if y_proba is not None:
        print("ROC-AUC:", roc_auc_score(y_test, y_proba))

### Cross-Validation for Robust Evaluation
* This code uses k-fold cross-validation to get a more reliable and robust estimate of the models' performance. Instead of a single train-test split, the data is divided into cv=5 folds (or groups).
* The output provides two numbers for each model:
* Mean Accuracy: The average accuracy across all 5 folds. This is a more stable and reliable measure of the model's performance than a single test score.
* Standard Deviation: This shows the variation in the accuracy scores across the folds. A low standard deviation indicates that the model's performance is consistent, while a high standard deviation suggests that its performance can vary depending on the data it's trained on.

In [None]:
from sklearn.model_selection import cross_val_score

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name} CV Mean Accuracy: {scores.mean():.4f} Â± {scores.std():.4f}")

### Hyperparameter Tuning with Grid Search
* Hyperparameters are settings that are not learned from the data but are set by the data scientist (e.g., the number of trees in a Random Forest). We use GridSearchCV to automatically find the best combination of these settings.
* The output shows the Best RF Params, which is the ideal combination of hyperparameters found by the search ('n_estimators': 200, 'max_depth': 20, etc.). It also shows the Best RF Score, which is the cross-validated accuracy achieved with this best combination, confirming that the tuning process successfully improved the model's performance.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Random Forest tuning
rf_params = {
    "n_estimators": [100, 200, 300],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4]
}

grid_rf = GridSearchCV(RandomForestClassifier(), rf_params, cv=3, scoring="accuracy", n_jobs=-1, verbose=2)
grid_rf.fit(X_train, y_train)

print("Best RF Params:", grid_rf.best_params_)
print("Best RF Score:", grid_rf.best_score_)

## Final model evaluation

Evaluate the best performing model (XGBoost) using a confusion matrix, classification report, and ROC-AUC score to provide a comprehensive understanding of its performance.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

best_model = models["XGBoost"]

y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("XGBoost Model Evaluation:")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))

### Project Summary
This project aimed to develop a machine learning model to detect phishing websites based on URL features.
We performed data exploration, including examining the distribution of features and the target variable, checking for missing values, and visualizing correlations.
We trained and evaluated three different models: Logistic Regression, Random Forest, and XGBoost.
Based on the initial evaluation using accuracy on the test set, the XGBoost model achieved the highest accuracy of approximately 89.18%.
Potential next steps include:
- Further hyperparameter tuning for the best-performing models.
- Exploring feature engineering techniques to create new relevant features.
- Investigating more advanced methods for handling outliers and potential class imbalance.
- Trying other advanced models such as neural networks.
- Deploying the best model for real-time phishing detection.

### Data Analysis Key Findings

*   The dataset contains information about websites, and the target variable 'phishing' indicates whether a website is phishing (1) or not (0).
*   Initial model training showed that the XGBoost model achieved the highest accuracy (approximately 89.18%) on the test set compared to Logistic Regression (approximately 85.63%) and Random Forest (approximately 89.03%).
*   The final evaluation of the XGBoost model using a confusion matrix, classification report, and ROC-AUC score provided a comprehensive understanding of its performance. The confusion matrix showed the counts of correct and incorrect predictions, the classification report provided precision, recall, and f1-score for each class, and the ROC-AUC score indicated the model's ability to distinguish between phishing and legitimate websites.

### Insights or Next Steps

*   While XGBoost performed best among the initial models, further hyperparameter tuning could potentially improve its performance.
*   Exploring feature engineering or incorporating additional features could potentially lead to a more robust model for phishing detection.