In [1]:
import pandas as pd

In [3]:
# Load dataset (note: filename as provided earlier)
PWTDF = pd.read_csv('phising.csv')

## Phishing Website Detection Dataset - Summary

This dataset is used to identify whether a website is **legitimate or phishing** based on URL structure, domain properties, web traffic, and HTML/JavaScript behavior.

###  Dataset Overview
- **Total Records:** ~11,000+
- **Total Features:** 30
- **Target Column:** 1
- **Total Columns:** 31
- **Problem Type:** Binary Classification
- **Domain:** Cybersecurity / Web Threat Detection

###  Target Variable
- **Column Name:** `Result`
- **Values:**
  - `1` → Legitimate Website 
  - `-1` → Phishing Website 

###  Feature Encoding
Most input features follow this encoding scheme:
- `1`  → Legitimate
- `0`  → Suspicious
- `-1` → Phishing

###  Feature Categories
- **URL-Based Features:** Detect suspicious patterns in URLs (IP address usage, long URLs, shortening services, special symbols).
- **Security Features:** SSL certificate validity, HTTPS misuse, port numbers.
- **HTML & JavaScript Behavior:** Form handling, redirects, popups, iframe usage, abnormal anchors.
- **Domain & Traffic Features:** Domain age, DNS record, Google indexing, page rank, backlinks, blacklist reports.

###  Machine Learning Usage
- **Input:** 30 website-related features  
- **Output:** Website class (Phishing or Legitimate)  
- **Common Algorithms Used:**
  - Logistic Regression
  - Random Forest
  - XGBoost
  - Support Vector Machine (SVM)

###  Application Areas
- Web security systems  
- Fraud and phishing detection  
- Browser security extensions  
- AI-based threat monitoring tools  

This dataset is widely used as a benchmark for **phishing website detection using machine learning**.


| No | Column Name                     | Description                              | Legit Website (Value = 1)     | Phishing Website (Value = -1 / 0) | Example (Legit ✅ vs Phishing ❌) |
|----|----------------------------------|------------------------------------------|-------------------------------|-----------------------------------|----------------------------------|
| 1  | having_IP_Address               | Whether URL uses an IP address            | Uses domain name              | Uses direct IP address            | https://bank.com ✅ / http://192.168.1.1/login ❌ |
| 2  | URL_Length                      | Length of the URL                         | Short / Normal                | Very Long                         | bank.com/login ✅ / bank.com/secure/verify/account/update ❌ |
| 3  | Shortining_Service              | Use of URL shortener                      | Not used                      | Used                              | https://bank.com ✅ / bit.ly/xyz ❌ |
| 4  | having_At_Symbol                | Presence of @ in URL                     | Not present                   | Present                           | bank.com/login ✅ / bank.com@evil.com ❌ |
| 5  | double_slash_redirecting        | Abnormal // in URL                       | Normal slashes                | Extra slashes                     | https://site.com/login ✅ / https://site.com//login ❌ |
| 6  | Prefix_Suffix                  | Dash (-) in domain                        | No dash                       | Dash present                      | securebank.com ✅ / secure-bank.com ❌ |
| 7  | having_Sub_Domain               | Number of subdomains                     | Normal                         | Too many                          | mail.google.com ✅ / login.verify.bank.secure.com ❌ |
| 8  | SSLfinal_State                  | Valid HTTPS and SSL                      | Trusted SSL                   | No / Fake SSL                     | https://bank.com ✅ / http://bank-login.com ❌ |
| 9  | Domain_registeration_length    | Domain registration period                | Long term                     | Short term                        | 10 years ✅ / 6 months ❌ |
| 10 | Favicon                         | Source of favicon                        | Same domain                   | External domain                   | Bank logo ✅ / Logo from another site ❌ |
| 11 | port                            | Use of non-standard port                 | Default port (80/443)         | Custom port                       | https://site.com ✅ / https://site.com:8080 ❌ |
| 12 | HTTPS_token                    | Fake “https” in domain name              | Not used                      | Used                              | https://bank.com ✅ / https://https-bank.com ❌ |
| 13 | Request_URL                    | Loading of media files                   | Same domain                   | External domains                  | Images from bank.com ✅ / External images ❌ |
| 14 | URL_of_Anchor                  | Safety of anchor links                   | Trusted links                 | Malicious links                   | Internal links ✅ / Random redirects ❌ |
| 15 | Links_in_tags                  | External links in tags                   | Few                            | Many                              | Minimal scripts ✅ / Multiple external scripts ❌ |
| 16 | SFH                            | Form Submission Handler                  | Same domain                   | Email / Blank                     | Submit to bank ✅ / Submit to email ❌ |
| 17 | Submitting_to_email            | Data sent via email                      | No                             | Yes                               | Secure server ✅ / mailto:action ❌ |
| 18 | Abnormal_URL                   | URL vs WHOIS mismatch                   | Match                          | Mismatch                          | Owner matches ✅ / Fake registration ❌ |
| 19 | Redirect                       | Number of redirects                     | Few (1–2)                     | Many                              | Normal navigation ✅ / Continuous redirection ❌ |
| 20 | on_mouseover                   | Mouse hover link change                  | No fake change                | Fake link on hover                | Same URL ✅ / Different URL ❌ |
| 21 | RightClick                    | Right-click restriction                 | Allowed                        | Disabled                          | Normal site ✅ / Right-click blocked ❌ |
| 22 | popUpWidnow                   | Presence of popups                      | No popups                      | Suspicious popups                 | No popup ✅ / Fake login popup ❌ |
| 23 | Iframe                         | Use of hidden iframe                    | Not used                      | Used                              | No hidden content ✅ / Hidden iframe ❌ |
| 24 | age_of_domain                 | Age of domain                           | Old domain                    | New domain                        | 8–10 years ✅ / Few days ❌ |
| 25 | DNSRecord                     | DNS record existence                    | Present                        | Missing                           | Proper DNS ✅ / No DNS record ❌ |
| 26 | web_traffic                   | Website popularity                      | High                           | Low                               | Google traffic ✅ / Unknown traffic ❌ |
| 27 | Page_Rank                     | Google PageRank                         | High                           | Low                               | Ranked site ✅ / Zero rank ❌ |
| 28 | Google_Index                  | Indexed in Google                       | Yes                            | No                                | Found on Google ✅ / Not found ❌ |
| 29 | Links_pointing_to_page        | Backlinks count                         | Many                           | Few                               | Many backlinks ✅ / No backlinks ❌ |
| 30 | Statistical_report           | Presence in phishing blacklist          | Not listed                     | Listed                            | Safe report ✅ / Blacklisted ❌ |
| 31 | Result                        | Final classification                    | Legitimate (1)                | Phishing (-1)                     | Legit ✅ / Phishing ❌ |


In [4]:
# Basic info and stats
df_shape = PWTDF.shape
df_info = PWTDF.dtypes
df_desc = PWTDF.describe()

In [8]:
# Value counts of target
target_counts = PWTDF['Result'].value_counts()

In [9]:
# Missing values
missing_values = PWTDF.isnull().sum()

In [12]:
import os
os.makedirs("plots", exist_ok=True)

In [None]:
import matplotlib.pyplot as plt

# Plot 1: Target distribution
plt.figure()
target_counts.plot(kind='bar')
plt.title("Target Variable Distribution (Legit vs Phishing)")
plt.xlabel("Class")
plt.ylabel("Count")
plt.savefig("plots/target_distribution.png", dpi=300, bbox_inches="tight")
plt.close()

# Plot 2: Distribution of a few important features
features_to_plot = ['URL_Length', 'having_IP_Address', 'SSLfinal_State', 'age_of_domain', 'web_traffic']

for feature in features_to_plot:
    plt.figure()
    PWTDF[feature].value_counts().plot(kind='bar')
    plt.title(f"Distribution of {feature}")
    plt.xlabel("Feature Value")
    plt.ylabel("Count")
    plt.savefig(f"plots/{feature}_distribution.png", dpi=300, bbox_inches="tight")
    plt.close()

df_shape, df_info, df_desc, target_counts, missing_values

NameError: name 'df' is not defined

<Figure size 640x480 with 0 Axes>

In [5]:
# Value counts of target
target_counts = PWTDF['Result'].value_counts()