<a href="https://colab.research.google.com/github/Kirtisable/cybersecurity-projects/blob/main/phishing_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 🙏🏻 welcome in phishing URL detector


In [1]:
print("Hello")


Hello


## Install Required Libraries

In [None]:
pip install pandas scikit-learn matplotlib


Note: you may need to restart the kernel to use updated packages.


# Install and Load the UCI Dataset

I’ll use the PhiUSIIL Phishing URL Dataset, which is publicly available and easy to import directly into Python.

In [None]:
pip install ucimlrepo


Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.


# Load Dataset

In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd

# Download and load the dataset
data = fetch_ucirepo(id=967)
X = data.data.features
y = data.data.targets

# Combine features and labels
df = pd.concat([X, y], axis=1)

# Check the first few rows
df.head()


Unnamed: 0,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,URLCharProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
0,https://www.southbankmosaics.com,31,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,0.061933,...,0,0,1,34,20,28,119,0,124,1
1,https://www.uni-mainz.de,23,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,0.050207,...,0,0,1,50,9,8,39,0,217,1
2,https://www.voicefmradio.co.uk,29,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,0.064129,...,0,0,1,10,2,7,42,2,5,1
3,https://www.sfnmjournal.com,26,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,0.057606,...,1,1,1,3,27,15,22,1,31,1
4,https://www.rewildingargentina.org,33,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,0.059441,...,1,0,1,244,15,34,72,1,85,1


# Explore the Data


In [None]:
df.shape           # Rows, columns
df.columns         # Feature list + label
df['label'].value_counts()  # Ratio of phishing vs legit
df.isnull().sum()  # Check for missing values


URL                           0
URLLength                     0
Domain                        0
DomainLength                  0
IsDomainIP                    0
TLD                           0
URLSimilarityIndex            0
CharContinuationRate          0
TLDLegitimateProb             0
URLCharProb                   0
TLDLength                     0
NoOfSubDomain                 0
HasObfuscation                0
NoOfObfuscatedChar            0
ObfuscationRatio              0
NoOfLettersInURL              0
LetterRatioInURL              0
NoOfDegitsInURL               0
DegitRatioInURL               0
NoOfEqualsInURL               0
NoOfQMarkInURL                0
NoOfAmpersandInURL            0
NoOfOtherSpecialCharsInURL    0
SpacialCharRatioInURL         0
IsHTTPS                       0
LineOfCode                    0
LargestLineLength             0
HasTitle                      0
Title                         0
DomainTitleMatchScore         0
URLTitleMatchScore            0
HasFavic

# Feature Extraction
Now we’ll extract simple, meaningful features from each URL to help our machine learning model detect phishing websites.

## 🎯 What are Features?
Features are patterns or clues we extract from URLs. For example:

Is the URL too long?

Does it contain an "@" symbol?

Does it have “https”?

Is there an IP address in the URL?

These features help the model learn what a phishing URL looks like.

# Step 1: Add New Columns (Features)

In [None]:
import re

# Function to check if IP address is present
def has_ip(url):
    return 1 if re.match(r"http[s]?://\d+\.\d+\.\d+\.\d+", url) else 0

# Function to check if '@' is in the URL
def has_at_symbol(url):
    return 1 if "@" in url else 0

# Function to get length of the URL
def url_length(url):
    return len(url)

# Function to check if '-' is in the domain
def has_dash(url):
    return 1 if "-" in url else 0

# Function to check if 'https' is used
def has_https(url):
    return 1 if "https" in url else 0


# Step 2: Apply Functions to Dataset
Let’s say your dataset has a column 'URL' or 'url'. Use the correct one.

In [None]:
df['have_ip'] = df['URL'].apply(has_ip)
df['have_at'] = df['URL'].apply(has_at_symbol)
df['url_length'] = df['URL'].apply(url_length)
df['have_dash'] = df['URL'].apply(has_dash)
df['have_https'] = df['URL'].apply(has_https)

# 🔎 Step 3: View New Features

In [None]:
df[['URL', 'have_ip', 'have_at', 'url_length', 'have_dash', 'have_https']].head()

Unnamed: 0,URL,have_ip,have_at,url_length,have_dash,have_https
0,https://www.southbankmosaics.com,0,0,32,0,1
1,https://www.uni-mainz.de,0,0,24,1,1
2,https://www.voicefmradio.co.uk,0,0,30,0,1
3,https://www.sfnmjournal.com,0,0,27,0,1
4,https://www.rewildingargentina.org,0,0,34,0,1


# Train Your First Machine Learning Model

## ✅ Step 1: Select Features and Target
Pick the features you just created and the label (e.g., 'Label', 'Result', or whatever your target column is).

In [None]:
# Features
X = df[['have_ip', 'have_at', 'url_length', 'have_dash', 'have_https']]

# Target
y = df['label']  # Replace with actual label column name if different

## ✅ Step 2: Split Data (Train/Test)

In [None]:
from sklearn.model_selection import train_test_split

# Split 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## ✅ Step 3: Train a Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize model
model = RandomForestClassifier()

# Train it
model.fit(X_train, y_train)

## ✅ Step 4: Test the Model

In [None]:
# Predict on test set
y_pred = model.predict(X_test)

## ✅ Step 5: Check Accuracy

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Detailed report
print(classification_report(y_test, y_pred))

Accuracy: 0.9202273161008503
              precision    recall  f1-score   support

           0       0.96      0.85      0.90     20124
           1       0.90      0.97      0.93     27035

    accuracy                           0.92     47159
   macro avg       0.93      0.91      0.92     47159
weighted avg       0.92      0.92      0.92     47159



🔥 Fantastic! You just built your first phishing URL detection model with 92% accuracy — that’s an excellent result, especially for a beginner project!

# Test with Custom URLs
Now let’s make your model smarter by testing it on new, unseen URLs — maybe even live URLs you type in yourself.

## ✅ Step 1: Define a URL Feature Extractor

In [None]:
#Define a URL Feature Extractor

def extract_features_from_url(url):
    return {
        'have_ip': has_ip(url),
        'have_at': has_at_symbol(url),
        'url_length': url_length(url),
        'have_dash': has_dash(url),
        'have_https': has_https(url)
    }

## ✅ Step 2: Try With a Custom URL

In [None]:
#Try With a Custom URL

# Example URL (you can change this!)
new_url = "http://198.51.100.1/login"

# Extract features
features = extract_features_from_url(new_url)

# Convert to DataFrame for prediction
import pandas as pd
test_df = pd.DataFrame([features])

# Predict
prediction = model.predict(test_df)

# Result
if prediction[0] == 1:
    print("🔒 Legitimate website")
else:
    print("⚠️ Phishing website")


⚠️ Phishing website


## examples, like:

https://google.com

http://paypal.login.verify-now.co

https://yourbank.com/login

In [None]:
import pandas as pd

# 🔎 Example 1
url1 = "https://google.com"
features1 = extract_features_from_url(url1)
df1 = pd.DataFrame([features1])
pred1 = model.predict(df1)

print(f"URL: {url1}")
print("Result:", "🔒 Legitimate" if pred1[0] == 1 else "⚠️ Phishing")

# 🔎 Example 2
url2 = "http://198.51.100.1/login"
features2 = extract_features_from_url(url2)
df2 = pd.DataFrame([features2])
pred2 = model.predict(df2)

print(f"\nURL: {url2}")
print("Result:", "🔒 Legitimate" if pred2[0] == 1 else "⚠️ Phishing")


URL: https://google.com
Result: 🔒 Legitimate

URL: http://198.51.100.1/login
Result: ⚠️ Phishing


project successfully completed 🎉