Skip to content

Doumit04/PhishGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ PhishGuard — Phishing Email Analyzer

Python Flask VirusTotal Status Tests

A hybrid phishing detection engine combining an 11-rule heuristic scoring system with live VirusTotal threat intelligence from 70+ security engines.

Cybersecurity Portfolio Project — Tony Doumit — March 2026


📌 Table of Contents


🔍 Overview

Phishing remains one of the most damaging forms of cybercrime worldwide. Attackers craft deceptive emails that impersonate trusted brands, create artificial urgency, and manipulate recipients into revealing sensitive credentials or transferring funds.

PhishGuard is a web-based phishing email analyser built with Python (Flask) and integrated with the VirusTotal API v3. It accepts raw email text (headers + body), runs it through 11 independent detection checks, queries VirusTotal for live threat intelligence, and returns a final verdict with a confidence level — all in real time.

The detection strategy is deliberately multi-layered. No single check is relied upon to produce a verdict. Instead, each check contributes an independent risk score, and the final verdict emerges from the accumulated weight of evidence. This mirrors how professional threat analysts think:

One suspicious signal may be a coincidence. Five simultaneous signals are a pattern.

Key Stats

Metric Value
Heuristic Checks 11
Security Engines (VirusTotal) 70+
Attack Vectors Covered 6
Test Emails 5 / 5 Passed
False Positives 0
Max Possible Score 460 pts

⚙️ How It Works

Raw Email Input (Headers + Body)
          │
          ▼
  Flask Route receives HTTP POST
          │
          ▼
  extract_email_parts()  ──►  From, Reply-To, Subject, Body
          │
          ▼
  11 Heuristic Checks run sequentially
  Each check returns: score contribution + triggered boolean
          │
          ▼
  VirusTotal API v3
  ├── scan_domain_virustotal()   → sender domain reputation
  └── scan_urls_virustotal()     → up to 3 URLs from body
          │
          ▼
  Score Aggregator
  Total Score + Triggered Count → Verdict + Confidence Level
          │
          ▼
  JSON Response → Browser renders result instantly

Helper Functions

Function Purpose
extract_email_parts() Splits raw email into headers, subject, and body. Handles both user@domain.com and Name <user@domain.com> formats
extract_domain() Extracts domain from email address, strips angle brackets
get_base_domain() Strips subdomains → mail.grammarly.com becomes grammarly.com. Used in Check 6 to prevent false positives
normalize_unicode() Converts Unicode lookalike characters to ASCII equivalents, defeating obfuscation attacks like bold VerifyVerify

🔎 Detection Engine: 11 Checks

Quick Reference

# Check Target Max Score
1 Typosquatting Detection Sender domain +40
2 Urgency & Threat Language Subject + Body +45
3 Sensitive Info Requests Body only +60
4 Suspicious Links & Attachments Body (URLs) +50
5 Trusted Domain Validation Sender domain +20
6 Reply-To Mismatch Email headers +35
7 Brand Impersonation Body + sender domain +40
8 Too Good to Be True Body only +50
9 Psychological Manipulation Body only +40
10 Poor Formatting Body only +30
11 Money Scam Indicators Body only +50
Maximum possible heuristic score +460

✅ Check 1 — Typosquatting Detection +40 pts

Detects domains where letters are replaced with visually similar numbers or character pairs to impersonate trusted brands.

Examples: paypa1.com · g00gle.com · micros0ft.com

Substitution map applied during normalisation:

Character Normalised To
0 o
1 l
3 e
rn m
vv w

Method: The sender domain is normalised using the substitution map, then compared against the trusted domain whitelist. A match on the normalised domain (but not the original) confirms typosquatting.


✅ Check 2 — Urgency & Threat Language +15 pts/keyword, cap +45

Scans the subject line and body for keywords that create fear or artificial time pressure — a core social engineering tactic.

20 trigger keywords: act now · immediately · urgent · 24 hours · suspended · locked · compromised · legal action · account will be closed · verify your identity · your account has been · confirm your information · unusual activity · unauthorized access · security alert · immediate action required · failure to respond · final notice · last warning · your account is at risk

Method: Unicode normalisation → keyword matching on subject + body.


✅ Check 3 — Sensitive Information Requests +30 pts/keyword, cap +60

Scans for explicit requests for private, financial, or authentication data — the primary objective of most phishing attacks.

20 trigger keywords: password · credit card · ssn · cvv · bank account · billing information · social security · date of birth · mother's maiden name · pin number · card number · account number · routing number · login credentials · username and password · security question · verify your account · update your payment · confirm your details · enter your information


✅ Check 4 — Suspicious Links & Attachments +25 pts/finding, cap +50

Scans for malicious URL patterns and dangerous attachment indicators.

Pattern Example
Raw IP addresses in links http://192.168.1.1/login
URL shorteners bit.ly, tinyurl.com, t.co, goo.gl, ow.ly
Suspicious TLDs .xyz, .tk, .ml, .ga, .cf, .gq
Multiple hyphens in URLs paypal-secure-login-verify.com
Attachment keywords open attached, download file, see attachment
Brand lookalike patterns apple-, paypal-, secure-login, account-verify

✅ Check 5 — Trusted Domain Validation +20 pts if NOT in list

Checks whether the sender domain is in a curated whitelist of 20 trusted domains.

Whitelist: gmail.com · outlook.com · hotmail.com · yahoo.com · icloud.com · apple.com · paypal.com · amazon.com · microsoft.com · google.com · github.com · linkedin.com · twitter.com · facebook.com · netflix.com · dropbox.com · slack.com · zoom.us · shopify.com · stripe.com


✅ Check 6 — Reply-To Mismatch +35 pts

Detects when the Reply-To domain differs from the From domain — a classic phishing technique where the email appears to come from a trusted sender, but replies go to the attacker.

Method: Base-domain comparison to prevent false positives. mail.grammarly.com and grammarly.com share the base domain grammarly.com and are treated as matching.


✅ Check 7 — Brand Impersonation +40 pts

Detects when an email body mentions a known brand but the sender domain is not that brand's official domain. Exact domain matching enforced — appleid-support.com ≠ apple.com.

Monitored brands: PayPal · Apple · Amazon · Microsoft · Netflix · Google · Facebook · Instagram · DHL · FedEx · Bank of America · Chase


✅ Check 8 — Too Good to Be True +25 pts/keyword, cap +50

Scans for lottery, prize, and inheritance scam language.

20 trigger keywords: you have won · you've been selected · free gift · lucky winner · congratulations you · inheritance · lottery · million dollars · unclaimed prize · cash reward · you are the winner · selected for reward · claim your prize · gift card · exclusive offer · you have been chosen · jackpot · sweepstakes · won a competition · awarded to you


✅ Check 9 — Psychological Manipulation +20 pts/keyword, cap +40

Detects secrecy and isolation tactics that prevent victims from consulting trusted people before complying with fraudulent requests.

15 trigger keywords: don't tell anyone · keep this confidential · this is between us · private offer · do not share · tell no one · strictly confidential · between you and me · do not discuss · keep this secret · do not contact your bank · do not inform · only you can see this · exclusive to you · do not reply to anyone else


✅ Check 10 — Poor Formatting cap +30 pts

Detects aggressive formatting patterns disproportionately common in phishing emails.

Pattern Score
3+ consecutive exclamation marks !!! +10 pts
3+ consecutive question marks ??? +10 pts
3+ ALL CAPS words in body +10 pts

✅ Check 11 — Money Scam Indicators +25 pts/keyword, cap +50

Detects financial scam language and suspicious dollar amounts targeting advance-fee fraud and government impersonation scams.

20 trigger keywords: stimulus · unclaimed funds · pending approval · financial benefit · verify eligibility · government grant · tax refund · wire transfer · western union · money gram · advance fee · processing fee · release fee · inheritance funds · beneficiary · next of kin · transfer of funds · bank transfer · financial assistance · claim your funds

Regex detection: Also detects irregular dollar amounts like $7,452 or $84,500 commonly used to add false credibility to scam emails.


Attack Vectors Coverage

Category Checks
Social Engineering Checks 2, 8, 9
Domain Spoofing & Header Forgery Checks 1, 5, 6
Credential & Financial Fraud Checks 3, 7, 11
Technical Deception Check 4
Quality Signal Check 10

🔬 VirusTotal Integration

Two dedicated functions query the VirusTotal API v3 on top of the 11 heuristic checks.

Domain Scan — scan_domain_virustotal(domain)

Sends the sender domain to VirusTotal and checks reputation against 70+ security engines.

Result Score Condition
Malicious +50 pts One or more engines flag as malicious
Suspicious +25 pts One or more engines flag as suspicious
Clean 0 pts No engines flag the domain

URL Scan — scan_urls_virustotal(body)

Extracts all URLs from the body and scans up to 3 URLs per email. Each URL is base64-encoded per the VirusTotal API v3 specification.

Result Score Condition
Malicious URL +50 pts One or more engines flag as malicious
Suspicious URL +25 pts One or more engines flag as suspicious
Clean 0 pts No engines flag the URL

Both functions use try/except blocks for graceful error handling. The API key is stored in .env and never committed to version control.


📊 Scoring & Verdict System

Verdict Thresholds

Score:    0 ──────────── 30 ──────────── 70 ──────────────► ∞
                  │               │
          ✅ SAFE          ⚠️ SUSPICIOUS        🚨 SCAM
         (0 – 30)          (31 – 70)            (71+)
Score Range Verdict Interpretation
0 – 30 ✅ SAFE No significant risk signals detected
31 – 70 ⚠️ SUSPICIOUS Some risk signals present; treat with caution
71+ 🚨 SCAM Strong indicators of a phishing or scam email

Confidence Levels

Confidence is determined by the number of checks triggered, independently of the total score.

Checks Triggered Confidence Notes
5 or more 🔴 High Multiple independent signals; verdict is reliable
3 – 4 🟠 Medium Corroborating evidence present
0 – 2 🟢 Low Limited signals; manual review advisable

🧪 Test Results

All 5 test emails were classified correctly with zero false positives.

Email Tested Verdict Score Confidence Result
PayPal phishing 🚨 SCAM 345 High ✅ Pass
Apple phishing (subtle) 🚨 SCAM 150 High ✅ Pass
Grammarly legitimate ✅ SAFE 20 Low ✅ Pass
GitHub security notice ✅ SAFE 0 Low ✅ Pass
Stimulus / financial scam 🚨 SCAM 110 Medium ✅ Pass

Score Visualization

PayPal phishing        ████████████████████████████████████████ 345
Apple phishing         ████████████████████ 150
Stimulus scam          ████████████████ 110
Grammarly legit        ███ 20
GitHub notice          0

                       │    │
                      30   70   ← Verdict thresholds

Test Design Rationale

  • PayPal phishing (345, High): Designed to trigger as many checks as possible simultaneously. Confirms score compounds correctly across parallel checks and VirusTotal adds on top of the heuristic total.
  • Apple phishing, subtle (150, High): Simulates a careful attacker who avoids aggressive language. Detection relies entirely on structural signals — typosquatted domain, brand impersonation, and VirusTotal flag. Confirms the engine catches sophisticated attacks without keyword triggers.
  • Grammarly legitimate (20, Low): Real-world marketing email. Tests false-positive resistance. Only Check 5 fires (+20) because grammarly.com is not in the whitelist — score stays well below the SUSPICIOUS threshold of 31.
  • GitHub security notice (0, Low): Clean-room baseline. A well-structured email from a whitelisted domain with clean URLs scores exactly 0. Nothing triggers, nothing fires.
  • Stimulus scam (110, Medium): Written without brand names or URL shorteners to test whether Check 11 and Check 2 alone push the score into SCAM territory. They do — confirming financial scams are caught without domain or brand signals.

📁 Project Structure

phishguard/
├── app.py                        # Main backend: Flask + full detection engine
├── .env                          
├── .gitignore                    # Excludes .env and venv from version control
├── Complete Detection Rules.md   # Full detection rules reference document
├── Project Progress.txt          # Development progress log
├── Tony Doumit Phishguard.pdf    # 19-page professional technical report (LaTeX)
└── templates/
    └── index.html                # Frontend web interface

🚀 How to Run

Prerequisites

Setup

# 1. Clone the repository
git clone https://github.com/Doumit04/PhishGuard.git
cd PhishGuard

# 2. Create and activate virtual environment
python -m venv venv

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

# 3. Install dependencies
pip install flask requests python-dotenv

# 4. Create your .env file
# Create a file named .env in the project root and add:
# VT_API_KEY=your_virustotal_api_key_here

# 5. Start the Flask server
python app.py

# 6. Open in browser
# http://127.0.0.1:5000

⚠️ Important: The .env file must exist and contain a valid VT_API_KEY before starting the server. Without it, VirusTotal integration will fail silently and only the 11 heuristic checks will run.

Usage

Paste any raw email (headers + body) into the interface and click Analyze Email. Results are displayed instantly including:

  • Total risk score
  • Triggered checks with individual scores
  • VirusTotal domain and URL findings
  • Final verdict badge (SAFE / SUSPICIOUS / SCAM)
  • Confidence level

📦 Dependencies

Package Source Purpose
flask pip Web framework and HTTP routing
requests pip HTTP calls to VirusTotal API
python-dotenv pip Secure loading of .env variables
re stdlib Regex for URL extraction and header parsing
base64 stdlib URL encoding required by VirusTotal API v3
unicodedata stdlib Unicode normalisation to defeat obfuscation
pip install flask requests python-dotenv

⚠️ Known Limitations

  1. Static trusted domain list: The 20-domain whitelist cannot cover every legitimate sender. Emails from unlisted legitimate companies receive +20 pts, potentially pushing borderline emails to SUSPICIOUS. Partially mitigated by VirusTotal domain reputation.

  2. Spear-phishing blind spot: A perfectly crafted targeted email that avoids all keyword triggers, uses a clean domain, and contains no suspicious links may score low despite being malicious. This is a known limitation of all rule-based detection systems.

  3. VirusTotal free tier rate limits: The free API tier caps at 4 requests/min, so only the first 3 URLs per email are scanned. Malicious links beyond the third are missed.

  4. Newly registered domains: A malicious domain created within the past 24–48 hours may not yet appear in VirusTotal's crowd-sourced reports, resulting in a clean score.


🔮 Planned Improvements (Version 2)

Improvement Purpose Limitation Addressed
Live DNS / WHOIS domain age detection Flag domains registered in the past 30 days Newly registered malicious domains
DKIM / SPF header authentication Verify sender is authorised to send on behalf of domain New signal class — not in v1
Paid VirusTotal tier Scan all URLs, no rate limit Free tier URL scanning cap
Machine learning classifier Detect zero-day phishing patterns via supervised learning Spear-phishing keyword bypass
PDF report export Download full analysis report for incident response Documentation workflow

📄 Documentation

A full 19-page professional technical report was produced in LaTeX alongside this project, covering:

  • System architecture and data flow diagrams
  • Full detection engine specification for all 11 checks
  • VirusTotal integration design
  • Complete test methodology and results
  • Known limitations and Version 2 roadmap

The report is included in this repository as Tony Doumit Phishguard.pdf.


👤 Author


PhishGuard · Cybersecurity Portfolio Project · March 2026

About

Hybrid phishing detection engine, 11-rule heuristic scoring combined with VirusTotal (70+ engines), achieving 100% accuracy and zero false positives across all test cases

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors