🛡️ PhishGuard — Phishing Email Analyzer

A hybrid phishing detection engine combining an 11-rule heuristic scoring system with live VirusTotal threat intelligence from 70+ security engines.

Cybersecurity Portfolio Project — Tony Doumit — March 2026

📌 Table of Contents

Overview
How It Works
Detection Engine: 11 Checks
VirusTotal Integration
Scoring & Verdict System
Test Results
Project Structure
How to Run
Dependencies
Known Limitations
Planned Improvements (Version 2)

🔍 Overview

Phishing remains one of the most damaging forms of cybercrime worldwide. Attackers craft deceptive emails that impersonate trusted brands, create artificial urgency, and manipulate recipients into revealing sensitive credentials or transferring funds.

PhishGuard is a web-based phishing email analyser built with Python (Flask) and integrated with the VirusTotal API v3. It accepts raw email text (headers + body), runs it through 11 independent detection checks, queries VirusTotal for live threat intelligence, and returns a final verdict with a confidence level — all in real time.

The detection strategy is deliberately multi-layered. No single check is relied upon to produce a verdict. Instead, each check contributes an independent risk score, and the final verdict emerges from the accumulated weight of evidence. This mirrors how professional threat analysts think:

One suspicious signal may be a coincidence. Five simultaneous signals are a pattern.

Key Stats

Metric	Value
Heuristic Checks	11
Security Engines (VirusTotal)	70+
Attack Vectors Covered	6
Test Emails	5 / 5 Passed
False Positives	0
Max Possible Score	460 pts

⚙️ How It Works

Raw Email Input (Headers + Body)
          │
          ▼
  Flask Route receives HTTP POST
          │
          ▼
  extract_email_parts()  ──►  From, Reply-To, Subject, Body
          │
          ▼
  11 Heuristic Checks run sequentially
  Each check returns: score contribution + triggered boolean
          │
          ▼
  VirusTotal API v3
  ├── scan_domain_virustotal()   → sender domain reputation
  └── scan_urls_virustotal()     → up to 3 URLs from body
          │
          ▼
  Score Aggregator
  Total Score + Triggered Count → Verdict + Confidence Level
          │
          ▼
  JSON Response → Browser renders result instantly

Helper Functions

Function	Purpose
`extract_email_parts()`	Splits raw email into headers, subject, and body. Handles both `user@domain.com` and `Name <user@domain.com>` formats
`extract_domain()`	Extracts domain from email address, strips angle brackets
`get_base_domain()`	Strips subdomains → `mail.grammarly.com` becomes `grammarly.com`. Used in Check 6 to prevent false positives
`normalize_unicode()`	Converts Unicode lookalike characters to ASCII equivalents, defeating obfuscation attacks like bold Verify → `Verify`

🔎 Detection Engine: 11 Checks

Quick Reference

#	Check	Target	Max Score
1	Typosquatting Detection	Sender domain	+40
2	Urgency & Threat Language	Subject + Body	+45
3	Sensitive Info Requests	Body only	+60
4	Suspicious Links & Attachments	Body (URLs)	+50
5	Trusted Domain Validation	Sender domain	+20
6	Reply-To Mismatch	Email headers	+35
7	Brand Impersonation	Body + sender domain	+40
8	Too Good to Be True	Body only	+50
9	Psychological Manipulation	Body only	+40
10	Poor Formatting	Body only	+30
11	Money Scam Indicators	Body only	+50
	Maximum possible heuristic score		+460

✅ Check 1 — Typosquatting Detection `+40 pts`

Detects domains where letters are replaced with visually similar numbers or character pairs to impersonate trusted brands.

Examples: paypa1.com · g00gle.com · micros0ft.com

Substitution map applied during normalisation:

Character	Normalised To
`0`	`o`
`1`	`l`
`3`	`e`
`rn`	`m`
`vv`	`w`

Method: The sender domain is normalised using the substitution map, then compared against the trusted domain whitelist. A match on the normalised domain (but not the original) confirms typosquatting.

✅ Check 2 — Urgency & Threat Language `+15 pts/keyword, cap +45`

Scans the subject line and body for keywords that create fear or artificial time pressure — a core social engineering tactic.

20 trigger keywords: act now · immediately · urgent · 24 hours · suspended · locked · compromised · legal action · account will be closed · verify your identity · your account has been · confirm your information · unusual activity · unauthorized access · security alert · immediate action required · failure to respond · final notice · last warning · your account is at risk

Method: Unicode normalisation → keyword matching on subject + body.

✅ Check 3 — Sensitive Information Requests `+30 pts/keyword, cap +60`

Scans for explicit requests for private, financial, or authentication data — the primary objective of most phishing attacks.

20 trigger keywords: password · credit card · ssn · cvv · bank account · billing information · social security · date of birth · mother's maiden name · pin number · card number · account number · routing number · login credentials · username and password · security question · verify your account · update your payment · confirm your details · enter your information

✅ Check 4 — Suspicious Links & Attachments `+25 pts/finding, cap +50`

Scans for malicious URL patterns and dangerous attachment indicators.

Pattern	Example
Raw IP addresses in links	`http://192.168.1.1/login`
URL shorteners	`bit.ly`, `tinyurl.com`, `t.co`, `goo.gl`, `ow.ly`
Suspicious TLDs	`.xyz`, `.tk`, `.ml`, `.ga`, `.cf`, `.gq`
Multiple hyphens in URLs	`paypal-secure-login-verify.com`
Attachment keywords	`open attached`, `download file`, `see attachment`
Brand lookalike patterns	`apple-`, `paypal-`, `secure-login`, `account-verify`

✅ Check 5 — Trusted Domain Validation `+20 pts if NOT in list`

Checks whether the sender domain is in a curated whitelist of 20 trusted domains.

Whitelist: gmail.com · outlook.com · hotmail.com · yahoo.com · icloud.com · apple.com · paypal.com · amazon.com · microsoft.com · google.com · github.com · linkedin.com · twitter.com · facebook.com · netflix.com · dropbox.com · slack.com · zoom.us · shopify.com · stripe.com

✅ Check 6 — Reply-To Mismatch `+35 pts`

Detects when the Reply-To domain differs from the From domain — a classic phishing technique where the email appears to come from a trusted sender, but replies go to the attacker.

Method: Base-domain comparison to prevent false positives. mail.grammarly.com and grammarly.com share the base domain grammarly.com and are treated as matching.

✅ Check 7 — Brand Impersonation `+40 pts`

Detects when an email body mentions a known brand but the sender domain is not that brand's official domain. Exact domain matching enforced — appleid-support.com ≠ apple.com.

Monitored brands: PayPal · Apple · Amazon · Microsoft · Netflix · Google · Facebook · Instagram · DHL · FedEx · Bank of America · Chase

✅ Check 8 — Too Good to Be True `+25 pts/keyword, cap +50`

Scans for lottery, prize, and inheritance scam language.

20 trigger keywords: you have won · you've been selected · free gift · lucky winner · congratulations you · inheritance · lottery · million dollars · unclaimed prize · cash reward · you are the winner · selected for reward · claim your prize · gift card · exclusive offer · you have been chosen · jackpot · sweepstakes · won a competition · awarded to you

✅ Check 9 — Psychological Manipulation `+20 pts/keyword, cap +40`

Detects secrecy and isolation tactics that prevent victims from consulting trusted people before complying with fraudulent requests.

15 trigger keywords: don't tell anyone · keep this confidential · this is between us · private offer · do not share · tell no one · strictly confidential · between you and me · do not discuss · keep this secret · do not contact your bank · do not inform · only you can see this · exclusive to you · do not reply to anyone else

✅ Check 10 — Poor Formatting `cap +30 pts`

Detects aggressive formatting patterns disproportionately common in phishing emails.

Pattern	Score
3+ consecutive exclamation marks `!!!`	+10 pts
3+ consecutive question marks `???`	+10 pts
3+ ALL CAPS words in body	+10 pts

✅ Check 11 — Money Scam Indicators `+25 pts/keyword, cap +50`

Detects financial scam language and suspicious dollar amounts targeting advance-fee fraud and government impersonation scams.

20 trigger keywords: stimulus · unclaimed funds · pending approval · financial benefit · verify eligibility · government grant · tax refund · wire transfer · western union · money gram · advance fee · processing fee · release fee · inheritance funds · beneficiary · next of kin · transfer of funds · bank transfer · financial assistance · claim your funds

Regex detection: Also detects irregular dollar amounts like $7,452 or $84,500 commonly used to add false credibility to scam emails.

Attack Vectors Coverage

Category	Checks
Social Engineering	Checks 2, 8, 9
Domain Spoofing & Header Forgery	Checks 1, 5, 6
Credential & Financial Fraud	Checks 3, 7, 11
Technical Deception	Check 4
Quality Signal	Check 10

🔬 VirusTotal Integration

Two dedicated functions query the VirusTotal API v3 on top of the 11 heuristic checks.

Domain Scan — `scan_domain_virustotal(domain)`

Sends the sender domain to VirusTotal and checks reputation against 70+ security engines.

Result	Score	Condition
Malicious	+50 pts	One or more engines flag as malicious
Suspicious	+25 pts	One or more engines flag as suspicious
Clean	0 pts	No engines flag the domain

URL Scan — `scan_urls_virustotal(body)`

Extracts all URLs from the body and scans up to 3 URLs per email. Each URL is base64-encoded per the VirusTotal API v3 specification.

Result	Score	Condition
Malicious URL	+50 pts	One or more engines flag as malicious
Suspicious URL	+25 pts	One or more engines flag as suspicious
Clean	0 pts	No engines flag the URL

Both functions use try/except blocks for graceful error handling. The API key is stored in .env and never committed to version control.

📊 Scoring & Verdict System

Verdict Thresholds

Score:    0 ──────────── 30 ──────────── 70 ──────────────► ∞
                  │               │
          ✅ SAFE          ⚠️ SUSPICIOUS        🚨 SCAM
         (0 – 30)          (31 – 70)            (71+)

Score Range	Verdict	Interpretation
0 – 30	✅ SAFE	No significant risk signals detected
31 – 70	⚠️ SUSPICIOUS	Some risk signals present; treat with caution
71+	🚨 SCAM	Strong indicators of a phishing or scam email

Confidence Levels

Confidence is determined by the number of checks triggered, independently of the total score.

Checks Triggered	Confidence	Notes
5 or more	🔴 High	Multiple independent signals; verdict is reliable
3 – 4	🟠 Medium	Corroborating evidence present
0 – 2	🟢 Low	Limited signals; manual review advisable

🧪 Test Results

All 5 test emails were classified correctly with zero false positives.

Email Tested	Verdict	Score	Confidence	Result
PayPal phishing	🚨 SCAM	345	High	✅ Pass
Apple phishing (subtle)	🚨 SCAM	150	High	✅ Pass
Grammarly legitimate	✅ SAFE	20	Low	✅ Pass
GitHub security notice	✅ SAFE	0	Low	✅ Pass
Stimulus / financial scam	🚨 SCAM	110	Medium	✅ Pass

Score Visualization

PayPal phishing        ████████████████████████████████████████ 345
Apple phishing         ████████████████████ 150
Stimulus scam          ████████████████ 110
Grammarly legit        ███ 20
GitHub notice          0

                       │    │
                      30   70   ← Verdict thresholds

Test Design Rationale

PayPal phishing (345, High): Designed to trigger as many checks as possible simultaneously. Confirms score compounds correctly across parallel checks and VirusTotal adds on top of the heuristic total.
Apple phishing, subtle (150, High): Simulates a careful attacker who avoids aggressive language. Detection relies entirely on structural signals — typosquatted domain, brand impersonation, and VirusTotal flag. Confirms the engine catches sophisticated attacks without keyword triggers.
Grammarly legitimate (20, Low): Real-world marketing email. Tests false-positive resistance. Only Check 5 fires (+20) because grammarly.com is not in the whitelist — score stays well below the SUSPICIOUS threshold of 31.
GitHub security notice (0, Low): Clean-room baseline. A well-structured email from a whitelisted domain with clean URLs scores exactly 0. Nothing triggers, nothing fires.
Stimulus scam (110, Medium): Written without brand names or URL shorteners to test whether Check 11 and Check 2 alone push the score into SCAM territory. They do — confirming financial scams are caught without domain or brand signals.

📁 Project Structure

phishguard/
├── app.py                        # Main backend: Flask + full detection engine
├── .env                          
├── .gitignore                    # Excludes .env and venv from version control
├── Complete Detection Rules.md   # Full detection rules reference document
├── Project Progress.txt          # Development progress log
├── Tony Doumit Phishguard.pdf    # 19-page professional technical report (LaTeX)
└── templates/
    └── index.html                # Frontend web interface

🚀 How to Run

Prerequisites

Python 3.10+
A free VirusTotal API key → Get one here

Setup

# 1. Clone the repository
git clone https://github.com/Doumit04/PhishGuard.git
cd PhishGuard

# 2. Create and activate virtual environment
python -m venv venv

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

# 3. Install dependencies
pip install flask requests python-dotenv

# 4. Create your .env file
# Create a file named .env in the project root and add:
# VT_API_KEY=your_virustotal_api_key_here

# 5. Start the Flask server
python app.py

# 6. Open in browser
# http://127.0.0.1:5000

⚠️ Important: The .env file must exist and contain a valid VT_API_KEY before starting the server. Without it, VirusTotal integration will fail silently and only the 11 heuristic checks will run.

Usage

Paste any raw email (headers + body) into the interface and click Analyze Email. Results are displayed instantly including:

Total risk score
Triggered checks with individual scores
VirusTotal domain and URL findings
Final verdict badge (SAFE / SUSPICIOUS / SCAM)
Confidence level

📦 Dependencies

Package	Source	Purpose
`flask`	pip	Web framework and HTTP routing
`requests`	pip	HTTP calls to VirusTotal API
`python-dotenv`	pip	Secure loading of `.env` variables
`re`	stdlib	Regex for URL extraction and header parsing
`base64`	stdlib	URL encoding required by VirusTotal API v3
`unicodedata`	stdlib	Unicode normalisation to defeat obfuscation

pip install flask requests python-dotenv

⚠️ Known Limitations

Static trusted domain list: The 20-domain whitelist cannot cover every legitimate sender. Emails from unlisted legitimate companies receive +20 pts, potentially pushing borderline emails to SUSPICIOUS. Partially mitigated by VirusTotal domain reputation.
Spear-phishing blind spot: A perfectly crafted targeted email that avoids all keyword triggers, uses a clean domain, and contains no suspicious links may score low despite being malicious. This is a known limitation of all rule-based detection systems.
VirusTotal free tier rate limits: The free API tier caps at 4 requests/min, so only the first 3 URLs per email are scanned. Malicious links beyond the third are missed.
Newly registered domains: A malicious domain created within the past 24–48 hours may not yet appear in VirusTotal's crowd-sourced reports, resulting in a clean score.

🔮 Planned Improvements (Version 2)

Improvement	Purpose	Limitation Addressed
Live DNS / WHOIS domain age detection	Flag domains registered in the past 30 days	Newly registered malicious domains
DKIM / SPF header authentication	Verify sender is authorised to send on behalf of domain	New signal class — not in v1
Paid VirusTotal tier	Scan all URLs, no rate limit	Free tier URL scanning cap
Machine learning classifier	Detect zero-day phishing patterns via supervised learning	Spear-phishing keyword bypass
PDF report export	Download full analysis report for incident response	Documentation workflow

📄 Documentation

A full 19-page professional technical report was produced in LaTeX alongside this project, covering:

System architecture and data flow diagrams
Full detection engine specification for all 11 checks
VirusTotal integration design
Complete test methodology and results
Known limitations and Version 2 roadmap

The report is included in this repository as Tony Doumit Phishguard.pdf.

👤 Author

PhishGuard · Cybersecurity Portfolio Project · March 2026

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
templates		templates
README.md		README.md
Tony Doumit Phishguard Project Email Analyzer.pdf		Tony Doumit Phishguard Project Email Analyzer.pdf
app.py		app.py

Folders and files

Latest commit

History

Repository files navigation

🛡️ PhishGuard — Phishing Email Analyzer

📌 Table of Contents

🔍 Overview

Key Stats

⚙️ How It Works

Helper Functions

🔎 Detection Engine: 11 Checks

Quick Reference

✅ Check 1 — Typosquatting Detection +40 pts

✅ Check 2 — Urgency & Threat Language +15 pts/keyword, cap +45

✅ Check 3 — Sensitive Information Requests +30 pts/keyword, cap +60

✅ Check 4 — Suspicious Links & Attachments +25 pts/finding, cap +50

✅ Check 5 — Trusted Domain Validation +20 pts if NOT in list

✅ Check 6 — Reply-To Mismatch +35 pts

✅ Check 7 — Brand Impersonation +40 pts

✅ Check 8 — Too Good to Be True +25 pts/keyword, cap +50

✅ Check 9 — Psychological Manipulation +20 pts/keyword, cap +40

✅ Check 10 — Poor Formatting cap +30 pts

✅ Check 11 — Money Scam Indicators +25 pts/keyword, cap +50

Attack Vectors Coverage

🔬 VirusTotal Integration

Domain Scan — scan_domain_virustotal(domain)

URL Scan — scan_urls_virustotal(body)

📊 Scoring & Verdict System

Verdict Thresholds

Confidence Levels

🧪 Test Results

Score Visualization

Test Design Rationale

📁 Project Structure

🚀 How to Run

Prerequisites

Setup

Usage

📦 Dependencies

⚠️ Known Limitations

🔮 Planned Improvements (Version 2)

📄 Documentation

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✅ Check 1 — Typosquatting Detection `+40 pts`

✅ Check 2 — Urgency & Threat Language `+15 pts/keyword, cap +45`

✅ Check 3 — Sensitive Information Requests `+30 pts/keyword, cap +60`

✅ Check 4 — Suspicious Links & Attachments `+25 pts/finding, cap +50`

✅ Check 5 — Trusted Domain Validation `+20 pts if NOT in list`

✅ Check 6 — Reply-To Mismatch `+35 pts`

✅ Check 7 — Brand Impersonation `+40 pts`

✅ Check 8 — Too Good to Be True `+25 pts/keyword, cap +50`

✅ Check 9 — Psychological Manipulation `+20 pts/keyword, cap +40`

✅ Check 10 — Poor Formatting `cap +30 pts`

✅ Check 11 — Money Scam Indicators `+25 pts/keyword, cap +50`

Domain Scan — `scan_domain_virustotal(domain)`

URL Scan — `scan_urls_virustotal(body)`

Packages