A hybrid phishing detection engine combining an 11-rule heuristic scoring system with live VirusTotal threat intelligence from 70+ security engines.
Cybersecurity Portfolio Project — Tony Doumit — March 2026
- Overview
- How It Works
- Detection Engine: 11 Checks
- VirusTotal Integration
- Scoring & Verdict System
- Test Results
- Project Structure
- How to Run
- Dependencies
- Known Limitations
- Planned Improvements (Version 2)
Phishing remains one of the most damaging forms of cybercrime worldwide. Attackers craft deceptive emails that impersonate trusted brands, create artificial urgency, and manipulate recipients into revealing sensitive credentials or transferring funds.
PhishGuard is a web-based phishing email analyser built with Python (Flask) and integrated with the VirusTotal API v3. It accepts raw email text (headers + body), runs it through 11 independent detection checks, queries VirusTotal for live threat intelligence, and returns a final verdict with a confidence level — all in real time.
The detection strategy is deliberately multi-layered. No single check is relied upon to produce a verdict. Instead, each check contributes an independent risk score, and the final verdict emerges from the accumulated weight of evidence. This mirrors how professional threat analysts think:
One suspicious signal may be a coincidence. Five simultaneous signals are a pattern.
| Metric | Value |
|---|---|
| Heuristic Checks | 11 |
| Security Engines (VirusTotal) | 70+ |
| Attack Vectors Covered | 6 |
| Test Emails | 5 / 5 Passed |
| False Positives | 0 |
| Max Possible Score | 460 pts |
Raw Email Input (Headers + Body)
│
▼
Flask Route receives HTTP POST
│
▼
extract_email_parts() ──► From, Reply-To, Subject, Body
│
▼
11 Heuristic Checks run sequentially
Each check returns: score contribution + triggered boolean
│
▼
VirusTotal API v3
├── scan_domain_virustotal() → sender domain reputation
└── scan_urls_virustotal() → up to 3 URLs from body
│
▼
Score Aggregator
Total Score + Triggered Count → Verdict + Confidence Level
│
▼
JSON Response → Browser renders result instantly
| Function | Purpose |
|---|---|
extract_email_parts() |
Splits raw email into headers, subject, and body. Handles both user@domain.com and Name <user@domain.com> formats |
extract_domain() |
Extracts domain from email address, strips angle brackets |
get_base_domain() |
Strips subdomains → mail.grammarly.com becomes grammarly.com. Used in Check 6 to prevent false positives |
normalize_unicode() |
Converts Unicode lookalike characters to ASCII equivalents, defeating obfuscation attacks like bold Verify → Verify |
| # | Check | Target | Max Score |
|---|---|---|---|
| 1 | Typosquatting Detection | Sender domain | +40 |
| 2 | Urgency & Threat Language | Subject + Body | +45 |
| 3 | Sensitive Info Requests | Body only | +60 |
| 4 | Suspicious Links & Attachments | Body (URLs) | +50 |
| 5 | Trusted Domain Validation | Sender domain | +20 |
| 6 | Reply-To Mismatch | Email headers | +35 |
| 7 | Brand Impersonation | Body + sender domain | +40 |
| 8 | Too Good to Be True | Body only | +50 |
| 9 | Psychological Manipulation | Body only | +40 |
| 10 | Poor Formatting | Body only | +30 |
| 11 | Money Scam Indicators | Body only | +50 |
| Maximum possible heuristic score | +460 |
Detects domains where letters are replaced with visually similar numbers or character pairs to impersonate trusted brands.
Examples: paypa1.com · g00gle.com · micros0ft.com
Substitution map applied during normalisation:
| Character | Normalised To |
|---|---|
0 |
o |
1 |
l |
3 |
e |
rn |
m |
vv |
w |
Method: The sender domain is normalised using the substitution map, then compared against the trusted domain whitelist. A match on the normalised domain (but not the original) confirms typosquatting.
Scans the subject line and body for keywords that create fear or artificial time pressure — a core social engineering tactic.
20 trigger keywords:
act now · immediately · urgent · 24 hours · suspended · locked · compromised · legal action · account will be closed · verify your identity · your account has been · confirm your information · unusual activity · unauthorized access · security alert · immediate action required · failure to respond · final notice · last warning · your account is at risk
Method: Unicode normalisation → keyword matching on subject + body.
Scans for explicit requests for private, financial, or authentication data — the primary objective of most phishing attacks.
20 trigger keywords:
password · credit card · ssn · cvv · bank account · billing information · social security · date of birth · mother's maiden name · pin number · card number · account number · routing number · login credentials · username and password · security question · verify your account · update your payment · confirm your details · enter your information
Scans for malicious URL patterns and dangerous attachment indicators.
| Pattern | Example |
|---|---|
| Raw IP addresses in links | http://192.168.1.1/login |
| URL shorteners | bit.ly, tinyurl.com, t.co, goo.gl, ow.ly |
| Suspicious TLDs | .xyz, .tk, .ml, .ga, .cf, .gq |
| Multiple hyphens in URLs | paypal-secure-login-verify.com |
| Attachment keywords | open attached, download file, see attachment |
| Brand lookalike patterns | apple-, paypal-, secure-login, account-verify |
Checks whether the sender domain is in a curated whitelist of 20 trusted domains.
Whitelist: gmail.com · outlook.com · hotmail.com · yahoo.com · icloud.com · apple.com · paypal.com · amazon.com · microsoft.com · google.com · github.com · linkedin.com · twitter.com · facebook.com · netflix.com · dropbox.com · slack.com · zoom.us · shopify.com · stripe.com
Detects when the Reply-To domain differs from the From domain — a classic phishing technique where the email appears to come from a trusted sender, but replies go to the attacker.
Method: Base-domain comparison to prevent false positives. mail.grammarly.com and grammarly.com share the base domain grammarly.com and are treated as matching.
Detects when an email body mentions a known brand but the sender domain is not that brand's official domain. Exact domain matching enforced — appleid-support.com ≠ apple.com.
Monitored brands: PayPal · Apple · Amazon · Microsoft · Netflix · Google · Facebook · Instagram · DHL · FedEx · Bank of America · Chase
Scans for lottery, prize, and inheritance scam language.
20 trigger keywords:
you have won · you've been selected · free gift · lucky winner · congratulations you · inheritance · lottery · million dollars · unclaimed prize · cash reward · you are the winner · selected for reward · claim your prize · gift card · exclusive offer · you have been chosen · jackpot · sweepstakes · won a competition · awarded to you
Detects secrecy and isolation tactics that prevent victims from consulting trusted people before complying with fraudulent requests.
15 trigger keywords:
don't tell anyone · keep this confidential · this is between us · private offer · do not share · tell no one · strictly confidential · between you and me · do not discuss · keep this secret · do not contact your bank · do not inform · only you can see this · exclusive to you · do not reply to anyone else
Detects aggressive formatting patterns disproportionately common in phishing emails.
| Pattern | Score |
|---|---|
3+ consecutive exclamation marks !!! |
+10 pts |
3+ consecutive question marks ??? |
+10 pts |
| 3+ ALL CAPS words in body | +10 pts |
Detects financial scam language and suspicious dollar amounts targeting advance-fee fraud and government impersonation scams.
20 trigger keywords:
stimulus · unclaimed funds · pending approval · financial benefit · verify eligibility · government grant · tax refund · wire transfer · western union · money gram · advance fee · processing fee · release fee · inheritance funds · beneficiary · next of kin · transfer of funds · bank transfer · financial assistance · claim your funds
Regex detection: Also detects irregular dollar amounts like $7,452 or $84,500 commonly used to add false credibility to scam emails.
| Category | Checks |
|---|---|
| Social Engineering | Checks 2, 8, 9 |
| Domain Spoofing & Header Forgery | Checks 1, 5, 6 |
| Credential & Financial Fraud | Checks 3, 7, 11 |
| Technical Deception | Check 4 |
| Quality Signal | Check 10 |
Two dedicated functions query the VirusTotal API v3 on top of the 11 heuristic checks.
Sends the sender domain to VirusTotal and checks reputation against 70+ security engines.
| Result | Score | Condition |
|---|---|---|
| Malicious | +50 pts | One or more engines flag as malicious |
| Suspicious | +25 pts | One or more engines flag as suspicious |
| Clean | 0 pts | No engines flag the domain |
Extracts all URLs from the body and scans up to 3 URLs per email. Each URL is base64-encoded per the VirusTotal API v3 specification.
| Result | Score | Condition |
|---|---|---|
| Malicious URL | +50 pts | One or more engines flag as malicious |
| Suspicious URL | +25 pts | One or more engines flag as suspicious |
| Clean | 0 pts | No engines flag the URL |
Both functions use
try/exceptblocks for graceful error handling. The API key is stored in.envand never committed to version control.
Score: 0 ──────────── 30 ──────────── 70 ──────────────► ∞
│ │
✅ SAFE ⚠️ SUSPICIOUS 🚨 SCAM
(0 – 30) (31 – 70) (71+)
| Score Range | Verdict | Interpretation |
|---|---|---|
| 0 – 30 | ✅ SAFE | No significant risk signals detected |
| 31 – 70 | Some risk signals present; treat with caution | |
| 71+ | 🚨 SCAM | Strong indicators of a phishing or scam email |
Confidence is determined by the number of checks triggered, independently of the total score.
| Checks Triggered | Confidence | Notes |
|---|---|---|
| 5 or more | 🔴 High | Multiple independent signals; verdict is reliable |
| 3 – 4 | 🟠 Medium | Corroborating evidence present |
| 0 – 2 | 🟢 Low | Limited signals; manual review advisable |
All 5 test emails were classified correctly with zero false positives.
| Email Tested | Verdict | Score | Confidence | Result |
|---|---|---|---|---|
| PayPal phishing | 🚨 SCAM | 345 | High | ✅ Pass |
| Apple phishing (subtle) | 🚨 SCAM | 150 | High | ✅ Pass |
| Grammarly legitimate | ✅ SAFE | 20 | Low | ✅ Pass |
| GitHub security notice | ✅ SAFE | 0 | Low | ✅ Pass |
| Stimulus / financial scam | 🚨 SCAM | 110 | Medium | ✅ Pass |
PayPal phishing ████████████████████████████████████████ 345
Apple phishing ████████████████████ 150
Stimulus scam ████████████████ 110
Grammarly legit ███ 20
GitHub notice 0
│ │
30 70 ← Verdict thresholds
- PayPal phishing (345, High): Designed to trigger as many checks as possible simultaneously. Confirms score compounds correctly across parallel checks and VirusTotal adds on top of the heuristic total.
- Apple phishing, subtle (150, High): Simulates a careful attacker who avoids aggressive language. Detection relies entirely on structural signals — typosquatted domain, brand impersonation, and VirusTotal flag. Confirms the engine catches sophisticated attacks without keyword triggers.
- Grammarly legitimate (20, Low): Real-world marketing email. Tests false-positive resistance. Only Check 5 fires (+20) because
grammarly.comis not in the whitelist — score stays well below the SUSPICIOUS threshold of 31. - GitHub security notice (0, Low): Clean-room baseline. A well-structured email from a whitelisted domain with clean URLs scores exactly 0. Nothing triggers, nothing fires.
- Stimulus scam (110, Medium): Written without brand names or URL shorteners to test whether Check 11 and Check 2 alone push the score into SCAM territory. They do — confirming financial scams are caught without domain or brand signals.
phishguard/
├── app.py # Main backend: Flask + full detection engine
├── .env
├── .gitignore # Excludes .env and venv from version control
├── Complete Detection Rules.md # Full detection rules reference document
├── Project Progress.txt # Development progress log
├── Tony Doumit Phishguard.pdf # 19-page professional technical report (LaTeX)
└── templates/
└── index.html # Frontend web interface
- Python 3.10+
- A free VirusTotal API key → Get one here
# 1. Clone the repository
git clone https://github.com/Doumit04/PhishGuard.git
cd PhishGuard
# 2. Create and activate virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# 3. Install dependencies
pip install flask requests python-dotenv
# 4. Create your .env file
# Create a file named .env in the project root and add:
# VT_API_KEY=your_virustotal_api_key_here
# 5. Start the Flask server
python app.py
# 6. Open in browser
# http://127.0.0.1:5000
⚠️ Important: The.envfile must exist and contain a validVT_API_KEYbefore starting the server. Without it, VirusTotal integration will fail silently and only the 11 heuristic checks will run.
Paste any raw email (headers + body) into the interface and click Analyze Email. Results are displayed instantly including:
- Total risk score
- Triggered checks with individual scores
- VirusTotal domain and URL findings
- Final verdict badge (SAFE / SUSPICIOUS / SCAM)
- Confidence level
| Package | Source | Purpose |
|---|---|---|
flask |
pip | Web framework and HTTP routing |
requests |
pip | HTTP calls to VirusTotal API |
python-dotenv |
pip | Secure loading of .env variables |
re |
stdlib | Regex for URL extraction and header parsing |
base64 |
stdlib | URL encoding required by VirusTotal API v3 |
unicodedata |
stdlib | Unicode normalisation to defeat obfuscation |
pip install flask requests python-dotenv-
Static trusted domain list: The 20-domain whitelist cannot cover every legitimate sender. Emails from unlisted legitimate companies receive +20 pts, potentially pushing borderline emails to SUSPICIOUS. Partially mitigated by VirusTotal domain reputation.
-
Spear-phishing blind spot: A perfectly crafted targeted email that avoids all keyword triggers, uses a clean domain, and contains no suspicious links may score low despite being malicious. This is a known limitation of all rule-based detection systems.
-
VirusTotal free tier rate limits: The free API tier caps at 4 requests/min, so only the first 3 URLs per email are scanned. Malicious links beyond the third are missed.
-
Newly registered domains: A malicious domain created within the past 24–48 hours may not yet appear in VirusTotal's crowd-sourced reports, resulting in a clean score.
| Improvement | Purpose | Limitation Addressed |
|---|---|---|
| Live DNS / WHOIS domain age detection | Flag domains registered in the past 30 days | Newly registered malicious domains |
| DKIM / SPF header authentication | Verify sender is authorised to send on behalf of domain | New signal class — not in v1 |
| Paid VirusTotal tier | Scan all URLs, no rate limit | Free tier URL scanning cap |
| Machine learning classifier | Detect zero-day phishing patterns via supervised learning | Spear-phishing keyword bypass |
| PDF report export | Download full analysis report for incident response | Documentation workflow |
A full 19-page professional technical report was produced in LaTeX alongside this project, covering:
- System architecture and data flow diagrams
- Full detection engine specification for all 11 checks
- VirusTotal integration design
- Complete test methodology and results
- Known limitations and Version 2 roadmap
The report is included in this repository as Tony Doumit Phishguard.pdf.
PhishGuard · Cybersecurity Portfolio Project · March 2026