DNS Threat Detector

A production-ready machine learning system for detecting malicious DNS domains including DGA domains, typosquatting attempts, malware C&C domains, and phishing sites.

Features

99.68% F1-Score on comprehensive test dataset
100% Typosquatting Detection with zero false positives
Sub-millisecond Latency (0.439ms average inference time)
Multi-tier Safelist with O(1) lookup for instant benign classification
99 Protected Brands including Google, Microsoft, PayPal, Amazon, etc.
Hybrid Architecture combining LightGBM, LSTM, and meta-learning
Easy Integration with Python API and CLI tool

Performance Metrics

Metric	Value
F1-Score	99.68%
Accuracy	99.38%
Precision	97.15%
Recall	99.95%
Typosquatting Detection	100%
False Positive Rate	28.5%
False Negative Rate	0.05%
Avg Latency	0.439 ms
Throughput	~2,275 domains/sec

Installation

pip install dns-threat-detector

Quick Start

Python API

from dns_threat_detector import DNS_ThreatDetector

# Initialize detector with safelist enabled
detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()

# Predict a single domain
result = detector.predict('gooogle.com')
print(result)
# Output:
# {
#   'prediction': 'MALICIOUS',
#   'confidence': 0.9000,
#   'reason': 'Typosquatting (dist=1 to google)',
#   'method': 'typosquatting_rule',
#   'latency_ms': 0.234
# }

# Batch predictions
domains = ['google.com', 'gooogle.com', 'example.com']
results = detector.predict_batch(domains)

Command-Line Interface

# Predict a single domain
dns-detect predict gooogle.com

# Get JSON output
dns-detect predict gooogle.com --json

# Batch process domains from file
dns-detect batch domains.txt --output results.json

# Show model information
dns-detect info

# Run self-tests
dns-detect test

Architecture

The DNS Threat Detector uses a sophisticated hybrid ensemble approach:

1. LightGBM Classifier

Gradient-boosted decision trees
11 features (4 FQDN + 7 typosquatting-specific)
200 trees with max depth 7
Handles structured feature patterns

2. Bidirectional LSTM

Character-level neural network
41-character vocabulary
159K parameters
Embedding(41→32) → Bi-LSTM(32→64×2) → FC(128→64→2)
Captures sequential patterns

3. Meta-Learner

Logistic regression stacking ensemble
Combines LightGBM and LSTM predictions
Learned weights: LSTM=7.04, LightGBM=2.53
Final classification decision

4. Typosquatting Detector

Rule-based + ML hybrid approach
Edit distance (Levenshtein) to 99 top brands
Distance 1-3 → Malicious (typosquatting)
Exact brand match → Benign (whitelist)

5. Multi-Tier Safelist (Optional)

Tier 1: 30K critical domains (government, finance)
Tier 2: 29K high-trust domains (tech, education)
Tier 3: 85K general trusted domains
O(1) in-memory lookup
322× speedup for safelisted domains

Prediction Flow

Domain Input
    ↓
┌─────────────────────┐
│  Safelist Check     │ → BENIGN (if listed)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  Brand Whitelist    │ → BENIGN (exact match)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  Typosquatting Rule │ → MALICIOUS (edit dist 1-3)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  ML Ensemble        │ → MALICIOUS/BENIGN
│  (LightGBM + LSTM)  │
└─────────────────────┘

Features Extracted

FQDN Features (4)

domain_length - Length of domain name excluding TLD
subdomain_count - Number of subdomains
numeric_chars - Count of numeric characters
entropy - Shannon entropy of character distribution

Typosquatting Features (7)

min_edit_distance - Minimum Levenshtein distance to top brands
edit_distance_ratio - Normalized edit distance by brand length
length_diff_to_closest - Length difference to closest brand
has_extra_char - Binary: domain has 1 extra character
has_missing_char - Binary: domain missing 1 character
has_swapped_char - Binary: adjacent characters swapped
digit_substitution - Binary: contains digit substitution

API Reference

DNS_ThreatDetector

class DNS_ThreatDetector(
    models_dir: Optional[str] = None,
    use_safelist: bool = False,
    safelist_dir: Optional[str] = None,
    safelist_tiers: List[int] = [1, 2, 3]
)

Methods

load_models() Load all model components (LightGBM, LSTM, meta-learner, safelist)

predict(domain: str) -> Dict Predict if a domain is malicious or benign

Returns:

{
    'prediction': 'MALICIOUS' | 'BENIGN',
    'confidence': float,  # 0.0 to 1.0
    'reason': str,  # Human-readable explanation
    'method': str,  # 'safelist' | 'brand_whitelist' | 'typosquatting_rule' | 'ensemble'
    'latency_ms': float  # Inference time in milliseconds
}

predict_batch(domains: List[str]) -> List[Dict] Predict multiple domains

get_model_info() -> Dict Get comprehensive model information and statistics

save_metadata(output_path: str) Save model metadata to JSON file

Advanced Usage

Custom Model Paths

detector = DNS_ThreatDetector(
    models_dir='/path/to/models',
    use_safelist=True,
    safelist_dir='/path/to/safelists',
    safelist_tiers=[1, 2, 3]
)
detector.load_models()

Disable Safelist

# Faster initialization, no safelist loading
detector = DNS_ThreatDetector(use_safelist=False)
detector.load_models()

Batch Processing with Progress

from tqdm import tqdm

domains = ['example1.com', 'example2.com', ...]
results = []

for domain in tqdm(domains):
    result = detector.predict(domain)
    results.append(result)

Model Statistics

info = detector.get_model_info()
print(f"Total predictions: {info['usage_statistics']['total_predictions']}")
print(f"Safelist hits: {info['usage_statistics']['safelist_hits']}")
print(f"Typosquatting detections: {info['usage_statistics']['typosquatting_detections']}")

CLI Reference

Commands

dns-detect predict <domain> Predict a single domain

--json: Output as JSON
--no-safelist: Disable safelist checking

dns-detect batch <file> Batch process domains from file (one domain per line)

--output <file>: Output file path (default: results.json)
--no-safelist: Disable safelist checking

dns-detect info Show model information and statistics

--no-safelist: Show info without loading safelist

dns-detect test Run built-in self-tests

Examples

Detecting Typosquatting

detector = DNS_ThreatDetector()
detector.load_models()

# Legitimate brand
result = detector.predict('google.com')
# → BENIGN (brand_whitelist)

# Typosquatting attempts
result = detector.predict('gooogle.com')  # Extra 'o'
# → MALICIOUS (typosquatting_rule, dist=1)

result = detector.predict('g00gle.com')  # Digit substitution
# → MALICIOUS (typosquatting_rule, dist=2)

Processing Domain Lists

import pandas as pd

detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()

# Read domains from CSV
df = pd.read_csv('domains.csv')

# Add predictions
df['prediction'] = df['domain'].apply(
    lambda d: detector.predict(d)['prediction']
)
df['confidence'] = df['domain'].apply(
    lambda d: detector.predict(d)['confidence']
)

# Filter malicious domains
malicious = df[df['prediction'] == 'MALICIOUS']
print(malicious)

Requirements

Python ≥ 3.8
PyTorch ≥ 2.0.0
LightGBM ≥ 4.0.0
scikit-learn ≥ 1.3.0
pandas ≥ 2.0.0
numpy ≥ 1.24.0

Model Size

Total package size: ~60 MB
LightGBM models: ~10 MB
LSTM model: ~5 MB
Safelist files (tiers 1-3): ~20 MB
Tokenizer: ~1 MB

Dataset

Trained on 51,000 domains:

50% benign (legitimate domains)
50% malicious (DGA, typosquatting, malware C&C)
80/20 train/test split with stratification

Citation

If you use this tool in your research or project, please cite:

@software{dns_threat_detector,
  title = {DNS Threat Detector},
  author = {UMUDGA Project},
  year = {2025},
  version = {1.0.0},
  url = {https://github.com/umudga/dns-threat-detector}
}

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

Support

GitHub Issues: https://github.com/umudga/dns-threat-detector/issues
Documentation: https://github.com/umudga/dns-threat-detector/wiki

Changelog

Version 1.0.0 (2025-10-30)

Initial release
Hybrid ensemble architecture (LightGBM + LSTM + Meta-learner)
99.68% F1-score on test data
100% typosquatting detection
Multi-tier safelist integration
CLI tool with batch processing
Comprehensive API documentation

Acknowledgments

Developed by the UMUDGA Project team as part of a final-year academic research project on DNS threat detection using machine learning.

Disclaimer

This tool is provided for educational and research purposes. While it achieves high accuracy, no detection system is perfect. Always use multiple layers of security in production environments.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dns_threat_detector		dns_threat_detector
examples		examples
tests		tests
.gitignore		.gitignore
DEPLOYMENT_GUIDE.md		DEPLOYMENT_GUIDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

DNS Threat Detector

Features

Performance Metrics

Installation

Quick Start

Python API

Command-Line Interface

Architecture

1. LightGBM Classifier

2. Bidirectional LSTM

3. Meta-Learner

4. Typosquatting Detector

5. Multi-Tier Safelist (Optional)

Prediction Flow

Features Extracted

FQDN Features (4)

Typosquatting Features (7)

API Reference

DNS_ThreatDetector

Methods

Advanced Usage

Custom Model Paths

Disable Safelist

Batch Processing with Progress

Model Statistics

CLI Reference

Commands

Examples

Detecting Typosquatting

Processing Domain Lists

Requirements

Model Size

Dataset

Citation

License

Contributing

Support

Changelog

Version 1.0.0 (2025-10-30)

Acknowledgments

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages