Domain Classifier

Classifies domains A–F based on DNS records, WHOIS registration data, and website content analysis. Designed for B2B data quality workflows where you need to quickly separate active business domains from parked pages, redirects, and junk.

Grades

Grade	Classification	Meaning
A	`business` (score > 0.6) or `government`	Confirmed business or institutional domain — high confidence
B	`business` (score ≤ 0.6) or DNS-upgraded	Likely business — moderate confidence, or strong DNS signals behind a block
C	`inactive`, `parked`, `ads`, `adult`, `gambling`, `personal`, `under_construction`, `gated`, `blocked`, `timeout`, `error`	Non-business or unclassifiable content
D	`redirect`	Domain redirects to a different domain
E	`ns_only`	Nameserver exists but no A record — no website
F	`unregistered`	No nameserver — domain is not registered

Classifications

Value	Description
`business`	Website content indicates an active business
`government`	Institutional TLD (.gov, .mil, .edu, .gov.uk, .gc.ca, etc.) — pre-pipeline fast path
`inactive`	Page loaded but has no meaningful content, or served inside a frameset
`parked`	Parked domain — ad lander, for-sale page, or content farm redirect
`ads`	Ad-only page with no substantive content
`under_construction`	Explicit "coming soon" / under construction page
`gated`	Login or CAPTCHA wall — no indexable content visible
`blocked`	WAF / Cloudflare challenge page
`adult`	Adult content
`gambling`	Gambling content
`personal`	Personal site, blog, or portfolio with no business signals
`redirect`	Redirects to a different domain
`ns_only`	DNS nameserver exists, no web server
`unregistered`	No DNS records
`timeout`	No response within time limit
`error`	Pipeline or HTTP error

Grade B upgrades — a domain that would otherwise receive C is promoted to B when strong DNS signals indicate a real organisation behind a block or blank page (enterprise MX provider, strict DMARC policy, Microsoft 365 tenant verification), or when the DBI API confirms a real web presence via domainrank.

Language is reported as a separate language field (ISO 639-1 code, e.g. de, zh, fr) and does not affect the classification or grade.

Installation

Requirements: Python 3.9+

# 1. Install the package and dependencies
pip install -e ".[dev]"

# 2. Install Playwright's headless Chromium (needed for JS-rendered sites)
playwright install chromium

CLI Usage

Classify a single domain

domain-classify stripe.com

Classify multiple domains from a file

domain-classify --input domains.txt

domains.txt — one domain per line, lines starting with # are ignored:

# Known businesses
stripe.com
dnb.com
salesforce.com

# Suspected junk
17work.cn

Save results to CSV

domain-classify --input domains.txt --output results.csv

Output as JSON

domain-classify stripe.com --format json

Classify from a delimited file

Use --delimiter and --field to parse structured input files (CSV, TSV, pipe-delimited, etc.). The classifier appends its output columns to each original row and writes an <stem>_output<ext> file automatically.

# CSV with a header row — domain is in field 1
domain-classify --input leads.csv --delimiter , --field 1

# Tab-delimited export — domain is in field 3
domain-classify --input export.tsv --delimiter $'\t' --field 3

# Pipe-delimited, no header
domain-classify --input data.txt --delimiter '|' --field 2

All options

Usage: domain-classify [OPTIONS] [DOMAIN]

Arguments:
  DOMAIN                    Single domain to classify  [optional]

Options:
  -i, --input PATH          File with one domain per line
  -o, --output PATH         Output file path (auto-generated in delimited mode)
      --format TEXT         Output format: table|json|csv  [default: table]
  -d, --delimiter TEXT      Field delimiter for structured input (e.g. ',' or '\t')
  -f, --field INT           1-based index of the domain field in delimited input
  -w, --workers INT         Concurrent domains  [default: 5]
  -t, --timeout INT         Per-domain timeout in seconds  [default: 30]
      --no-cache            Disable the SQLite result cache
  -v, --verbose             Enable debug logging to stderr
      --log-file PATH       Write logs to this file instead of stderr
      --ai-compare          Compare static results with Claude AI classifier
      --ai-all              Run AI comparison on every domain (default: low-confidence only)
      --help                Show this message and exit

Example table output

              Domain Classification Results
┌──────────────────┬───────┬──────────────────┬──────┬───────┬──────────┐
│ Domain           │ Grade │ Classification   │ Lang │ Score │ Time (ms)│
├──────────────────┼───────┼──────────────────┼──────┼───────┼──────────┤
│ stripe.com       │ A     │ business         │ en   │ 0.923 │ 4821     │
│ salesforce.com   │ A     │ business         │ en   │ 0.881 │ 3204     │
│ cdc.gov          │ A     │ government       │ en   │ 1.000 │ 12       │
│ example.de       │ B     │ business         │ de   │ 0.541 │ 2109     │
│ 17work.cn        │ C     │ timeout          │      │ 1.000 │ 30012    │
│ parked.biz       │ C     │ parked           │ en   │ 1.000 │ 1832     │
│ old-co.com       │ D     │ redirect         │ en   │ 1.000 │ 521      │
│ empty.io         │ E     │ ns_only          │      │ 1.000 │ 412      │
│ ghost-xyz.net    │ F     │ unregistered     │      │ 1.000 │ 203      │
└──────────────────┴───────┴──────────────────┴──────┴───────┴──────────┘

CSV columns

domain, grade, classification, language, classification_score,
has_dns_a_record, is_registered, is_reachable, final_url, redirects_to,
copyright_found, has_phone, has_email, domain_age_days, registrar,
ssl_cert_error, processing_time_ms, error,
ai_classification, ai_confidence, ai_match, ai_reasoning,
dbi_company_name, dbi_domainrank, dbi_classification, dbi_grade, dbi_category

REST API

The classifier is also available as a FastAPI server.

Start the server

uvicorn domain_classifier.api:app --reload

API docs available at http://localhost:8000/docs.

Classify via API

# Single domain
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"domain": "stripe.com"}'

# Batch (≤10 domains — synchronous response)
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"domains": ["stripe.com", "parked.biz", "ghost-xyz.net"]}'

# Batch (>10 domains — returns a job_id)
curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"domains": ["a.com", "b.com", "..."]}'

# Poll job status
curl http://localhost:8000/result/<job_id>

ML Model

The classifier ships with a heuristic scorer that works without any training data. To improve accuracy for your specific domain set, train an SGD model on labelled examples.

Train from a CSV

python -m domain_classifier.ml.train \
  --data labelled_domains.csv \
  --output domain_classifier/ml/models/classifier.joblib

labelled_domains.csv format:

domain,classification
stripe.com,business
parked-example.com,parked
under-construction.net,under_construction

Active learning via the Chrome Extension

The browser extension (see chrome_extension/README.md) provides a labelling loop: browse → classify → label → retrain — without leaving Chrome.

Configuration

All settings can be overridden with environment variables prefixed DC_, or placed in a .env file in the project root.

Variable	Default	Description
`DC_WORKERS`	`5`	Concurrent domains in batch
`DC_HTTP_TIMEOUT`	`10`	HTTP request timeout (seconds)
`DC_PLAYWRIGHT_TIMEOUT`	`20`	Playwright render timeout (seconds)
`DC_DOMAIN_TIMEOUT`	`30`	Per-domain pipeline timeout (seconds)
`DC_CACHE_ENABLED`	`true`	Enable SQLite result cache
`DC_CACHE_PATH`	`domain_cache.db`	Cache database path
`DC_CACHE_TTL_HOURS`	`24`	Cache entry lifetime
`DC_MODEL_PATH`	`domain_classifier/ml/models/classifier.joblib`	Trained model path
`DC_AI_API_KEY`	(unset)	Anthropic API key for AI comparison mode
`DC_AI_MODEL`	`claude-haiku-4-5-20251001`	Claude model used for AI comparison
`DC_AI_COMPARE_THRESHOLD`	`0.6`	Only run AI comparison when static score is below this
`DC_DBI_ENABLED`	`true`	Enable DBI API lookups for inactive/blocked/error domains
`DC_DBI_AUTH_HEADER_PATH`	`~/.config/dbiapi/auth_header`	Path to DBI API auth header file
`DC_DBI_TIMEOUT`	`8`	DBI API request timeout (seconds)

Run Tests

python3 -m pytest tests/ -v

Pipeline Stages

DNS check
    │
    ├─ No NS record ──────────────────────────────────► unregistered (F)
    │
    ├─ Institutional TLD (.gov/.mil/.edu/etc.) ───────► government (A)
    │
    ├─ NS only, no A record ──► WHOIS ───────────────► ns_only (E) or B if enterprise DNS
    │
    ▼
Fetch (httpx → Playwright fallback for JS-rendered sites)
    │
    ├─ Redirect to parking provider ─────────────────► parked (C)
    ├─ Redirect to other domain ──────────────────────► redirect (D)
    ├─ Blocked / error ───────────────────────────────► blocked/error → DNS upgrade → B or C
    │
    ▼
Content analysis (BeautifulSoup, langdetect)
    │
    ├─ Rule-based patterns (parked, gated, adult, etc.) ──► early exit C
    │
    ▼
ML classification (SGDClassifier or heuristic fallback)
    │
    ▼
Grade assignment (A–F)
    │
    ▼
DBI lookup (inactive / blocked / error at C only)
    │
    └─ domainrank or business classification ─────────► upgrade to B

Playwright is used automatically when the httpx response has fewer than 500 raw bytes or fewer than 50 visible words — catching JS-heavy SPAs that return a large skeleton HTML with no rendered text.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
chrome_extension		chrome_extension
domain_classifier		domain_classifier
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Domain_Grades.txt		Domain_Grades.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Classifier

Grades

Classifications

Installation

CLI Usage

Classify a single domain

Classify multiple domains from a file

Save results to CSV

Output as JSON

Classify from a delimited file

All options

Example table output

CSV columns

REST API

Start the server

Classify via API

ML Model

Train from a CSV

Active learning via the Chrome Extension

Configuration

Run Tests

Pipeline Stages

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Domain Classifier

Grades

Classifications

Installation

CLI Usage

Classify a single domain

Classify multiple domains from a file

Save results to CSV

Output as JSON

Classify from a delimited file

All options

Example table output

CSV columns

REST API

Start the server

Classify via API

ML Model

Train from a CSV

Active learning via the Chrome Extension

Configuration

Run Tests

Pipeline Stages

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages