Modular web scraping and data cleaning lab. Cleaners for addresses, phone numbers, emails, business names, and URLs. Jupyter notebooks with real output. Built as a live reference for production pipelines.
Live output → jays.website/BizIntel
personal-scraper-lab/
├── 01_foundations/ # Standalone cleaning functions — easy to copy into any project
│ ├── clean_addresses.py # Regex parser: street / unit / city / state / zip
│ ├── clean_business_names.py # Strips legal suffixes, keeps meaningful words
│ ├── clean_phone_numbers.py # Normalizes to (XXX) XXX-XXXX
│ ├── clean_urls.py # Adds https, strips UTM params
│ ├── clean_emails.py # Validates format
│ └── clean_text.py # General text normalization
│
├── 02_modular_scrapers/ # Pluggable scraper components
│ ├── base_scraper.py # BaseScraper with retry logic + exponential backoff
│ └── parallel_scraper.py # Multi-city scraper using ThreadPoolExecutor
│
├── 03_cleaning_playground/ # Core pipeline
│ ├── cleaning_engine.py # Orchestrates all cleaners + export
│ ├── quality_scorer.py # Scores each record 0–100 by field completeness
│ ├── deduplicator.py # Fuzzy dedup via rapidfuzz
│ └── schema_validator.py # pandera schema validation
│
├── notebooks/ # Jupyter notebooks with real scraped output
├── data/ # raw/ and cleaned/ for comparison
└── run_lab.py # Main runner
git clone https://github.com/JaysWebDev/personal-scraper-lab.git
cd personal-scraper-lab
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run the full pipeline: scrape → clean → export
python3 run_lab.pyIf the target site blocks requests, the pipeline automatically falls back to the most recent cached raw CSV.
Each cleaner in 01_foundations/ is standalone — copy any one into your own project.
from foundations.clean_phone_numbers import clean_phone
clean_phone("555.867.5309") # → "(555) 867-5309"
clean_phone("(555) 867-5309") # → "(555) 867-5309"
clean_phone("18005551234") # → "(800) 555-1234"from foundations.clean_addresses import parse_address
parse_address("123 Main St Apt 4, Austin, TX 78701")
# → {"street": "123 Main St", "unit": "Apt 4", "city": "Austin", "state": "TX", "zip": "78701"}from foundations.clean_business_names import clean_name
clean_name("Acme Flowers LLC") # → "Acme Flowers" (strips legal suffix)
clean_name("Bob's Florist Inc.") # → "Bob's Florist"Each record gets a score from 0–100 based on field completeness:
| Condition | Points |
|---|---|
| Base | 100 |
| Missing name | −15 |
| Missing phone | −20 |
| Missing address | −20 |
| Missing website | −10 |
| Missing email | −10 |
| All three core fields present | +10 bonus |
python3 -m pytest tests/ -v
# 42 tests covering all cleaners and the quality scorerOpen the Jupyter notebook for interactive exploration:
jupyter notebook notebooks/cleaning_playground.ipynbShows real before/after comparisons across all cleaning steps.
Part of the JaysWebDev toolchain.