Skip to content

JaysWebDev/personal-scraper-lab

Repository files navigation

Personal Scraper Lab 🕷️

Modular web scraping and data cleaning lab. Cleaners for addresses, phone numbers, emails, business names, and URLs. Jupyter notebooks with real output. Built as a live reference for production pipelines.

Live output → jays.website/BizIntel


1. What's Inside

personal-scraper-lab/
├── 01_foundations/          # Standalone cleaning functions — easy to copy into any project
│   ├── clean_addresses.py   # Regex parser: street / unit / city / state / zip
│   ├── clean_business_names.py  # Strips legal suffixes, keeps meaningful words
│   ├── clean_phone_numbers.py   # Normalizes to (XXX) XXX-XXXX
│   ├── clean_urls.py        # Adds https, strips UTM params
│   ├── clean_emails.py      # Validates format
│   └── clean_text.py        # General text normalization
│
├── 02_modular_scrapers/     # Pluggable scraper components
│   ├── base_scraper.py      # BaseScraper with retry logic + exponential backoff
│   └── parallel_scraper.py  # Multi-city scraper using ThreadPoolExecutor
│
├── 03_cleaning_playground/  # Core pipeline
│   ├── cleaning_engine.py   # Orchestrates all cleaners + export
│   ├── quality_scorer.py    # Scores each record 0–100 by field completeness
│   ├── deduplicator.py      # Fuzzy dedup via rapidfuzz
│   └── schema_validator.py  # pandera schema validation
│
├── notebooks/               # Jupyter notebooks with real scraped output
├── data/                    # raw/ and cleaned/ for comparison
└── run_lab.py               # Main runner

2. Quick Start 🚀

git clone https://github.com/JaysWebDev/personal-scraper-lab.git
cd personal-scraper-lab

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the full pipeline: scrape → clean → export
python3 run_lab.py

If the target site blocks requests, the pipeline automatically falls back to the most recent cached raw CSV.

3. Cleaning Modules 🧹

Each cleaner in 01_foundations/ is standalone — copy any one into your own project.

Phone Numbers

from foundations.clean_phone_numbers import clean_phone

clean_phone("555.867.5309")      # → "(555) 867-5309"
clean_phone("(555) 867-5309")    # → "(555) 867-5309"
clean_phone("18005551234")       # → "(800) 555-1234"

Addresses

from foundations.clean_addresses import parse_address

parse_address("123 Main St Apt 4, Austin, TX 78701")
# → {"street": "123 Main St", "unit": "Apt 4", "city": "Austin", "state": "TX", "zip": "78701"}

Business Names

from foundations.clean_business_names import clean_name

clean_name("Acme Flowers LLC")   # → "Acme Flowers"  (strips legal suffix)
clean_name("Bob's Florist Inc.") # → "Bob's Florist"

4. Quality Scoring

Each record gets a score from 0–100 based on field completeness:

Condition Points
Base 100
Missing name −15
Missing phone −20
Missing address −20
Missing website −10
Missing email −10
All three core fields present +10 bonus

5. Run Tests

python3 -m pytest tests/ -v
# 42 tests covering all cleaners and the quality scorer

6. Notebooks 📓

Open the Jupyter notebook for interactive exploration:

jupyter notebook notebooks/cleaning_playground.ipynb

Shows real before/after comparisons across all cleaning steps.


Part of the JaysWebDev toolchain.

About

Web scraping & data cleaning learning lab - modular scrapers, cleaning engines, and Jupyter notebooks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors