Personal Scraper Lab 🕷️

Modular web scraping and data cleaning lab. Cleaners for addresses, phone numbers, emails, business names, and URLs. Jupyter notebooks with real output. Built as a live reference for production pipelines.

Live output → jays.website/BizIntel

1. What's Inside

personal-scraper-lab/
├── 01_foundations/          # Standalone cleaning functions — easy to copy into any project
│   ├── clean_addresses.py   # Regex parser: street / unit / city / state / zip
│   ├── clean_business_names.py  # Strips legal suffixes, keeps meaningful words
│   ├── clean_phone_numbers.py   # Normalizes to (XXX) XXX-XXXX
│   ├── clean_urls.py        # Adds https, strips UTM params
│   ├── clean_emails.py      # Validates format
│   └── clean_text.py        # General text normalization
│
├── 02_modular_scrapers/     # Pluggable scraper components
│   ├── base_scraper.py      # BaseScraper with retry logic + exponential backoff
│   └── parallel_scraper.py  # Multi-city scraper using ThreadPoolExecutor
│
├── 03_cleaning_playground/  # Core pipeline
│   ├── cleaning_engine.py   # Orchestrates all cleaners + export
│   ├── quality_scorer.py    # Scores each record 0–100 by field completeness
│   ├── deduplicator.py      # Fuzzy dedup via rapidfuzz
│   └── schema_validator.py  # pandera schema validation
│
├── notebooks/               # Jupyter notebooks with real scraped output
├── data/                    # raw/ and cleaned/ for comparison
└── run_lab.py               # Main runner

2. Quick Start 🚀

git clone https://github.com/JaysWebDev/personal-scraper-lab.git
cd personal-scraper-lab

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the full pipeline: scrape → clean → export
python3 run_lab.py

If the target site blocks requests, the pipeline automatically falls back to the most recent cached raw CSV.

3. Cleaning Modules 🧹

Each cleaner in 01_foundations/ is standalone — copy any one into your own project.

Phone Numbers

from foundations.clean_phone_numbers import clean_phone

clean_phone("555.867.5309")      # → "(555) 867-5309"
clean_phone("(555) 867-5309")    # → "(555) 867-5309"
clean_phone("18005551234")       # → "(800) 555-1234"

Addresses

from foundations.clean_addresses import parse_address

parse_address("123 Main St Apt 4, Austin, TX 78701")
# → {"street": "123 Main St", "unit": "Apt 4", "city": "Austin", "state": "TX", "zip": "78701"}

Business Names

from foundations.clean_business_names import clean_name

clean_name("Acme Flowers LLC")   # → "Acme Flowers"  (strips legal suffix)
clean_name("Bob's Florist Inc.") # → "Bob's Florist"

4. Quality Scoring

Each record gets a score from 0–100 based on field completeness:

Condition	Points
Base	100
Missing name	−15
Missing phone	−20
Missing address	−20
Missing website	−10
Missing email	−10
All three core fields present	+10 bonus

5. Run Tests

python3 -m pytest tests/ -v
# 42 tests covering all cleaners and the quality scorer

6. Notebooks 📓

Open the Jupyter notebook for interactive exploration:

jupyter notebook notebooks/cleaning_playground.ipynb

Shows real before/after comparisons across all cleaning steps.

Part of the JaysWebDev toolchain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personal Scraper Lab 🕷️

1. What's Inside

2. Quick Start 🚀

3. Cleaning Modules 🧹

Phone Numbers

Addresses

Business Names

4. Quality Scoring

5. Run Tests

6. Notebooks 📓

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
01_foundations		01_foundations
02_modular_scrapers		02_modular_scrapers
03_cleaning_playground		03_cleaning_playground
data		data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_dummy_data.py		generate_dummy_data.py
run_lab.py		run_lab.py

Folders and files

Latest commit

History

Repository files navigation

Personal Scraper Lab 🕷️

1. What's Inside

2. Quick Start 🚀

3. Cleaning Modules 🧹

Phone Numbers

Addresses

Business Names

4. Quality Scoring

5. Run Tests

6. Notebooks 📓

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages