A production-grade, headless cricket database system with comprehensive ETL pipeline for cricket data collection, processing, and analysis.
- Comprehensive Data Collection: Scrapes cricket data from multiple sources (ESPN Cricinfo, Cricket API)
- Robust ETL Pipeline: Extract, Transform, Load with idempotent upserts and data quality checks
- Production-Ready Database: MySQL 8 with optimized schema, constraints, and indexes
- Advanced Analytics: Rich SQL query library for cricket statistics and insights
- CLI Interface: Command-line tools with dry-run, rate limiting, and scheduling
- Data Quality: Automated validation, deduplication, and quality monitoring
- Scalable Architecture: Modular design with proper separation of concerns
IMPORTANT: This system is designed for educational and research purposes. Users must:
- Respect robots.txt: The system includes rate limiting and respects website crawling policies
- Follow ToS: Always review and comply with the Terms of Service of data sources
- Rate Limiting: Built-in delays and request limits to avoid overwhelming servers
- Ethical Usage: Use responsibly and consider the impact on source websites
- Attribution: Properly attribute data sources when using scraped information
The system includes configurable rate limiting, retry mechanisms, and dry-run modes to ensure respectful data collection practices.
CRITICAL NOTES:
- CricketArchive may require logged-in sessions and restrict automated access
- Keep credentials external (never hardcode), stop immediately if blocked
- Consider Cricsheet CSV as a lawful, robust supplementary source
- Schema supports multi-source ingest via SOURCE_ID for different data providers
The following commands bootstrap the project, setup local MySQL, install Playwright, export env vars, run a dry-run refresh, apply migrations, run ETL, and QA checks.
make bootstrap
make setup-mysql
playwright install --with-deps
export $(grep -v '^#' .env | xargs) && \
python -m etl.cli discover-latest --since 2024-01-01 && \
python -m etl.cli refresh --since 2024-01-01 --dry-run
make migrate
make etl
make qaNotes:
make setup-mysqlprovides instructions for local MySQL setup- On Windows PowerShell, replace the
export $(...)line with manually setting env vars or run inside Git Bash.
- Install MySQL 8.0+ locally on your system
- Run the setup script:
# Windows scripts\setup_mysql.bat # Linux/macOS ./scripts/setup_mysql.sh # Or manually mysql -u root -p < scripts/setup_mysql.sql
- Update
.envwith your MySQL credentials:DB_HOST=127.0.0.1 DB_PORT=3306 DB_NAME=cricket_db DB_USER=cricket_user DB_PASSWORD=cricket_password
Use any MySQL client (MySQL Workbench, phpMyAdmin, or command line):
- Host: 127.0.0.1
- Port: 3306
- Username: from
.envDB_USER - Password: from
.envDB_PASSWORD - Database: from
.envDB_NAME
Run these in your MySQL client to verify data presence after ETL:
-- Counts
SELECT COUNT(*) AS matches FROM matches;
SELECT COUNT(*) AS innings FROM innings;
SELECT COUNT(*) AS deliveries FROM deliveries;
-- Recent matches
SELECT id, format, start_date, venue_id, series_id, result_type, winner_team_id
FROM matches
ORDER BY start_date DESC
LIMIT 10;
-- Top batters by runs (sample)
SELECT p.full_name, SUM(bi.runs) AS runs
FROM batting_innings bi
JOIN players p ON p.id = bi.player_id
GROUP BY p.id
ORDER BY runs DESC
LIMIT 10;
-- Bowler wickets (sample)
SELECT p.full_name, SUM(bo.wickets) AS wkts
FROM bowling_innings bo
JOIN players p ON p.id = bo.player_id
GROUP BY p.id
ORDER BY wkts DESC
LIMIT 10;- Respect Terms of Service and robots.txt. Only fetch content that is permitted and for which you have a legitimate use. When in doubt, obtain written consent.
- Conservative defaults to minimize load:
- Rate limit: 1 request/second by default (configurable via
RATE_LIMIT_RPS). - Exponential backoff with jitter on retries.
- Safety cap for new work per run via
--max-new-matches(andMAX_NEW_MATCHESin.env).
- Rate limit: 1 request/second by default (configurable via
- Allowlist/Blocklist controls:
- Configure
ETL_ALLOWLISTandETL_BLOCKLIST(comma-separated regex patterns) to explicitly permit or deny URL patterns. - Blocklist rules take precedence; allowlist must match when present.
- Configure
- Credentials and secrets:
- Use
.envfor local development. Never commit real secrets to source control. - Environment variables are loaded at runtime; rotate credentials periodically.
- Use
- Safe fetch modes:
--dry-run: execute without persisting responses to the database.--headers-only: perform HTTP HEAD requests only, avoiding download of response bodies when you only need metadata/validation.
The system includes comprehensive tables for:
- Teams: International and domestic cricket teams
- Players: Player profiles, roles, and career information
- Matches: Match details, venues, series, and results
- Innings: Inning-by-inning scorecards and statistics
- Ball-by-Ball: Detailed ball-by-ball data for analysis
- Player Statistics: Match-wise and career statistics
- Python 3.11+
- MySQL 8.0+
- Node.js (for Playwright)
-
Clone the repository
git clone <repository-url> cd cricket-database
-
Bootstrap the project
make bootstrap
-
Configure environment
cp env.example .env # Edit .env with your database credentials -
Setup database
make migrate
-
Run ETL pipeline
make etl
-
Check data quality
make qa
-
Run tests
make test
# Format code
make format
# Run linting
make lint
# Run tests
make test
# Run ETL pipeline
make etl
# Run quality checks
make qa
# Clean up
make cleanUse your system scheduler to run ETL and refresh season summaries nightly. Example crontab (Linux):
# Run nightly at 02:30. Ensure virtualenv and env vars are loaded accordingly.
30 2 * * * cd /path/to/repo && SEASON_ID=123 make nightly >> logs/nightly.log 2>&1On Windows Task Scheduler, create a task running:
cmd /c "cd /d C:\path\to\repo && set SEASON_ID=123 && make nightly"Top-level ETL CLI is in src/etl/cli.py:
# Discover latest series/competitions since a date
python -m etl.cli discover-latest --since 2024-01-01
# Fetch queued pages
python -m etl.cli fetch
# Parse recent raw_html to cached models (data/cache/parsed)
python -m etl.cli parse
# Load cached models into DB (idempotent)
python -m etl.cli load
# Run the full incremental refresh
python -m etl.cli refresh --since 2024-01-01To schedule nightly refresh via cron without Docker:
0 3 * * * cd /path/to/repo && /path/to/venv/bin/python -m etl.cli refresh --since $(date -d 'yesterday' +\%Y-\%m-\%d) >> logs/etl_refresh.log 2>&1# Setup database schema
python -m src.cli setup-db
# Setup with force recreation
python -m src.cli setup-db --force# Scrape all data
python -m src.cli scrape
# Scrape specific data type
python -m src.cli scrape --data-type matches
# Scrape from specific source
python -m src.cli scrape --source espn
# Limit number of records
python -m src.cli scrape --limit 100# Incremental update (last 7 days)
python -m src.cli update --incremental
# Incremental update with custom days
python -m src.cli update --incremental --days-back 14
# Full data refresh
python -m src.cli update --full# Run data quality checks
python -m src.cli quality-check
# Validate data sources
python -m src.cli validate-sources
# Show system status
python -m src.cli status# Schedule daily updates at 2 AM
python -m src.cli schedule
# Schedule every 6 hours
python -m src.cli schedule --schedule "0 */6 * * *"
# Schedule incremental updates
python -m src.cli schedule --incremental# Test without making changes
python -m src.cli scrape --dry-run
python -m src.cli update --incremental --dry-runThe system includes a comprehensive library of SQL queries for cricket analytics:
- Top batsmen by runs, average, strike rate
- Centuries and fifties analysis
- Boundary hitting patterns
- Venue and opposition performance
- Form analysis and consistency metrics
- Top bowlers by wickets, average, economy
- Five-wicket hauls and best figures
- Death overs and powerplay analysis
- Venue and opposition performance
- Consistency and impact analysis
- Win-loss records and rankings
- Home vs away performance
- Series and tournament analysis
- Head-to-head records
- Toss impact analysis
- Recent and upcoming matches
- High/low scoring matches
- Close matches analysis
- Venue characteristics
- Match momentum analysis
- Powerplay and death overs analysis
- Run rate patterns by over
- Boundary and wicket analysis
- Player impact on match outcomes
- Venue scoring characteristics
-
Scrapers (
src/cricket_database/scrapers/)- ESPN Cricinfo scraper with Playwright
- Cricket API scraper with HTTPX
- Base scraper with rate limiting and retries
-
ETL Pipeline (
src/cricket_database/etl/)- Data transformation and validation
- Database loading with upserts
- Quality checks and monitoring
-
Database Models (
src/cricket_database/models/)- SQLAlchemy ORM models
- Pydantic validation schemas
- Comprehensive relationships
-
CLI Interface (
src/cricket_database/cli/)- Rich command-line interface
- Progress tracking and logging
- Configuration management
-
SQL Queries (
sql/queries/)- Organized query library
- Batting, bowling, team, match analytics
- Advanced statistical queries
Data Sources β Scrapers β ETL Pipeline β Database β Analytics
β β β β β
ESPN Cricinfo β ESPN Scraper β Transform β MySQL β SQL Queries
Cricket API β API Scraper β Validate β Tables β Statistics
# Database Configuration
DB_HOST=localhost
DB_PORT=3306
DB_NAME=cricket_db
DB_USER=cricket_user
DB_PASSWORD=your_secure_password
# Scraping Configuration
SCRAPER_RATE_LIMIT=1.0
SCRAPER_RETRY_ATTEMPTS=3
SCRAPER_TIMEOUT=30
SCRAPER_USER_AGENT=CricketDataBot/1.0
# Rate Limiting
MAX_REQUESTS_PER_MINUTE=60
MAX_REQUESTS_PER_HOUR=1000
# Data Quality
ENABLE_DATA_VALIDATION=true
ENABLE_DUPLICATE_CHECK=true
BATCH_SIZE=1000# Development
make dev # Full development setup
make install # Install dependencies
make dev-install # Install with dev tools
make setup-db # Initialize database
# Data Operations
make run-scraper # Run data scraper
make dry-run # Test scraper in dry-run mode
make update-incremental # Incremental update
make update-full # Full data refresh
# Quality & Testing
make test # Run tests
make test-cov # Run tests with coverage
make lint # Run linting
make format # Format code
make quality-check # Run data quality checks
# Maintenance
make clean # Clean temporary files
make ci # Run CI pipeline locally# Run all tests
make test
# Run with coverage
make test-cov
# Run specific test categories
pytest tests/unit/ # Unit tests
pytest tests/integration/ # Integration tests
pytest -m "not slow" # Skip slow tests
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/unit/test_models.pyThe system includes comprehensive data quality checks:
- Validation: Pydantic schema validation
- Deduplication: Automatic duplicate detection
- Referential Integrity: Foreign key validation
- Data Freshness: Stale data identification
- Consistency Checks: Cross-table validation
- Quality Scoring: Overall data quality metrics
cricket-database/
βββ src/
β βββ cricket_database/
β β βββ models/ # Database models
β β βββ schemas/ # Pydantic schemas
β β βββ scrapers/ # Data scrapers
β β βββ etl/ # ETL pipeline
β β βββ cli/ # CLI interface
β β βββ utils/ # Utilities
β βββ cli.py # CLI entry point
βββ sql/
β βββ queries/ # SQL query library
βββ tests/ # Test suite
βββ logs/ # Log files
βββ data/ # Data files
βββ requirements.txt # Dependencies
βββ pyproject.toml # Project configuration
βββ Makefile # Build commands
βββ README.md # This file
- Create a new scraper class inheriting from
BaseScraper - Implement required methods:
scrape_teams(),scrape_players(),scrape_matches() - Add to the ETL pipeline configuration
- Update tests and documentation
- Create new query classes in
sql/queries/ - Add methods for specific analytics
- Update the query library exports
- Add tests and examples
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- ESPN Cricinfo for cricket data
- Playwright for web scraping capabilities
- SQLAlchemy for database ORM
- Pydantic for data validation
- Rich for beautiful CLI interfaces
For questions, issues, or contributions:
- Create an issue on GitHub
- Check the documentation
- Review the test suite for examples
- Join the community discussions
Note: This system is designed for educational and research purposes. Please respect the terms of service of data sources and implement appropriate rate limiting and ethical scraping practices.