Skip to content

Batibaran/job-scraper

Repository files navigation

Production-Ready Job Scraper

A robust data pipeline that scrapes job postings from a mock API, transforming a proof-of-concept script into a production-ready application with focus on performance, reliability, and data integrity.

Candidate: Baran Batı
Company: xxxxx

Key Architectural Decisions

1. Modular Architecture with Separation of Concerns

Decision: Refactored monolithic script into 8 specialized modules while maintaining scraper.py as the entry point.

Reasoning: Original script got too bloated which violated Single Responsibility Principle. Highly prioritized separation of concerns - each module has a single, well-defined responsibility for better testing, maintenance, and error isolation.

2. Concurrent Processing

Decision: Implemented ThreadPoolExecutor with 500 concurrent workers (configurable).

Reasoning: Sequential processing of 5000+ companies would take 12-14 hours. Concurrent processing reduces execution time to under 15 minutes depending on worker config.

Trade-off: Increased memory usage and complexity vs 50x performance improvement.

3. Robust Error Handling with Retry Decorator

Decision: Comprehensive retry mechanism with exponential backoff and jitter implemented as a reusable decorator.

Reasoning: Production APIs are unreliable. Transient failures (5xx errors, timeouts, rate limiting) shouldn't crash the entire process.

Implementation:

  • Exponential backoff (1s → 2s → 4s) with jitter to prevent thundering herd effects
  • Decorator pattern for easy application across multiple functions without code repetition
  • Selective retry logic (retries 5xx/429, not 4xx), graceful degradation

4. Data Validation & Normalization

Decision: Streamlined 6-field Pydantic model with comprehensive normalization and extended data capture.

Reasoning: API responses have inconsistent formats (dates, locations) and field mappings that need standardization for reliable storage. Additionally, capturing extra fields ensures future-proofing for API evolution.

Trade-off: Prioritized data quality and processing efficiency while maintaining flexibility for future API changes through the extra_fields JSON object.

5. Infinite Loop Protection

Problem: During testing, certain companies caused infinite loops with repeated pagination tokens.

Solution: Token-based loop detection using seen_tokens set

Benefits: Precise root cause addressing, no false positives, immediate detection.

6. Extended Data Capture

Decision: Automatically capture all API fields beyond the core 6 fields in an extra_fields JSON object.

Reasoning: APIs evolve over time, adding new fields like salary ranges, remote work options, skill requirements, etc. Capturing these automatically ensures the scraper remains future-proof without requiring code changes when the API schema changes.

Implementation:

  • Core fields (jobId, title, company, location, applicants, postedDate) are normalized and validated
  • Any additional fields from the API response are automatically stored in extra_fields
  • Backward compatible - existing functionality unchanged
  • No configuration needed - always captures extra fields

Trade-off: Slightly increased storage size vs complete future-proofing and data preservation.

Production-Ready Features Implemented

To transform the proof-of-concept into a production-ready application, I implemented improvements systematically across three phases (Core Reliability, Production-Readiness, and Polish). Key achievements include:

  • Fixed Critical Pagination Bug: Original script only fetched first page per company - now correctly retrieves ALL job pages
  • Externalized Configuration: Moved all hardcoded values to config.yaml with fallback defaults for environment-specific deployments
  • Professional User Interface: Added real-time progress bars with clear status indicators showing completion percentage, processing rate, etc.
  • Structured Logging: Replaced all print() statements with proper logging infrastructure and reduced threading noise
  • Comprehensive Test Coverage: 27 unit tests covering data normalization (date/location formats), API error handling (HTTP errors, network failures, timeouts), retry logic, and model validation - all using mocking with no external dependencies
  • Thread-Safe Operations: All database operations protected with locks to prevent race conditions during concurrent processing
  • Graceful Error Handling: Failed companies don't crash the entire process - execution continues with detailed error reporting and isolation
  • Data Validation: Pydantic models ensure data integrity and type safety throughout the pipeline with comprehensive field mapping
  • Extended Data Capture: Automatically captures additional API metadata beyond core fields in extra_fields JSON object for future analysis
  • Configurable Performance: Worker count adjustable via configuration for different system capabilities (tested from 1 to 800+ workers)

Possible Future Implementations

In order to make the project more production-ready some complex features can be considered in the future.

  • Full Async/Await Refactor: ThreadPoolExecutor provides sufficient performance without the complexity of asyncio
  • Resume Capability: Checkpointing system for interrupted scrapes deemed too complex for this scope
  • Advanced Monitoring: Circuit breakers, detailed metrics, and alerting systems beyond basic logging
  • Graceful Shutdown: Signal handling for clean Ctrl+C interruption (would require additional complexity)

Trade-offs Considered

  • ThreadPoolExecutor vs asyncio: Chose threading for simplicity and reliability with blocking I/O operations
  • JSON vs Database: JSON file sufficient for project scale; simpler than database setup
  • Comprehensive vs Focused Data: Prioritized essential 6 fields while capturing additional metadata in extra_fields for future flexibility
  • Testing Complexity vs Reliability: 27 comprehensive tests add development time but ensure production quality
  • Modular Separation vs Simplicity: Chose 8 specialized modules over monolithic script - adds initial complexity and more files to manage but provides better maintainability, testability, and long-term code organization

Installation

Prerequisites

  • Python 3.9+
  • Docker and Docker Compose

Setup

# Install dependencies
pip install -e .

# Start mock API server
docker compose up

# Verify API connection
python test_api_connection.py

Note: For Intel/AMD machines, edit docker-compose.yaml to use xxxxxapp/mock-api-server:0.1-amd64 image.

Usage

Configuration

Adjust settings in config.yaml:

  • scraper.worker_count: Concurrent workers (default: 500)
  • api.timeout: Request timeout (default: 30s)

Running the Scraper

python -m src.swe.scraper

The script will:

  1. Fetch company list from API
  2. Process companies concurrently with real-time progress bars showing completion rate and ETA
  3. Normalize and validate all job data with clean logging output
  4. Save results to jobs_data.json
  5. Display comprehensive completion statistics including performance metrics

Data Structure

Each job posting is stored with the following structure:

{
  "jobId": "company000000-000",
  "title": "Software Engineer",
  "company": "company000000",
  "location": "San Francisco, CA",
  "applicants": 25,
  "postedDate": "2025-01-01",
  "extra_fields": {
    // Any additional API fields beyond the core 6 fields
    // Automatically captured for future analysis
  }
}

Core Fields: The 6 essential fields (jobId, title, company, location, applicants, postedDate) are normalized and validated.

Extra Fields: Any additional metadata from the API is automatically captured in the extra_fields JSON object, ensuring future-proofing for API evolution.

Running Tests

# Run all tests
python -m pytest tests.py -v

# Run specific test categories
python -m pytest tests.py::test_normalize_date_unix_timestamps -v

Tests provide comprehensive coverage of:

  • Data Normalization: Date format handling (Unix timestamps, ISO strings, invalid values), location format standardization (objects to strings)
  • API Error Handling: HTTP errors (404, 429, 500), network failures, connection timeouts, and pagination edge cases
  • Retry Logic: Exponential backoff behavior, retry exhaustion scenarios, and selective retry logic validation
  • Model Validation: Pydantic field constraints, type safety, and fallback value generation
  • End-to-End Workflows: Complete company processing cycles including success and failure scenarios

All tests use mocking for complete isolation and avoid external dependencies.

Final Run Summary

  • Total number of companies to process: 5000
  • Total number of companies processed: 4997
  • Total jobs stored: 82406
  • Total applicants counted: 5901469

About

Production ready job scraper from a mock api. Case study for an anonymized company.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages