Production-Ready Job Scraper

A robust data pipeline that scrapes job postings from a mock API, transforming a proof-of-concept script into a production-ready application with focus on performance, reliability, and data integrity.

Candidate: Baran Batı
Company: xxxxx

Key Architectural Decisions

1. Modular Architecture with Separation of Concerns

Decision: Refactored monolithic script into 8 specialized modules while maintaining scraper.py as the entry point.

Reasoning: Original script got too bloated which violated Single Responsibility Principle. Highly prioritized separation of concerns - each module has a single, well-defined responsibility for better testing, maintenance, and error isolation.

2. Concurrent Processing

Decision: Implemented ThreadPoolExecutor with 500 concurrent workers (configurable).

Reasoning: Sequential processing of 5000+ companies would take 12-14 hours. Concurrent processing reduces execution time to under 15 minutes depending on worker config.

Trade-off: Increased memory usage and complexity vs 50x performance improvement.

3. Robust Error Handling with Retry Decorator

Decision: Comprehensive retry mechanism with exponential backoff and jitter implemented as a reusable decorator.

Reasoning: Production APIs are unreliable. Transient failures (5xx errors, timeouts, rate limiting) shouldn't crash the entire process.

Implementation:

Exponential backoff (1s → 2s → 4s) with jitter to prevent thundering herd effects
Decorator pattern for easy application across multiple functions without code repetition
Selective retry logic (retries 5xx/429, not 4xx), graceful degradation

4. Data Validation & Normalization

Decision: Streamlined 6-field Pydantic model with comprehensive normalization and extended data capture.

Reasoning: API responses have inconsistent formats (dates, locations) and field mappings that need standardization for reliable storage. Additionally, capturing extra fields ensures future-proofing for API evolution.

Trade-off: Prioritized data quality and processing efficiency while maintaining flexibility for future API changes through the extra_fields JSON object.

5. Infinite Loop Protection

Problem: During testing, certain companies caused infinite loops with repeated pagination tokens.

Solution: Token-based loop detection using seen_tokens set

Benefits: Precise root cause addressing, no false positives, immediate detection.

6. Extended Data Capture

Decision: Automatically capture all API fields beyond the core 6 fields in an extra_fields JSON object.

Reasoning: APIs evolve over time, adding new fields like salary ranges, remote work options, skill requirements, etc. Capturing these automatically ensures the scraper remains future-proof without requiring code changes when the API schema changes.

Implementation:

Core fields (jobId, title, company, location, applicants, postedDate) are normalized and validated
Any additional fields from the API response are automatically stored in extra_fields
Backward compatible - existing functionality unchanged
No configuration needed - always captures extra fields

Trade-off: Slightly increased storage size vs complete future-proofing and data preservation.

Production-Ready Features Implemented

To transform the proof-of-concept into a production-ready application, I implemented improvements systematically across three phases (Core Reliability, Production-Readiness, and Polish). Key achievements include:

Fixed Critical Pagination Bug: Original script only fetched first page per company - now correctly retrieves ALL job pages
Externalized Configuration: Moved all hardcoded values to config.yaml with fallback defaults for environment-specific deployments
Professional User Interface: Added real-time progress bars with clear status indicators showing completion percentage, processing rate, etc.
Structured Logging: Replaced all print() statements with proper logging infrastructure and reduced threading noise
Comprehensive Test Coverage: 27 unit tests covering data normalization (date/location formats), API error handling (HTTP errors, network failures, timeouts), retry logic, and model validation - all using mocking with no external dependencies
Thread-Safe Operations: All database operations protected with locks to prevent race conditions during concurrent processing
Graceful Error Handling: Failed companies don't crash the entire process - execution continues with detailed error reporting and isolation
Data Validation: Pydantic models ensure data integrity and type safety throughout the pipeline with comprehensive field mapping
Extended Data Capture: Automatically captures additional API metadata beyond core fields in extra_fields JSON object for future analysis
Configurable Performance: Worker count adjustable via configuration for different system capabilities (tested from 1 to 800+ workers)

Possible Future Implementations

In order to make the project more production-ready some complex features can be considered in the future.

Full Async/Await Refactor: ThreadPoolExecutor provides sufficient performance without the complexity of asyncio
Resume Capability: Checkpointing system for interrupted scrapes deemed too complex for this scope
Advanced Monitoring: Circuit breakers, detailed metrics, and alerting systems beyond basic logging
Graceful Shutdown: Signal handling for clean Ctrl+C interruption (would require additional complexity)

Trade-offs Considered

ThreadPoolExecutor vs asyncio: Chose threading for simplicity and reliability with blocking I/O operations
JSON vs Database: JSON file sufficient for project scale; simpler than database setup
Comprehensive vs Focused Data: Prioritized essential 6 fields while capturing additional metadata in extra_fields for future flexibility
Testing Complexity vs Reliability: 27 comprehensive tests add development time but ensure production quality
Modular Separation vs Simplicity: Chose 8 specialized modules over monolithic script - adds initial complexity and more files to manage but provides better maintainability, testability, and long-term code organization

Installation

Prerequisites

Python 3.9+
Docker and Docker Compose

Setup

# Install dependencies
pip install -e .

# Start mock API server
docker compose up

# Verify API connection
python test_api_connection.py

Note: For Intel/AMD machines, edit docker-compose.yaml to use xxxxxapp/mock-api-server:0.1-amd64 image.

Usage

Configuration

Adjust settings in config.yaml:

scraper.worker_count: Concurrent workers (default: 500)
api.timeout: Request timeout (default: 30s)

Running the Scraper

python -m src.swe.scraper

The script will:

Fetch company list from API
Process companies concurrently with real-time progress bars showing completion rate and ETA
Normalize and validate all job data with clean logging output
Save results to jobs_data.json
Display comprehensive completion statistics including performance metrics

Data Structure

Each job posting is stored with the following structure:

{
  "jobId": "company000000-000",
  "title": "Software Engineer",
  "company": "company000000",
  "location": "San Francisco, CA",
  "applicants": 25,
  "postedDate": "2025-01-01",
  "extra_fields": {
    // Any additional API fields beyond the core 6 fields
    // Automatically captured for future analysis
  }
}

Core Fields: The 6 essential fields (jobId, title, company, location, applicants, postedDate) are normalized and validated.

Extra Fields: Any additional metadata from the API is automatically captured in the extra_fields JSON object, ensuring future-proofing for API evolution.

Running Tests

# Run all tests
python -m pytest tests.py -v

# Run specific test categories
python -m pytest tests.py::test_normalize_date_unix_timestamps -v

Tests provide comprehensive coverage of:

Data Normalization: Date format handling (Unix timestamps, ISO strings, invalid values), location format standardization (objects to strings)
API Error Handling: HTTP errors (404, 429, 500), network failures, connection timeouts, and pagination edge cases
Retry Logic: Exponential backoff behavior, retry exhaustion scenarios, and selective retry logic validation
Model Validation: Pydantic field constraints, type safety, and fallback value generation
End-to-End Workflows: Complete company processing cycles including success and failure scenarios

All tests use mocking for complete isolation and avoid external dependencies.

Final Run Summary

Total number of companies to process: 5000
Total number of companies processed: 4997
Total jobs stored: 82406
Total applicants counted: 5901469

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/swe		src/swe
.gitignore		.gitignore
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md
config.yaml		config.yaml
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
test_api_connection.py		test_api_connection.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Production-Ready Job Scraper

Key Architectural Decisions

1. Modular Architecture with Separation of Concerns

2. Concurrent Processing

3. Robust Error Handling with Retry Decorator

4. Data Validation & Normalization

5. Infinite Loop Protection

6. Extended Data Capture

Production-Ready Features Implemented

Possible Future Implementations

Trade-offs Considered

Installation

Prerequisites

Setup

Usage

Configuration

Running the Scraper

Data Structure

Running Tests

Final Run Summary

About

Uh oh!

Releases

Packages

Languages

Batibaran/job-scraper

Folders and files

Latest commit

History

Repository files navigation

Production-Ready Job Scraper

Key Architectural Decisions

1. Modular Architecture with Separation of Concerns

2. Concurrent Processing

3. Robust Error Handling with Retry Decorator

4. Data Validation & Normalization

5. Infinite Loop Protection

6. Extended Data Capture

Production-Ready Features Implemented

Possible Future Implementations

Trade-offs Considered

Installation

Prerequisites

Setup

Usage

Configuration

Running the Scraper

Data Structure

Running Tests

Final Run Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages