A robust data pipeline that scrapes job postings from a mock API, transforming a proof-of-concept script into a production-ready application with focus on performance, reliability, and data integrity.
Candidate: Baran Batı
Company: xxxxx
Decision: Refactored monolithic script into 8 specialized modules while maintaining scraper.py
as the entry point.
Reasoning: Original script got too bloated which violated Single Responsibility Principle. Highly prioritized separation of concerns - each module has a single, well-defined responsibility for better testing, maintenance, and error isolation.
Decision: Implemented ThreadPoolExecutor
with 500 concurrent workers (configurable).
Reasoning: Sequential processing of 5000+ companies would take 12-14 hours. Concurrent processing reduces execution time to under 15 minutes depending on worker config.
Trade-off: Increased memory usage and complexity vs 50x performance improvement.
Decision: Comprehensive retry mechanism with exponential backoff and jitter implemented as a reusable decorator.
Reasoning: Production APIs are unreliable. Transient failures (5xx errors, timeouts, rate limiting) shouldn't crash the entire process.
Implementation:
- Exponential backoff (1s → 2s → 4s) with jitter to prevent thundering herd effects
- Decorator pattern for easy application across multiple functions without code repetition
- Selective retry logic (retries 5xx/429, not 4xx), graceful degradation
Decision: Streamlined 6-field Pydantic model with comprehensive normalization and extended data capture.
Reasoning: API responses have inconsistent formats (dates, locations) and field mappings that need standardization for reliable storage. Additionally, capturing extra fields ensures future-proofing for API evolution.
Trade-off: Prioritized data quality and processing efficiency while maintaining flexibility for future API changes through the extra_fields
JSON object.
Problem: During testing, certain companies caused infinite loops with repeated pagination tokens.
Solution: Token-based loop detection using seen_tokens
set
Benefits: Precise root cause addressing, no false positives, immediate detection.
Decision: Automatically capture all API fields beyond the core 6 fields in an extra_fields
JSON object.
Reasoning: APIs evolve over time, adding new fields like salary ranges, remote work options, skill requirements, etc. Capturing these automatically ensures the scraper remains future-proof without requiring code changes when the API schema changes.
Implementation:
- Core fields (jobId, title, company, location, applicants, postedDate) are normalized and validated
- Any additional fields from the API response are automatically stored in
extra_fields
- Backward compatible - existing functionality unchanged
- No configuration needed - always captures extra fields
Trade-off: Slightly increased storage size vs complete future-proofing and data preservation.
To transform the proof-of-concept into a production-ready application, I implemented improvements systematically across three phases (Core Reliability, Production-Readiness, and Polish). Key achievements include:
- Fixed Critical Pagination Bug: Original script only fetched first page per company - now correctly retrieves ALL job pages
- Externalized Configuration: Moved all hardcoded values to
config.yaml
with fallback defaults for environment-specific deployments - Professional User Interface: Added real-time progress bars with clear status indicators showing completion percentage, processing rate, etc.
- Structured Logging: Replaced all
print()
statements with proper logging infrastructure and reduced threading noise - Comprehensive Test Coverage: 27 unit tests covering data normalization (date/location formats), API error handling (HTTP errors, network failures, timeouts), retry logic, and model validation - all using mocking with no external dependencies
- Thread-Safe Operations: All database operations protected with locks to prevent race conditions during concurrent processing
- Graceful Error Handling: Failed companies don't crash the entire process - execution continues with detailed error reporting and isolation
- Data Validation: Pydantic models ensure data integrity and type safety throughout the pipeline with comprehensive field mapping
- Extended Data Capture: Automatically captures additional API metadata beyond core fields in
extra_fields
JSON object for future analysis - Configurable Performance: Worker count adjustable via configuration for different system capabilities (tested from 1 to 800+ workers)
In order to make the project more production-ready some complex features can be considered in the future.
- Full Async/Await Refactor: ThreadPoolExecutor provides sufficient performance without the complexity of asyncio
- Resume Capability: Checkpointing system for interrupted scrapes deemed too complex for this scope
- Advanced Monitoring: Circuit breakers, detailed metrics, and alerting systems beyond basic logging
- Graceful Shutdown: Signal handling for clean Ctrl+C interruption (would require additional complexity)
- ThreadPoolExecutor vs asyncio: Chose threading for simplicity and reliability with blocking I/O operations
- JSON vs Database: JSON file sufficient for project scale; simpler than database setup
- Comprehensive vs Focused Data: Prioritized essential 6 fields while capturing additional metadata in
extra_fields
for future flexibility - Testing Complexity vs Reliability: 27 comprehensive tests add development time but ensure production quality
- Modular Separation vs Simplicity: Chose 8 specialized modules over monolithic script - adds initial complexity and more files to manage but provides better maintainability, testability, and long-term code organization
- Python 3.9+
- Docker and Docker Compose
# Install dependencies
pip install -e .
# Start mock API server
docker compose up
# Verify API connection
python test_api_connection.py
Note: For Intel/AMD machines, edit docker-compose.yaml
to use xxxxxapp/mock-api-server:0.1-amd64
image.
Adjust settings in config.yaml
:
scraper.worker_count
: Concurrent workers (default: 500)api.timeout
: Request timeout (default: 30s)
python -m src.swe.scraper
The script will:
- Fetch company list from API
- Process companies concurrently with real-time progress bars showing completion rate and ETA
- Normalize and validate all job data with clean logging output
- Save results to
jobs_data.json
- Display comprehensive completion statistics including performance metrics
Each job posting is stored with the following structure:
{
"jobId": "company000000-000",
"title": "Software Engineer",
"company": "company000000",
"location": "San Francisco, CA",
"applicants": 25,
"postedDate": "2025-01-01",
"extra_fields": {
// Any additional API fields beyond the core 6 fields
// Automatically captured for future analysis
}
}
Core Fields: The 6 essential fields (jobId, title, company, location, applicants, postedDate) are normalized and validated.
Extra Fields: Any additional metadata from the API is automatically captured in the extra_fields
JSON object, ensuring future-proofing for API evolution.
# Run all tests
python -m pytest tests.py -v
# Run specific test categories
python -m pytest tests.py::test_normalize_date_unix_timestamps -v
Tests provide comprehensive coverage of:
- Data Normalization: Date format handling (Unix timestamps, ISO strings, invalid values), location format standardization (objects to strings)
- API Error Handling: HTTP errors (404, 429, 500), network failures, connection timeouts, and pagination edge cases
- Retry Logic: Exponential backoff behavior, retry exhaustion scenarios, and selective retry logic validation
- Model Validation: Pydantic field constraints, type safety, and fallback value generation
- End-to-End Workflows: Complete company processing cycles including success and failure scenarios
All tests use mocking for complete isolation and avoid external dependencies.
- Total number of companies to process: 5000
- Total number of companies processed: 4997
- Total jobs stored: 82406
- Total applicants counted: 5901469