$stuff
This project follows the Reproducible Analytical Pipeline (RAP) methodology, providing a modular ETL (Extract, Transform, Load) framework for data processing workflows.
IMPORTANT: This README was generated from a template. Please update it with specific information about your project, including:
- Detailed project description and purpose
- Specific data sources and outputs
- Usage examples and API documentation
- Contributing guidelines specific to your project
- Any project-specific compliance requirements
This project implements RAP principles:
- Reproducibility: All processes are automated and version-controlled
- Auditability: Clear data lineage and transformation logic
- Quality Assurance: Automated testing and validation
- Efficiency: Reusable components and standardised workflows
To get a local copy up and running, follow these simple steps.
Ensure you have the following installed:
- Python: Version specified in
.python-version. - Poetry: This is used to manage package dependencies and virtual environments.
- Operating System: MacOS
- Clone the repository and install the required dependencies.
git clone https://github.com/$ONSdigital/$joejoe.git- Install dependencies
Poetry is used to manage dependencies in this project. For more information, read the Poetry documentation.
To install all dependencies, including development dependencies, run:
make install-devTo install only production dependencies, run:
make installThe template includes several ways to run the ETL pipeline:
- Using the convenience function:
make run- Using individual components:
# Extract data
poetry run python -c "from $joejoe.extract import extract_from_source; print(extract_from_source('example_data.csv'))"
# Run full pipeline with custom parameters
poetry run python run_etl.py- Programmatic usage:
from $joejoe import ETLPipeline
pipeline = ETLPipeline()
success = pipeline.run_pipeline(
source_path="data/input.csv",
output_path="data/output.csv",
apply_transforms=True
)Get started with development by running the following commands.
Before proceeding, make sure you have the development dependencies installed using the make install-dev command.
A Makefile is provided to simplify common development tasks. To view all available commands, run:
makeThe unit tests are written using the pytest framework. To run the tests and check coverage, run:
make testTo run only the unit tests, run:
make test-unitTo run only the end-to-end tests, run:
make test-e2eRuff is used for both linting and formatting of the Python code in this project. Ruff is a fast Python linter and formatter that replaces multiple tools with a single, efficient solution.
The tool is configured using the pyproject.toml file.
To lint the Python code, run:
make lintTo auto-format the Python code and correct fixable linting issues, run:
make formatBandit is used for security scanning of the Python code. It helps identify common security issues in Python applications.
To run the security scan, run:
make security-scanMyPy is used for static type checking to catch type-related errors before runtime.
To run type checking, run:
poetry run mypy $joejoeLinting/formatting and Security Scanning GitHub actions are enabled by default on template repositories. If you go to the Actions tab on your repository, you can view all the workflows for the repository. If an action has failed, it will show a red circle with a cross in it.
To find out more details about why it failed:
- Click on the name of the action
- Click the job in the Jobs section in the left sidebar
- Find the dropdown with the red circle with a cross in it to view more information about the failed action
Please note that the GitHub actions will not automatically fix the errors, you must resolve them locally.
The project includes pre-commit hooks to automatically run linting, formatting, and security checks before each commit.
- Install pre-commit using Poetry:
poetry add --group dev pre-commit- Activate the git hooks:
pre-commit installFrom now on Ruff and Bandit will run automatically on the files you stage before every commit.
This template provides a modular ETL framework with the following components:
The extract.py module contains the DataExtractor class and helper functions for reading data from various sources:
- CSV files: Primary data source with configurable parameters
- File validation: Ensures data sources exist before processing
- Metadata extraction: Provides file information and statistics
- Error handling: Comprehensive error management and logging
The transform.py module contains the DataTransformer class for data cleaning and enhancement:
- Data cleaning: Removes duplicates and handles missing values
- Calculated columns: Adds derived fields based on business logic
- Data filtering: Applies configurable filters and business rules
- Transformation logging: Tracks all applied transformations
- Column normalisation: Standardises column names and formats
The load.py module contains the DataLoader class for outputting processed data:
- Multiple formats: CSV, Parquet, and JSON output support
- Data summaries: Automatic generation of processing metadata
- Directory management: Creates output directories as needed
- Load validation: Ensures successful data persistence
- Performance tracking: Monitors load operations and file sizes
The __init__.py module provides the ETLPipeline class that orchestrates the entire process:
- End-to-end execution: Manages extract, transform, and load phases
- Configuration management: Handles pipeline parameters and options
- Error recovery: Provides graceful error handling and rollback
- Progress tracking: Monitors pipeline execution and performance
- Flexible execution: Supports various execution patterns and customisation
See CONTRIBUTING.md for details.
See LICENSE for details.