Skip to content

ONSdigital/joejoe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

$joejoe

Build Status Build Status Linting: Ruff poetry-managed License - MIT

$stuff

This project follows the Reproducible Analytical Pipeline (RAP) methodology, providing a modular ETL (Extract, Transform, Load) framework for data processing workflows.

IMPORTANT: This README was generated from a template. Please update it with specific information about your project, including:

  • Detailed project description and purpose
  • Specific data sources and outputs
  • Usage examples and API documentation
  • Contributing guidelines specific to your project
  • Any project-specific compliance requirements

RAP Methodology

This project implements RAP principles:

  • Reproducibility: All processes are automated and version-controlled
  • Auditability: Clear data lineage and transformation logic
  • Quality Assurance: Automated testing and validation
  • Efficiency: Reusable components and standardised workflows

Table of Contents

Getting Started

To get a local copy up and running, follow these simple steps.

Pre-requisites

Ensure you have the following installed:

  1. Python: Version specified in .python-version.
  2. Poetry: This is used to manage package dependencies and virtual environments.
  3. Operating System: MacOS

Installation

  1. Clone the repository and install the required dependencies.
git clone https://github.com/$ONSdigital/$joejoe.git
  1. Install dependencies

Poetry is used to manage dependencies in this project. For more information, read the Poetry documentation.

To install all dependencies, including development dependencies, run:

make install-dev

To install only production dependencies, run:

make install

Running the RAP Pipeline

The template includes several ways to run the ETL pipeline:

  1. Using the convenience function:
make run
  1. Using individual components:
# Extract data
poetry run python -c "from $joejoe.extract import extract_from_source; print(extract_from_source('example_data.csv'))"

# Run full pipeline with custom parameters
poetry run python run_etl.py
  1. Programmatic usage:
from $joejoe import ETLPipeline

pipeline = ETLPipeline()
success = pipeline.run_pipeline(
    source_path="data/input.csv",
    output_path="data/output.csv",
    apply_transforms=True
)

Development

Get started with development by running the following commands. Before proceeding, make sure you have the development dependencies installed using the make install-dev command.

A Makefile is provided to simplify common development tasks. To view all available commands, run:

make

Run Tests with Coverage

The unit tests are written using the pytest framework. To run the tests and check coverage, run:

make test

To run only the unit tests, run:

make test-unit

To run only the end-to-end tests, run:

make test-e2e

Linting and Formatting

Ruff is used for both linting and formatting of the Python code in this project. Ruff is a fast Python linter and formatter that replaces multiple tools with a single, efficient solution.

The tool is configured using the pyproject.toml file.

To lint the Python code, run:

make lint

To auto-format the Python code and correct fixable linting issues, run:

make format

Security Scanning

Bandit is used for security scanning of the Python code. It helps identify common security issues in Python applications.

To run the security scan, run:

make security-scan

Type Checking

MyPy is used for static type checking to catch type-related errors before runtime.

To run type checking, run:

poetry run mypy $joejoe

GitHub actions

Linting/formatting and Security Scanning GitHub actions are enabled by default on template repositories. If you go to the Actions tab on your repository, you can view all the workflows for the repository. If an action has failed, it will show a red circle with a cross in it.

To find out more details about why it failed:

  1. Click on the name of the action
  2. Click the job in the Jobs section in the left sidebar
  3. Find the dropdown with the red circle with a cross in it to view more information about the failed action

Please note that the GitHub actions will not automatically fix the errors, you must resolve them locally.

Pre-commit Hooks

The project includes pre-commit hooks to automatically run linting, formatting, and security checks before each commit.

  1. Install pre-commit using Poetry:
poetry add --group dev pre-commit
  1. Activate the git hooks:
pre-commit install

From now on Ruff and Bandit will run automatically on the files you stage before every commit.

RAP Components

This template provides a modular ETL framework with the following components:

Extract Module

The extract.py module contains the DataExtractor class and helper functions for reading data from various sources:

  • CSV files: Primary data source with configurable parameters
  • File validation: Ensures data sources exist before processing
  • Metadata extraction: Provides file information and statistics
  • Error handling: Comprehensive error management and logging

Transform Module

The transform.py module contains the DataTransformer class for data cleaning and enhancement:

  • Data cleaning: Removes duplicates and handles missing values
  • Calculated columns: Adds derived fields based on business logic
  • Data filtering: Applies configurable filters and business rules
  • Transformation logging: Tracks all applied transformations
  • Column normalisation: Standardises column names and formats

Load Module

The load.py module contains the DataLoader class for outputting processed data:

  • Multiple formats: CSV, Parquet, and JSON output support
  • Data summaries: Automatic generation of processing metadata
  • Directory management: Creates output directories as needed
  • Load validation: Ensures successful data persistence
  • Performance tracking: Monitors load operations and file sizes

Pipeline Orchestration

The __init__.py module provides the ETLPipeline class that orchestrates the entire process:

  • End-to-end execution: Manages extract, transform, and load phases
  • Configuration management: Handles pipeline parameters and options
  • Error recovery: Provides graceful error handling and rollback
  • Progress tracking: Monitors pipeline execution and performance
  • Flexible execution: Supports various execution patterns and customisation

Contributing

See CONTRIBUTING.md for details.

License

See LICENSE for details.

About

stuff

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors