$joejoe

$stuff

This project follows the Reproducible Analytical Pipeline (RAP) methodology, providing a modular ETL (Extract, Transform, Load) framework for data processing workflows.

IMPORTANT: This README was generated from a template. Please update it with specific information about your project, including:

Detailed project description and purpose

Specific data sources and outputs

Usage examples and API documentation

Contributing guidelines specific to your project

Any project-specific compliance requirements

RAP Methodology

This project implements RAP principles:

Reproducibility: All processes are automated and version-controlled
Auditability: Clear data lineage and transformation logic
Quality Assurance: Automated testing and validation
Efficiency: Reusable components and standardised workflows

Getting Started

To get a local copy up and running, follow these simple steps.

Pre-requisites

Ensure you have the following installed:

Python: Version specified in .python-version.
Poetry: This is used to manage package dependencies and virtual environments.
Operating System: MacOS

Installation

Clone the repository and install the required dependencies.

git clone https://github.com/$ONSdigital/$joejoe.git

Install dependencies

Poetry is used to manage dependencies in this project. For more information, read the Poetry documentation.

To install all dependencies, including development dependencies, run:

make install-dev

To install only production dependencies, run:

make install

Running the RAP Pipeline

The template includes several ways to run the ETL pipeline:

Using the convenience function:

make run

Using individual components:

# Extract data
poetry run python -c "from $joejoe.extract import extract_from_source; print(extract_from_source('example_data.csv'))"

# Run full pipeline with custom parameters
poetry run python run_etl.py

Programmatic usage:

from $joejoe import ETLPipeline

pipeline = ETLPipeline()
success = pipeline.run_pipeline(
    source_path="data/input.csv",
    output_path="data/output.csv",
    apply_transforms=True
)

Development

Get started with development by running the following commands. Before proceeding, make sure you have the development dependencies installed using the make install-dev command.

A Makefile is provided to simplify common development tasks. To view all available commands, run:

make

Run Tests with Coverage

The unit tests are written using the pytest framework. To run the tests and check coverage, run:

make test

To run only the unit tests, run:

make test-unit

To run only the end-to-end tests, run:

make test-e2e

Linting and Formatting

Ruff is used for both linting and formatting of the Python code in this project. Ruff is a fast Python linter and formatter that replaces multiple tools with a single, efficient solution.

The tool is configured using the pyproject.toml file.

To lint the Python code, run:

make lint

To auto-format the Python code and correct fixable linting issues, run:

make format

Security Scanning

Bandit is used for security scanning of the Python code. It helps identify common security issues in Python applications.

To run the security scan, run:

make security-scan

Type Checking

MyPy is used for static type checking to catch type-related errors before runtime.

To run type checking, run:

poetry run mypy $joejoe

GitHub actions

Linting/formatting and Security Scanning GitHub actions are enabled by default on template repositories. If you go to the Actions tab on your repository, you can view all the workflows for the repository. If an action has failed, it will show a red circle with a cross in it.

To find out more details about why it failed:

Click on the name of the action
Click the job in the Jobs section in the left sidebar
Find the dropdown with the red circle with a cross in it to view more information about the failed action

Please note that the GitHub actions will not automatically fix the errors, you must resolve them locally.

Pre-commit Hooks

The project includes pre-commit hooks to automatically run linting, formatting, and security checks before each commit.

Install pre-commit using Poetry:

poetry add --group dev pre-commit

Activate the git hooks:

pre-commit install

From now on Ruff and Bandit will run automatically on the files you stage before every commit.

RAP Components

This template provides a modular ETL framework with the following components:

Extract Module

The extract.py module contains the DataExtractor class and helper functions for reading data from various sources:

CSV files: Primary data source with configurable parameters
File validation: Ensures data sources exist before processing
Metadata extraction: Provides file information and statistics
Error handling: Comprehensive error management and logging

Transform Module

The transform.py module contains the DataTransformer class for data cleaning and enhancement:

Data cleaning: Removes duplicates and handles missing values
Calculated columns: Adds derived fields based on business logic
Data filtering: Applies configurable filters and business rules
Transformation logging: Tracks all applied transformations
Column normalisation: Standardises column names and formats

Load Module

The load.py module contains the DataLoader class for outputting processed data:

Multiple formats: CSV, Parquet, and JSON output support
Data summaries: Automatic generation of processing metadata
Directory management: Creates output directories as needed
Load validation: Ensures successful data persistence
Performance tracking: Monitors load operations and file sizes

Pipeline Orchestration

The __init__.py module provides the ETLPipeline class that orchestrates the entire process:

End-to-end execution: Manages extract, transform, and load phases
Configuration management: Handles pipeline parameters and options
Error recovery: Provides graceful error handling and rollback
Progress tracking: Monitors pipeline execution and performance
Flexible execution: Supports various execution patterns and customisation

Contributing

See CONTRIBUTING.md for details.

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.devcontainer		.devcontainer
.github		.github
copier_scripts		copier_scripts
docs		docs
joejoe		joejoe
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.secrets.baseline		.secrets.baseline
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMPLIANCE.md		COMPLIANCE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
PIRR.md		PIRR.md
README.md		README.md
SECURITY.md		SECURITY.md
catalog-info.yaml		catalog-info.yaml
example_data.csv		example_data.csv
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_etl.py		run_etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$joejoe

RAP Methodology

Table of Contents

Getting Started

Pre-requisites

Installation

Running the RAP Pipeline

Development

Run Tests with Coverage

Linting and Formatting

Security Scanning

Type Checking

GitHub actions

Pre-commit Hooks

RAP Components

Extract Module

Transform Module

Load Module

Pipeline Orchestration

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

$joejoe

RAP Methodology

Table of Contents

Getting Started

Pre-requisites

Installation

Running the RAP Pipeline

Development

Run Tests with Coverage

Linting and Formatting

Security Scanning

Type Checking

GitHub actions

Pre-commit Hooks

RAP Components

Extract Module

Transform Module

Load Module

Pipeline Orchestration

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages