Skip to content

Olympus-R-D/Athena

Repository files navigation

ATHENA Logo - Advanced Tool for High-throughput Experimental NGS Analysis

License Version C++ Platform

ATHENA - Advanced NGS Data Processing Pipeline

ATHENA (Advanced Tool for High-throughput Experimental NGS Analysis) is a comprehensive, production-ready command-line bioinformatics pipeline designed for complete NGS data analysis from raw reads to assembled contigs. It integrates industry-standard tools including FastQC, Trimmomatic, SPAdes, and QUAST to provide an automated, parallel-processing workflow with advanced features like execution history tracking, resume capabilities, and production deployment options.

🧬 What is ATHENA?

ATHENA revolutionizes NGS data analysis by providing:

Core Workflow

  • 🔍 Quality Assessment: Comprehensive FastQC analysis on raw sequencing data
  • ✂️ Read Trimming: Advanced Trimmomatic processing to remove low-quality bases and adapters
  • 🧬 Genome Assembly: High-quality de novo genome assembly using SPAdes
  • 📊 Assembly Validation: Detailed QUAST assembly quality assessment
  • 🔄 Quality Validation: Post-trimming FastQC analysis to validate improvements
  • 📈 Automated Reporting: Rich quality control reports throughout the pipeline

Advanced Features

  • 📝 Execution History: Complete tracking of pipeline runs with resume capabilities
  • 🔄 Smart Resume: Continue from failed steps without re-running completed stages
  • 🚀 Production Ready: AWS deployment with multiple scaling strategies
  • ⚡ High Performance: Multi-threaded execution with configurable resource allocation
  • 🛡️ Error Recovery: Robust error handling with detailed diagnostics
  • 📊 Rich Reporting: Terminal-based summaries with detailed metrics

🚀 Key Features

Core Commands

Command Description Use Case
start Complete automated pipeline (FastQC → Trimmomatic → FastQC → SPAdes → QUAST) Full genome analysis
continue Resume pipeline from specific step Recovery from failures
fastqc Standalone FastQC quality control analysis Quality assessment only
trim Standalone Trimmomatic read trimming Read preprocessing
spades Standalone SPAdes genome assembly Assembly only
quast Standalone QUAST assembly quality assessment Assembly validation
clean Clean up temporary files and resources Maintenance
help Comprehensive usage information Documentation

Advanced Capabilities

🔄 Execution Management

  • History Tracking: Complete JSON-based execution history with detailed metadata
  • Smart Resume: Automatic detection of completed steps and intelligent restart
  • Session Management: Named sessions for organized project management
  • Error Recovery: Detailed error diagnostics with suggested solutions
  • Automated Configuration: Fully automated pipeline execution with config files
  • Remote Server Support: Execute pipelines on external servers with user credentials

⚡ Performance Features

  • Multi-threaded Processing: Configurable thread allocation for optimal performance
  • Memory Management: Intelligent memory allocation with user-defined limits
  • Resource Monitoring: Real-time tracking of CPU, memory, and disk usage
  • Parallel Execution: Concurrent processing of multiple samples

🛡️ Production Features

  • Docker Support: Containerized deployment for consistent environments
  • AWS Integration: Multiple cloud deployment strategies without credential sharing
  • Configuration Management: YAML/JSON configuration files for reproducible runs
  • Comprehensive Testing: Automated test suites with performance benchmarking

📊 Quality & Reporting

  • Rich Terminal Output: Color-coded progress indicators and detailed status
  • Comprehensive Reports: HTML and text-based quality summaries
  • Metrics Tracking: Performance and quality metrics throughout pipeline
  • Flexible I/O: Support for compressed/uncompressed files and directory inputs

🛠️ Building ATHENA

Prerequisites

System Requirements

  • Operating System: Linux (Ubuntu 20.04+, CentOS 8+) or macOS (10.15+)
  • CPU: Multi-core processor (4+ cores recommended)
  • Memory: 8GB RAM minimum (16GB+ recommended for large assemblies)
  • Storage: 50GB+ free space (varies with dataset size)

Development Dependencies

  • C++17 compatible compiler (GCC 7+, Clang 5+, or MSVC 2017+)
  • CMake 3.12 or higher
  • Git (for version control)
  • Python 3.6+ (for test suites and utilities)

Bioinformatics Tools

  • FastQC v0.11.9+ (quality control analysis)
  • Trimmomatic v0.39+ (read trimming)
  • SPAdes v3.13+ (genome assembly)
  • QUAST v5.0+ (assembly quality assessment)
  • Java Runtime Environment 8+ (for FastQC and Trimmomatic)

Header Libraries (Included)

  • CLI11 (command-line parsing) - external/CLI11.hpp
  • nlohmann/json (JSON processing) - external/json.hpp

Quick Start Build

# Clone the repository
git clone https://github.com/1337-R-D/Athena.git
cd Athena

# Navigate to the main build directory
cd Athena

# Create and enter build directory
mkdir -p build && cd build

# Configure and build
cmake ..
make -j$(nproc)

# Test the build
./athena --version
./athena help

Advanced Build Options

# Debug build with additional information
cmake -DCMAKE_BUILD_TYPE=Debug ..
make -j$(nproc)

# Release build with optimizations
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)

# Install system-wide (optional)
sudo make install

# Clean build
make clean

Build Targets

# Build main executable only
make athena

# Build and run application with test data
make run

# Run test suite
make test-commands

# Clean all build artifacts
make clean

Automated Installation Script

For convenience, use the provided installation script:

# Make installation script executable
chmod +x installation.sh

# Run automated installation (installs dependencies and builds ATHENA)
./installation.sh

# Verify installation
./Athena/build/athena --version

Build Targets

# Build the main executable
make athena

# Build and run the application
make run

# Run the test suite
make test-commands

💻 Usage

Complete Pipeline (Recommended)

Run the full ATHENA pipeline for comprehensive NGS analysis from raw reads to assembled genome:

# Complete pipeline with paired-end reads
./build/athena start -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o results/

# Complete pipeline with directory input (auto-detects paired files)
./build/athena start -d input_data/ -o results/

# With quality reports and verbose output
./build/athena start -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o results/ -r -v

Automated Pipeline with Configuration Files

ATHENA now supports fully automated execution through configuration files, enabling seamless integration into workflows and remote server execution:

# Automated pipeline using config file
./build/athena start --config config.json

# Automated pipeline with custom output directory
./build/athena start --config config.json -o custom_results/

Configuration File Format (config.json):

{
    "remote": {
        "ip": "12.34.56.78",
        "user": "ubuntu", 
        "key_path": "~/.ssh/private-key",
        "remote_dir": "/home/user1/job"
    },
    "pipeline": {
        "input1": "sample_R1.fastq.gz",
        "input2": "sample_R2.fastq.gz",
        "output_dir": "results/",
        "generate_reports": true,
        "verbose": false
    }
}

Remote Server Execution

ATHENA provides seamless execution on external high-performance servers through secure SSH connections:

Features:

  • 🔐 Secure Authentication: SSH key-based authentication for secure connections
  • 🚀 High-Performance Computing: Leverage powerful remote servers for large-scale analysis
  • 📁 Automatic File Management: Seamless file transfer and remote directory management
  • 👥 Multi-User Support: Individual user credentials and isolated workspaces
  • 🔄 Remote Resume: Continue interrupted analyses on remote servers

Getting Remote Access: To use remote server capabilities, contact the ATHENA team to obtain your personal credentials:

  • Email: [olympus-]
  • Request: Include your intended use case and computational requirements
  • Credentials: You'll receive a personalized config.json file with your server access details

Remote Execution Example:

# Execute complete pipeline on remote server
./build/athena start --config your_credentials.json

# Resume interrupted remote job
./build/athena continue --config your_credentials.json -p remote_results/

# Run specific analysis step remotely
./build/athena spades --config your_credentials.json -1 R1.fastq -2 R2.fastq -o assembly/

Pipeline Execution Flow

ATHENA follows a structured 7-stage pipeline:

Stage Tool Purpose Outputs
1. Initialization Internal Setup and validation Session metadata, directory structure
2. FastQC (Raw) FastQC Quality assessment of raw reads HTML reports, quality metrics
3. Trimmomatic Trimmomatic Remove low-quality bases and adapters Trimmed paired/unpaired reads
4. FastQC (Trimmed) FastQC Quality assessment post-trimming Quality improvement metrics
5. SPAdes Assembly SPAdes De novo genome assembly Contigs, scaffolds, assembly graph
6. QUAST Analysis QUAST Assembly quality assessment Assembly statistics, reports
7. Finalization Internal Cleanup and summary Final reports, session completion

Smart Resume and Continue

ATHENA tracks execution history and allows intelligent resumption:

# Resume from last failed step automatically
./build/athena continue -p results/

# Continue specific session
./build/athena continue -s session_name -p results/

# Resume from specific step (if previous steps completed)
./build/athena continue -p results/ --from spades

Resume capabilities:

  • ✅ Automatic detection of completed steps
  • ✅ Validation of intermediate files
  • ✅ Error diagnostics and suggestions
  • ✅ Session-based progress tracking

Individual Commands

📊 FastQC Quality Control

# Analyze paired-end reads
./build/athena fastqc -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o qc_results/

# With terminal quality report
./build/athena fastqc -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o qc_results/ -r

# Verbose mode with detailed output
./build/athena fastqc -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o qc_results/ -v

# Process directory of FASTQ files
./build/athena fastqc -d input_reads/ -o qc_results/ -r

✂️ Trimmomatic Read Trimming

# Trim paired-end reads with default settings
./build/athena trim -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o trimmed_results/

# Trim with custom configuration and verbose output
./build/athena trim -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -o trimmed_results/ -v

# Process directory input
./build/athena trim -d raw_reads/ -o trimmed_results/

🧬 SPAdes Genome Assembly

# Assemble trimmed paired-end reads
./build/athena spades -1 trimmed_R1_paired.fq.gz -2 trimmed_R2_paired.fq.gz -o assembly/

# Assembly with verbose logging
./build/athena spades -1 trimmed_R1_paired.fq.gz -2 trimmed_R2_paired.fq.gz -o assembly/ -v

# Assembly with quality report generation
./build/athena spades -1 trimmed_R1_paired.fq.gz -2 trimmed_R2_paired.fq.gz -o assembly/ -r

📈 QUAST Assembly Quality Assessment

# Basic assembly quality evaluation
./build/athena quast -c contigs.fasta -o quast_results/

# With reference genome for comparison
./build/athena quast -c contigs.fasta -r reference_genome.fasta -o quast_results/

# Verbose analysis with detailed metrics
./build/athena quast -c contigs.fasta -o quast_results/ -v

🛠️ Utility Commands

# Display comprehensive help
./build/athena help

# Show version information
./build/athena --version

# Clean temporary files and directories
./build/athena clean

# Clean with confirmation prompts
./build/athena clean --interactive

Command Line Options Reference

Global Options

  • -v, --version: Display ATHENA version information
  • -h, --help: Show command-specific help

Input/Output Options

  • -1, --input1 FILE: First input FASTQ file (R1)
  • -2, --input2 FILE: Second input FASTQ file (R2)
  • -d, --directory DIR: Input directory containing FASTQ files
  • -o, --output DIR: Output directory for results (required)
  • -c, --contig FILE: Input contig/assembly file (QUAST only)
  • -r, --reference FILE: Reference genome file (QUAST optional)

Execution Options

  • -r, --report: Generate detailed terminal reports
  • -v, --verbose: Enable verbose output and logging
  • -p, --project DIR: Project directory for resume operations
  • -s, --session NAME: Specific session name for operations
  • --from STEP: Resume from specific pipeline step
  • --config FILE: Configuration file for automated execution (JSON format)

Configuration File Options:

  • Remote server credentials and connection details
  • Pipeline parameters and input/output specifications
  • Execution preferences (verbose, reports, threading)
  • User-specific workspace and authentication settings

Clean Command Features:

  • Removes test output directories (test_output, fastqc_test_output, etc.)
  • Cleans temporary files (.tmp, .log, *_fastqc.html, *_trimmed*.fq.gz)
  • Cleans cache directories containing ATHENA-related files
  • Optional build directory cleanup with user confirmation
  • Color-coded output for clear feedback
  • Detailed summary of cleaned items

Command Line Options

Global Options

  • -v, --version: Show version information
  • -h, --help: Display help message

Start Command Options

  • -1, --input1 FILE: First input FASTQ file (required)
  • -2, --input2 FILE: Second input FASTQ file (required)
  • -o, --output DIR: Output directory (required)
  • -r, --report: Generate terminal quality reports

FastQC Command Options

  • -1, --input1 FILE: First input FASTQ file (required)
  • -2, --input2 FILE: Second input FASTQ file (required)
  • -o, --output DIR: Output directory (required)
  • -r, --report: Generate terminal quality reports
  • -v, --verbose: Show detailed FastQC output

Trim Command Options

  • -1, --input1 FILE: First input FASTQ file (required)
  • -2, --input2 FILE: Second input FASTQ file (required)
  • -o, --output DIR: Output directory (required)
  • -r, --report: Generate terminal reports (placeholder)
  • -v, --verbose: Show detailed Trimmomatic output

SPAdes Command Options

  • -1, --input1 FILE: First input FASTQ file (required)
  • -2, --input2 FILE: Second input FASTQ file (required)
  • -o, --output DIR: Output directory (required)
  • -r, --report: Generate terminal reports
  • -v, --verbose: Show detailed SPAdes output

QUAST Command Options

  • -c, --contig FILE: Input contig/assembly file (required)
  • -o, --output DIR: Output directory (required)
  • -r, --reference FILE: Reference genome file (optional)
  • -v, --verbose: Show detailed QUAST output

� Coming Soon

ATHENA is continuously evolving to provide cutting-edge bioinformatics capabilities. Here are the exciting features currently in development:

📄 Advanced Reporting & Analytics

  • 📊 Comprehensive PDF Reports: Professional-grade analysis reports with publication-ready figures and tables
  • 📈 Interactive Quality Dashboards: Web-based interactive visualizations for quality metrics and assembly statistics
  • 🎯 Comparative Analysis Reports: Multi-sample comparison reports with statistical analysis
  • 📋 Executive Summaries: High-level summary reports for project managers and stakeholders

Stay tuned for updates! Follow our development progress and contribute to the roadmap on our GitHub repository.

�📚 Additional Resources

License

This project is distributed under the terms specified in the repository license. Please refer to the LICENSE file for detailed information.

Contributing

Contributions are welcome! Please read CONTRIBUTORS.md for:

  • Development environment setup
  • Code style guidelines
  • Testing procedures
  • Submission process

ATHENA - Streamlining NGS data preprocessing with automated quality control and read trimming.

About

Athena is a command-line interface tool for genomic data processing and analysis. It automates common bioinformatics workflows, including quality control, read trimming, and genome assembly.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors