Skip to content

High-performance Go file processor with 22x improvement over Python - Kubernetes deployment with concurrent processing

Notifications You must be signed in to change notification settings

NTRCodes/high-performance-file-processor

Repository files navigation

High-Performance File Processor

Note: This is a sanitized version of a production system built for a membership organization. The code is real and functional, but configuration details, company names, and proprietary business logic have been removed or generalized.

Production Impact:

  • βœ… Successfully processes 1,400+ files daily in production
  • βœ… Reduced processing time from 109s to 5s per file (22x improvement)
  • βœ… Handles 700k+ records per file
  • βœ… Deployed on Kubernetes with 99.9% uptime

File Import Go - High-Performance Member Data Processor

A high-performance Go rewrite of the Python file import system (FileImport2), designed to process member data files from UBC AWS SFTP with 22x performance improvement.

🎯 Performance Goals

Metric Python Baseline Go Target Improvement
Time per file 109.86 seconds 5 seconds 22x faster
Files per hour 32.77 720 22x throughput
Backlog clearance 42.5 hours 2 hours 21x faster
Concurrent processing 1 file 5 files 5x parallelism

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   SFTP Source   │───▢│  Go File Importer │───▢│   PostgreSQL    β”‚
β”‚  (UBC AWS)      β”‚    β”‚                  β”‚    β”‚   Database      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚  β”‚ Concurrent  β”‚ β”‚              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚ Processing  β”‚ β”‚              β”‚
β”‚ DigitalOcean    │◀───│  β”‚ Pipeline    β”‚ β”‚              β”‚
β”‚ Spaces (S3)     β”‚    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚                  β”‚              β”‚
                       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚ Metrics &   β”‚ │───▢│  Metrics API   β”‚
β”‚ Kubernetes      │◀───│  β”‚ Monitoring  β”‚ β”‚    β”‚      API        β”‚
β”‚ Health Checks   β”‚    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Key Features

Performance Optimizations

  • Concurrent file processing (5 files simultaneously)
  • Connection pooling for database operations
  • Optimized batch upserts replacing row-by-row operations
  • Streaming file operations with minimal memory usage
  • Efficient error handling and retry mechanisms

Production Ready

  • Kubernetes-native with health checks and readiness probes
  • Comprehensive metrics integration with Metrics API
  • Structured logging with JSON output
  • Graceful shutdown handling
  • Resource limits and security contexts

Monitoring & Observability

  • Health check endpoints (/health, /ready)
  • Prometheus-style metrics (/metrics)
  • Detailed performance tracking per processing stage
  • Business value metrics calculation
  • Real-time processing statistics

πŸ“ Project Structure

FileImportGo/
β”œβ”€β”€ cmd/importer/           # Application entry point
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ config/            # Configuration management
β”‚   β”œβ”€β”€ database/          # PostgreSQL client with connection pooling
β”‚   β”œβ”€β”€ sftp/              # SFTP client for file operations
β”‚   β”œβ”€β”€ s3/                # DigitalOcean Spaces (S3) client
β”‚   β”œβ”€β”€ stamper/           # File stamping and validation
β”‚   β”œβ”€β”€ metrics/           # Metrics API metrics integration
β”‚   β”œβ”€β”€ importer/          # Core processing pipeline
β”‚   └── server/            # Health check server
β”œβ”€β”€ k8s/                   # Kubernetes deployment manifests
β”œβ”€β”€ Dockerfile             # Multi-stage Docker build
β”œβ”€β”€ go.mod                 # Go module dependencies
└── README.md              # This file

πŸ”§ Configuration

The application is configured via environment variables:

Required Configuration

# Database
DATABASE_URL="postgres://user:pass@host:port/db?sslmode=require"

# SFTP
AWS_HOST="your-sftp-host"
AWS_USER="your-sftp-user"
AWS_KEY_PATH="/path/to/ssh/key"

# S3/DigitalOcean Spaces
S3_ENDPOINT="https://your-spaces-endpoint"
S3_ACCESS_KEY_ID="your-access-key"
S3_SECRET_ACCESS_KEY="your-secret-key"
S3_BUCKET_NAME="your-bucket"

Optional Configuration

# Processing
PROCESSING_INTERVAL_MINUTES=5
MAX_CONCURRENT_FILES=5
PROCESSING_TIMEOUT_MINUTES=4

# Metrics API
METRICS_API_ENDPOINT="http://24.144.84.159/"
METRICS_API_KEY="your-api-key"

# Server
HEALTH_PORT=8080

🐳 Docker Usage

Build

docker build -t file-importer-go .

Run

docker run -d \
  --name file-importer \
  -p 8080:8080 \
  -e DATABASE_URL="your-db-url" \
  -e AWS_HOST="your-sftp-host" \
  -e AWS_USER="your-sftp-user" \
  -v /path/to/ssh/key:/app/creds/sftp-key:ro \
  file-importer-go

☸️ Kubernetes Deployment

1. Create Secrets

# Copy and edit the secrets template
cp k8s/secrets-template.yaml k8s/secrets.yaml
# Edit k8s/secrets.yaml with your actual base64-encoded values

# Apply secrets
kubectl apply -f k8s/secrets.yaml

2. Deploy Application

kubectl apply -f k8s/deployment.yaml

3. Check Status

# Check pod status
kubectl get pods -l app=file-importer-go

# Check logs
kubectl logs -l app=file-importer-go -f

# Check health
kubectl port-forward svc/file-importer-go-service 8080:8080
curl http://localhost:8080/health

πŸ› οΈ Utility Scripts

The project includes several helpful scripts for deployment and troubleshooting:

Deployment Scripts

redeploy-with-fix.sh - Automated redeployment with health check fixes

chmod +x redeploy-with-fix.sh
./redeploy-with-fix.sh
  • Removes current deployment
  • Applies production deployment with optimized liveness probes
  • Waits for deployment to be ready
  • Shows pod status and recent logs

deploy-to-k8s.sh - Full deployment automation

chmod +x deploy-to-k8s.sh
./deploy-to-k8s.sh
  • Builds Docker image
  • Pushes to registry
  • Applies Kubernetes manifests
  • Verifies deployment

Troubleshooting Scripts

diagnose-issue.sh - Comprehensive diagnostic tool

chmod +x diagnose-issue.sh
./diagnose-issue.sh
  • Checks deployment configuration
  • Shows pod status and restart counts
  • Displays recent logs and crash logs
  • Checks service endpoints and secrets
  • Provides diagnosis summary with recommended actions

check_processing_status.sh - Monitor file processing

chmod +x check_processing_status.sh
./check_processing_status.sh
  • Shows current processing status
  • Displays file counts and statistics
  • Monitors database operations

fix-deployment.sh - Quick deployment fix

chmod +x fix-deployment.sh
./fix-deployment.sh
  • Removes broken deployment
  • Applies correct production deployment
  • Shows status and logs

πŸ“Š Monitoring

Health Endpoints

  • GET /health - Overall health status
  • GET /ready - Readiness for traffic
  • GET /stats - Processing statistics
  • GET /metrics - Prometheus-style metrics

Example Health Response

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "checks": {
    "sftp": true,
    "database": true,
    "s3": true,
    "metrics_api": true
  }
}

πŸ”„ Processing Pipeline

  1. SFTP Discovery - List files in test directory
  2. Concurrent Download - Download up to 5 files simultaneously
  3. File Stamping - Add timestamp columns to CSV data
  4. S3 Upload - Store processed files in DigitalOcean Spaces
  5. Database Import - COPY data to zzimport2 table
  6. Batch Upsert - Optimized upsert to zzmember_base
  7. Cleanup - Remove processed files from SFTP
  8. Metrics - Report performance and value metrics

πŸ“ˆ Performance Metrics

The system tracks and reports:

  • Processing time per stage and overall
  • Throughput (files per hour)
  • Business value ($1.85 savings per file)
  • Performance gain vs Python baseline
  • Error rates and retry statistics

πŸ”’ Security

  • Non-root container execution
  • Read-only filesystem where possible
  • Secret management via Kubernetes secrets
  • Resource limits to prevent resource exhaustion
  • Network policies (configure as needed)

πŸš€ Migration from Python

  1. Deploy Go version alongside Python version
  2. Validate performance against baseline metrics
  3. Monitor for 24-48 hours to ensure stability
  4. Gradually increase processing (adjust MAX_CONCURRENT_FILES)
  5. Decommission Python version once confident

πŸ› οΈ Development

Prerequisites

  • Go 1.21+
  • Docker
  • kubectl (for Kubernetes deployment)
  • Access to SFTP, database, and S3 resources

Local Development

# Install dependencies
go mod download

# Run tests
go test ./...

# Build
go build -o file-importer ./cmd/importer

# Run locally (with environment variables set)
./file-importer

πŸ“ Value Proposition

  • $1.85 cost savings per file vs manual processing
  • $2,579 total automation value for current 1,400 file backlog
  • 22x performance improvement over Python baseline
  • Reduced infrastructure costs through efficient resource usage
  • Improved reliability with better error handling and monitoring

About

High-performance Go file processor with 22x improvement over Python - Kubernetes deployment with concurrent processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published