ReliabilityOS

Production-grade SRE platform demonstrating observability, SLOs, incident response, and reliability engineering on Kubernetes.

What This Project Demonstrates

Area	Implementation
Observability	Prometheus metrics, Loki logs, Tempo traces — correlated via trace_id
SLOs	99.9% availability, p95 < 200ms, error budget tracking with burn-rate alerts
Alerting	Multi-window burn-rate (P1/P2/P3), Slack notifications, runbooks linked
Dashboards	SLO Overview, Engineering (drill-down), Executive (business), Capacity
Progressive Delivery	Argo Rollouts canary (10% → 30% → 60% → 100%) with Prometheus analysis
Chaos Engineering	Chaos Mesh: pod-kill, network latency, CPU stress — with documented hypotheses
Autoscaling	HPA (CPU) for API, KEDA (RabbitMQ queue depth) for worker
IaC	Terraform modules (VPC, EKS, ECR), GitOps with ArgoCD app-of-apps
CI/CD	GitHub Actions: build, push to ECR, auto-update Helm values
Incident Response	IMAG model, severity P0-P3, postmortem templates, escalation ladder

Tech Stack

Layer	Technology
Application	Python 3.12, FastAPI, Celery, SQLAlchemy
Data	PostgreSQL 16, RabbitMQ 3.13, Redis 7
Orchestration	AWS EKS (K8s 1.34), Helm, ArgoCD
Observability	Prometheus, Grafana, Loki, Tempo, Alloy, Blackbox Exporter
Reliability	Argo Rollouts, Chaos Mesh, KEDA, HPA
Infrastructure	Terraform, AWS (VPC, EKS, ECR, S3, IAM)
CI/CD	GitHub Actions

Quick Start

Prerequisites

AWS account with CLI configured
Terraform >= 1.5
kubectl
Helm 3

Deploy Infrastructure

cd terraform/01-vpc && terraform init && terraform apply
cd ../02-cluster && terraform init && terraform apply
cd ../03-ecr && terraform init && terraform apply

Configure Cluster Access

aws eks update-kubeconfig --name studying-cluster --region us-east-1

Create Required Secrets

kubectl create secret generic slack-webhook -n monitoring \
  --from-literal=url='https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

Verify Deployment

ArgoCD automatically deploys all applications via the app-of-apps pattern. Check status:

kubectl get applications -n argocd

Access Services

Service	URL
Orders API	https://orders.gabrielstudying.click
Grafana	https://grafana.gabrielstudying.click
ArgoCD	https://argocd.gabrielstudying.click

Documentation

Full documentation available via Backstage TechDocs including SLIs/SLOs, error budgets, incident response, on-call procedures, runbooks, postmortem templates, root cause analysis, and architecture details.

Project Structure

terraform/          # IaC (VPC, EKS, ECR, IRSA)
apps/               # Application source (orders-api, worker)
helm/               # Helm values for all services
argocd/             # ArgoCD Application manifests
chaos/              # Chaos Mesh experiment definitions
docs/               # SRE documentation (TechDocs)
.github/workflows/  # CI/CD pipelines

Name		Name	Last commit message	Last commit date
Latest commit History 369 Commits
.github/workflows		.github/workflows
apps		apps
argocd		argocd
chaos		chaos
docs		docs
helm		helm
images		images
terraform		terraform
.gitignore		.gitignore
README.md		README.md
catalog-info.yaml		catalog-info.yaml
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReliabilityOS

What This Project Demonstrates

Tech Stack

Quick Start

Prerequisites

Deploy Infrastructure

Configure Cluster Access

Create Required Secrets

Verify Deployment

Access Services

Documentation

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReliabilityOS

What This Project Demonstrates

Tech Stack

Quick Start

Prerequisites

Deploy Infrastructure

Configure Cluster Access

Create Required Secrets

Verify Deployment

Access Services

Documentation

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages