Facthory Engineering & Architecture Principles

At Facthory, we only build systems that are worth running in production. That means resilient, observable, deterministic, scalable, and brutally cost-efficient. We operate close to real manufacturing floors where noise, latency, and unpredictable environments are the norm. Our architecture must reflect reality—not fantasy.

We aim to build applications that are:

resilient under load, failure, and bad networks
maintainable by real engineers, not archaeologists
deterministic in behavior so debugging is predictable
designed for scale before scale arrives
cost-efficient at every layer (storage, GPUs, compute, network)
compliant by default (GDPR, EU AI Act, auditability)

There is no room for fragile software in an industrial context. We engineer like everything is mission-critical—because in many factories, it is.

Architecture Philosophy

We prefer loosely coupled, event-driven microservices. Each service owns its data, its lifecycle, and its failure modes. No shared state. No hidden dependencies. No mystery glue code.

This philosophy draws from foundational microservices patterns documented in Microservices.io by Martin Fowler and Sam Newman, which establish the core principles of autonomous services, bounded contexts, and domain-driven design. We've learned from real-world implementations at scale, including Uber's engineering blog which details lessons from managing 1000+ microservices in production, and Netflix's tech blog which demonstrates resilience patterns, circuit breakers, and async systems that survive at global scale.

We design autonomous services that:

can deploy independently
survive dependency failures
remain responsive under partial degradations
communicate through async events—not blocking chains
reflect the business domain (video processing, embedding, retrieval, etc.)

A tightly coupled system is a time bomb. A loosely coupled one is an organism.

Asynchronous Communication, Always First

Factories are chaotic. Uploads spike. Transcription may take seconds or minutes. Model inference may stall. Synchronous chains collapse under real-world conditions.

We avoid that by defaulting to event-driven workflows. Our event broker is NATS, a cloud-native messaging system that provides always-on messaging with interest-based propagation, persistence with streaming replays, and flexible authentication. NATS is a CNCF incubating project, designed for edge and cloud-native environments—exactly the kind of distributed, resilient infrastructure that industrial systems require.

NATS gives us the foundation for building truly asynchronous systems. Unlike traditional message brokers, NATS provides simple, secure, and performant communications that scale from edge devices to cloud clusters. We leverage NATS for pub/sub messaging, request/reply patterns, and streaming where we need persistence and replay capabilities.

We use asynchronous pipelines because:

sync calls block threads and create cascading failures
retries, DLQs, and event logs make failures recoverable
load can be smoothed out naturally
workflows become observable end-to-end
partial failures don't freeze the system

The patterns we follow are informed by the Google SRE Book's distributed systems patterns, which document reliability and failure patterns that microservices must follow. These patterns—retries with exponential backoff, circuit breakers, graceful degradation—are not optional in production systems.

Async is not an optimization. It's survival.

Service Degradation as a First-Class Behavior

When a dependency fails, we do not break. We degrade gracefully and keep delivering value. This principle is central to building resilient systems, as detailed in the Google SRE Book and demonstrated in Netflix's production systems.

For example:

If transcription fails → return audio + request reprocess
If translation fails → keep source language + fallback
If embedding fails → store metadata and retry later
If retrieval confidence is low → warn, don't hallucinate
If a model is slow → serve cached guidance

A degraded output is better than a frozen system. Factories can't wait.

Low-Tech Coupling

We deliberately choose technologies that age well. This means avoiding proprietary protocols, vendor lock-in, and overly clever abstractions that become technical debt.

Instead of clever coupling, we rely on:

DNS for service discovery
HTTP/JSON for universal compatibility
simple contracts that don't break when teams iterate
clean, explicit boundaries

Factories don't need fancy. They need reliable.

RESTful APIs with JSON Payload

We prefer REST-based APIs with JSON payloads. Distributed systems following the REST style have looser coupling between client and server implementations and come with less rigid contracts that don't break when either side makes changes. This makes it easier to build interoperating distributed systems that can be evolved in parallel by different teams while continuing to work.

REST-like APIs with JSON payload are the most widely accepted and used service interfacing style in the internet web service industry. They're simple, debuggable, and work everywhere.

API First, Human First

APIs are our public contract. They must be designed before code—reviewed by peers, validated against real use cases, and treated as long-lived assets.

Our API culture is informed by industry-leading practices. We follow the Microsoft REST API Guidelines, which are used internally across Azure and Microsoft services, providing battle-tested patterns for versioning, error handling, and resource design. We study Stripe's API Design Principles, widely considered "the best designed APIs on the internet," to understand how to build APIs that developers love to use.

We design APIs using the OpenAPI (Swagger) Specification as our standard. API-first means designing here before writing backend code. This approach, detailed in Red Hat's "API First" guide, shows how to structure engineering organizations around API contracts. We also reference the Google API Product Management Guide for defining versioning strategies, SLAs, adoption metrics, and product thinking around APIs.

Our API culture includes:

OpenAPI specs defined before implementation
early peer reviews
stable backward-compatible evolution
documentation with examples and behavior notes
predictable patterns across services

APIs must be boring in the best way: obvious, consistent, and trustworthy.

Our APIs should obey Postel's Law—a.k.a. "the Robustness Principle": Be conservative in what you send, be liberal in what you accept. APIs must be evolved without breaking any consumers.

Right-Sized Microservices

We size services around business capabilities, not arbitrary functions. This follows the domain-driven design principles that underpin successful microservice architectures.

Examples:

Video Processing
Document Intelligence
Embedding Engine
Retrieval Engine
Learning Path Service
Issue Management
Vector Index Service
Policy Evaluation

A service should be big enough to offer a valid business capability, but small enough to be handled by a team that can be fed by two pizzas (Amazon's Two-Pizza Team rule)—from two to 12 people. In practice, a Two-Pizza Team may be able to own and run a large number of small services, or a smaller number of larger services.

Services must be small enough to understand, large enough to matter.

Autonomy as Law

A service must be able to stand alone. That means:

owning its own data
isolating its failures
deploying independently
avoiding shared libraries with hidden logic
starting up whether dependencies are healthy or not
never leaking domain concepts into another service's code

A service should run in its own process and be independently deployable. It should start up and be resilient when its dependencies are not available. It should not share its data storage or code repository with any other service, so that changes do not affect other systems. It should not share libraries with other services, unless those libraries are open-source or inner-source and are actively maintained by a community. Shared dependencies may lead to large-scale complexity over time.

A service should not provide a client library containing business logic. The core API and its data model are expressed as REST and JSON.

Autonomy protects velocity. Dependency protects nothing.

12-Factor Applications

We build applications following the 12-Factor App methodology, the canonical reference for building modern, scalable applications. The 12 factors provide a methodology for building software-as-a-service apps that are portable, scalable, and maintainable.

We apply these principles with modern containerization in mind, following the Cloud Native Computing Foundation's guidance on 12-Factor for Containers, which shows how to apply 12-factor principles to Kubernetes environments. We also reference DigitalOcean's guide to modernizing to 12-Factor for practical examples and clear breakdowns.

The 12 factors guide our approach to:

codebase management (one codebase, many deploys)
dependencies (explicitly declare and isolate)
configuration (store in environment, not code)
backing services (treat as attached resources)
build, release, run (strictly separate stages)
processes (execute as stateless processes)
port binding (export services via port binding)
concurrency (scale out via the process model)
disposability (maximize robustness with fast startup and graceful shutdown)
dev/prod parity (keep development, staging, and production as similar as possible)
logs (treat logs as event streams)
admin processes (run admin/management tasks as one-off processes)

These principles ensure our applications are cloud-native, container-ready, and production-hardened from day one.

Data Architecture Principles

This section is where Facthory is fundamentally different from classic SaaS.

We operate on heavy, multimodal data (video, frames, documents, audio). Handling this data well is not an implementation detail—it's our competitive edge.

Data Gravity Rules Everything

We minimize data movement:

direct-to-blob uploads
no routing video through backend
processing close to storage
region-aware compute scheduling

Data that moves slowly creates products that feel slow.

Metadata Is a Product

We don't treat metadata as decoration. It is the backbone of retrieval quality and explainability.

We maintain metadata for:

provenance
timestamps
machine ID
operator ID (RBAC-safe)
validation status
frame-level evidence

Better metadata = better answers.

Deterministic Pipelines

We enforce deterministic output for:

chunking
speech-to-text
document parsing
embedding generation

Same input → same chunks → same retrieval context.

Determinism reduces hallucination risk and makes debugging sane.

Event-Sourced Knowledge Construction

Nothing is overwritten. Everything is versioned.

We ingest raw knowledge → transform → enrich → index.

The lineage is immutable.

This ensures:

auditability
explainability
compliance
reproducibility

Factories can't operate on "mystery outputs."

RAG/KAG Principles

While others play with "chatbots," we treat RAG/KAG as knowledge architecture.

Chunking Is an Engineering Discipline

Chunk size, boundaries, and modality matter. We engineer chunking based on:

semantics
modality
task type
retrieval target
hallucination risk

Bad chunking = bad product.

Retrieval Is Ranking

Vector similarity alone is not enough. We incorporate:

metadata filters
recency bias
expert validation boosting
machine context
source type weighting

RAG is search → ranking → reasoning.

Not "shove text into an LLM."

KAG (Knowledge-Augmented Generation)

Long-term, we build towards graph-enhanced reasoning:

relationships between machines
recurring failure modes
part sharedness
shift-level patterns
domain constraints

This is where Facthory becomes the "brain of the shop floor."

Agentic Workflows — With Realism

We don't jump on hype. Agentic systems are powerful but dangerous if misapplied.

Agents are useful for:

orchestrating multi-step workflows
validating outcomes
generating structured summaries
coordinating humans + services

Agents are not permitted to:

autonomously call production systems
alter knowledge without human validation
emit instructions with uncertain accuracy
rewrite databases
run unbounded loops

We wrap all agents in:

strict policy envelopes
sandboxed toolsets
determinism guards
audit logging
diff validation

AI does not get to "improvise" on real shop floors.

Security & Privacy Are Non-Negotiable

Industrial environments demand trust.

We enforce:

TLS everywhere (HTTPS, not HTTP)
keyless workloads via Managed Identity
RBAC + ABAC access control
encrypted blobs + encrypted Postgres
audit logs for every interaction
prompt-level redaction
content validation
adherence to GDPR and early EU AI Act alignment

Always use TLS and make sure the caller of your service is authenticated and authorized. TLS actually means "HTTPS everywhere, not HTTP."

Security is not a feature. It's a foundation.

Failure Is Expected. Chaos Is Not.

We engineer predictable failure. This draws from the Google SRE Book's approach to reliability engineering and Netflix's production-tested resilience patterns.

We implement:

retries with exponential backoff
idempotent processors
circuit breakers
DLQ inspection pipelines
rollback-first mentality
blameless (but brutally honest) post-mortems

Whenever possible and reasonable, we make service endpoints idempotent, so that an operation produces the same result even when it's executed multiple times. This allows clients to safely retry operations in case of timeouts due to service processing or network failures.

If it breaks twice, the engineering team failed.

General Guidelines

Stateless

When possible, be stateless. If you can't, persist state outside the address space of the application, for example in a database or through NATS streaming for event-sourced state.

Immutable

Strive for immutability whenever possible. An object is immutable if its state cannot be modified. Immutable things are automatically thread-safe, without requiring synchronization. Overall, immutability tends to result in fewer bugs and makes it easier to prove a program correct.

Development Culture

This is where our engineering philosophy comes through strongest.

We Are Engineers at Heart

We build, test, break, learn, and rebuild.

We don't hide from complexity—we dismantle it.

Failure Culture

Failure is part of engineering. Avoiding failure is cowardice.

We fail fast, recover decisively, and document so it never happens twice.

Help Early, Ask Early

Silence kills teams. Collaboration builds momentum.

Efficiency Over Bureaucracy

No useless meetings. No PowerPoints.

Documentation and execution beat theater.

Documentation Over Memory

If it matters, it must be written.

No tribal knowledge. No hero bottlenecks.

Document the architecture of your APIs and applications. Make it clear, concise, and current. Use inline documentation for more complex code fragments.

High Velocity, High Ownership

We move fast, and we clean up after ourselves.

Speed does not justify sloppiness.

Peer Review

Don't wait until you're done to ask for code review: It's the best way to catch defects early. Create a pull request at the start of your work, not at the end. This pulls people into an ongoing conversation about your code, from Day One.

Code review is expensive in some ways, so get the most out of it. Reviewing code is a great way to learn about style, get help with idioms, and grow as a programmer and reviewer.

Code review can be hard when the culture around it isn't supportive and constructive. It takes practice to learn how to accept code reviews without getting defensive, and to review code without focusing on trivial things. Don't bike shed.

Peer review gets easier when you have a good attitude about it. Everybody around you is smart, and you are smart. We're all smart in different ways.

Depending on the team and its codebases, it might be required that at least one person reviews code before it goes live. This is especially true for systems that touch customer or financial data. In general, though, we don't want to focus about when code review is or isn't required: The system works best when people decide on their own that code review is valuable, and seek it out.

Architectural decisions should be made as a team, and the team should ask for help if it's unsure. Embrace open discussions and alternate opinions.

Quality

Quality is related to mindset, and it's part of engineering. Systems that support industrial manufacturing must be engineered for high quality. Usually this means:

writing unit tests early on
mocking external systems so you can test against them while they're not running, and also so that you can simulate various failure scenarios from the service and the network between it
striving for automation

Automate testing whenever possible. It's not always possible, but life is almost always better if you invest in automated tests of your code. See Martin Fowler's Testing Strategies in a Microservice Architecture.

We're not going to require you to test your code, but expect your peers to challenge you if you don't. For the most part, a dedicated QA team is a thing of the past. You and your team are responsible for your code's behavior: There's no other safety net.

Years ago, we didn't build systems this way. Now we must. Fortunately, the tooling is pretty amazing.

Continuous Delivery

Strive for very short release cycles, optimally deploying daily; automating the delivery pipeline makes this possible. Small releases tend to have fewer bugs. Use canary testing for your new deployments to identify problems early.

Source Code Management

We support GitHub as SCM to check in your code. You might want to use local git hooks for checking references to specifications in commit messages or checks.

Deployment Principles

We deploy to AKS (Azure Kubernetes Service) with:

Flux GitOps for declarative infrastructure
container signing for supply chain security
progressive rollouts with canary deployments
isolated namespaces for service boundaries
region-aware clusters for data locality

We favor containerized application development. Containers provide the isolation, portability, and consistency that 12-Factor applications require. Kubernetes gives us the orchestration layer to manage these containers at scale.

The system must run globally, but operate locally.

Observability

We require:

structured logs (treat logs as event streams, per 12-Factor)
distributed tracing across service boundaries
metrics with SLOs (Service Level Objectives)
model drift monitoring for AI/ML systems
cost visibility at every layer

We use Prometheus for metrics collection and alerting, and Grafana for visualization and dashboards. Prometheus provides the time-series database and query language we need to track system health, performance, and business metrics. Grafana gives us the visualization layer to turn those metrics into actionable insights.

If we can't see the system clearly, we can't trust it.

SaaS-Ready Architecture

Build your services so that it's possible to offer them as a SaaS solution to third parties. In fact, consider any other system a third party with regards to API structure, resilience and service level. This is easier to do than it was a few years ago: cloud platforms push us this way, the Internet model scales, and our security model is geared toward allowing our services to be on the open Internet.

We want to offer services in ways we never imagined or expected. This is part of being a platform. In some cases, this means being multi-tenant from the start.

The Joy of Engineering

At the end of the day, we do this because engineering is joy.

Solving hard, real industrial problems with elegant systems is meaningful work.

Building the brain of the factory is not hype. It's our craft.

We engineer because we love it.

We ship because factories need it.

We write principles because good teams deserve clarity.

Building software systems can produce substantial existential pleasure. When the conditions are just right, programming is a reliable path to Flow: a state almost beyond pleasure. We want to get there, and stay there, and we want you to join us there. We hope these principles help.

References & Further Reading

API Design & Standards

Microsoft REST API Guidelines - Used internally across Azure and Microsoft services
Stripe API Design Principles - Widely considered "the best designed APIs on the internet"
OpenAPI (Swagger) Specification - Official specification for API-first design
"API First" by Red Hat - How to structure engineering orgs around API contracts
Google — API Product Management Guide - Defining versioning, SLAs, adoption, monetization

Microservices & Distributed Systems

Microservices.io - The foundational patterns for modern microservice design by Martin Fowler & Sam Newman
Google SRE Book — Distributed Systems Patterns - The reliability and failure patterns microservices must follow
Uber Engineering Blog — Microservice Migration & Lessons - Real cases from 1000+ microservices in production
Netflix Tech Blog — Microservices & Resilience Patterns - Circuit breakers, service meshes, async systems

12-Factor Applications

The Original 12-Factor App Manifesto (Heroku) - The canonical reference
DigitalOcean — Modernizing to 12-Factor - Clear breakdown with practical examples
Cloud Native Computing Foundation (CNCF): 12-Factor for Containers - How to apply 12-factor to Kubernetes

Event Broker

NATS - The edge & cloud native messaging system we use for asynchronous communication

Observability & Monitoring

Prometheus - Time-series database and monitoring system for metrics collection and alerting
Grafana - Open-source analytics and visualization platform for metrics, logs, and traces

Additional Resources

Architectural Styles and the Design of Network-based Software Architectures - Roy Fielding's dissertation on REST
Postel's Law (Robustness Principle) - Be conservative in what you send, be liberal in what you accept
Idempotence - Making operations safe to retry
Testing Strategies in a Microservice Architecture - Martin Fowler's guide to testing microservices
Flow (Psychology) - The state of optimal experience

License

We have published these guidelines under the Bandpey license.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
MAINTAINERS		MAINTAINERS
README.md		README.md

License

FACTHORYAI/engineering-principles

Folders and files

Latest commit

History

Repository files navigation

Facthory Engineering & Architecture Principles

Architecture Philosophy

Asynchronous Communication, Always First

Service Degradation as a First-Class Behavior

Low-Tech Coupling

RESTful APIs with JSON Payload

API First, Human First

Right-Sized Microservices

Autonomy as Law

12-Factor Applications

Data Architecture Principles

Data Gravity Rules Everything

Metadata Is a Product

Deterministic Pipelines

Event-Sourced Knowledge Construction

RAG/KAG Principles

Chunking Is an Engineering Discipline

Retrieval Is Ranking

KAG (Knowledge-Augmented Generation)

Agentic Workflows — With Realism

Security & Privacy Are Non-Negotiable

Failure Is Expected. Chaos Is Not.

General Guidelines

Stateless

Immutable

Development Culture

We Are Engineers at Heart

Failure Culture

Help Early, Ask Early

Efficiency Over Bureaucracy

Documentation Over Memory

High Velocity, High Ownership

Peer Review

Quality

Continuous Delivery

Source Code Management

Deployment Principles

Observability

SaaS-Ready Architecture

The Joy of Engineering

References & Further Reading

API Design & Standards

Microservices & Distributed Systems

12-Factor Applications

Event Broker

Observability & Monitoring

Additional Resources

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages