Observability Platform Demo

An end-to-end observability demo built around a Python FastAPI service and an OpenTelemetry pipeline.

This repository is designed to be easy to run locally, easy to explain in interviews, and concrete enough to show backend + SRE thinking in one project.

中文简介：这是一个可本地运行的小型可观测性平台，展示了我如何把一个 Python FastAPI 服务接入 OpenTelemetry，并把 traces、metrics、logs 统一送到 OTEL Collector，再接入 Grafana、Prometheus、Tempo 和 Loki 做观测与排障。

This project demonstrates:

a containerized backend service with FastAPI
distributed tracing with OpenTelemetry and Tempo
structured JSON logging with Loki
metrics collection with Prometheus
centralized OTLP ingestion through the OpenTelemetry Collector
Grafana provisioning for datasources and dashboards

Architecture

FastAPI app
  -> OTLP traces  -> OTEL Collector -> Tempo      -> Grafana
  -> OTLP metrics -> OTEL Collector -> Prometheus -> Grafana
  -> OTLP logs    -> OTEL Collector -> Loki       -> Grafana

Screenshots

Service overview dashboard

Tempo trace exploration

Resume-Friendly Summary

Built a containerized observability demo platform around a FastAPI service and OpenTelemetry instrumentation.
Implemented request tracing, structured JSON logging, and custom metrics with a unified OTLP pipeline through the OpenTelemetry Collector.
Provisioned Grafana, Prometheus, Tempo, and Loki automatically with Docker Compose so the full stack can be started locally with one command.
Added trace-to-log correlation through trace_id, making it possible to pivot from a failing request trace to the exact related logs.

Tech Stack

FastAPI
OpenTelemetry
OpenTelemetry Collector
Prometheus
Grafana
Tempo
Loki
Docker Compose
micromamba for local development

What the App Does

The backend exposes two endpoints:

GET /ok
- lightweight health-style endpoint
- returns 200
- includes trace_id in both the JSON body and the X-Trace-Id response header
GET /slow
- simulates random latency
- supports configurable failure probability
- creates custom spans named fake-db and external-call
- returns trace_id in both the JSON body and the X-Trace-Id response header

The application also emits:

JSON logs with trace_id and span_id
requests_total{route,method,status}
request_duration_seconds_bucket{route,method,status,le}
inflight_requests

Repository Structure

observability-platform/
  app/                 # FastAPI service, Dockerfile, Python dependencies
  otel-collector/      # OTEL Collector pipeline config
  prometheus/          # Prometheus scrape config
  grafana/             # Grafana provisioning and dashboard JSON
  tempo/               # Tempo trace backend config
  loki/                # Loki log backend config
  docker-compose.yml   # one-command local stack startup
  environment.yml      # micromamba development environment
  .env.example         # app and Grafana runtime settings

Quick Start

Option 1: Run the full stack with Docker Compose

docker compose up --build

Default local endpoints:

App: http://localhost:8000
Grafana: http://localhost:3000
Prometheus: http://localhost:9090
Loki: http://localhost:3100
Tempo: http://localhost:3200
Collector metrics exporter: http://localhost:9464/metrics

Grafana credentials:

username: admin
password: admin

Option 2: Create a local micromamba environment

micromamba env create -f environment.yml
micromamba activate obs-platform

If you want to run the FastAPI app on your host while the rest of the stack stays in Docker:

export OTEL_EXPORTER_OTLP_BASE_ENDPOINT=http://localhost:4318
python app/main.py

Verify the Demo

Basic requests

curl -s http://localhost:8000/ok | jq .
curl -s "http://localhost:8000/slow" | jq .
curl -s "http://localhost:8000/slow?min_ms=200&max_ms=1200&fail_rate=0.35" | jq .
curl -i "http://localhost:8000/slow?fail_rate=1"

Generate load

seq 200 | xargs -I{} -P 20 curl -s "http://localhost:8000/slow?min_ms=100&max_ms=900&fail_rate=0.2" >/dev/null

Explore in Grafana

Open Grafana and go to:

Dashboards
Observability Demo / Service Overview

The dashboard includes:

request rate
error rate
P95 latency
inflight requests
requests by route and status
recent request logs

View P95 latency

The dashboard uses:

histogram_quantile(0.95, sum by (le) (rate(request_duration_seconds_bucket[$__rate_interval])))

View traces

Open Explore
Select the Tempo datasource
Query recent traces
Open a /slow trace to inspect:
- the FastAPI HTTP server span
- the fake-db custom span
- the external-call custom span

Jump from a trace to related logs

Use the trace_id from the response header or from a Tempo trace and query Loki with:

{service_name="demo-api"} | json | trace_id="PUT_TRACE_ID_HERE"

Grafana is provisioned so Tempo can link directly to Loki logs for the same trace.

Key Design Choices

Why P95 instead of average latency

Average latency hides tail behavior. A service can have a good average while still serving a meaningful number of very slow requests. P95 gives a more useful operational signal.

Why traces default to 100% sampling

This repository is a local demo and interview project, so the default is optimized for visibility and learning. In production, you would usually reduce the sampling ratio based on traffic volume and cost.

The sampling ratio is controlled by:

TRACE_SAMPLE_RATIO=1.0

Why metrics are not sampled

Metrics are already aggregated and relatively cheap compared to full-fidelity traces. Request rate, error rate, and latency SLOs need complete counts to stay reliable.

What This Project Shows in an Interview

backend API implementation in Python with FastAPI
observability-first service design
OpenTelemetry instrumentation for traces, metrics, and logs
containerized local platform setup with Docker Compose
operational thinking around latency, error rate, structured logs, and trace correlation
Grafana provisioning instead of manual dashboard setup

What I Learned

How to use the OpenTelemetry Collector as a central ingestion layer instead of wiring each backend directly into the application.
Why P95 latency is usually a more operationally useful signal than average latency.
How structured logs become much more valuable when they share the same trace_id as traces.
How to package a multi-service local platform so other engineers can run it with one command instead of manual setup.

Common Troubleshooting

Metrics are missing

Check whether the Collector metrics exporter has data:

curl -s http://localhost:9464/metrics | grep requests_total

Traces are missing

Make sure you have called /ok or /slow, then inspect recent traces in Grafana Explore with the Tempo datasource.

Logs are missing in Loki

First confirm the app is writing JSON logs:

docker compose logs app

If logs appear in stdout but not in Loki, inspect the OTEL Collector and Loki container logs next.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
app		app
grafana		grafana
loki		loki
otel-collector		otel-collector
output/playwright		output/playwright
prometheus		prometheus
tempo		tempo
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

Observability Platform Demo

Architecture

Screenshots

Service overview dashboard

Tempo trace exploration

Resume-Friendly Summary

Tech Stack

What the App Does

Repository Structure

Quick Start

Option 1: Run the full stack with Docker Compose

Option 2: Create a local micromamba environment

Verify the Demo

Basic requests

Generate load

Explore in Grafana

View P95 latency

View traces

Jump from a trace to related logs

Key Design Choices

Why P95 instead of average latency

Why traces default to 100% sampling

Why metrics are not sampled

What This Project Shows in an Interview

What I Learned

Common Troubleshooting

Metrics are missing

Traces are missing

Logs are missing in Loki

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages