ops: structured application log aggregation

## Tech Story

As a solo engineer, I want application logs aggregated into a searchable web UI so that I can investigate silent failures, slow queries, and unexpected behaviour without SSH-ing into the server and running `docker logs` — especially when I am not at my desk.

## ELI5 Context

**What is log aggregation?**
Your NestJS app writes log lines to stdout. By default, those logs live only inside the Docker container — to read them you must SSH into the VPS and run `docker logs station-backend-1`. Log aggregation means shipping those log lines to a hosted service where they are stored, indexed, and searchable from a web browser. You can filter by log level, search for a specific request ID, or set up alerts for error spikes — all without SSH.

**What is Vector?**
Vector is an open-source log and metrics pipeline. It runs as a lightweight sidecar container, watches the Docker socket for new log lines from your backend container, and forwards them to Logtail. No changes to your NestJS app are required for basic log shipping — Vector handles it.

**What is nestjs-pino?**
Pino is a fast Node.js JSON logger. `nestjs-pino` wraps it as a NestJS module, replacing the default NestJS logger with one that outputs structured JSON (one JSON object per log line). JSON logs are far easier for Logtail to parse, search, and alert on than free-form text.

**What is a request ID?**
A UUID generated at the start of each HTTP request and included in every log line produced during that request. If a user reports an error at 2:14 PM, you can search Logtail for that timestamp, find the request ID, then see every log line from that specific request — database queries, auth checks, everything. Without request IDs, correlating log lines to a specific request is guesswork.

**What is the 1 GB/month Logtail limit?**
Logtail's free tier ingests 1 GB of log data per month. A JSON log line is roughly 200–500 bytes. To stay under the limit: only ship `warn` and `error` level logs from production (not `log`/`debug`). At low traffic this keeps you well under 1 GB. You can always increase the log level temporarily for debugging.

## Technical Elaboration

### Part 1: Structured logging with nestjs-pino

**Install:**
```bash
cd backend
pnpm add nestjs-pino pino-http
pnpm add -D pino-pretty   # pretty-print for local development only
```

**Update `backend/src/app.module.ts`:**
```typescript
import { LoggerModule } from 'nestjs-pino';

@Module({
  imports: [
    LoggerModule.forRoot({
      pinoHttp: {
        level: process.env['NODE_ENV'] === 'production' ? 'warn' : 'debug',
        transport: process.env['NODE_ENV'] !== 'production'
          ? { target: 'pino-pretty' }  // human-readable in development
          : undefined,                  // raw JSON in production (Vector reads this)
        autoLogging: true,              // logs every HTTP request automatically
        redact: ['req.headers.authorization', 'req.body.password'],  // never log these
      },
    }),
    // ... other modules
  ],
})
export class AppModule {}
```

**Update all service constructors** to inject the Pino logger instead of the default NestJS Logger:
```typescript
// Before:
private readonly logger = new Logger(MyService.name);

// After:
import { Logger } from 'nestjs-pino';
constructor(private readonly logger: Logger) {}
```

**Each log line in production will look like:**
```json
{
  "level": "warn",
  "time": 1715000000000,
  "pid": 1,
  "hostname": "station-backend-1",
  "req": { "id": "abc-123", "method": "POST", "url": "/auth/login" },
  "msg": "Failed login attempt for unknown user"
}
```

### Part 2: Vector sidecar container

**New file: `infra/vector.toml`**
```toml
[sources.docker_logs]
type = "docker_logs"
include_containers = ["station-backend-1"]   # only ship backend logs

[transforms.filter_level]
type = "filter"
inputs = ["docker_logs"]
# Only forward warn and error — keeps volume under free tier limit
condition = '.message | contains("\"level\":30") or contains("\"level\":50")'
# Pino level numbers: 30=warn, 40=error, 50=fatal

[sinks.logtail]
type = "http"
inputs = ["filter_level"]
uri = "https://in.logtail.com"
encoding.codec = "json"
auth.strategy = "bearer"
auth.token = "${LOGTAIL_SOURCE_TOKEN}"
```

**Add Vector to `docker-compose.prod.yml`:**
```yaml
vector:
  image: timberio/vector:latest-alpine
  restart: unless-stopped
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock:ro   # read-only access to Docker socket
    - ./infra/vector.toml:/etc/vector/vector.toml:ro
  environment:
    LOGTAIL_SOURCE_TOKEN: ${LOGTAIL_SOURCE_TOKEN}
  depends_on:
    - backend
```

**Add to `.env.production.example`:**
```env
LOGTAIL_SOURCE_TOKEN=   # from Logtail source settings — leave blank to disable
```

**Add to `env.validation.ts`:**
```typescript
LOGTAIL_SOURCE_TOKEN: Joi.string().optional().allow(''),
```

**Logtail setup (manual, one-time):**
1. Sign up at betterstack.com/logtail
2. Create a new Source: type "HTTP" (Vector will push to it)
3. Copy the Source Token into GitHub environment secrets as `LOGTAIL_SOURCE_TOKEN`
4. Verify logs appear in the Logtail UI after first deploy

### Part 3: Logtail alert

In the Logtail UI:
- Create alert: query `level >= error`, threshold > 10 occurrences in 5 minutes
- Notification: email to your address

### New file: `infra/docs/logging.md`

Document:
1. **Architecture** — NestJS (pino-http) -> Docker stdout -> Vector -> Logtail
2. **Log levels** — what gets shipped in production (warn+) vs development (debug+)
3. **How to search logs** — Logtail query syntax examples: filter by level, by URL, by time range
4. **How to temporarily increase verbosity** — change `level` in `vector.toml` to `debug`, redeploy Vector container only
5. **What to do if Logtail is full** — free tier is 1 GB/month; if approaching limit, increase filter to `error` only
6. **Retention** — free tier keeps 3 days; anything older is gone
7. **Alert setup** — current alert configuration

## Definition of Done

- [ ] `nestjs-pino` and `pino-http` installed; `LoggerModule` registered in `AppModule`
- [ ] Log level set to `warn` in production, `debug` in development
- [ ] `req.headers.authorization` and `req.body.password` redacted from all logs
- [ ] All services use the injected Pino logger instead of `new Logger(ServiceName.name)`
- [ ] `infra/vector.toml` written filtering to warn+ level and forwarding to Logtail
- [ ] `vector` service added to `docker-compose.prod.yml` with read-only Docker socket mount
- [ ] `LOGTAIL_SOURCE_TOKEN` is optional — Vector container starts without error when token is absent (it will fail to connect to Logtail but not crash the app)
- [ ] `LOGTAIL_SOURCE_TOKEN` added to GitHub environment secrets for production
- [ ] Logs verified appearing in Logtail UI after deploy
- [ ] Error spike alert configured in Logtail UI (>10 errors in 5 minutes -> email)
- [ ] `infra/docs/logging.md` written
- [ ] `pnpm test` passes — nestjs-pino must not break unit tests (use `LoggerModule.forRoot({ pinoHttp: { level: 'silent' } })` in test module setup if needed)

## Dependencies

- Depends on: #108 (Docker Compose — adding Vector as a service)
- Depends on: #128 (secrets management — LOGTAIL_SOURCE_TOKEN stored in GitHub environment secrets)
- Blocks: nothing


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops: structured application log aggregation #131

Tech Story

ELI5 Context

Technical Elaboration

Part 1: Structured logging with nestjs-pino

Part 2: Vector sidecar container

Part 3: Logtail alert

New file: `infra/docs/logging.md`

Definition of Done

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ops: structured application log aggregation #131

Description

Tech Story

ELI5 Context

Technical Elaboration

Part 1: Structured logging with nestjs-pino

Part 2: Vector sidecar container

Part 3: Logtail alert

New file: infra/docs/logging.md

Definition of Done

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New file: `infra/docs/logging.md`