Skip to content

EhFa1999/TechAssignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

░██████╗███╗░░██╗░█████╗░██████╗░██████╗░ ██╔════╝████╗░██║██╔══██╗██╔══██╗██╔══██╗ ╚█████╗░██╔██╗██║███████║██████╔╝██████╔╝ ░╚═══██╗██║╚████║██╔══██║██╔═══╝░██╔═══╝░ ██████╔╝██║░╚███║██║░░██║██║░░░░░██║░░░░░ ╚═════╝░╚═╝░░╚══╝╚═╝░░╚═╝╚═╝░░░░░╚═╝░░░░░

Traffic HTTP Service

Spring Boot service for ingesting and querying traffic data using ClickHouse.

Run

docker-compose up --build

Swagger

http://localhost:8080/swagger-ui.html

Endpoints

POST /api/traffic/{request_id} GET /api/traffic/daily/{user_id}?page=0&size=10

Notes

  • Async ingestion
  • Idempotent writes
  • ClickHouse ReplacingMergeTree
  • Metrics via Actuator

Architecture Highlights

  • Layered architecture (Controller / Service / Repository)
  • Append-only writes optimized for ClickHouse
  • Logical updates via ReplacingMergeTree
  • Idempotent write API using request_id
  • Asynchronous ingestion for low-latency writes
  • Centralized exception handling
  • Built-in observability (metrics & health checks)

Database Design

  • Engine: ReplacingMergeTree
  • Partitioning: Monthly (toYYYYMM(created_at))
  • Ordering: (user_id, created_at, request_id)
  • Chosen to balance:
  1. High write throughput
  2. Efficient time-based aggregation
  3. Update capability without mutable storage

Other

  • I used eclipse-temurin:17-jre because it provides a secure, vendor-neutral OpenJDK distribution. The JRE variant is sufficient for running a Spring Boot application and keeps the container image minimal and production-ready.

  • The init.sql file is mounted into the ClickHouse container under /docker-entrypoint-initdb.d/ and is executed automatically only once, during the first startup of the container when the data directory is empty. Its responsibility is to bootstrap the database schema by creating the required database and tables, including defining the storage engine, partitioning, and ordering strategy. In this project, the table uses the ReplacingMergeTree(version) engine because ClickHouse does not support traditional row-level updates; instead, updates are modeled as new inserts with a higher version value, and during background merge operations ClickHouse keeps the latest version of each record (based on the ORDER BY key) while discarding older ones. This approach enables an update endpoint at the application level while remaining fully aligned with ClickHouse’s OLAP-oriented, append-only design principles.

  • I exposed application metrics via Micrometer and Actuator, scraped them using Prometheus, and visualized ingestion throughput and API behavior in Grafana. This allows observing system health and data ingestion performance in real time.

Service URL
Traffic API http://localhost:8080
Swagger http://localhost:8080/swagger-ui.html
Actuator http://localhost:8080/actuator
Prometheus http://localhost:9090
Grafana http://localhost:3000
Component Version Notes
Java 17 (LTS) Long-term support, production standard
Spring Boot 3.2.5 Modern Spring stack, Jakarta EE
ClickHouse Server latest (Docker) stable + HTTP & native protocol support
ClickHouse JDBC Driver 0.6.3 Official driver
Springdoc OpenAPI 2.5.0 Swagger UI for API documentation
Micrometer Spring Boot managed Metrics instrumentation
Prometheus Registry Spring Boot managed Metrics export
Spring Boot Actuator 2.5.0 Health & metrics endpoints
Docker latest Containerization
Docker Compose v3.8 Local orchestration
JUnit Jupiter Spring Boot managed Testing framework
Testcontainers 1.19.8 Integration with real ClickHouse,not mocks

example_curl

  • write_Request

curl -X POST http://localhost:8080/api/traffic/11111111-1111-1111-1111-111111111111
-H "Content-Type: application/json"
-d '{ "userId": 42, "createdAt": "2025-01-01T12:00:00", "pageUrl": "https://example.com/home" }'

  • idempotency_test

curl -X POST http://localhost:8080/api/traffic/11111111-1111-1111-1111-111111111111
-H "Content-Type: application/json"
-d '{ ... }'

  • Update_Request

curl -X PUT http://localhost:8080/api/traffic/11111111-1111-1111-1111-111111111111
-H "Content-Type: application/json"
-d '{ "userId": 42, "createdAt": "2025-01-01T12:05:00", "pageUrl": "https://example.com/profile" }'

  • ReadAPI_Request(Daily Report)

curl http://localhost:8080/api/traffic/daily/42

docker commands

Category Purpose Command
Run Build and run docker compose up --build
Stop Stop and remove volumes docker compose down -v
Logs View app logs docker compose logs traffic-app
Logs View ClickHouse logs docker compose logs clickhouse
Logs View Prometheus logs docker compose logs prometheus
Shell Enter traffic app container docker exec -it traffic-app /bin/sh
Shell Enter ClickHouse client docker exec -it clickhouse clickhouse-client
Shell Enter Prometheus container docker exec -it prometheus /bin/sh
API test Health check curl http://localhost:8080/actuator/health
API test Prometheus metrics curl http://localhost:8080/actuator/prometheus
API test Insert traffic event curl -X POST http://localhost:8080/api/traffic/{uuid}
API test Update traffic event curl -X PUT http://localhost:8080/api/traffic/{uuid}
API test Daily report curl http://localhost:8080/api/traffic/daily/{userId}
Database Show tables SHOW TABLES FROM traffic;
Database Inspect data SELECT * FROM traffic.traffic_events;
Database Validate daily aggregation SELECT page_url, COUNT() FROM traffic.traffic_events WHERE user_id = 42 AND created_at >= now() - INTERVAL 1 DAY GROUP BY page_url;
Prometheus Access UI http://localhost:9090

maven commands

Category Purpose Command
Build Clean and compile mvn clean compile
Test Run all unit tests mvn test
Test Run a specific test class mvn -Dtest=TrafficServiceTest test
Test Run a specific test method mvn -Dtest=TrafficServiceTest#dailyReport_shouldReturnCounts test
Package Create executable JAR mvn clean package
Package Skip tests during build mvn clean package -DskipTests
Package Package without running tests mvn clean package -Dmaven.test.skip=true
Docker Build Docker image via Spring Boot mvn spring-boot:build-image
Docker Build image with custom name mvn spring-boot:build-image -Dspring-boot.build-image.imageName=traffic-ingestion:latest
Dependency Show dependency tree mvn dependency:tree
Debug Run with debug output mvn clean test -X
Debug Resume failed build mvn -rf :traffic-ingestion

systemDesign question

Identifying the Bottleneck in the Current System In the current system, ClickHouse is used directly as the primary storage for ingesting traffic events. While ClickHouse is an excellent OLAP database, this design makes it the main bottleneck under heavy write and large-scale read workloads.

  • Why ClickHouse Becomes the Bottleneck: ClickHouse is optimized for analytical queries and batch-oriented data ingestion, not for high-frequency transactional writes.

The main bottlenecks are:

  • High-frequency small writes:
  1. ClickHouse performs best with large batch inserts.
  2. Continuous single-row inserts create many small data parts.
  3. This significantly increases background merge operations.
  • Expensive background merges:
  1. MergeTree engines continuously merge parts to optimize query performance.
  2. Under heavy write load, merge operations consume CPU and disk I/O.
  3. This directly impacts query latency.
  • Read–write contention:
  1. OLAP queries and ingestion share the same resources.
  2. Analytical queries suffer from increased latency during peak writes.
  • Lack of write buffering:
  1. The system writes directly to ClickHouse synchronously.
  2. Sudden traffic spikes cannot be absorbed gracefully.

As a result, ClickHouse becomes the primary bottleneck and limits system scalability beyond a certain write throughput.

The redesigned system must satisfy the following requirements:

  • Heavy write tolerant
  • Low-latency OLAP queries
  • High availability
  • Scalable analytics on very large datasets
  • Support both real-time and batch processing
  • Use only open-source technologies

Apache Flink (Real-Time Processing)

  • DLH queryEngine
  • Windowed aggregations
  • Enrichment
  • stateful processing
  • use flink_CDC(ingestion)& flink_SQL(processing like batch with Table/Stream)

ClickHouse (RTOLAP Serving Layer)

  • used strictly for analytical queries
  • Optimized schema (e.g., AggregatingMergeTree)
  • No direct user writes

Apache Fluss (ingestion & hot layer)

  • prevent deduplication ETL
  • Storage unification for all layers
  • for deltaJoin

Iceberg or Paimon(Analytics Layer)

  • stores large-scale & cold-data datasets
  • bring ACID & TimeTravels ability or data_Versioning
  • use snapshot and metadataLayer

About

snapp TechAssignment for DataEngineering position

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors