GitHub - EhFa1999/TechAssignment: snapp TechAssignment for DataEngineering position

░██████╗███╗░░██╗░█████╗░██████╗░██████╗░ ██╔════╝████╗░██║██╔══██╗██╔══██╗██╔══██╗ ╚█████╗░██╔██╗██║███████║██████╔╝██████╔╝ ░╚═══██╗██║╚████║██╔══██║██╔═══╝░██╔═══╝░ ██████╔╝██║░╚███║██║░░██║██║░░░░░██║░░░░░ ╚═════╝░╚═╝░░╚══╝╚═╝░░╚═╝╚═╝░░░░░╚═╝░░░░░

Traffic HTTP Service

Spring Boot service for ingesting and querying traffic data using ClickHouse.

Run

docker-compose up --build

Swagger

http://localhost:8080/swagger-ui.html

Endpoints

POST /api/traffic/{request_id} GET /api/traffic/daily/{user_id}?page=0&size=10

Notes

Async ingestion
Idempotent writes
ClickHouse ReplacingMergeTree
Metrics via Actuator

Architecture Highlights

Layered architecture (Controller / Service / Repository)
Append-only writes optimized for ClickHouse
Logical updates via ReplacingMergeTree
Idempotent write API using request_id
Asynchronous ingestion for low-latency writes
Centralized exception handling
Built-in observability (metrics & health checks)

Database Design

Engine: ReplacingMergeTree
Partitioning: Monthly (toYYYYMM(created_at))
Ordering: (user_id, created_at, request_id)
Chosen to balance:

High write throughput
Efficient time-based aggregation
Update capability without mutable storage

Other

I used eclipse-temurin:17-jre because it provides a secure, vendor-neutral OpenJDK distribution. The JRE variant is sufficient for running a Spring Boot application and keeps the container image minimal and production-ready.
The init.sql file is mounted into the ClickHouse container under /docker-entrypoint-initdb.d/ and is executed automatically only once, during the first startup of the container when the data directory is empty. Its responsibility is to bootstrap the database schema by creating the required database and tables, including defining the storage engine, partitioning, and ordering strategy. In this project, the table uses the ReplacingMergeTree(version) engine because ClickHouse does not support traditional row-level updates; instead, updates are modeled as new inserts with a higher version value, and during background merge operations ClickHouse keeps the latest version of each record (based on the ORDER BY key) while discarding older ones. This approach enables an update endpoint at the application level while remaining fully aligned with ClickHouse’s OLAP-oriented, append-only design principles.
I exposed application metrics via Micrometer and Actuator, scraped them using Prometheus, and visualized ingestion throughput and API behavior in Grafana. This allows observing system health and data ingestion performance in real time.

Service	URL
Traffic API	http://localhost:8080
Swagger	http://localhost:8080/swagger-ui.html
Actuator	http://localhost:8080/actuator
Prometheus	http://localhost:9090
Grafana	http://localhost:3000

Component	Version	Notes
Java	17 (LTS)	Long-term support, production standard
Spring Boot	3.2.5	Modern Spring stack, Jakarta EE
ClickHouse Server	latest (Docker)	stable + HTTP & native protocol support
ClickHouse JDBC Driver	0.6.3	Official driver
Springdoc OpenAPI	2.5.0	Swagger UI for API documentation
Micrometer	Spring Boot managed	Metrics instrumentation
Prometheus Registry	Spring Boot managed	Metrics export
Spring Boot Actuator	2.5.0	Health & metrics endpoints
Docker	latest	Containerization
Docker Compose	v3.8	Local orchestration
JUnit Jupiter	Spring Boot managed	Testing framework
Testcontainers	1.19.8	Integration with real ClickHouse,not mocks

example_curl

write_Request

curl -X POST http://localhost:8080/api/traffic/11111111-1111-1111-1111-111111111111
-H "Content-Type: application/json"
-d '{ "userId": 42, "createdAt": "2025-01-01T12:00:00", "pageUrl": "https://example.com/home" }'

idempotency_test

curl -X POST http://localhost:8080/api/traffic/11111111-1111-1111-1111-111111111111
-H "Content-Type: application/json"
-d '{ ... }'

Update_Request

curl -X PUT http://localhost:8080/api/traffic/11111111-1111-1111-1111-111111111111
-H "Content-Type: application/json"
-d '{ "userId": 42, "createdAt": "2025-01-01T12:05:00", "pageUrl": "https://example.com/profile" }'

ReadAPI_Request(Daily Report)

curl http://localhost:8080/api/traffic/daily/42

docker commands

Category	Purpose	Command
Run	Build and run	`docker compose up --build`
Stop	Stop and remove volumes	`docker compose down -v`
Logs	View app logs	`docker compose logs traffic-app`
Logs	View ClickHouse logs	`docker compose logs clickhouse`
Logs	View Prometheus logs	`docker compose logs prometheus`
Shell	Enter traffic app container	`docker exec -it traffic-app /bin/sh`
Shell	Enter ClickHouse client	`docker exec -it clickhouse clickhouse-client`
Shell	Enter Prometheus container	`docker exec -it prometheus /bin/sh`
API test	Health check	`curl http://localhost:8080/actuator/health`
API test	Prometheus metrics	`curl http://localhost:8080/actuator/prometheus`
API test	Insert traffic event	`curl -X POST http://localhost:8080/api/traffic/{uuid}`
API test	Update traffic event	`curl -X PUT http://localhost:8080/api/traffic/{uuid}`
API test	Daily report	`curl http://localhost:8080/api/traffic/daily/{userId}`
Database	Show tables	`SHOW TABLES FROM traffic;`
Database	Inspect data	`SELECT * FROM traffic.traffic_events;`
Database	Validate daily aggregation	`SELECT page_url, COUNT() FROM traffic.traffic_events WHERE user_id = 42 AND created_at >= now() - INTERVAL 1 DAY GROUP BY page_url;`
Prometheus	Access UI	`http://localhost:9090`

maven commands

Category	Purpose	Command
Build	Clean and compile	`mvn clean compile`
Test	Run all unit tests	`mvn test`
Test	Run a specific test class	`mvn -Dtest=TrafficServiceTest test`
Test	Run a specific test method	`mvn -Dtest=TrafficServiceTest#dailyReport_shouldReturnCounts test`
Package	Create executable JAR	`mvn clean package`
Package	Skip tests during build	`mvn clean package -DskipTests`
Package	Package without running tests	`mvn clean package -Dmaven.test.skip=true`
Docker	Build Docker image via Spring Boot	`mvn spring-boot:build-image`
Docker	Build image with custom name	`mvn spring-boot:build-image -Dspring-boot.build-image.imageName=traffic-ingestion:latest`
Dependency	Show dependency tree	`mvn dependency:tree`
Debug	Run with debug output	`mvn clean test -X`
Debug	Resume failed build	`mvn -rf :traffic-ingestion`

systemDesign question

Identifying the Bottleneck in the Current System In the current system, ClickHouse is used directly as the primary storage for ingesting traffic events. While ClickHouse is an excellent OLAP database, this design makes it the main bottleneck under heavy write and large-scale read workloads.

Why ClickHouse Becomes the Bottleneck: ClickHouse is optimized for analytical queries and batch-oriented data ingestion, not for high-frequency transactional writes.

The main bottlenecks are:

High-frequency small writes:

ClickHouse performs best with large batch inserts.
Continuous single-row inserts create many small data parts.
This significantly increases background merge operations.

Expensive background merges:

MergeTree engines continuously merge parts to optimize query performance.
Under heavy write load, merge operations consume CPU and disk I/O.
This directly impacts query latency.

Read–write contention:

OLAP queries and ingestion share the same resources.
Analytical queries suffer from increased latency during peak writes.

Lack of write buffering:

The system writes directly to ClickHouse synchronously.
Sudden traffic spikes cannot be absorbed gracefully.

As a result, ClickHouse becomes the primary bottleneck and limits system scalability beyond a certain write throughput.

The redesigned system must satisfy the following requirements:

Heavy write tolerant
Low-latency OLAP queries
High availability
Scalable analytics on very large datasets
Support both real-time and batch processing
Use only open-source technologies

Apache Flink (Real-Time Processing)

DLH queryEngine
Windowed aggregations
Enrichment
stateful processing
use flink_CDC(ingestion)& flink_SQL(processing like batch with Table/Stream)

ClickHouse (RTOLAP Serving Layer)

used strictly for analytical queries
Optimized schema (e.g., AggregatingMergeTree)
No direct user writes

Apache Fluss (ingestion & hot layer)

prevent deduplication ETL
Storage unification for all layers
for deltaJoin

Iceberg or Paimon(Analytics Layer)

stores large-scale & cold-data datasets
bring ACID & TimeTravels ability or data_Versioning
use snapshot and metadataLayer

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.mvn/wrapper		.mvn/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Traffic HTTP Service

Run

Swagger

Endpoints

Notes

Architecture Highlights

Database Design

Other

example_curl

docker commands

maven commands

systemDesign question

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Traffic HTTP Service

Run

Swagger

Endpoints

Notes

Architecture Highlights

Database Design

Other

example_curl

docker commands

maven commands

systemDesign question

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages