Skip to content

MaterializeIncLabs/materialize-json-inline-schema

Repository files navigation

Materialize JSON Schema Attacher

A Kafka Streams application that attaches inline JSON schemas to messages from Materialize sinks, enabling seamless integration with the Confluent JDBC Sink Connector.

Architecture

Materialize (exactly-once)
    → Redpanda (JSON without schema)
    → Kafka Streams App (attaches schema, exactly-once)
    → Redpanda (JSON with inline schema)
    → JDBC Sink Connector (upsert mode)
    → Postgres

Features

  • Exactly-once semantics throughout the pipeline
  • Simple configuration - no code changes needed to add new topics
  • Upsert support - handles updates correctly using primary keys
  • Multiple topic support - transform multiple streams simultaneously
  • Fully tested - comprehensive unit tests included
  • Docker-based - entire pipeline runs in Docker Compose

Prerequisites

  • Java 17+
  • Maven 3.6+
  • Docker & Docker Compose
  • Python 3.8+ (for test data generation)

Quick Start

1. Build the application

mvn clean package

This creates a fat JAR: target/json-schema-attacher-1.0.0-SNAPSHOT.jar

2. Start the infrastructure

docker compose up -d

This starts:

  • Redpanda (Kafka-compatible broker)
  • Redpanda Console (Web UI on http://localhost:8080)
  • Postgres (data destination)
  • Materialize (streaming database)
  • Kafka Connect (JDBC sink connector)
  • Schema Attacher App (this application)

Wait for all services to be healthy (~30-60 seconds):

docker compose ps

3. Deploy JDBC sink connectors

./scripts/deploy_connectors.sh

This deploys three JDBC sink connectors with upsert mode enabled.

4. Generate test data

Install Python dependencies:

pip install -r scripts/requirements.txt

Run the data generator:

python3 scripts/generate_test_data.py

This generates:

  • 10 user records
  • 20 order records
  • 30 event records
  • Updates to test upsert behavior

5. Verify the pipeline

./scripts/verify_pipeline.sh

Or manually check Postgres:

docker exec -it postgres psql -U postgres -d sink_db
SELECT * FROM users;
SELECT * FROM orders;
SELECT * FROM events;

Configuration

Adding New Topics

Edit config/application.properties:

# Pattern: schema.topic.<input-topic>=<schema-json>
#          output.topic.<input-topic>=<output-topic>

schema.topic.my-new-topic={"type": "struct", "fields": [...], "name": "..."}
output.topic.my-new-topic=my-new-topic-with-schema

Need help creating schemas? See the Schema Configuration Guide for detailed instructions on:

  • Mapping Materialize types to Kafka Connect schema types
  • Step-by-step schema creation from materialized views
  • Timestamp and logical type handling
  • Common patterns and troubleshooting

Restart the schema attacher:

docker compose restart schema-attacher

Exactly-Once Semantics

The application is configured for exactly-once processing:

processing.guarantee=exactly_once_v2

Combined with:

  • Materialize's exactly-once sinks
  • JDBC connector's upsert mode

This provides end-to-end exactly-once delivery.

Schema Format

Schemas must be in Kafka Connect JSON format:

{
  "type": "struct",
  "fields": [
    {
      "type": "int64",
      "optional": false,
      "field": "id"
    },
    {
      "type": "string",
      "optional": true,
      "field": "name"
    }
  ],
  "optional": false,
  "name": "schema.name"
}

Supported types:

  • int32, int64
  • float32, float64
  • string
  • boolean
  • bytes
  • Nested struct
  • array

Schema Generator Tool

For complex schemas or multiple sinks, use the included schema generator tool to automatically create schema configurations from your Materialize sinks:

# Install dependencies
pip install -r tools/requirements.txt

# Generate schemas for all Kafka sinks
python3 tools/generate_schemas.py \
  --host localhost \
  --port 6875 \
  --output config/generated.properties

# Generate for specific sinks
python3 tools/generate_schemas.py \
  --host mz.example.com \
  --sink users_json_sink \
  --sink orders_json_sink

The tool:

  • Queries Materialize's catalog to discover all Kafka sinks
  • Extracts column definitions with types and nullability
  • Automatically maps Materialize types to Kafka Connect types
  • Handles temporal types with proper logical type annotations
  • Supports properties and JSON output formats

See tools/README.md for complete documentation.

Running Standalone Container

If you want to run only the schema-attacher container (not the full docker-compose stack), use the production Docker image.

1. Prepare Your Configuration

Create a custom application.properties file for your environment:

# Kafka/Redpanda connection settings
bootstrap.servers=your-kafka-broker:9092
application.id=schema-attacher

# Kafka Streams configuration
default.key.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde=org.apache.kafka.common.serialization.Serdes$StringSerde

# Processing guarantees - exactly-once semantics
processing.guarantee=exactly_once_v2

# Topic mappings
schema.topic.your-input-topic={"type": "struct", "fields": [...], "name": "your.schema"}
output.topic.your-input-topic=your-output-topic

Key Settings to Customize:

  • bootstrap.servers: Your Kafka/Redpanda broker addresses
  • application.id: Unique identifier for this Kafka Streams app
  • Topic mappings: Define your input topics, schemas, and output topics

2. Run the Container

Pull the production image:

docker pull ghcr.io/materializeinclabs/materialize-json-inline-schema:latest

Run with your custom configuration:

docker run -d \
  --name schema-attacher \
  --restart unless-stopped \
  --network your-network \
  -v /path/to/your/application.properties:/app/config/application.properties \
  -e JAVA_OPTS="-Xmx2g -Xms2g" \
  ghcr.io/materializeinclabs/materialize-json-inline-schema:latest

Important:

  • Replace /path/to/your/application.properties with your actual config file path
  • Adjust --network to match your Kafka cluster's network
  • Tune JAVA_OPTS memory settings based on your throughput needs

3. Update Configuration

To update the configuration:

  1. Edit your application.properties file
  2. Restart the container:
    docker restart schema-attacher

The application will reload with the new configuration.

4. Monitor the Container

View logs:

docker logs -f schema-attacher

Check health:

docker ps --filter name=schema-attacher

Verify it reached RUNNING state:

docker logs schema-attacher | grep "State transition.*RUNNING"

Example: Connecting to Confluent Cloud

bootstrap.servers=pkc-xxxxx.us-east-1.aws.confluent.cloud:9092
application.id=schema-attacher-prod

# SASL/SSL configuration for Confluent Cloud
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
  username="YOUR_API_KEY" \
  password="YOUR_API_SECRET";

# Processing guarantee
processing.guarantee=exactly_once_v2

# Your topic mappings
schema.topic.users-from-materialize={"type": "struct", ...}
output.topic.users-from-materialize=users-with-schema

Example: Connecting to AWS MSK

bootstrap.servers=b-1.mycluster.xxxxx.kafka.us-east-1.amazonaws.com:9092,b-2.mycluster.xxxxx.kafka.us-east-1.amazonaws.com:9092
application.id=schema-attacher-msk

# IAM authentication (if using MSK IAM)
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler

# Processing guarantee
processing.guarantee=exactly_once_v2

# Your topic mappings
schema.topic.users-from-materialize={"type": "struct", ...}
output.topic.users-from-materialize=users-with-schema

Project Structure

.
├── src/
│   ├── main/java/com/materialize/schema/
│   │   ├── SchemaAttacherApp.java      # Main application
│   │   ├── SchemaWrapper.java          # Schema wrapping logic
│   │   ├── ConfigParser.java           # Configuration parser
│   │   └── TopicConfig.java            # Topic configuration model
│   └── test/java/com/materialize/schema/
│       ├── SchemaWrapperTest.java
│       ├── ConfigParserTest.java
│       └── TopicConfigTest.java
├── docker/
│   ├── materialize/init.sql            # Materialize setup
│   ├── postgres/init.sql               # Postgres schema
│   └── connect/                        # JDBC connector configs
│       ├── users-sink.json
│       ├── orders-sink.json
│       └── events-sink.json
├── scripts/
│   ├── generate_test_data.py           # Test data generator
│   ├── deploy_connectors.sh            # Deploy JDBC connectors
│   └── verify_pipeline.sh              # Verify data flow
├── config/
│   └── application.properties          # Kafka Streams config
├── docker-compose.yml
├── Dockerfile
└── pom.xml

Testing

Unit Tests

mvn test

Integration Testing

The entire pipeline serves as an integration test:

  1. Start the infrastructure
  2. Generate test data
  3. Verify data landed in Postgres correctly
  4. Check for duplicates (should be none with exactly-once)
  5. Generate updates and verify upserts work

Manual Testing

Monitor topics in Redpanda Console:

Check Materialize views:

docker exec -it materialize psql -h localhost -p 6875 -U materialize
SHOW SOURCES;
SHOW SINKS;
SELECT * FROM users_processed LIMIT 5;

View connector status:

curl http://localhost:8083/connectors/users-postgres-sink/status | jq '.'

Troubleshooting

Schema Attacher not starting

Check logs:

docker compose logs schema-attacher

Common issues:

  • Redpanda not ready yet (wait and restart)
  • Invalid schema JSON in configuration

Data not appearing in Postgres

  1. Check if data is in Materialize:

    docker exec -it materialize psql -h localhost -p 6875 -U materialize
    SELECT * FROM users_processed;
  2. Check if schema attacher is processing:

    docker-compose logs schema-attacher
  3. Check JDBC connector status:

    curl http://localhost:8083/connectors/users-postgres-sink/status
  4. Check connector logs:

    docker compose logs kafka-connect

Topics not created

Topics are auto-created by Redpanda. If needed, manually create:

docker exec redpanda rpk topic create users-json-noschema
docker exec redpanda rpk topic create users-json-withschema

Performance Tuning

For production deployments:

  1. Kafka Streams:

    • Increase num.stream.threads for parallelism
    • Tune cache.max.bytes.buffering for batching
  2. Redpanda:

    • Increase resources (memory, CPU)
    • Adjust --smp and --memory flags
  3. Postgres:

    • Add appropriate indexes
    • Tune connection pool size in JDBC connector
    • Consider partitioning for large tables

Development

Running Locally (without Docker)

  1. Start only infrastructure:

    docker compose up -d redpanda postgres materialize kafka-connect
  2. Update config/application.properties:

    bootstrap.servers=localhost:19092
  3. Run the application:

    java -jar target/json-schema-attacher-1.0.0-SNAPSHOT.jar config/application.properties

Adding New Features

The codebase is structured for easy extension:

  • SchemaWrapper: Modify to support different schema formats
  • ConfigParser: Add support for external schema files
  • SchemaAttacherApp: Add custom transformations or filtering

License

[Your License Here]

Contributing

[Your Contributing Guidelines Here]

About

transform mz JSON sinks to confluent-formatted JSON inline schema topics ( for JDBC connector etc )

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •