Lookup-based token embeddings for the JVM, in pure Kotlin.
Tessera is the piece. Mosaic is the whole.
π Site: hectorifc.github.io/mosaic
See ARCHITECTURE.md for design details.
Mosaic is a Kotlin library that provides a trainable EmbeddingTable β a [vocabSize Γ embeddingDim] matrix mapping token IDs to dense Float vectors. Built as a sister project to Tessera (BPE tokenizer), completing the text β tokens β vectors pipeline in pure Kotlin.
What Mosaic is:
- A lookup table modeled on PyTorch's
nn.Embedding - Efficient flat 1D
FloatArraystorage (cache-friendly, ~1 % overhead) - Essential vector operations (cosine similarity, top-K nearest)
- 6 pluggable initializers (
uniformDefault,uniform,xavier,he,zeros,constant) - Compact binary persistence with SHA-256 checksum
- Native Tessera integration via
TesseraEmbeddings
What Mosaic is NOT:
- It does not train embeddings (no Word2Vec, no backprop, no SGD)
- No dense matrix-by-matrix operations
- No GPU acceleration
- No quantization
- Text only (no image/audio embeddings)
The intent is to be a solid Lego block: other projects (yours or future) that want to actually train embeddings can use Mosaic as the storage layer and update weights via the public API.
- Library, not application β meant to be consumed by other Kotlin projects (but ships with a CLI module for debug)
- Pure Kotlin β no ML libraries, no math libraries (unless proven necessary for performance)
FloatArrayeverywhere β noDouble, no boxing, noList<Float>- Flat 1D storage β cache locality matters
- Minimal public API β only what's necessary, all marked
publicexplicitly viaexplicitApi()
// settings.gradle.kts
dependencyResolutionManagement {
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}
// build.gradle.kts
dependencies {
implementation("com.github.HectorIFC:mosaic:mosaic-core-v0.0.4")
// Tessera comes as a transitive dependency β no need to declare it explicitly
}// settings.gradle.kts
dependencyResolutionManagement {
repositories {
mavenCentral()
maven {
url = uri("https://maven.pkg.github.com/HectorIFC/mosaic")
credentials {
username = System.getenv("GITHUB_ACTOR")
password = System.getenv("GITHUB_TOKEN")
}
}
}
}
// build.gradle.kts
dependencies {
implementation("dev.mosaic:mosaic-core:0.0.4")
}import dev.mosaic.EmbeddingTable
import dev.mosaic.TesseraEmbeddings
import dev.mosaic.Initializer
import dev.tessera.BpeTokenizer
fun main() {
// 1. Load a previously-trained Tessera tokenizer
val tokenizer = BpeTokenizer.load("tessera.json")
// 2. Create an embedding table with a matching vocab size
val embeddings = EmbeddingTable.create(
vocabSize = tokenizer.vocabSize,
embeddingDim = 128,
initializer = Initializer.uniformDefault(seed = 42L),
)
// 3. Wire them into a pipeline
val pipeline = TesseraEmbeddings(tokenizer, embeddings)
val vectors = pipeline.encode("Hello, mosaic!")
println("Got ${vectors.size} vectors of dim ${vectors[0].size}")
// 4. Persist for later
embeddings.save("mosaic.bin")
}val table = EmbeddingTable.create(vocabSize = 1000, embeddingDim = 64)
// Lookup
val v = table.get(id = 42)
// Write (for external training loops)
val newVec = FloatArray(64) { 0.1f * it }
table.set(id = 42, vector = newVec)
// Similarity
val similar = table.mostSimilar(id = 42, topK = 5)
similar.forEach { (id, score) -> println("Token $id β score $score") }More examples in the mosaic-samples module.
This is a Gradle multi-module project with 3 modules:
mosaic/
βββ mosaic-core/ β the library (the published JAR)
βββ mosaic-cli/ β CLI application built on top of the lib
βββ mosaic-samples/ β runnable usage examples
mosaic-coreβ the consumable JAR. Minimal public API. Published to JitPack and GitHub Packages.mosaic-cliβ runnable application (./gradlew :mosaic-cli:run) with commandscreate,inspect,stats,similar,encode. Useful for interactive debugging.mosaic-samplesβ small Kotlin programs withmain()demonstrating usage patterns.
# Build everything
./gradlew build
# Run tests
./gradlew test
# Full quality pipeline
./gradlew test koverVerify ktlintCheck detekt
# Install the library into Maven Local for testing in other projects
./gradlew publishToMavenLocal
# Run the CLI
./gradlew :mosaic-cli:run --args="--help"
./gradlew :mosaic-cli:run --args="create --vocab-size 1000 --dim 64 --output embeddings.bin"
./gradlew :mosaic-cli:run --args="inspect --input embeddings.bin"
# Run a sample
./gradlew :mosaic-samples:run -PmainClass=dev.mosaic.samples.QuickStartSampleKtIn a nutshell:
- Storage β the matrix is held in a single contiguous
FloatArrayof sizevocabSize Γ embeddingDim. Rowilives at offseti Γ embeddingDim. - Lookup β
get(id)returns a copy of the slice. Mutating the result never touches the table. - Vector ops β all live in stateless
VectorOps. Sums are accumulated inDoubleand narrowed back toFloaton return, avoiding precision drift. mostSimilarβ implemented with a fixed-size min-heap forO(N log K).- Persistence β compact binary
.bin(16-byte header + raw float32 LE) + JSON.meta.jsonsidecar. SHA-256 checksum verifies integrity. - Tessera integration β
TesseraEmbeddingscombines tokenizer and table into one class; the vocab-size match is checked at construction.
See ARCHITECTURE.md for the deeper rationale.
On Apple M1 / JVM 21, at dim = 128:
| vocabSize | mostSimilar(topK=10) |
save | load |
|---|---|---|---|
| 10 000 | 3.09 ms | 11.10 ms | 4.69 ms |
| 50 000 | 11.67 ms | 50.48 ms | 22.14 ms |
| 100 000 | 23.16 ms | 108.06 ms | 80.62 ms |
The acceptance criterion (< 100 ms at 10 K vocab Γ 128 dim) is met with a ~32Γ margin. Full details in BENCHMARKS.md.
- Phase 0 β Gradle multi-module setup + GitHub infrastructure (workflows, dependabot, PR template, detekt) + Tessera dependency
- Phase 1 β Core lib (EmbeddingTable, Initializer, VectorOps, mostSimilar)
- Phase 2 β Binary persistence + TesseraEmbeddings integration
- Phase 3 β Samples + benchmarks + β₯ 80 % coverage
- Phase 4 β CLI (create, inspect, stats, similar, encode)
- Phase 5 β GitHub Pages live (logo + MP4 animations + orange/black/white palette)
- Phase 6 β Publication on JitPack + polish (v0.0.1)
- Tessera β byte-level BPE tokenizer (Mosaic depends on it via JitPack)
- A future project β possibly a Word2Vec/Skip-gram trainer that consumes Mosaic as its storage, or a small transformer
MIT β see LICENSE.
"Mosaic" β an artwork composed of many individual pieces (tesserae) arranged to form a complete image.
In Tessera, each token is an isolated piece β a chunk of bytes with an ID. In Mosaic, those pieces find their place in a vector space where, together, they begin to form meaning. A token alone is just an index; surrounded by its neighbors, it is part of a larger semantic structure.
