Skip to content

GgufModelMetadata silently drops UInt/ULong-typed numeric fields #585

@michalharakal

Description

@michalharakal

Summary

GgufModelMetadata.from() and any code that reaches into reader.fields with the idiom (value as? Number)?.toInt() silently returns null for GGUF metadata stored as uint32 / uint64. In Kotlin, the unsigned types (UInt, ULong, UShort, UByte) do not extend kotlin.Number, so the as? Number cast yields null.

Modern GGUF files (anything produced by recent llama.cpp converters) store dimensions and counts as uint32. The result: contextLength, embeddingLength, headCount, layerCount, vocabSize (fallback), bosTokenId, eosTokenId, etc. are all populated as null instead of the real values, and the model loader falls back to defaults (e.g. blockCount = 0 → a transformer with zero layers).

Where

skainet-io/skainet-io-gguf/src/commonMain/kotlin/sk/ainet/io/gguf/GgufModelMetadata.kt:179

private fun Map<String, Any?>.getInt(vararg keys: String): Int? {
    for (key in keys) {
        val value = this[key]
        when (value) {
            is Number -> return value.toInt()      // ← UInt/ULong fall through
            is String -> value.toIntOrNull()?.let { return it }
        }
    }
    return null
}

getIntList (line 190) has the same bug for the list-of-numbers case.

Reproduction

val md = GgufModelMetadata.from(mapOf(
    "general.architecture" to "llama",
    "llama.context_length" to 8192u,        // UInt — what the reader actually emits
    "llama.embedding_length" to 4096u,
    "llama.block_count" to 32u
))
md.contextLength       // null — expected 8192
md.embeddingLength     // null — expected 4096
md.layerCount          // null — expected 32

The existing GgufModelMetadataTokenizerTest only uses Int literals, which is why this never tripped a test.

Impact

  • Anyone calling GgufModelMetadata.from(reader) on a real-world GGUF gets a GgufModelMetadata with most numeric fields null.
  • Same idiom is repeated downstream — e.g. SKaiNET-transformers UnifiedModelLoader.peek had to introduce a local workaround. Every consumer of reader.fields is exposed to the same trap.

Proposed fix (target: hotfix/0.22.2)

  1. Add public top-level extensions on Map<String, Any?> in sk.ainet.io.gguf (new file, e.g. GgufFieldAccessors.kt):
    • getInt(vararg keys: String): Int?
    • getLong(vararg keys: String): Long?
    • getString(vararg keys: String): String?
    • getIntList(vararg keys: String): List<Int>?
    • getStringList(vararg keys: String): List<String>?
      The numeric ones handle Int/UInt/Long/ULong/Short/UShort/Byte/UByte/String.
  2. Delete the buggy private helpers in GgufModelMetadata.kt and route through the new public ones.
  3. Add a regression test that drives GgufModelMetadata.from with UInt and ULong values (both list and scalar).
  4. Bump VERSION_NAME to 0.22.2.

This is non-breaking — only adds new public API and fixes existing methods to return correct values.

Notes

  • Downstream stopgap already in SKaiNET-transformers develop (UnifiedModelLoader.toIntValue); it can be removed once consumers can adopt 0.22.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions