🚀 Advanced JSON tokenizer with multiple encoding strategies for optimal compression and performance.
Lightweight and symmetric JSON tokenizer for compression and optimization. Generates consistent dictionaries and supports alphabetic, numeric, base64, UUID-based, and custom tokenization methods with symmetric encoding/decoding. Perfect for data compression, API optimization, and storage efficiency. Can be used standalone or with MessagePack or Gzip for enhanced compression.
- Multiple Tokenization Methods: Alphabetic, numeric, base64, UUID-short, and custom
- Symmetric Encoding: Perfect reconstruction of original data
- 🔒 Security First: Built-in prototype pollution protection
- High Performance: Optimized algorithms with minimal overhead
- TypeScript Support: Full type safety
npm install @docamz/json-tokenizerimport { generateDictionary, tokenize, detokenize, TokenizationMethod } from "@docamz/json-tokenizer";
const data = { name: "Alice", age: 30, city: "Paris" };
const keys = ["name", "age", "city"];
// Generate dictionary and tokenize
const dict = generateDictionary(keys);
const encoded = tokenize(data, dict.forward);
const decoded = detokenize(encoded, dict.reverse);
console.log(encoded); // { a: "Alice", b: 30, c: "Paris" }
console.log(decoded); // { name: "Alice", age: 30, city: "Paris" }Perfect for maximum compression with readable tokens.
const dict = generateDictionary(keys, { method: TokenizationMethod.ALPHABETIC });
// Result: { name: "a", age: "b", city: "c" }Simple numeric tokens for databases and APIs.
const dict = generateDictionary(keys, { method: TokenizationMethod.NUMERIC });
// Result: { name: "0", age: "1", city: "2" }Fixed-width numeric tokens for consistent formatting.
const dict = generateDictionary(keys, {
method: TokenizationMethod.PADDED_NUMERIC,
paddingLength: 3
});
// Result: { name: "000", age: "001", city: "002" }High-density encoding using alphanumeric + symbols.
const dict = generateDictionary(keys, { method: TokenizationMethod.BASE64 });
// Supports 64 characters: a-z, A-Z, 0-9, _, $
// Result: { name: "a", age: "b", city: "c", ... key63: "$", key64: "ba" }Distributed-system friendly with timestamp + counter.
const dict = generateDictionary(keys, { method: TokenizationMethod.UUID_SHORT });
// Result: { name: "1a2b00", age: "1a2b01", city: "1a2b02" }
// Format: 4-char timestamp + 2-char counter (6 chars total)Define your own tokenization logic.
const dict = generateDictionary(keys, {
method: TokenizationMethod.CUSTOM,
customGenerator: (index) => `custom_${index}`
});
// Result: { name: "custom_0", age: "custom_1", city: "custom_2" }Add prefixes to any tokenization method.
const dict = generateDictionary(keys, {
method: TokenizationMethod.NUMERIC,
prefix: "api_"
});
// Result: { name: "api_0", age: "api_1", city: "api_2" }const complexData = {
user: {
profile: { firstName: "John", lastName: "Doe", email: "john@example.com" },
settings: { theme: "dark", language: "en", notifications: true }
},
metadata: { version: "2.0", createdAt: "2023-01-01T00:00:00Z" }
};
const keys = [
"user", "profile", "firstName", "lastName", "email",
"settings", "theme", "language", "notifications",
"metadata", "version", "createdAt"
];
const dict = generateDictionary(keys, { method: TokenizationMethod.ALPHABETIC });
const encoded = tokenize(complexData, dict.forward);
const decoded = detokenize(encoded, dict.reverse);
// Perfect reconstruction guaranteed
console.log(decoded === complexData); // trueconst arrayData = {
users: [
{ name: "Alice", age: 30, role: "admin" },
{ name: "Bob", age: 25, role: "user" },
{ name: "Charlie", age: 35, role: "moderator" }
]
};
const keys = ["users", "name", "age", "role"];
const dict = generateDictionary(keys, { method: TokenizationMethod.BASE64 });
const encoded = tokenize(arrayData, dict.forward);
// Result: { a: [{ b: "Alice", c: 30, d: "admin" }, ...] }// Save dictionary for later use
const dict = generateDictionary(keys, { method: TokenizationMethod.ALPHABETIC });
const serialized = JSON.stringify(dict);
fs.writeFileSync('dictionary.json', serialized);
// Load and use dictionary
const loaded = JSON.parse(fs.readFileSync('dictionary.json', 'utf-8'));
const decoded = detokenize(encodedData, loaded.reverse);Built-in protection against prototype pollution and security vulnerabilities:
import { tokenize, sanitizeObject, isSafeKey } from "@docamz/json-tokenizer";
// Automatic protection against dangerous keys
const maliciousData = { name: "Alice", "__proto__": { isAdmin: true } };
tokenize(maliciousData, dict.forward); // Throws: "Dangerous key detected"
// Sanitize untrusted input
const cleanData = sanitizeObject(untrustedInput, { throwOnUnsafeKeys: true });
// Validate keys manually
if (isSafeKey(keyName)) {
// Safe to use
}Protected against:
__proto__pollutionconstructormanipulation- Dangerous property access
- Control character injection
📖 See SECURITY.md for complete security guide
| Function | Parameters | Description |
|---|---|---|
generateDictionary(keys, options?) |
keys: string[], options?: TokenizationOptions |
Generate tokenization dictionary |
tokenize(obj, dict) |
obj: any, dict: Record<string, string> |
Replace keys with tokens |
detokenize(obj, reverse) |
obj: any, reverse: Record<string, string> |
Restore original keys |
| Method | Description | Use Case |
|---|---|---|
ALPHABETIC |
a, b, c, ..., z, aa, ab | Maximum compression, readable |
NUMERIC |
0, 1, 2, 3, ... | Simple, database-friendly |
PADDED_NUMERIC |
000, 001, 002, ... | Fixed-width, sortable |
BASE64 |
a-z, A-Z, 0-9, _, $ | High-density encoding |
UUID_SHORT |
timestamp + counter | Distributed systems |
CUSTOM |
User-defined function | Custom requirements |
interface TokenizationOptions {
method?: TokenizationMethod; // Default: ALPHABETIC
customGenerator?: (index: number) => string; // For CUSTOM method
paddingLength?: number; // Default: 4 (for PADDED_NUMERIC)
prefix?: string; // Default: "" (empty)
}Access individual generators directly:
import {
generateAlphabeticSequence,
generateNumericSequence,
generatePaddedNumericSequence,
generateBase64Sequence,
generateUuidShortSequence
} from "@docamz/json-tokenizer";
// Use specific generators
const token1 = generateAlphabeticSequence(0); // "a"
const token2 = generateBase64Sequence(63); // "$"
const token3 = generateUuidShortSequence(0); // "1a2b00"- model1.json (83.8 KB file) 2679 Row - 216 unique keys dict
- model2.json (134.4 KB file) 4069 Row - 216 unique keys dict
- model3.json (148.7 KB file) 4424 Row - 216 unique keys dict
- model4.json (33.1 file) 1056 Row - 216 unique keys dict
this files contains complex nested structures and arrays, their values are multiples(boolean, url, text, numbers..) to simulate real-world JSON data.
Compression Benchmarks for Different Tokenization Methods on model3.json (148.7 KB file) 4424 Row - 216 unique keys :
| Method | Dict Gen | Tokenize | Total | Original | Tokenized | Compression | Saved |
|---|---|---|---|---|---|---|---|
| alphabetic | 0.00 ms | 112.28 ms | 112.28 ms | 72.14 KB | 49.26 KB | 31.71% | 22.87 KB |
| base64 | 0.00 ms | 111.24 ms | 111.24 ms | 72.14 KB | 48.70 KB | 32.49% | 23.44 KB |
| numeric | 0.00 ms | 113.88 ms | 113.88 ms | 72.14 KB | 51.52 KB | 28.58% | 20.62 KB |
| padded_numeric | 0.00 ms | 127.31 ms | 127.31 ms | 72.14 KB | 56.87 KB | 21.17% | 15.27 KB |
| uuid_short | 0.00 ms | 113.00 ms | 113.00 ms | 72.14 KB | 63.82 KB | 11.53% | 8.31 KB |
FASTEST TOKENIZATION:
- base64: 111.24 ms
- alphabetic: 112.28 ms
- uuid_short: 113.00 ms
- numeric: 113.88 ms
- padded_numeric: 127.31 ms
BEST COMPRESSION:
- base64: 32.49% (23.44 KB saved)
- alphabetic: 31.71% (22.87 KB saved)
- numeric: 28.58% (20.62 KB saved)
- padded_numeric: 21.17% (15.27 KB saved)
- uuid_short: 11.53% (8.31 KB saved)
MOST SPACE SAVED:
- base64: 23.44 KB
- alphabetic: 22.87 KB
- numeric: 20.62 KB
- padded_numeric: 15.27 KB
- uuid_short: 8.31 KB
EFFICIENCY SCORE (Compression/Time):
- base64: 0.2921 (32.49% in 111.24 ms)
- alphabetic: 0.2824 (31.71% in 112.28 ms)
- numeric: 0.2510 (28.58% in 113.88 ms)
- padded_numeric: 0.1663 (21.17% in 127.31 ms)
- uuid_short: 0.1020 (11.53% in 113.00 ms)
| Model | Raw Size | Raw→Tok | Tok+Gzip | MsgPack | Tok+Msg | Tok+Msg+Gzip | Tok Enc/Dec | Msg Enc/Dec | Tok+Msg Enc/Dec |
|---|---|---|---|---|---|---|---|---|---|
| model1.json | 83.8 KB | 64.6% | 55.5% | 60.3% | 76.3% | 55.5% | 86.5/74.6 ms | 1.3/0.8 ms | 87.1/72.3 ms |
| model2.json | 134.4 KB | 65.7% | 51.8% | 61.3% | 77.2% | 55.7% | 103.9/105.9 ms | 0.3/0.4 ms | 104.2/106.5 ms |
| model3.json | 148.7 KB | 66.9% | 56.5% | 62.7% | 78.0% | 57.4% | 113.4/116.5 ms | 0.3/0.3 ms | 113.7/115.0 ms |
| model4.json | 33.1 KB | 69.9% | 45.6% | 64.1% | 82.2% | 46.2% | 28.1/28.6 ms | 0.2/0.1 ms | 28.1/27.5 ms |
| Average | - | 66.8% | 52.4% | 62.1% | 78.4% | 53.7% | 82.9/81.4 ms | 0.53/0.40 ms | 83.28/80.3 ms |
Key:
- Raw→Tok: Tokenization compression ratio
- Tok+Gzip: Tokenized with Gzip compression
- MsgPack: MessagePack compression ratio
- Tok+Msg: Combined tokenization + MessagePack
- Tok+Msg+Gzip: Best compression (tokenization + MessagePack + Gzip)
- Enc/Dec: Encoding/Decoding performance in milliseconds
MIT License © 2025 DocAmz