GenesisL1 Forest (GL1F)

A browser-only studio for:

training GBDT models (in-browser): regression + binary + multiclass + multilabel classification
deploying models on GenesisL1 as ERC-721 Model NFTs
on-chain inference (view call + optional paid tx)
AI store listing/buying
on-chain Terms + Creative Commons license enforcement
on-chain title-word search index

Run locally

You must serve this folder over HTTP (not file://) so module imports work.

Option A: Python

python3 -m http.server 8080

Open:

Network

GenesisL1:

chainId: 29
default RPC: https://rpc.genesisl1.org

Notes

Model bytes are stored as chunk contracts. Pointer-table and chunks use runtime magic "GL1C".
Model format "GL1F" v1 is used for regression + binary classification.
Model format "GL1F" v2 is used for multiclass + multilabel classification (vector-output logits).
Model metadata (title/description/icon/features) is stored on-chain in the NFT.

GenesisL1 Forest Model (GL1F) — Architecture & Design

This document describes the model itself (not the UI or the trainer servers): the GL1F formats, inference rules, and the training procedure used to produce deterministic, on-chain-friendly Gradient Boosted Decision Trees (GBDT).

1) Design goals

Deterministic execution

The model must produce identical predictions across:

browser (WebWorker + local inference),
Python trainer,
C++ trainer,
EVM on-chain runtime.

Determinism is achieved by:

seeded RNG (xorshift32) for all randomized choices,
fixed split/search procedure,
explicit rounding rules compatible with JavaScript Math.round,
fixed model layout (no pointer-heavy structures).

On-chain friendliness

The on-chain runtime must be:

bounded-time (no unbounded recursion, no dynamic tree growth at inference),
integer-only operations (no floats on-chain),
compact enough to store as EVM bytecode chunks.

Solutions:

fixed-depth complete binary trees,
int32 quantization (Q units) for thresholds and leaf values,
simple xQ > thrQ branching,
model bytes chunking for on-chain storage (GL1C chunks + pointer table).

2) What the model is

GL1F is a GBDT ensemble of fixed-depth, axis-aligned decision trees.

There are two model versions:

v1 (scalar output)

Used for:

regression
binary classification (outputs a single logit)

Output: a single integer value scoreQ in Q-units.

v2 (vector output)

Used for:

multiclass classification (outputs logits per class)
multilabel classification (outputs independent logits per label)

Output: an array of integers logitsQ[k] in Q-units (one per class/label).

3) Numeric representation (Q-units)

The model stores all thresholds and leaf values as signed 32-bit integers (int32) representing:

valueQ = round(valueFloat * scaleQ)
valueFloat ≈ valueQ / scaleQ

Where:

scaleQ is stored in the model header (uint32).
Features at inference are quantized the same way:
- xQ = round(x * scaleQ) clamped into int32 range.

Rounding rule (important)

All implementations (JS/Python/C++) match:

Math.round(x)  <=>  floor(x + 0.5)   for finite x

This matches JavaScript Math.round, including negative half cases (e.g. -1.5 → -1).

Choosing scaleQ

Higher scaleQ increases precision but risks overflow in:

quantized inputs xQ
accumulated logits during inference

Practical guidance:

keep abs(x * scaleQ) safely within ≈ 2.147e9 (int32 range),
keep accumulated logits within JS safe integer range (≈ 2^53) if doing off-chain inference in JS numbers,
on-chain inference uses int256 accumulators, so it is typically safe for larger totals.

4) Tree structure

All trees are complete binary trees of fixed depth.

Let:

L = 2^depth leaves
I = 2^depth - 1 internal nodes

The tree is stored in arrays:

internal nodes indexed idx = 0 .. I-1
leaves indexed idx = I .. I+L-1

Traversal starts at idx=0 (root). At each level:

goRight = (xQ[feature] > thresholdQ)
idx = goRight ? (2*idx + 2) : (2*idx + 1)

After depth decisions, idx points to a leaf node:

leafIndex = idx - I
leafValueQ = leaf[leafIndex]

“Forced leaf” nodes

If the trainer cannot find a valid split (or leaf constraints prevent splitting), it fills the remaining subtree with:

feature = 0
threshold = INT32_MAX
identical leaf values copied downwards

This preserves fixed-depth layout and deterministic inference.

5) GL1F binary formats

All fields are little-endian.

5.1) GL1F v1 header (24 bytes)

Offset	Size	Type	Meaning
0	4	bytes	magic `"GL1F"`
4	1	u8	version = 1
5	1	u8	reserved
6	2	u16	`nFeatures`
8	2	u16	`depth`
10	4	u32	`nTrees`
14	4	i32	`baseQ` (base prediction)
18	4	u32	`scaleQ`
22	2	u16	reserved

After the header come nTrees trees.

5.2) GL1F v1 tree layout

For each tree:

Internal nodes: I * 8 bytes
Each internal node is 8 bytes:

u16 featureIndex at offset +0
i32 thresholdQ at offset +2
2 bytes padding/reserved at offset +6

Leaves: L * 4 bytes
Each leaf:

i32 leafValueQ

So each tree is:

perTreeBytes = I*8 + L*4

5.3) GL1F v2 header

v2 is similar but adds:

vector output size (nClasses)
base logits per class/label
trees arranged class-major

Header:

Offset	Size	Type	Meaning
0	4	bytes	magic `"GL1F"`
4	1	u8	version = 2
5	1	u8	reserved
6	2	u16	`nFeatures`
8	2	u16	`depth`
10	4	u32	`treesPerClass`
14	4	i32	reserved
18	4	u32	`scaleQ`
22	2	u16	`nClasses` (>=2)
24..	4*nClasses	i32[]	`baseLogitsQ[k]`

After base logits come treesPerClass * nClasses trees, stored class-major:

for class k in 0..nClasses-1:
  for t in 0..treesPerClass-1:
    write tree(k,t)

Tree layout is identical to v1 (internal nodes + leaves).

6) Inference rules

6.1) v1 inference (regression / binary logit)

Compute:

accQ = baseQ
for each tree:
  accQ += leafValueQ(tree, featuresQ)
return accQ

Interpretation:

Regression: y ≈ accQ / scaleQ
Binary classification:
- logit = accQ / scaleQ
- probability = sigmoid(logit)
- class = prob >= 0.5 (equivalently logit >= 0)

6.2) v2 inference (multiclass / multilabel)

For each class/label k:

logitsQ[k] = baseLogitsQ[k] + sum_t leafValueQ(tree(k,t), featuresQ)

Interpretation:

Multiclass: argmax_k logitsQ[k]
- optional probabilities: softmax(logitsQ/scaleQ)
Multilabel: independent per label:
- p_k = sigmoid(logitsQ[k] / scaleQ)
- label active if p_k >= threshold (often 0.5 => logit >= 0)

6.3) Overflow considerations

Stored values are int32.
Accumulation:
- Browser local inference uses JS numbers for v2 logits to avoid int32 wrap.
- On-chain runtime should use int256 accumulators.

7) Training procedure (GBDT)

Training produces GL1F bytes by boosting fixed-depth trees.

7.1) Dataset split

Rows are shuffled by a seeded RNG (xorshift32) and split into:

train
validation
test

A stratified split is used for single-label classification when enabled.

7.2) Feature sub-sampling

At each node split, a random subset of features is sampled:

colsample = round(sqrt(nFeatures)) (at least 1)

This reduces correlation and speeds up search while keeping determinism via seeded RNG.

7.3) Candidate thresholds via histogram binning

For each candidate feature:

compute a bin index per row using either:

Linear binning

b = floor( ((x - min) / range) * BINS )

Quantile binning

precompute (BINS-1) thresholds per feature from a deterministic sample of the train split,
bin is found by binary-searching thresholds.

For each bin boundary b = 0..BINS-2, the threshold used is:

linear: min + range * ((b+1)/BINS)
quantile: thr[b]

The stored thresholdQ is round(threshold * scaleQ).

7.4) Regression objective (squared loss)

Base prediction: mean of y on training rows:
```
baseQ = round(mean(y_train) * scaleQ)
```
Residual: r = y - pred

Leaf weight for a node: mean residual scaled by learning rate:

leaf = lr * mean(residuals_in_leaf)
leafQ = round(leaf * scaleQ)

Split score: minimize SSE (sum of squared errors) of residuals:
```
SSE = SSE_left + SSE_right
```

7.5) Binary classification objective (log loss)

Base logit: log-odds of (optionally weighted) positive rate:

p0 = clamp(mean(y_train), [1e-6, 1-1e-6])
baseLogit = log(p0/(1-p0))
baseQ = round(baseLogit * scaleQ)

Uses Newton-style updates:
```
grad = (y - p)
hess = p*(1-p)
```
(optionally multiplied by per-row weights)

Leaf weight:

w = lr * sum(grad) / (sum(hess) + lambda)

with lambda = 1.0.

Split gain:
```
gain = G_L^2/(H_L+λ) + G_R^2/(H_R+λ) - G_P^2/(H_P+λ)
```
where G is sum of gradients and H sum of hessians for the node.

A split is accepted only if gain > 0 and both sides have at least minLeaf rows.

7.6) Multiclass classification

Vector-output boosting (v2).

Base logits: log of class priors (with numerical safeguards).
Each boosting round trains one tree per class using class-specific gradients/hessians computed from current softmax probabilities.

Trees are stored class-major:

treesPerClass rounds,
total trees = treesPerClass * nClasses.

7.7) Multilabel classification

Also v2.

Each label is treated as an independent logistic head.
Base logits per label from label-wise positive rates.
Each boosting round trains one tree per label using the same binary Newton scheme.

7.8) Early stopping and LR schedules

Training can optionally:

early-stop based on validation metric (patience),
adjust learning rate (e.g., plateau schedule).

Important: because early stopping decisions depend on floating-point metric values, absolute bit-identical results across machines require:

the same metric computations,
and no ties/near-ties where extremely tiny float differences flip the “best” iteration.

In practice, the implementation is designed so that JS/Python/C++ produce bit-identical model bytes given identical inputs and parameters (validated in parity tests).

8) Reproducibility & cross-implementation parity

If you want byte-identical GL1F outputs across:

WebWorker (JS),
Python,
C++,

ensure:

same feature order and same selected columns,
same class/label ordering (binary/multiclass),
same seed, splitTrain/splitVal, refitTrainVal, etc.,
same binning mode and bins/quantileSamples,
compare core model bytes (exclude any optional GL1X footer).

9) Optional packaging: GL1X footer

A .gl1f file may contain:

core GL1F model bytes
followed by an optional GL1X footer with JSON metadata (mint info, feature names, etc.)

This footer is not used for inference, and should be stripped before computing modelId (so deployment matches the on-chain runtime expectations).

10) Limitations (model-level)

Numeric features only (no native categorical handling).
Missing values are typically dropped at preprocessing time.
Fixed depth (no leaf-wise growth like LightGBM).
No monotonic constraints (can be added later if desired).
No per-feature normalization stored in-model (must be handled by the data pipeline).

11) Why this architecture works well on-chain

Fixed-depth trees => predictable gas and runtime complexity.
Integer thresholds/leaf values => deterministic behavior across EVM nodes.
Compact contiguous layout => efficient bytecode storage and fast decoding.
Vector-output v2 => multiclass/multilabel without separate model assets.

License

This project is released under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
contracts		contracts
cpp		cpp
src		src
GL1F.pdf		GL1F.pdf
LICENSE		LICENSE
README.md		README.md
_test.csv		_test.csv
_test_bin.csv		_test_bin.csv
aistore.html		aistore.html
build_cpp_trainer.bat		build_cpp_trainer.bat
build_cpp_trainer.sh		build_cpp_trainer.sh
cpp.gl1f		cpp.gl1f
cpp_bin.gl1f		cpp_bin.gl1f
cpp_mc.gl1f		cpp_mc.gl1f
create.html		create.html
forest.html		forest.html
index.html		index.html
local_trainer_server.py		local_trainer_server.py
model.html		model.html
my.html		my.html
patch.py		patch.py
py.gl1f		py.gl1f
py_bin.gl1f		py_bin.gl1f
py_mc.gl1f		py_mc.gl1f
style.css		style.css
terms.html		terms.html
train_gl1f.py		train_gl1f.py

Folders and files

Latest commit

History

Repository files navigation

GenesisL1 Forest (GL1F)

Run locally

Option A: Python

Network

Notes

GenesisL1 Forest Model (GL1F) — Architecture & Design

1) Design goals

Deterministic execution

On-chain friendliness

2) What the model is

v1 (scalar output)

v2 (vector output)

3) Numeric representation (Q-units)

Rounding rule (important)

Choosing scaleQ

4) Tree structure

“Forced leaf” nodes

5) GL1F binary formats

5.1) GL1F v1 header (24 bytes)

5.2) GL1F v1 tree layout

5.3) GL1F v2 header

6) Inference rules

6.1) v1 inference (regression / binary logit)

6.2) v2 inference (multiclass / multilabel)

6.3) Overflow considerations

7) Training procedure (GBDT)

7.1) Dataset split

7.2) Feature sub-sampling

7.3) Candidate thresholds via histogram binning

7.4) Regression objective (squared loss)

7.5) Binary classification objective (log loss)

7.6) Multiclass classification

7.7) Multilabel classification

7.8) Early stopping and LR schedules

8) Reproducibility & cross-implementation parity

9) Optional packaging: GL1X footer

10) Limitations (model-level)

11) Why this architecture works well on-chain

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages