### Minimal ensemble code

This code builds a simple bagging ensemble based on Kfold train/val split. Use it for quick idea testing with GPT-5-assisted tree generation.

In [None]:
# Local dataset settings
DATA_PATH = "mydata/data.csv"
TARGET_COL = "PM2.5"
TASK_TYPE = "regression"  # regression | binary | multiclass
RANDOM_STATE = 42
TEST_SIZE = 0.2
VAL_SIZE = 0.2
N_REPEATS = 1

dataset_name = "pm25_local"
results_path = "data/tree_scores.pm25_local.json"

In [None]:
import os
import re
import json
import numpy as np
import pandas as pd
import smolagents
from huggingface_hub import login
import proxy_api_model
import prompting
import tree_agent
from sklearn.model_selection import train_test_split
from task import metric_func_by_task

login(token="")
login(token="")
# V-- this uses a GPT-5 model over an API. Replace with https://smolagents.org/docs/agents-guided-tour/
model = proxy_api_model.ProxyAPIModel(
    model_id="deepseek-reasoner",
    api_base="https://api.deepseek.com/v1",  # <-- https://your/openai-like/api/v1/chat/completions
    api_key="sk-50f7e5e092e04da4be8f8f8bb3f6d103",  # <-- use your token
    max_new_tokens=1024 * 8,
    callback=lambda msg, **etc: print(  # print model thoughts before code
        re.sub(r"<code>.*?</code>", "<code omitted>", msg.content, flags=re.DOTALL)
    ),
)

# Load local dataset
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"DATA_PATH not found: {DATA_PATH}")

df = pd.read_csv(DATA_PATH)

if TARGET_COL not in df.columns:
    matches = [c for c in df.columns if c.lower() == TARGET_COL.lower()]
    if matches:
        TARGET_COL = matches[0]
    else:
        raise ValueError(
            f"Target column '{TARGET_COL}' not found in data columns: {list(df.columns)}"
        )

X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL].to_numpy()

task_type = TASK_TYPE
metric_func = metric_func_by_task[task_type]
metric_name = prompting.metrics_by_task[task_type]

cat_cols = [c for c in X.columns if X[c].dtype == "object"]
num_cols = [c for c in X.columns if c not in cat_cols]

# Dataset description to help the LLM form hypotheses
feature_list = ", ".join(X.columns)
cat_list = ", ".join(cat_cols) if cat_cols else "none"
num_list = ", ".join(num_cols) if num_cols else "none"

dataset_desc = f"""
<dataset>
Your task is to predict PM2.5.
Size: {len(X)} total (full dataset)
Feature columns: {feature_list}
Categorical columns: {cat_list}
Numeric columns: {num_list}
Targets: {TARGET_COL} (float)
Metric: {metric_name}
</dataset>
""".strip()

airfoil_self_noise 1503 363612
anneal 898 363614
Another-Dataset-on-used-Fiat-500 1538 363615
blood-transfusion-service-center 748 363621
concrete_compressive_strength 1030 363625
credit-g 1000 363626
diabetes 768 363629
Fitness_Club 1500 363671
hazelnut-spread-contaminant-detection 2400 363674
healthcare_insurance_expenses 1338 363675
Is-this-a-good-customer 1723 363682
Marketing_Campaign 2240 363684
maternal_health_risk 1014 363685
qsar-biodeg 1054 363696
QSAR_fish_toxicity 907 363698
website_phishing 1353 363707
MIC 1699 363711


In [None]:
test_scores = []
for repeat_index in range(N_REPEATS):
    print("Beginning repeat", repeat_index)
    split_seed = RANDOM_STATE if N_REPEATS == 1 else RANDOM_STATE + repeat_index

    X_train_full, X_test, y_train_full, y_test = train_test_split(
        X,
        y,
        test_size=TEST_SIZE,
        random_state=split_seed,
        stratify=None if task_type == "regression" else y,
    )
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_full,
        y_train_full,
        test_size=VAL_SIZE,
        random_state=split_seed,
        stratify=None if task_type == "regression" else y_train_full,
    )

    result = tree_agent.TreeAgent(model=model).run(
        task=f"""
Build the optimal decision tree for the '{dataset_name}' dataset.
You are given access to 4 data variables in your python environment:
 - X_train, X_val are pandas dataframes with named feature columns (see below) that may need preprocessing;
 - y_train, y_val are numpy arrays (1d) with targets, also described below;

Dataset description (use it to form hypotheses):
{dataset_desc}

Here's one way you could construct before you begin editing it manually:
{prompting.starter_snippets_by_task[task_type]}

Now begin: view the data variables, preprocess as necessary, train a baseline tree, then propose the first hypothesis and start improving.
Focus on drawing conclusions from data, looking at the tree (e.g. via print) and using your own intuition about the problem for manual tree edits.
Quality is more important than speed: take as many steps as you need to get the best tree.
""".strip(),
        additional_args=dict(
            X_train=X_train.copy(),
            y_train=y_train.copy(),
            X_val=X_val.copy(),
            y_val=y_val.copy(),
        ),
    )

    y_pred_i = result["model"].predict(result["preprocess_features"](X_test.copy()))
    if task_type == "multiclass":  # normalize for logloss
        y_pred_i = y_pred_i / y_pred_i.sum(axis=-1, keepdims=True)
    test_score = metric_func(y_test, y_pred_i)
    print(f"Test {metric_name} score #{repeat_index}: {test_score:.5f}")
    test_scores.append(test_score)
    with open(results_path, "w") as f:
        json.dump(test_scores, f, ensure_ascii=False, indent=2)

Beginning repeat 0
Inferred task type: multiclass


Thought: I will inspect the raw data to understand shapes, class balance, feature types, and missing/categorical values. Hypothesis: Many categorical features with "not_applicable" need one-hot encoding; numeric features seem already numeric. I will print basic summaries to guide preprocessing choices and a baseline model setup.

<code omitted>


Thought: The earlier dtype check missed pandas 'category' dtype. Hypothesis: Many columns are categorical with dtype 'category' (not 'object'), so I should detect non-numeric dtypes and one-hot encode them. I will:
- Identify numeric vs non-numeric columns using pandas is_numeric_dtype.
- Build a preprocessing function that one-hot encodes non-numeric columns based on training data, aligns validation columns, and preserves numeric columns as-is.
- Train a baseline DecisionTreeClassifier and evaluate log loss. Then print the editable tree to inspect structure for targeted improvements.

<code omitted>


Thought: The preprocessor produced an unexpected row count (702 vs 478). Hypothesis: Concatenation or index alignment across per-column dummy DataFrames caused an index union (likely due to non-unique indices or misalignment), inflating rows. I will debug indices for consistency, then switch to a simpler, robust one-hot encoding that builds all categorical dummies at once using pd.get_dummies, fits on train to capture dummy columns, and reindexes validation to match. This should keep row counts correct and aligned with y. I will then retrain a baseline tree and inspect it.

<code omitted>


Thought: I will inspect the trained tree’s leaves: how many train/val samples route to each leaf and what their class distributions are. Hypothesis: Some leaves are overly confident (0/1 probabilities), which can hurt log-loss on validation. If so, I will apply Laplace smoothing to leaf probability vectors using the training counts per leaf to improve calibration. I’ll also check per-leaf validation log-loss to target problematic leaves for potential pruning/grafting next.

<code omitted>


Thought: Smoothing clearly improved validation log-loss drastically. Next, I will (a) fine-tune the global Laplace smoothing alpha, (b) identify problematic leaves after smoothing, and (c) try a targeted local retraining on the mixed leaf id=12 with a very shallow subtree trained only on its training data. I’ll keep snapshots and accept changes only if validation log-loss improves. I’ll also briefly test 2-3 sensible tree hyperparameter variants combined with the best smoothing to ensure we’re not leaving easy gains on the table.

<code omitted>


Thought: The largest remaining contributor to validation log-loss is leaf 15, which is a pure training leaf (all class 4) but contains one class-2 sample in validation. Since local retraining at node 15 won’t help (no training variability), I’ll try adaptive Laplace smoothing: use higher alpha for pure leaves to reduce overconfidence, and base alpha for mixed leaves. Hypothesis: This will reduce the penalty on unexpected classes at pure leaves without harming well-supported mixed leaves. I’ll:
- Recompute per-leaf validation losses for the current edited tree.
- Try adaptive smoothing with alpha_base=0.15 (best global) and alpha_pure in {0.5, 1.0, 2.0}.
- Keep the best if validation log-loss improves.

<code omitted>



Thought: Leaf 15 contributes a large chunk of the remaining validation log-loss due to being a pure class-4 training leaf with one class-2 sample in validation. Hypothesis: Increasing smoothing only for this leaf (id=15) to alpha=1.0 will reduce overconfidence there and improve overall validation log-loss, while keeping the globally optimal alpha=0.15 for all other leaves to avoid harming large, well-calibrated leaves (e.g., id=5 with many samples). I will implement per-leaf smoothing with a custom alpha map (id 15 -> 1.0; others -> 0.15), evaluate, and accept if it improves validation. Then I will finalize, packaging the preprocessing function and the final editable tree.

<code omitted>


Test Log Loss (sklearn.metrics.log_loss) score #0: 0.08062
Beginning repeat 1
Inferred task type: multiclass


Thought:
I will start by inspecting the raw training/validation data: shapes, dtypes, unique values for categorical features, and class distribution for y. Hypothesis: the dataset has many categorical columns with "not_applicable" markers and mixed types; I will need to one-hot encode categoricals and keep numeric columns as-is. I will also check that train/val have consistent categories to ensure aligned one-hot columns.

I will print:
- Shapes and dtypes
- Head of X_train
- Unique classes in y_train and counts
- For a subset of categorical columns, show unique values to plan preprocessing

<code omitted>


Thought:
Most features are pandas "category" dtype, not "object" — my earlier object-based detection missed them. I will:
- Preprocess by one-hot encoding all categorical (category dtype) features; keep numeric as-is.
- Align validation columns to training dummy columns.
- Train a small set of baseline DecisionTreeClassifier models (a few sensible max_depth/min_samples settings) to get an initial, not overfit model.
- Pick the best by validation log_loss, convert to editable Tree, and print its structure and diagnostics to guide manual improvements.

I expect features like condition, temper_rolling, and surface-quality to be predictive. I will also print feature importances to see which dummies matter most.

<code omitted>


Thought:
Our baseline tree already achieves strong validation log loss (0.0839). Before attempting growth, I’ll analyze the current tree for overconfident or weak branches. Hypothesis: some leaves are too pure on train (e.g., predict [0,0,0,0,1]) and may harm calibration/generalization. I’ll compute, for each internal node, the validation log loss for samples under that node with:
- current subtree predictions, vs
- a hypothetical prune (replace the subtree with a leaf using the node’s training distribution).
If pruning improves validation loss, I’ll prune that node. I’ll print per-node diagnostics (train/val counts, current vs pruned val loss) and then iteratively apply beneficial prunes, keeping the best tree snapshot.

<code omitted>


Thought:
The previous pruning analysis failed because sklearn’s log_loss requires an explicit labels argument when y_true subsets don’t contain all classes. I will:
- Re-run evaluation and node-level pruning diagnostics with labels set to the global classes.
- If any internal nodes show better validation loss when pruned, I’ll greedily apply those prunes.
- Additionally, I’ll inspect leaf-level supports and consider simple Laplace smoothing (alpha=0.5) for leaf probabilities if pruning yields no gains, to reduce overconfidence on small leaves.

I expect few or no beneficial prunes given the already low validation loss, but smoothing might yield a small robustness gain.

<code omitted>


Thought:
Targeted local growth: I’ll try improving specific internal nodes by retraining subtrees on the data routed to those nodes. Hypothesis: nodes on the right branch (ids 11, 12, 13, 18, 17) might benefit from small depth-2/3 subtrees using log-loss criterion and min_samples_leaf to avoid overfitting. I’ll:
- For each candidate node id, extract routed train/val indices.
- Train a small DecisionTreeClassifier on that subset and graft it with grow_subtree.
- Evaluate overall validation log loss; accept only if it improves over the current best. I’ll test a few sensible hyperparameters per node and apply the single best change greedily if it helps.

<code omitted>


Thought:
The previous local growth failed because the grafted subtree had leaf probability vectors with fewer classes than the global 5, causing inconsistent vector sizes. Fix: after growing any subtree, I’ll recompute leaf probabilities using global class counts over the samples routed to each leaf. This enforces consistent length/order and avoids relying on sklearn’s local classes_. I’ll retry targeted local growth on a few promising nodes (11, 12, 13, 18, 17) with small depth and min_samples_leaf. I will accept only if overall validation log_loss improves.

<code omitted>


Thought:
The previous attempt failed because node ids changed after grafting. I will:
- First, explicitly re-apply the beneficial graft at node 11 (depth=3, min_samples_leaf=5) on the baseline best_tree and update best scores.
- Then run a safe iterative local-growth loop that, at each iteration, recomputes the current tree’s internal node ids, tries small subtrees at nodes with sufficient samples, enforces global-length leaf probabilities after each graft, and accepts only changes that improve validation log loss.
- Finally, I’ll print the final tree and prepare the preprocess_features callable and final_answer payload.

<code omitted>


Test Log Loss (sklearn.metrics.log_loss) score #1: 0.32442
Beginning repeat 2
Inferred task type: multiclass


Thought:
I will start by exploring the data schema, target distribution, and basic missing patterns to form preprocessing hypotheses. Then I will design a preprocessing pipeline:
- Treat all object columns as categorical and one-hot encode them.
- Keep numeric columns as-is.
- Build a stable one-hot schema by computing dummy columns from the union of train and val categories (to ensure preprocess_features(X_val) matches training columns).
I will also compute class distribution to anticipate baseline performance and then prepare helper functions and the metric (log_loss). I will not train a model yet; this step is diagnostics and preprocessing setup.

<code omitted>


Thought:
The earlier preprocessing mistakenly treated only object dtype as categorical, but here most categoricals are pandas Categorical dtype. Hypothesis: properly one-hot encoding category dtype columns will expand features and allow the tree to learn. I will:
- Rebuild the dummy schema using pd.get_dummies on the combined train+val, which handles category dtype automatically.
- Update preprocess_features to reindex to that schema.
Then I will train a baseline DecisionTreeClassifier with conservative settings (max_depth=5, min_samples_leaf=10) to get a stable probability estimation for log-loss, evaluate, convert to editable_tree.Tree, and print the structure to plan improvements.

<code omitted>


Thought:
I will evaluate a naive baseline (global class prior) and analyze the current tree’s leaves on validation: how many samples route to each leaf, their class distributions, and per-leaf log-loss. Hypothesis: a few leaves are poorly calibrated or mis-splitting many validation samples; locally retraining/growing targeted subtrees at those leaves should improve validation log-loss. I will also try a few conservative baseline trees (varying max_depth and min_samples_leaf) to pick a stronger starting point before manual edits.

<code omitted>


Thought:
Two leaves (ids 8 and 4) show relatively high validation log-loss. Hypothesis: within each of these leaves there exist simple, stable splits (e.g., on condition, len, etc.) that separate classes better and reduce calibration error. I will:
- Grow small local subtrees at node 8 and node 4 using only the samples routed to them (train subset), with a tiny grid over max_depth and min_samples_leaf.
- For each grafted subtree, evaluate the full-model validation log-loss to choose improvements conservatively.
- Keep a snapshot of the best tree and revert if a change harms validation.

<code omitted>


Thought:
The error likely occurred because the locally grown subtree only saw a subset of classes, producing leaves with shorter probability vectors than the global tree expects. Hypothesis: after grafting, calling fill_values with the full training data will recompute leaf probabilities with the full class set and fix vector-size inconsistencies. I will:
- Implement a safe local growth routine that, after grow_subtree, calls repair() and fill_values(..., recompute_all=True, prune_empty=True).
- Try local growth at node 8 and node 4 separately, always starting from the same baseline tree to avoid node-id drift.
- Select the best improvement (if any) by validation log-loss and accept it.

<code omitted>


Thought:
The last attempt failed due to using repr(e). I’ll retry with a safer local-growth routine that avoids repr, and I’ll apply edits sequentially:
1) Optimize node 8 (the worst leaf by val log-loss among those with notable mass), accept if it improves validation.
2) Recompute diagnostics, then optimize node 4 on top of that improved tree, again accepting only if it improves validation.
I’ll ensure class-probability vector sizes remain consistent by calling repair and fill_values after grafting.

<code omitted>


Thought:
We improved validation log-loss markedly by growing a local subtree at node 8. The remaining weak spot is node 4 (7 val samples, mixed classes 1 and 2, val_ll≈0.604). Hypothesis: a single, interpretable split inside node 4 can separate the two classes better. I will:
- Extract train/val indices routed to node 4.
- Try candidate 1-split rules on the local features (binary dummies and a few numeric thresholds).
- For each candidate, compute leaf probabilities from the local train subset (with Laplace smoothing) and evaluate the resulting per-node val log-loss.
- Pick the best split and manually graft a subtree at node 4 using the chosen feature/threshold and the computed 5-class probability vectors.
- Accept only if overall validation log-loss improves.

<code omitted>


Thought:
We achieved a strong validation log-loss by improving nodes 8 and 4. I’ll now inspect per-leaf validation diagnostics on the current best tree to spot any remaining leaves with notable validation loss. Hypothesis: there may be one more leaf with mixed classes that can be improved via a small local graft. I will only accept changes that improve overall validation log-loss to avoid overfitting.

<code omitted>


Thought:
We have a strong tree after targeted local improvements:
- Proper one-hot encoding of pandas categorical dtypes increased feature space (84 columns).
- Baseline selection favored a compact tree (depth=3, min_samples_leaf=20) with Val log-loss ≈0.116.
- Local growth at node 8 (formability/steel/surface-finish) improved Val log-loss to ≈0.0709.
- Manual graft at node 4 (splitting by strength ≤ 525 within the mixed leaf) further reduced Val log-loss to ≈0.0508.
Diagnostics show no remaining leaf with sufficient validation mass and high loss to justify further edits without overfitting. I will finalize this model and provide the preprocessing function.

<code omitted>


Test Log Loss (sklearn.metrics.log_loss) score #2: 0.07000
Beginning repeat 3
Inferred task type: multiclass


Thought:
I will start by inspecting the raw training/validation data: shapes, dtypes, categorical vs numeric features, unique values counts and class distribution for y. Hypothesis: many features are categorical with a special "not_applicable" token. I need to confirm there are no NaNs and understand class imbalance to choose a baseline tree setup. I will print summary stats to guide preprocessing.

<code omitted>


Thought:
Our earlier dtype check missed that features are pandas 'category', not 'object'. Hypothesis: We should one-hot encode all categorical columns (category dtype) using a union of train+val categories to ensure consistent columns and fit a baseline decision tree with modest depth and class_weight='balanced'. I’ll try a small set of interpretable hyperparameters to get a good baseline, then convert to an editable Tree and inspect it.

<code omitted>


Thought:
I will verify the baseline tree’s log loss using the editable tree and the preprocessed (one-hot) data. Then, I’ll inspect leaf distributions (train/val counts and probabilities) to identify overconfident leaves. Hypothesis: Laplace smoothing on leaf probabilities will improve calibration and reduce log loss. I’ll apply alpha=1 smoothing to each leaf’s class distribution (computed from training samples routed to that leaf), update the tree, and re-evaluate log loss. I’ll also check for redundant splits where both children make identical predictions and prune them.

<code omitted>


Thought:
The smoothed-and-pruned baseline looks strong (val log loss 0.1231). Next hypothesis: the worst-validation-loss leaves are small and heterogeneous (e.g., leaves 16, 9, 6). Fitting tiny local subtrees at those leaves (depth 1–2, min_samples_leaf=3) using only the samples that route there can better separate classes and produce more calibrated probabilities. After each graft, I will re-apply Laplace smoothing (alpha=1) across all leaves and prune any redundant equal-children splits, then re-evaluate validation log loss. I will accept only improvements.

<code omitted>


Thought:
We have a strong smoothed-and-pruned tree (val log loss ~0.1231). I’ll pursue three targeted improvements:
1) Tune Laplace smoothing alpha globally to see if calibration improves further.
2) Carefully try local growth on small problematic leaves (ids 16, 9, 21, 6) with a lenient min_samples_leaf=2 and depth=1, accepting only if validation improves.
3) Try selective pruning of internal nodes (turn subtrees into a single smoothed leaf) to reduce overfitting, accepting only if validation improves. I’ll evaluate each change and keep the best.

<code omitted>


Thought:
Alpha tuning gave a large gain (val log loss 0.07394). The earlier error was from using locals(), which is forbidden. I will:
- Use the already selected alpha (alpha_best_a) directly.
- Try local growth on top problematic leaves with enough train samples (>=8) using depth=1 and min_samples_leaf=2.
- After each graft, re-smooth with alpha_best_a and prune redundant splits.
- Accept only if validation log loss improves.
Finally, I’ll report the resulting best scores and keep the best tree for the final answer.

<code omitted>


Thought:
We achieved a strong baseline with a depth-6 tree on one-hot encoded features. I hypothesized calibration and small-leaf overconfidence were limiting validation performance. I applied Laplace smoothing across leaves and found alpha=0.3 gave large log-loss gains. I then targeted the worst-validation-loss leaves and locally grew shallow subtrees (depth=1, min_samples_leaf=2) at nodes with enough training data (notably leaves 21, 16, 9), resmoothed (alpha=0.3) and pruned redundant splits. Each change was accepted only if it improved validation log loss. The final model achieves a substantial improvement while remaining interpretable.

<code omitted>


Test Log Loss (sklearn.metrics.log_loss) score #3: 0.07290
Beginning repeat 4
Inferred task type: multiclass


Thought:
I will inspect the raw data to understand dtypes, class balance, and categorical cardinalities. Hypothesis: many columns are categorical with a dominant “not_applicable” category; we should one-hot encode categoricals and keep numeric as-is. I’ll print basic stats to guide preprocessing choices.

<code omitted>


Thought:
The earlier dtype check misidentified categorical columns because they are pandas "category" dtype, not object. I will:
- Correctly detect categorical columns (category or object).
- Inspect constant columns in both train and val to decide drops.
- Build a preprocessing function: one-hot encode categoricals, keep numerics, drop constants, and align columns.
- Train a baseline DecisionTreeClassifier with conservative depth and class_weight='balanced' to handle imbalance.
- Evaluate log loss and print the editable tree to start manual analysis.

<code omitted>


Thought:
Our baseline tree is overconfident and underperforms the prior. Hypothesis: shallower/pruned trees with larger leaves and possibly entropy criterion (more suitable for probability estimation) and no class_weight will yield better-calibrated probabilities and lower validation log loss. I will try a small, targeted parameter sweep over depth, min_samples_leaf, criterion, class_weight, and minimal pruning (ccp_alpha). I’ll select the best by val log loss, convert to editable_tree, and inspect structure and leaf sizes to plan manual smoothing if still needed.

<code omitted>


Thought:
I will try targeted probability smoothing at small leaves to improve log loss calibration without changing the tree structure. Hypothesis: small leaves (n ≤ 25) are overconfident; applying Laplace smoothing (alpha=1) to their class proportions will reduce overconfidence and lower validation log loss. I’ll:
- Compute per-leaf train/val sizes and counts.
- Apply Laplace smoothing to leaves with n ≤ 25, keeping larger leaves unchanged.
- Update leaf node.value vectors accordingly.
- Recompute log loss and compare.
- If improved, keep; otherwise revert. I’ll also print per-leaf val log loss contributions to see which leaves dominate.

<code omitted>


Thought:
Leaf 4 contributes the largest portion of validation loss and mixes classes 1 and 2. Hypothesis: a small, local split within node 4 can better separate class 1 from class 2, reducing its log loss and overall validation loss. I will:
- Extract train/val indices routed to node 4.
- Train a tiny subtree (max_depth=2, min_samples_leaf=3) on those train samples.
- Evaluate partial val log loss on those routed val samples vs current leaf predictions.
- If improved, graft the subtree at node 4 and recompute overall log loss.

<code omitted>


Thought:
The previous graft failed because the local subtree’s class dimension (trained on only classes {1,2}) did not match the global class set (5 classes). Hypothesis: if I expand the local subtree’s leaf probability vectors to length 5 in the global class order [0,1,2,3,4], then grafting will work and reduce overall log loss significantly (since node 4’s partial val loss can go to ~0). I will:
- Rebuild the best local subtree for node 4 (we already identified it).
- Convert it to editable_tree and expand all node.value vectors to length 5 using the mapping from local classes to global classes.
- Graft at node 4, repair, and evaluate full train/val log loss.
- If improved, keep it.

<code omitted>


Thought:
There’s a discrepancy: the sklearn local model at node 4 achieved near-zero val loss on its subset earlier, but the editable subtree evaluation gave huge loss. Hypothesis: probability vector ordering/mapping is mismatched between sklearn.predict_proba and our expanded editable subtree. I will:
- Compute node-4 val probabilities from sklearn clf (with its own class order) and from the expanded editable subtree, and print both along with y labels.
- Print class orders, and the feature names used in the local clf and in the editable subtree to ensure alignment.
- If mapping is wrong, fix the mapping and re-graft. Otherwise, I’ll abandon this graft and move to improving other problematic leaves (5, 9, 11).

<code omitted>


Thought:
The local graft failed due to binary-class representation quirks: Tree.from_sklearn for binary classification may store leaf values as a scalar (probability of class “1”) or in an order that doesn’t match local_clf.classes_, causing mis-mapped probabilities after expansion. Hypothesis: recomputing each leaf’s distribution directly from the data routed to that leaf (using subtree routing on the node-4 training subset) and mapping to the global 5-class order will fix probability mapping and yield the near-zero partial val loss we saw with sklearn. I will:
- Build the local subtree for node 4 again.
- For each leaf, recompute class counts from the node-4 training subset and set node.value as normalized 5-length vector in global class order [0..4].
- Verify partial val loss at node 4, graft if improved, and evaluate overall log loss.
- If successful, I’ll inspect remaining high-loss leaves (5, 9, 11) and consider similar targeted fixes.

<code omitted>


Thought:
We improved validation log loss substantially by grafting a local subtree at node 4. Next, I will:
- Print the current tree to confirm node IDs.
- Target remaining high-loss leaves (ids 5, 9, 11 from the per-leaf report) with the same local refinement approach:
  - Train a tiny local DecisionTree on the node’s routed training subset.
  - Set each new leaf’s probability as the normalized class counts over the local training subset mapped to the global 5-class order.
  - Graft if it reduces that node’s partial validation log loss and improves overall val loss.
I’ll implement a reusable helper to try (max_depth in {1,2}, min_samples_leaf in {2,3,5}) and keep the best.

<code omitted>



Thought:
The remaining validation loss is dominated by leaf 3 (n=75, avg loss 0.0797), which is highly overconfident for class 2 and penalized by one misclassified sample. Hypothesis: lightly smoothing leaf 3’s probabilities using its training counts (Laplace smoothing) will reduce overconfidence and overall val log loss. I’ll:
- Inspect train/val class counts at leaf 3.
- Try a few alphas (0.25, 0.5, 1.0, 2.0) only for leaf 3, compute new probs, graft the adjusted leaf value, and evaluate train/val log loss.
- Keep the alpha with best val loss if it improves over current best.
I’ll also optionally test a tiny tweak on leaf 17 (n=1 on val) only if leaf 3 smoothing stagnates.

<code omitted>


Thought:
We have a strong baseline with careful preprocessing (dropping constant features and one-hot encoding categoricals). We then trained a compact tree and iteratively improved it by targeted local grafts where validation loss was concentrated. Specifically, we:
- Grafted a local subtree at node 4 (splitting on strength for surface-quality_not_applicable and hardness<=75 branch), perfectly separating classes 1 and 2 on its validation subset.
- Refined node 5 by splitting on width, improving class-4 vs class-2.
- Refined node 9 (within the condition_S branch) to split on surface-finish_P and strength, carving out classes 0,1,2 appropriately.
Smoothing large confident leaf (node 3) did not help; node 11 had only one validation sample and local refinements did not improve its partial loss, so we left it as-is.

The resulting model achieves a much lower validation log loss.

<code omitted>


Test Log Loss (sklearn.metrics.log_loss) score #4: 0.14293


In [None]:
test_scores

[0.08061947848102688,
 0.32442495917804665,
 0.0700032991596908,
 0.07289752745568814,
 0.14292990538497083]

In [None]:
np.mean(test_scores), np.std(test_scores, ddof=1)

(0.13817503393188466, 0.10831992394475876)