# Semi-supervised approach


We use **semi-supervised learning**: training a regressor with a small labeled set plus extra unlabeled data to improve generalization. Typical families include:
- **Graph-based** methods (label propagation, Laplacian-regularized regression).
- **Consistency regularization** (teacher–student / Mean-Teacher, perturbation invariance).
- **Co-/Tri-training** (multiple views/models teach each other).
- **Generative** approaches (e.g., VAE-based feature learning + supervised head).



## The approach we use
We will use the approach called **self-training**.
We fit a **RandomForestRegressor** on **labeled train** (baseline), build the **unlabeled pool from train**, and iterate. 
At each round we predict on the pool, compute per-tree std **$\sigma$** as an uncertainty proxy, keep the **lowest- $\sigma$ quantile (e.g., 20%)**, add those samples with **uniform weight = 0.5** (true labels keep weight = 1.0), refit with sample weights, and report RMSE/MAE/R² on the **labeled test**. 





## Pseudocode
```pseudo
for i in range(nb_iterations):
  if U_train empty: break
  μ, σ = per-tree mean/std predictions on U_train
  S = indices with σ in lowest quantile (e.g., 20%)
  add (U_train[S], label=μ[S], weight=0.5) to training set
  remove S from U_train
  refit RandomForest with sample_weight; report metrics on labeled test


# Import the modules

In [160]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import KFold, GridSearchCV

# Import the data

    Note: In this part we will directly prevent the leakage by dropping the post-process/mechanical fields:
    `ultimate_tensile_strength_mpa`, `elongation_percent`, `reduction_of_area_percent`, `charpy_temperature_c`, `charpy_impact_toughness_j`, `haz_hardness`.

In [161]:

# load
df_train = pd.read_csv("preprocess_data/train_processed.csv")
df_test  = pd.read_csv("preprocess_data/test_processed.csv")

# drop leakage columns
leak_cols = [
    "ultimate_tensile_strength_mpa",
    "elongation_percent",
    "reduction_of_area_percent",
    "charpy_temperature_c",
    "charpy_impact_toughness_j",
    "haz_hardness",
]
df_train = df_train.drop(columns=leak_cols)
df_test  = df_test.drop(columns=leak_cols)

# Split labeled / unlabeled
y_col = "yield_strength_mpa"
train_labeled      = df_train[df_train[y_col].notna()].copy()
train_unlabeled_X  = df_train[df_train[y_col].isna()].drop(columns=[y_col]).copy()

test_labeled       = df_test[df_test[y_col].notna()].copy()   # only for evaluation
# (ignore test unlabeled)


In [162]:
# Consistent feature set
feature_cols = [c for c in df_train.columns if c != y_col]
X_train_L = train_labeled[feature_cols]
y_train_L = train_labeled[y_col].values
X_test_L  = test_labeled[feature_cols]
y_test_L  = test_labeled[y_col].values

#  Baseline: train on train labeled only
model = RandomForestRegressor(
    n_estimators=400, max_depth=None, min_samples_split = 3,
    max_features=0.5, random_state=42, n_jobs=-1, bootstrap=True
)
model.fit(X_train_L, y_train_L)

pred_base = model.predict(X_test_L)
print({
    "stage": "baseline",
    "rmse": float(np.sqrt(mean_squared_error(y_test_L, pred_base))),
    "mae":  float(mean_absolute_error(y_test_L, pred_base)),
    "r2":   float(r2_score(y_test_L, pred_base))
})

{'stage': 'baseline', 'rmse': 44.94810315018374, 'mae': 29.07128949336176, 'r2': 0.7971385235889683}


In [163]:
# Self-training: train unlabeled only, with uniform pseudo-label weight = 0.5
X_unlabeled = train_unlabeled_X[feature_cols].copy()
confidence_quantile = 0.20   # keep lowest 20% std each round
iterations = 5

X_lab = X_train_L.copy()
y_lab = y_train_L.copy()
sw_lab = np.ones(len(y_train_L), dtype=float)   # 1.0 for true labels

for it in range(iterations):
    if len(X_unlabeled) == 0:
        break
    print(f"Iteration {it+1}: labeled={len(y_lab)}, unlabeled={len(X_unlabeled)}")

    # Tree-wise predictions -> mean and std per sample (uncertainty proxy)
    X_unl_np = X_unlabeled.values  
    tree_preds = np.stack([t.predict(X_unl_np) for t in model.estimators_], axis=1)
    mu = tree_preds.mean(axis=1)
    sigma = tree_preds.std(axis=1)

    # Select most confident (lowest std)
    thresh = np.quantile(sigma, confidence_quantile)
    confident_idx = np.where(sigma <= thresh)[0]

    # Uniform weights for pseudo-labels (all equal to 0.5)
    w = np.full(len(confident_idx), 0.5)

    # Add pseudo-labels + weights; remove from pool
    X_lab = pd.concat([X_lab, X_unlabeled.iloc[confident_idx]], axis=0)
    y_lab = np.concatenate([y_lab, mu[confident_idx]])
    sw_lab = np.concatenate([sw_lab, w])
    X_unlabeled = X_unlabeled.drop(index=X_unlabeled.index[confident_idx])

    # Retrain RF with sample_weight
    model = RandomForestRegressor(
    n_estimators=400, max_depth=None, min_samples_split = 3,
    max_features=0.5, random_state=42, n_jobs=-1, bootstrap=True)
    model.fit(X_lab, y_lab, sample_weight=sw_lab)

    # Quick eval on labeled test
    y_hat = model.predict(X_test_L)
    print({
        "iter": it+1,
        "rmse": float(np.sqrt(mean_squared_error(y_test_L, y_hat))),
        "mae":  float(mean_absolute_error(y_test_L, y_hat)),
        "r2":   float(r2_score(y_test_L, y_hat)),
        "added_total": int(len(y_lab) - len(y_train_L))
    })



Iteration 1: labeled=613, unlabeled=708
{'iter': 1, 'rmse': 44.84354299652713, 'mae': 29.444082898577804, 'r2': 0.7980812354917506, 'added_total': 142}
Iteration 2: labeled=755, unlabeled=566
{'iter': 2, 'rmse': 45.06221936435447, 'mae': 29.595129506424954, 'r2': 0.7961071487852592, 'added_total': 259}
Iteration 3: labeled=872, unlabeled=449
{'iter': 3, 'rmse': 44.35792843986998, 'mae': 29.329649912802267, 'r2': 0.802430747750497, 'added_total': 352}
Iteration 4: labeled=965, unlabeled=356
{'iter': 4, 'rmse': 45.14678570517127, 'mae': 29.770361786535197, 'r2': 0.7953411567105333, 'added_total': 424}
Iteration 5: labeled=1037, unlabeled=284
{'iter': 5, 'rmse': 44.694882703359475, 'mae': 29.693964047412898, 'r2': 0.7994177734066739, 'added_total': 481}


In [164]:
#print baseline to compare
print({
    "stage": "baseline",
    "rmse": float(np.sqrt(mean_squared_error(y_test_L, pred_base))),
    "mae":  float(mean_absolute_error(y_test_L, pred_base)),
    "r2":   float(r2_score(y_test_L, pred_base))
})
# Final eval on labeled test
y_pred = model.predict(X_test_L)
print({
    "stage": "final",
    "rmse": float(np.sqrt(mean_squared_error(y_test_L, y_pred))),
    "mae":  float(mean_absolute_error(y_test_L, y_pred)),
    "r2":   float(r2_score(y_test_L, y_pred))
})

{'stage': 'baseline', 'rmse': 44.94810315018374, 'mae': 29.07128949336176, 'r2': 0.7971385235889683}
{'stage': 'final', 'rmse': 44.69488270335948, 'mae': 29.69396404741289, 'r2': 0.7994177734066739}


## Comments 
Compared to the supervised baseline, the final self-trained model achieves RMSE 44.69 (≈ −0.56%), MAE 29.69 (≈ +2.1%), and $R^2$ 0.7994 (+0.0023).
Verdict: the change is too small and inconsistent (RMSE down but MAE up) to be meaningful—very likely within random variability.