先不做gridsearchCV 而是先确定哪个模型初步表现最好. 哪个target最有解释力和预测里. 是否存在明显欠拟合和过拟合. 

在本阶段中，我们基于 :contentReference[oaicite:0]{index=0} 2020–2021 至 2024–2025 赛季的场上统计数据，系统性地完成了基准模型的搭建。  
数据采用时间序列划分（2020–2023 训练，2024–2025 测试），以避免未来信息泄露。

我们分别测试了 5 类模型（Ridge、RF、XGB、LGBM、MLP）和 5 种目标变量（`salary_usd`、`log_salary`、`salary_cap_ratio`、`log_salary_cap_ratio`、`salary_cap_equiv`），并使用 `core`（10 特征）与 `full`（40 特征）两种特征组合进行对比。

结果显示：
- `log_salary_cap_ratio` 目标表现最稳定（R²≈0.74，RMSE≈0.04）；
- `salary_cap_ratio` 与 `salary_cap_equiv` 也接近；
- 而 `log_salary` 及原始薪资目标的表现较差。
- RF、XGB、LGBM 的结果显著优于 Ridge 和 MLP，说明场上数据与薪资之间关系具有明显的非线性特征。
- `full` 特征整体优于 `core` 特征，表明保留更多有效特征有助于提升模型表现。

---

### ✅ 结论

- **最佳 baseline 组合**：`RF + full 特征 + log_salary_cap_ratio`  
- XGB 与 LGBM 是次优模型，可在后续调参与扩展阶段重点考虑。  
- Ridge 和 MLP 仅作为对照使用。  
- Baseline 的 R²≈0.74，已为后续引入场下数据、知识图谱与 GNN 建模提供清晰参考基线。

---

### 🧪 后续方向

- EDA_5 阶段将针对 2–4 组较优组合（不同 target + RF/XGB/LGBM）做交叉验证与调参。  
- 确定最终最优组合后，再与加入场下数据/知识图谱后的结果进行对比，以验证增益效果。

In [7]:
# %% imports & config
from pathlib import Path
import json, time
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

# xgboost / lightgbm 可选
try:
    import xgboost as xgb
    HAS_XGB = True
except Exception:
    HAS_XGB = False

try:
    from lightgbm import LGBMRegressor
    HAS_LGBM = True
except Exception:
    HAS_LGBM = False

import matplotlib
matplotlib.use("Agg")  # 避免某些环境下的 Tk 异常
import matplotlib.pyplot as plt

RANDOM_STATE = 42

# === paths ===
DATA_FEAT_PATH = Path("../data/processed/training_oncourt_features.parquet")
REPORT_DIR = Path("reports/EDA4_time_split")
STAMP = time.strftime("%Y%m%d_%H%M%S")
OUT_DIR = REPORT_DIR / STAMP
OUT_DIR.mkdir(parents=True, exist_ok=True)

# 特征清单（EDA_3 产物）
CORE_PATHS = [
    Path("reports/features/selected_features_core.json"),
    Path("reports/features/selected_features_core.txt"),
    Path("reports/features/selected_features_core"),  # 兼容无后缀
]
FULL_PATHS = [
    Path("reports/features/selected_features_full.json"),
    Path("reports/features/selected_features_full.txt"),
    Path("reports/features/selected_features_full"),
]

TARGETS = [
    "salary_cap_ratio",
    "salary_cap_equiv",
    "log_salary_cap_ratio",
    "salary_usd",
    "log_salary",
]

TRAIN_SEASONS = [2020, 2021, 2022, 2023]  # 2020-21 ~ 2023-24
TEST_SEASONS  = [2024, 2025]              # 2024-25 ~ (未来可扩展)

print(f"Outputs will be saved to: {OUT_DIR.resolve()}")

Outputs will be saved to: C:\nba-salary-kg-project_newest\notebooks\reports\EDA4_time_split\20251028_162508


In [8]:
# %% helper: load feature list (兼容 json / txt / 单列文本)
def load_feature_list_try(paths):
    for p in paths:
        if p.exists():
            # try json
            try:
                obj = json.loads(p.read_text(encoding="utf-8"))
                if isinstance(obj, dict) and "features" in obj:
                    feats = list(obj["features"])
                    print(f"[OK] loaded features from JSON: {p}")
                    return feats
                elif isinstance(obj, list):
                    print(f"[OK] loaded features(list) from JSON: {p}")
                    return list(obj)
            except Exception:
                pass
            # try lines
            txt = p.read_text(encoding="utf-8").strip()
            feats = [line.strip() for line in txt.splitlines() if line.strip()]
            print(f"[OK] loaded features from text lines: {p}")
            return feats
    raise FileNotFoundError(f"Cannot find feature list in: {paths}")

core_feats = load_feature_list_try(CORE_PATHS)
full_feats = load_feature_list_try(FULL_PATHS)
print("Core feats (len={}):".format(len(core_feats)), core_feats[:10], "...")
print("Full feats (len={}):".format(len(full_feats)), full_feats[:10], "...")

[OK] loaded features from JSON: reports\features\selected_features_core.json
[OK] loaded features from JSON: reports\features\selected_features_full.json
Core feats (len=10): ['FP', 'Age', 'Min', 'FTA', 'TOV', 'PTS_per_gp', 'log1p_PF', 'log1p_3PM', 'PTS_per_min', 'OREB_per_min'] ...
Full feats (len=40): ['FP', 'PTS', 'Age', 'Min', 'FTA', 'TOV', 'PTS_per_gp', 'log1p_PF', 'log1p_3PM', 'PTS_per_min'] ...


In [9]:
# %% load data & time-based split
df = pd.read_parquet(DATA_FEAT_PATH)

# 基础校验
must_have = set(["season", "salary_usd", "salary_cap_ratio", "salary_cap_equiv", "log_salary", "log_salary_cap_ratio"])
missing = list(must_have - set(df.columns))
if missing:
    raise KeyError(f"Missing required columns: {missing}")

# 只保留我们关心的赛季
df = df[df["season"].isin(TRAIN_SEASONS + TEST_SEASONS)].copy()

# 时间切分
df_train = df[df["season"].isin(TRAIN_SEASONS)].copy()
df_test  = df[df["season"].isin(TEST_SEASONS)].copy()

print("Train seasons:", sorted(df_train["season"].unique()))
print("Test  seasons:", sorted(df_test["season"].unique()))
print("Train shape:", df_train.shape, "Test shape:", df_test.shape)

Train seasons: [np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023)]
Test  seasons: [np.int64(2024)]
Train shape: (1659, 90) Test shape: (423, 90)


In [11]:
 # %% build model helpers (封装五类模型)
def fit_ridge(Xtr, ytr, Xte):
    # 线性模型需要标准化
    pipe = Pipeline([
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("model", Ridge(alpha=1.0, random_state=RANDOM_STATE))
    ])
    pipe.fit(Xtr, ytr)
    pred = pipe.predict(Xte)
    return "Ridge", pipe, pred, None  # 线性模型的系数解释可在后续需要时单独导出

def fit_rf(Xtr, ytr, Xte):
    rf = RandomForestRegressor(
        n_estimators=700, max_depth=None,
        random_state=RANDOM_STATE, n_jobs=-1
    )
    rf.fit(Xtr, ytr)
    pred = rf.predict(Xte)
    imp = pd.Series(rf.feature_importances_, index=Xtr.columns)
    return "RF", rf, pred, imp

def fit_xgb(Xtr, ytr, Xte):
    if not HAS_XGB:
        return None, None, None, None
    dtr = xgb.DMatrix(Xtr, label=ytr)
    dte = xgb.DMatrix(Xte)
    params = {
        "objective": "reg:squarederror",
        "max_depth": 6,
        "eta": 0.08,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "seed": RANDOM_STATE
    }
    booster = xgb.train(params, dtr, num_boost_round=600)
    pred = booster.predict(dte)
    gain = booster.get_score(importance_type="gain")
    imp = pd.Series({k: gain.get(k, 0.0) for k in Xtr.columns})
    return "XGB", booster, pred, imp

def fit_lgbm(Xtr, ytr, Xte):
    if not HAS_LGBM:
        return None, None, None, None
    lgbm = LGBMRegressor(
        n_estimators=800, learning_rate=0.05,
        subsample=0.8, colsample_bytree=0.8,
        random_state=RANDOM_STATE
    )
    # 避免 verbose 和早停在不同版本上的签名差异
    lgbm.fit(Xtr, ytr)
    pred = lgbm.predict(Xte)
    imp = pd.Series(lgbm.feature_importances_, index=Xtr.columns)
    return "LGBM", lgbm, pred, imp

def fit_mlp(Xtr, ytr, Xte):
    # MLP 也需要标准化
    pipe = Pipeline([
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("mlp", MLPRegressor(hidden_layer_sizes=(128,64), activation="relu",
                             solver="adam", learning_rate_init=1e-3,
                             max_iter=800, random_state=RANDOM_STATE))
    ])
    pipe.fit(Xtr, ytr)
    pred = pipe.predict(Xte)
    return "MLP", pipe, pred, None

In [12]:
# %% utilities: eval + plot + save
def evaluate_and_log(split_tag, target, feature_tag, Xtr, ytr, Xte, yte, fitters, out_dir):
    rows = []
    for fit_func in fitters:
        name, model, pred, imp = fit_func(Xtr, ytr, Xte)
        if name is None:  # xgb/lgbm 不可用时略过
            continue
        r2  = r2_score(yte, pred)
        rmse = mean_squared_error(yte, pred, squared=False)
        print(f"[{feature_tag}/{name}] target={target}  R^2={r2:.4f}  RMSE={rmse:.6f}")

        rows.append({
            "split": split_tag,
            "features": feature_tag,
            "model": name,
            "target": target,
            "R2": r2,
            "RMSE": rmse
        })

        # 保存散点图（预测 vs 真值）
        fig_path = out_dir / f"scatter_{feature_tag}_{name}_{target}.png"
        plt.figure(figsize=(5,5))
        plt.scatter(yte, pred, s=10)
        plt.xlabel(f"True {target}")
        plt.ylabel(f"Pred {target}")
        lims = [min(yte.min(), pred.min()), max(yte.max(), pred.max())]
        plt.plot(lims, lims, linewidth=1)
        plt.title(f"{feature_tag}-{name}: {target} (R2={r2:.3f})")
        plt.tight_layout()
        plt.savefig(fig_path, dpi=150)
        plt.close()

        # 保存重要性（树类）
        if imp is not None:
            imp = imp.sort_values(ascending=False)
            imp_path = out_dir / f"importance_{feature_tag}_{name}_{target}.csv"
            imp.to_csv(imp_path, header=["importance"])

    return pd.DataFrame(rows)


In [13]:
# %% run all: for core + full, for all TARGETS
all_metrics = []

for feature_tag, feats in [("core", core_feats), ("full", full_feats)]:
    # 只保留存在的特征列
    feats_exist = [c for c in feats if c in df.columns]
    missing_feats = list(set(feats) - set(feats_exist))
    if missing_feats:
        print(f"[WARN] {feature_tag} missing {len(missing_feats)} cols (ignored): {missing_feats[:5]} ...")

    # X/y 构造
    Xtr_base = df_train[feats_exist].copy()
    Xte_base = df_test[feats_exist].copy()

    for tgt in TARGETS:
        # 有些目标是 log_xxx, 某些赛季可能有 0 值 → 已在特征工程阶段处理；若仍有缺失，这里兜底
        ytr = df_train[tgt].astype(float).fillna(0.0)
        yte = df_test[tgt].astype(float).fillna(0.0)

        # 非线性（树类 + MLP）
        fitters = [fit_rf, fit_xgb, fit_lgbm, fit_mlp, fit_ridge]  # 都跑；Ridge 放最后

        out_dir_sub = OUT_DIR / f"{feature_tag}_{tgt}"
        out_dir_sub.mkdir(parents=True, exist_ok=True)

        metrics_df = evaluate_and_log(
            split_tag="time-split(2020-23→2024-25)",
            target=tgt,
            feature_tag=feature_tag,
            Xtr=Xtr_base, ytr=ytr, Xte=Xte_base, yte=yte,
            fitters=fitters,
            out_dir=out_dir_sub
        )
        metrics_df.to_csv(out_dir_sub / "metrics.csv", index=False)
        all_metrics.append(metrics_df)

# 汇总保存
all_metrics = pd.concat(all_metrics, ignore_index=True)
all_metrics.to_csv(OUT_DIR / "metrics_all.csv", index=False)

print(f"[Saved] {OUT_DIR/'metrics_all.csv'}")



[core/RF] target=salary_cap_ratio  R^2=0.7422  RMSE=0.046125




[core/XGB] target=salary_cap_ratio  R^2=0.7273  RMSE=0.047442
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000558 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1493
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 10
[LightGBM] [Info] Start training from score 0.077255




[core/LGBM] target=salary_cap_ratio  R^2=0.7151  RMSE=0.048491




[core/MLP] target=salary_cap_ratio  R^2=0.6080  RMSE=0.056881
[core/Ridge] target=salary_cap_ratio  R^2=0.6300  RMSE=0.055263




[core/RF] target=salary_cap_equiv  R^2=0.7414  RMSE=6495033.291534




[core/XGB] target=salary_cap_equiv  R^2=0.7265  RMSE=6679876.469451
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000151 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1493
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 10
[LightGBM] [Info] Start training from score 10861147.057087




[core/LGBM] target=salary_cap_equiv  R^2=0.7151  RMSE=6817269.786467




[core/MLP] target=salary_cap_equiv  R^2=0.5416  RMSE=8648121.461364
[core/Ridge] target=salary_cap_equiv  R^2=0.6300  RMSE=7769300.473269




[core/RF] target=log_salary_cap_ratio  R^2=0.7360  RMSE=0.040734




[core/XGB] target=log_salary_cap_ratio  R^2=0.7233  RMSE=0.041707
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000176 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1493
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 10
[LightGBM] [Info] Start training from score 0.071419




[core/LGBM] target=log_salary_cap_ratio  R^2=0.7066  RMSE=0.042942




[core/MLP] target=log_salary_cap_ratio  R^2=0.5657  RMSE=0.052247
[core/Ridge] target=log_salary_cap_ratio  R^2=0.6334  RMSE=0.048006




[core/RF] target=salary_usd  R^2=0.7019  RMSE=6973368.942916




[core/XGB] target=salary_usd  R^2=0.6869  RMSE=7147460.945609
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000140 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1493
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 10
[LightGBM] [Info] Start training from score 9307824.515973




[core/LGBM] target=salary_usd  R^2=0.6772  RMSE=7256699.152064




[core/MLP] target=salary_usd  R^2=0.5031  RMSE=9003196.921358
[core/Ridge] target=salary_usd  R^2=0.5921  RMSE=8157298.463107




[core/RF] target=log_salary  R^2=0.5248  RMSE=0.800851




[core/XGB] target=log_salary  R^2=0.5149  RMSE=0.809181
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000191 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1493
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 10
[LightGBM] [Info] Start training from score 15.414125




[core/LGBM] target=log_salary  R^2=0.4630  RMSE=0.851359




[core/MLP] target=log_salary  R^2=0.5037  RMSE=0.818460
[core/Ridge] target=log_salary  R^2=0.5109  RMSE=0.812441




[full/RF] target=salary_cap_ratio  R^2=0.7443  RMSE=0.045943




[full/XGB] target=salary_cap_ratio  R^2=0.7278  RMSE=0.047398
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000385 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7215
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 40
[LightGBM] [Info] Start training from score 0.077255




[full/LGBM] target=salary_cap_ratio  R^2=0.7420  RMSE=0.046151




[full/MLP] target=salary_cap_ratio  R^2=0.2108  RMSE=0.080713
[full/Ridge] target=salary_cap_ratio  R^2=0.6377  RMSE=0.054686




[full/RF] target=salary_cap_equiv  R^2=0.7444  RMSE=6457951.869355




[full/XGB] target=salary_cap_equiv  R^2=0.7261  RMSE=6684377.726534
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000846 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7215
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 40
[LightGBM] [Info] Start training from score 10861147.057087




[full/LGBM] target=salary_cap_equiv  R^2=0.7420  RMSE=6488343.568363




[full/MLP] target=salary_cap_equiv  R^2=0.5083  RMSE=8955955.705748
[full/Ridge] target=salary_cap_equiv  R^2=0.6377  RMSE=7688226.447107




[full/RF] target=log_salary_cap_ratio  R^2=0.7410  RMSE=0.040351




[full/XGB] target=log_salary_cap_ratio  R^2=0.7290  RMSE=0.041274
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001243 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7215
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 40
[LightGBM] [Info] Start training from score 0.071419




[full/LGBM] target=log_salary_cap_ratio  R^2=0.7403  RMSE=0.040400




[full/MLP] target=log_salary_cap_ratio  R^2=0.1076  RMSE=0.074896
[full/Ridge] target=log_salary_cap_ratio  R^2=0.6395  RMSE=0.047605




[full/RF] target=salary_usd  R^2=0.6992  RMSE=7005608.007570




[full/XGB] target=salary_usd  R^2=0.6815  RMSE=7208189.645891
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000614 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7215
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 40
[LightGBM] [Info] Start training from score 9307824.515973




[full/LGBM] target=salary_usd  R^2=0.6934  RMSE=7072126.892947




[full/MLP] target=salary_usd  R^2=0.4862  RMSE=9155185.804067
[full/Ridge] target=salary_usd  R^2=0.5987  RMSE=8091256.738129




[full/RF] target=log_salary  R^2=0.5477  RMSE=0.781340




[full/XGB] target=log_salary  R^2=0.5332  RMSE=0.793743
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000825 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7215
[LightGBM] [Info] Number of data points in the train set: 1659, number of used features: 40
[LightGBM] [Info] Start training from score 15.414125




[full/LGBM] target=log_salary  R^2=0.5126  RMSE=0.811062
[full/MLP] target=log_salary  R^2=0.1655  RMSE=1.061237
[full/Ridge] target=log_salary  R^2=0.5022  RMSE=0.819696
[Saved] reports\EDA4_time_split\20251028_162508\metrics_all.csv




In [14]:
# %% 小结：按 target 聚合最优模型
best_rows = []
for tgt, g in all_metrics.groupby("target"):
    # 你也可以分 core/full 维度各自取最优
    best = g.sort_values("R2", ascending=False).head(1).copy()
    best_rows.append(best)
best_table = pd.concat(best_rows, ignore_index=True)
best_table

Unnamed: 0,split,features,model,target,R2,RMSE
0,time-split(2020-23→2024-25),full,RF,log_salary,0.547662,0.7813396
1,time-split(2020-23→2024-25),full,RF,log_salary_cap_ratio,0.740957,0.04035119
2,time-split(2020-23→2024-25),full,RF,salary_cap_equiv,0.744363,6457952.0
3,time-split(2020-23→2024-25),full,RF,salary_cap_ratio,0.744282,0.04594259
4,time-split(2020-23→2024-25),core,RF,salary_usd,0.701929,6973369.0


In [16]:
# %% 保存 best_table
best_table.to_csv(OUT_DIR / "best_by_target.csv", index=False)
print(f"[Saved] {OUT_DIR/'best_by_target.csv'}")


[Saved] reports\EDA4_time_split\20251028_162508\best_by_target.csv
