
# AutoGluon 분류 파이프라인 (Stratified K-Fold, 셀별 실행 · 함수형태 X)

**목표**: `train.csv`(타깃: `target`)로 학습 → `test.csv` 분류 결과 생성  
**요구사항 반영**
- Stratified K-Fold (OOF & 폴드 성능)
- **각 기능별 셀**로 구현(함수 정의 없이 바로 실행)
- 데이터 전처리:
  - **G1 그룹**: `["X_05","X_09","X_20","X_22","X_25","X_51"]` → 표준화(평균0, 분산1) → **PCA(설명분산 0.95)** → `PC_G1_1, PC_G1_2, ...`만 남기고 G1 원본 6개 변수 **삭제**
  - **PAIR_GROUPS(쌍 그룹)**: 각 쌍에서 대표 1개만 남김. **대표 선택 기준**: 결측률↓, mutual information(↑), ANOVA-F(↑)
- AutoGluon으로 학습/검증/추론


# 🔧 Custom Feature Engineering Block (자동 추가)

요구사항에 따라 아래 로직을 추가했습니다.

- **중요 피처**: `X_08, X_19, X_29, X_46, X_40, X_41, X_49`
  - 이 중 **상관관계가 높은 쌍(기본 임계값 |corr| ≥ 0.95)**이 1개 이상 존재하면 → 해당 쌍들에 대해 **곱(product) 피처** 생성 + 각 피처 **제곱(sq) 피처** 생성
  - 상관관계가 높은 쌍이 **하나도 없으면** → **제곱(sq) 피처만** 생성
- **제거할 피처**: `X_17`, `X_09`, `X_40`, `X_21_sq`
  - `X_40`은 중요 피처 목록에도 있었으나, **최종적으로 삭제 요청**에 따라 파생에 사용하지 않고 제거합니다.
- **나머지 전처리 로직은 그대로 유지**합니다.
- **Idempotent(중복 안전)**: 동일 셀을 여러 번 실행해도 중복 생성되지 않도록 구현했습니다.

In [1]:

# === 1. Imports & Config ===
import os
from pathlib import Path
import random
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_classif, f_classif

try:
    from autogluon.tabular import TabularDataset, TabularPredictor
    AUTOGluon_OK = True
except Exception as e:
    AUTOGluon_OK = False
    print('[WARN] AutoGluon import 실패:', e)

# 경로/설정
DATA_DIR = Path('.')
TRAIN_PATH = DATA_DIR / 'train.csv'
TEST_PATH  = DATA_DIR / 'test.csv'
TARGET_COL = 'target'
ID_COL = None  # 예: 'id' (없으면 None)

N_FOLDS = 5
RANDOM_STATE = 42
PRESETS = 'medium_quality'
TIME_LIMIT = 1800  # 초(폴드당)

SAVE_ROOT = Path('./ag_cv_models')
OOF_PATH  = Path('./oof_predictions.csv')
FULL_DIR  = Path('./ag_full_model')
SUB_PATH  = Path('./submission.csv')

# 그룹 정의
G1 = ["X_05","X_09","X_20","X_22","X_25","X_51"]
VAR_RATIO = 0.95  # PCA 설명분산 임계

# 일단 여기선 별로 사용이 안 됨.
# 쌍 변수 그룹 (필요시 더 추가)
PAIR_GROUPS = [
    ["X_04","X_39"],
    ["X_06","X_45"],
    ####################
    ["X_10","X_17"],
    ["X_07","X_33"],
    ["X_12","X_21"],
    ["X_26","X_30"],
    ["X_38","X_47"],
    
    # ["X_05","X_25"], ...  # 예시로 더 넣을 수 있음
]

# 재현성
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

print('[INFO] Config loaded.')

  from .autonotebook import tqdm as notebook_tqdm


[INFO] Config loaded.


# 📊 Feature Importance (after training)
- AutoGluon `TabularPredictor`가 있으면 `predictor.feature_importance(...)` 사용
- 그렇지 않으면, `feature_importances_` 속성이 있는 트리계열 모델에서 중요도 추출
- 상위 30개를 가로바(barh)로 시각화

In [2]:
# === Feature Importance (Generic) ===
import inspect
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 후보 데이터/라벨명
label_candidates = ["target", "TARGET", "label", "Label", "y"]
df_candidates = ["train_pair", "train", "train_df", "df_train"]
X_candidates  = ["X_train", "X_tr", "X"]
model_candidates = ["predictor", "model", "clf", "estimator"]

def _detect_label_name(df):
    for col in df.columns:
        for lab in label_candidates:
            if col.lower() == lab.lower():
                return col
    return None

def _get_first_existing(names):
    g = globals()
    for nm in names:
        if nm in g:
            return nm, g[nm]
    return None, None

imp_df = None
used_route = None

# 1) Try AutoGluon TabularPredictor
name_pred, predictor = _get_first_existing(["predictor"])
if predictor is not None:
    try:
        # Check class name to avoid false positives
        clsname = predictor.__class__.__name__.lower()
        if "tabularpredictor" in clsname and hasattr(predictor, "feature_importance"):
            # find a labeled df
            df_name, df_obj = _get_first_existing(df_candidates)
            if isinstance(df_obj, pd.DataFrame):
                label_col = _detect_label_name(df_obj)
                if label_col is not None:
                    fi = predictor.feature_importance(df_obj, silent=True)
                    # AutoGluon returns a DataFrame with 'importance' column (usually 'importance' or 'importance_abs')
                    # Normalize to a standard format
                    if isinstance(fi, pd.DataFrame):
                        # try common column names
                        value_col = None
                        for c in ["importance", "importance_value", "importance_mean", "importance_abs"]:
                            if c in fi.columns:
                                value_col = c
                                break
                        if value_col is None and fi.shape[1] >= 1:
                            value_col = fi.columns[0]
                        imp_df = fi[[value_col]].copy()
                        imp_df.columns = ["importance"]
                    else:
                        # Unexpected; coerce
                        imp_df = pd.DataFrame({"importance": fi})
                    used_route = f"AutoGluon({name_pred}) on {df_name}"
    except Exception as e:
        print("[FI] AutoGluon route skipped:", repr(e))

# 2) Try any model with feature_importances_
if imp_df is None:
    # Pick a model object exposing feature_importances_
    model_obj = None
    for k, v in list(globals().items()):
        try:
            if hasattr(v, "feature_importances_"):
                model_obj = v
                model_name = k
                break
        except Exception:
            continue
    if model_obj is not None:
        # Determine feature names
        feat_names = None
        # Prefer X_train with columns
        Xn, Xobj = _get_first_existing(["X_train", "X_tr"])
        if isinstance(Xobj, pd.DataFrame):
            feat_names = list(Xobj.columns)
        if feat_names is None:
            # fallback: from train_pair excluding label
            df_name, df_obj = _get_first_existing(df_candidates)
            if isinstance(df_obj, pd.DataFrame):
                label_col = _detect_label_name(df_obj)
                if label_col is not None:
                    feat_names = [c for c in df_obj.columns if c != label_col]
        # Build df
        try:
            vals = np.array(model_obj.feature_importances_).ravel()
            if feat_names is None:
                feat_names = [f"f{i}" for i in range(len(vals))]
            imp_df = pd.DataFrame({"feature": feat_names, "importance": vals}).set_index("feature")
            used_route = f"{model_name}.feature_importances_"
        except Exception as e:
            print("[FI] feature_importances_ route skipped:", repr(e))

# 3) If nothing worked, exit gracefully
if imp_df is None or imp_df.empty:
    print("[FI] 중요도를 계산할 수 있는 학습 객체를 찾지 못했습니다.")
else:
    # Sort and take top 30
    imp_df = imp_df.sort_values("importance", ascending=False).head(30)
    display(imp_df)
    # Plot
    plt.figure(figsize=(8, max(4, 0.3*len(imp_df))))
    imp_df.sort_values("importance").plot(kind="barh", legend=False)
    plt.title(f"Feature Importance (top {len(imp_df)})\n{used_route if used_route else ''}")
    plt.xlabel("Importance")
    plt.tight_layout()
    plt.show()

[FI] 중요도를 계산할 수 있는 학습 객체를 찾지 못했습니다.


In [3]:

# === 2. Load train/test, 타깃 점검 ===
assert TRAIN_PATH.exists(), f'train.csv가 {TRAIN_PATH} 에 없습니다.'
train = pd.read_csv(TRAIN_PATH)
print('[INFO] train.shape =', train.shape)
assert TARGET_COL in train.columns, f'{TARGET_COL} 컬럼이 train에 없습니다.'

if TEST_PATH.exists():
    test = pd.read_csv(TEST_PATH)
    print('[INFO] test.shape  =', test.shape)
else:
    test = None
    print('[WARN] test.csv 가 없어 추론 셀은 스킵될 수 있습니다.')

# 타깃 클린업(문자열 통일 + 결측 제거)
y = train[TARGET_COL].astype(str)
na_mask = y.isna() | (y.str.lower()=='nan') | (y=='None') | (y=='')
if na_mask.any():
    print(f'[WARN] 타깃 결측/유사결측 {na_mask.sum()}건 발견 → 해당 행 제거')
    train = train.loc[~na_mask].reset_index(drop=True)
    y = train[TARGET_COL].astype(str)

ALL_CLASSES = sorted(y.unique())
print('[INFO] 클래스 수:', len(ALL_CLASSES))
print('[INFO] 예시:', ALL_CLASSES[:20])

[INFO] train.shape = (21693, 54)
[INFO] test.shape  = (15004, 53)
[INFO] 클래스 수: 21
[INFO] 예시: ['0', '1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '3', '4', '5', '6', '7', '8']


In [4]:

# === 3. PAIR_GROUPS 전역 대표 변수 선택 (train 기준) ===
# - 결측률 낮을수록 좋음
# - mutual information 높을수록 좋음
# - ANOVA-F 높을수록 좋음
# 점수 = (-결측률) 정규화 + (MI) 정규화 + (F) 정규화  (각 1/3 가중)

# 고상관 쌍(PAIR_GROUPS)의 각 쌍에서 대표 변수 1개만 남기기 위해, 두 변수의 “품질+유용성”을 점수로 계산해 비교하는 로직
# -> 39, 45 삭제

rep_map = {}         # {('X_04','X_39'): 'X_04', ...}
to_drop_pairs = []   # 대표가 아닌 변수 목록

# 스코어 계산을 위한 임시 데이터(결측 채움: 수치 median) 구성
num_cols = train.drop(columns=[TARGET_COL]).select_dtypes(include=[np.number]).columns.tolist()
tmp = train.copy()

if len(num_cols) > 0:
    imputer_num = SimpleImputer(strategy='median')
    tmp[num_cols] = imputer_num.fit_transform(tmp[num_cols])

# 각 쌍에 대해 점수 산출
for pair in PAIR_GROUPS:
    pair = [c for c in pair if c in train.columns and c != TARGET_COL]
    if len(pair) != 2:
        print(f'[WARN] 쌍 {pair} 중 유효 컬럼이 2개가 아닙니다. 스킵.')
        continue

    cols_ok = []
    scores = []
    for col in pair:
        # 결측률 (원본 기준)
        miss_rate = train[col].isna().mean()

        # MI/F는 수치형에서만 바로 계산. 수치가 아닐 경우 임시 변환(범주→순번 인코딩)
        s = tmp[col]
        if not np.issubdtype(s.dtype, np.number):
            # 간단 라벨 인코딩
            s = s.astype('category').cat.codes

        X_ = s.values.reshape(-1, 1)
        y_ = train[TARGET_COL].astype(str).values

        # MI (discrete target)
        try:
            mi = mutual_info_classif(X_, y_, discrete_features=True, random_state=RANDOM_STATE)[0]
        except Exception:
            mi = 0.0

        # ANOVA-F
        try:
            f, _ = f_classif(X_, pd.factorize(y_)[0])
            f = float(f[0])
        except Exception:
            f = 0.0

        cols_ok.append(col)
        scores.append({'col': col, 'miss': miss_rate, 'mi': mi, 'f': f})

    if len(scores) != 2:
        print(f'[WARN] {pair} 점수 계산 실패(스킵)')
        continue

    # 정규화
    miss_vals = np.array([s['miss'] for s in scores])
    mi_vals   = np.array([s['mi']   for s in scores])
    f_vals    = np.array([s['f']    for s in scores])

    def norm(v):
        v = v.astype(float)
        if np.allclose(v.max(), v.min()):
            return np.zeros_like(v)
        return (v - v.min()) / (v.max() - v.min())

    miss_n = norm(1 - miss_vals)  # 결측률 낮을수록↑ → (1 - miss)
    mi_n   = norm(mi_vals)
    f_n    = norm(f_vals)

    final  = (miss_n + mi_n + f_n) / 3.0
    best_idx = int(np.argmax(final))
    rep = scores[best_idx]['col']
    rep_map[tuple(pair)] = rep

    drop_cols = [c for c in pair if c != rep]
    to_drop_pairs.extend(drop_cols)

print('[INFO] 대표 선택 결과:', rep_map)
print('[INFO] 삭제 대상(대표 아닌 변수):', to_drop_pairs)



[INFO] 대표 선택 결과: {('X_04', 'X_39'): 'X_04', ('X_06', 'X_45'): 'X_06', ('X_10', 'X_17'): 'X_17', ('X_07', 'X_33'): 'X_07', ('X_12', 'X_21'): 'X_12', ('X_26', 'X_30'): 'X_26', ('X_38', 'X_47'): 'X_38'}
[INFO] 삭제 대상(대표 아닌 변수): ['X_39', 'X_45', 'X_10', 'X_33', 'X_21', 'X_30', 'X_47']




In [5]:

# === 4. 대표 아닌 변수 삭제(전역) ===
train_pair = train.drop(columns=[c for c in to_drop_pairs if c in train.columns], errors='ignore').copy()
if test is not None:
    test_pair = test.drop(columns=[c for c in to_drop_pairs if c in test.columns], errors='ignore').copy()
else:
    test_pair = None

print('[INFO] train_pair.shape =', train_pair.shape)
if test_pair is not None:
    print('[INFO] test_pair.shape  =', test_pair.shape)

[INFO] train_pair.shape = (21693, 47)
[INFO] test_pair.shape  = (15004, 46)


In [6]:
train_pair.columns

Index(['ID', 'X_01', 'X_02', 'X_03', 'X_04', 'X_05', 'X_06', 'X_07', 'X_08',
       'X_09', 'X_11', 'X_12', 'X_13', 'X_14', 'X_15', 'X_16', 'X_17', 'X_18',
       'X_19', 'X_20', 'X_22', 'X_23', 'X_24', 'X_25', 'X_26', 'X_27', 'X_28',
       'X_29', 'X_31', 'X_32', 'X_34', 'X_35', 'X_36', 'X_37', 'X_38', 'X_40',
       'X_41', 'X_42', 'X_43', 'X_44', 'X_46', 'X_48', 'X_49', 'X_50', 'X_51',
       'X_52', 'target'],
      dtype='object')

In [7]:

# === 파생 피처 추가 (함수 없이 바로 적용) ===
_base_cols = [c for c in train_pair.columns if c != TARGET_COL]
_num_cols  = train_pair[_base_cols].select_dtypes(include=[float, int]).columns

# 1) 행 단위 요약
train_pair['row_na_cnt'] = train_pair[_base_cols].isna().sum(axis=1)
if test_pair is not None:
    test_pair['row_na_cnt'] = test_pair[_base_cols].isna().sum(axis=1)

train_pair['row_num_mean'] = train_pair[_num_cols].mean(axis=1)
train_pair['row_num_std']  = train_pair[_num_cols].std(axis=1)
if test_pair is not None:
    test_pair['row_num_mean'] = test_pair[_num_cols].mean(axis=1)
    test_pair['row_num_std']  = test_pair[_num_cols].std(axis=1)

if len(_num_cols) > 0:
    _Q1  = train_pair[_num_cols].quantile(0.25)
    _Q3  = train_pair[_num_cols].quantile(0.75)
    _IQR = (_Q3 - _Q1).replace(0, 1e-12)

    _below_tr = (train_pair[_num_cols] < (_Q1 - 1.5 * _IQR))
    _above_tr = (train_pair[_num_cols] > (_Q3 + 1.5 * _IQR))
    train_pair['row_outlier_cnt'] = (_below_tr | _above_tr).sum(axis=1)

    if test_pair is not None:
        _below_te = (test_pair[_num_cols] < (_Q1 - 1.5 * _IQR))
        _above_te = (test_pair[_num_cols] > (_Q3 + 1.5 * _IQR))
        test_pair['row_outlier_cnt'] = (_below_te | _above_te).sum(axis=1)

# 2) G1 그룹 통계
_g1_cols = [c for c in G1 if c in train_pair.columns]
if len(_g1_cols) > 0:
    train_pair['G1_mean']  = train_pair[_g1_cols].mean(axis=1)
    train_pair['G1_std']   = train_pair[_g1_cols].std(axis=1)
    train_pair['G1_min']   = train_pair[_g1_cols].min(axis=1)
    train_pair['G1_max']   = train_pair[_g1_cols].max(axis=1)
    train_pair['G1_range'] = train_pair['G1_max'] - train_pair['G1_min']

    if test_pair is not None:
        test_pair['G1_mean']  = test_pair[_g1_cols].mean(axis=1)
        test_pair['G1_std']   = test_pair[_g1_cols].std(axis=1)
        test_pair['G1_min']   = test_pair[_g1_cols].min(axis=1)
        test_pair['G1_max']   = test_pair[_g1_cols].max(axis=1)
        test_pair['G1_range'] = test_pair['G1_max'] - test_pair['G1_min']


# 3) 비선형 파생 (예: X_05, X_09)
if 'X_05' in train_pair.columns and 'X_09' in train_pair.columns:
    train_pair['X05_div_X09'] = train_pair['X_05'] / (train_pair['X_09'].replace(0, 1e-12) + 1e-12)
    train_pair['X05_mul_X09'] = train_pair['X_05'] * train_pair['X_09']
    if test_pair is not None:
        test_pair['X05_div_X09'] = test_pair['X_05'] / (test_pair['X_09'].replace(0, 1e-12) + 1e-12)
        test_pair['X05_mul_X09'] = test_pair['X_05'] * test_pair['X_09']

if 'X_05' in train_pair.columns:
    train_pair['X05_sq'] = train_pair['X_05'] ** 2
    if test_pair is not None:
        test_pair['X05_sq'] = test_pair['X_05'] ** 2

if 'X_09' in train_pair.columns:
    train_pair['X09_sq'] = train_pair['X_09'] ** 2
    if test_pair is not None:
        test_pair['X09_sq'] = test_pair['X_09'] ** 2


if 'X_20' in train_pair.columns and 'X_22' in train_pair.columns:
    train_pair['X20_div_X22'] = train_pair['X_20'] / (train_pair['X_22'].replace(0, 1e-12) + 1e-12)
    train_pair['X20_mul_X22'] = train_pair['X_20'] * train_pair['X_22']
    if test_pair is not None:
        test_pair['X20_div_X22'] = test_pair['X_20'] / (test_pair['X_22'].replace(0, 1e-12) + 1e-12)
        test_pair['X20_mul_X22'] = test_pair['X_20'] * test_pair['X_22']

if 'X_20' in train_pair.columns:
    train_pair['X20_sq'] = train_pair['X_20'] ** 2
    if test_pair is not None:
        test_pair['X20_sq'] = test_pair['X_20'] ** 2

if 'X_22' in train_pair.columns:
    train_pair['X22_sq'] = train_pair['X_22'] ** 2
    if test_pair is not None:
        test_pair['X22_sq'] = test_pair['X_22'] ** 2



if 'X_25' in train_pair.columns and 'X_51' in train_pair.columns:
    train_pair['X25_div_X51'] = train_pair['X_25'] / (train_pair['X_51'].replace(0, 1e-12) + 1e-12)
    train_pair['X25_mul_X51'] = train_pair['X_25'] * train_pair['X_51']
    if test_pair is not None:
        test_pair['X25_div_X51'] = test_pair['X_25'] / (test_pair['X_51'].replace(0, 1e-12) + 1e-12)
        test_pair['X25_mul_X51'] = test_pair['X_25'] * test_pair['X_51']

if 'X_25' in train_pair.columns:
    train_pair['X25_sq'] = train_pair['X_25'] ** 2
    if test_pair is not None:
        test_pair['X25_sq'] = test_pair['X_25'] ** 2

if 'X_51' in train_pair.columns:
    train_pair['X51_sq'] = train_pair['X_51'] ** 2
    if test_pair is not None:
        test_pair['X51_sq'] = test_pair['X_51'] ** 2



# import itertools

# # 1) pairwise 조합 (곱/나눗셈)
# for f1, f2 in itertools.combinations(G1, 2):
#     if f1 in train_pair.columns and f2 in train_pair.columns:
#         # 나눗셈 (f1 / f2)
#         train_pair[f"{f1}_div_{f2}"] = train_pair[f1] / (train_pair[f2].replace(0, 1e-12) + 1e-12)
#         if test_pair is not None:
#             test_pair[f"{f1}_div_{f2}"] = test_pair[f1] / (test_pair[f2].replace(0, 1e-12) + 1e-12)

#         # 곱 (f1 * f2)
#         train_pair[f"{f1}_mul_{f2}"] = train_pair[f1] * train_pair[f2]
#         if test_pair is not None:
#             test_pair[f"{f1}_mul_{f2}"] = test_pair[f1] * test_pair[f2]

# # 2) 각 feature에 대해 제곱
# for f in G1:
#     if f in train_pair.columns:
#         train_pair[f"{f}_sq"] = train_pair[f] ** 2
#         if test_pair is not None:
#             test_pair[f"{f}_sq"] = test_pair[f] ** 2

print('[INFO] 파생 피처 추가 완료. train_pair:', train_pair.shape, '| test_pair:', (None if test_pair is None else test_pair.shape))


[INFO] 파생 피처 추가 완료. train_pair: (21693, 68) | test_pair: (15004, 67)


In [8]:
# # === Custom Feature Engineering (요구사항 반영) ===
# from itertools import combinations

# # 중요 피처
# important_cols = ["X_08", "X_19", "X_29", "X_46", "X_40", "X_41", "X_49"]
# # 제거할 피처
# # drop_cols = ["X_17", "X_09", "X_40", "X_21_sq"]

# # drop_cols 제외 후 실제 존재하는 중요 피처만 필터링
# exist_importants = [c for c in important_cols if c in train_pair.columns and c not in drop_cols]

# # 상관관계 확인 (임계값: 0.95)
# high_corr_pairs = []
# if len(exist_importants) >= 2:
#     corr = train_pair[exist_importants].corr().abs()
#     for a, b in combinations(exist_importants, 2):
#         if corr.loc[a, b] >= 0.7:
#             high_corr_pairs.append((a, b))

# # 제곱 피처 생성
# for c in exist_importants:
#     new_col = f"{c}_sq"
#     if new_col not in train_pair.columns:
#         train_pair[new_col] = train_pair[c] ** 2
#         test_pair[new_col] = test_pair[c] ** 2

# # 곱 피처 생성 (상관관계 높은 경우만)
# if high_corr_pairs:
#     for a, b in high_corr_pairs:
#         new_col = f"{a}x{b}"
#         if new_col not in train_pair.columns:
#             train_pair[new_col] = train_pair[a] * train_pair[b]
#             test_pair[new_col] = test_pair[a] * test_pair[b]

# if high_corr_pairs:
#     for a, b in high_corr_pairs:
#         new_col = f"{a}x{b}"
#         if new_col not in train_pair.columns:
#             train_pair[new_col] = train_pair[a] * train_pair[b]
#             test_pair[new_col] = test_pair[a] * test_pair[b]
            
# # # 불필요 피처 제거
# # for c in drop_cols:
# #     if c in train_pair.columns:
# #         train_pair.drop(columns=[c], inplace=True)
# #     if c in test_pair.columns:
# #         test_pair.drop(columns=[c], inplace=True)

# print("[Custom FE] 제곱 추가:", [f"{c}_sq" for c in exist_importants])
# print("[Custom FE] 곱 추가:", [f"{a}x{b}" for a, b in high_corr_pairs])
# print("[Custom FE] 제거:", drop_cols)


In [9]:
# === Custom FE (inline, k-fold 직전) ===
from itertools import combinations
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# ---- 설정 ----
important_cols = ["X_08", "X_19", "X_29", "X_46", "X_40", "X_41", "X_49"]
drop_cols = ["X_17", "X_09", "X_40", "X_21_sq"]
corr_threshold = 0.7
scale_cols = ["X_11", "X_19", "X_37", "X_40"]
label_candidates = ["target", "TARGET", "label", "Label", "y"]

# ---- 유틸 ----
def _safe_ratio(a: pd.Series, b: pd.Series, eps: float = 1e-9) -> pd.Series:
    # 0/0 또는 분모 0 대응
    return a / (b.replace(0, np.nan))
    # 필요시: return a / (b + eps)

def _numeric_feature_cols(df: pd.DataFrame):
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    to_exclude = []
    for c in num_cols:
        for lab in label_candidates:
            if c.lower() == lab.lower():
                to_exclude.append(c)
                break
    # id 류 컬럼 제외(있다면)
    for c in num_cols:
        if c.lower() in ("id", "idx"):
            to_exclude.append(c)
    to_exclude = list(dict.fromkeys(to_exclude))
    return [c for c in num_cols if c not in to_exclude]

# ---- 음수 개수 피처 ----
_numeric_cols_train = _numeric_feature_cols(train_pair)
train_pair["row_neg_count"] = (train_pair[_numeric_cols_train] < 0).sum(axis=1)

_numeric_cols_test = [c for c in _numeric_cols_train if c in test_pair.columns]
test_pair["row_neg_count"] = (test_pair[_numeric_cols_test] < 0).sum(axis=1)

# ---- 스케일링 ----
for c in scale_cols:
    if c in train_pair.columns:
        scaler = StandardScaler()
        train_pair[f"{c}_std"] = scaler.fit_transform(train_pair[[c]]).ravel()
        if c in test_pair.columns:
            test_pair[f"{c}_std"] = scaler.transform(test_pair[[c]]).ravel()

# ---- 중요 피처 파생 ----
exist_importants = [c for c in important_cols if c in train_pair.columns and c not in drop_cols]

# 상관관계 계산
high_corr_pairs = []
if len(exist_importants) >= 2:
    corr = train_pair[exist_importants].corr().abs()
    for a, b in combinations(exist_importants, 2):
        if corr.loc[a, b] >= corr_threshold:
            high_corr_pairs.append((a, b))

# 제곱
for c in exist_importants:
    sq = f"{c}_sq"
    if sq not in train_pair.columns and np.issubdtype(train_pair[c].dtype, np.number):
        train_pair[sq] = train_pair[c] ** 2
        if c in test_pair.columns:
            test_pair[sq] = test_pair[c] ** 2

# 곱/나눗셈 (고상관일 때)
if high_corr_pairs:
    for a, b in high_corr_pairs:
        prod = f"{a}x{b}"
        if prod not in train_pair.columns:
            train_pair[prod] = train_pair[a] * train_pair[b]
            if (a in test_pair.columns) and (b in test_pair.columns):
                test_pair[prod] = test_pair[a] * test_pair[b]
        # ratio a/b, b/a
        div1 = f"{a}_div_{b}"
        div2 = f"{b}_div_{a}"
        if div1 not in train_pair.columns:
            train_pair[div1] = _safe_ratio(train_pair[a], train_pair[b])
            if (a in test_pair.columns) and (b in test_pair.columns):
                test_pair[div1] = _safe_ratio(test_pair[a], test_pair[b])
        if div2 not in train_pair.columns:
            train_pair[div2] = _safe_ratio(train_pair[b], train_pair[a])
            if (a in test_pair.columns) and (b in test_pair.columns):
                test_pair[div2] = _safe_ratio(test_pair[b], test_pair[a])

# ---- 컬럼 제거 ----
for c in drop_cols:
    if c in train_pair.columns:
        train_pair.drop(columns=[c], inplace=True)
    if c in test_pair.columns:
        test_pair.drop(columns=[c], inplace=True)

print("[Custom FE] row_neg_count 추가 완료")
print("[Custom FE] scaling 추가:", [c for c in scale_cols if f"{c}_std" in train_pair.columns])
print("[Custom FE] squares:", [f"{c}_sq" for c in exist_importants])
print("[Custom FE] products:", [f"{a}x{b}" for a,b in high_corr_pairs])
print("[Custom FE] ratios:", [f"{a}_div_{b}" for a,b in high_corr_pairs] + [f"{b}_div_{a}" for a,b in high_corr_pairs])
print("[Custom FE] 제거:", [c for c in drop_cols])

[Custom FE] row_neg_count 추가 완료
[Custom FE] scaling 추가: ['X_11', 'X_19', 'X_37', 'X_40']
[Custom FE] squares: ['X_08_sq', 'X_19_sq', 'X_29_sq', 'X_46_sq', 'X_41_sq', 'X_49_sq']
[Custom FE] products: ['X_08xX_19', 'X_08xX_29']
[Custom FE] ratios: ['X_08_div_X_19', 'X_08_div_X_29', 'X_19_div_X_08', 'X_29_div_X_08']
[Custom FE] 제거: ['X_17', 'X_09', 'X_40', 'X_21_sq']


In [10]:
train_pair.columns

Index(['ID', 'X_01', 'X_02', 'X_03', 'X_04', 'X_05', 'X_06', 'X_07', 'X_08',
       'X_11', 'X_12', 'X_13', 'X_14', 'X_15', 'X_16', 'X_18', 'X_19', 'X_20',
       'X_22', 'X_23', 'X_24', 'X_25', 'X_26', 'X_27', 'X_28', 'X_29', 'X_31',
       'X_32', 'X_34', 'X_35', 'X_36', 'X_37', 'X_38', 'X_41', 'X_42', 'X_43',
       'X_44', 'X_46', 'X_48', 'X_49', 'X_50', 'X_51', 'X_52', 'target',
       'row_na_cnt', 'row_num_mean', 'row_num_std', 'row_outlier_cnt',
       'G1_mean', 'G1_std', 'G1_min', 'G1_max', 'G1_range', 'X05_div_X09',
       'X05_mul_X09', 'X05_sq', 'X09_sq', 'X20_div_X22', 'X20_mul_X22',
       'X20_sq', 'X22_sq', 'X25_div_X51', 'X25_mul_X51', 'X25_sq', 'X51_sq',
       'row_neg_count', 'X_11_std', 'X_19_std', 'X_37_std', 'X_40_std',
       'X_08_sq', 'X_19_sq', 'X_29_sq', 'X_46_sq', 'X_41_sq', 'X_49_sq',
       'X_08xX_19', 'X_08_div_X_19', 'X_19_div_X_08', 'X_08xX_29',
       'X_08_div_X_29', 'X_29_div_X_08'],
      dtype='object')

In [11]:

# === 5. Stratified K-Fold CV (fold별로 G1 표준화+PCA 적용 후 AutoGluon 학습) ===
if not AUTOGluon_OK:
    raise ImportError('AutoGluon 불러오기 실패. 설치 필요: pip install autogluon.tabular')

SAVE_ROOT.mkdir(parents=True, exist_ok=True)

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=RANDOM_STATE)
fold_indices = list(skf.split(train_pair.drop(columns=[TARGET_COL]), train_pair[TARGET_COL].astype(str)))

oof_pred = pd.Series(index=np.arange(len(train_pair)), dtype=object)
fold_scores = []

for fold, (trn_idx, val_idx) in enumerate(fold_indices, start=1):
    print(f'\n===== [Fold {fold}/{N_FOLDS}] =====')
    trn_df = train_pair.iloc[trn_idx].reset_index(drop=True).copy()
    val_df = train_pair.iloc[val_idx].reset_index(drop=True).copy()

    # # --- G1 처리: 표준화 → PCA(0.95) → PC_G1_* 생성, G1 원본 삭제 ---
    # g1_cols = [c for c in G1 if c in trn_df.columns and c != TARGET_COL]
    # if len(g1_cols) > 0:
    #     # 결측 채우기(수치 median) → 스케일러/ PCA 학습은 trn 기준
    #     imputer_g1 = SimpleImputer(strategy='median')
    #     scaler_g1  = StandardScaler()
    #     pca_g1     = PCA(n_components=VAR_RATIO, svd_solver='full')

    #     trn_g1 = imputer_g1.fit_transform(trn_df[g1_cols])
    #     trn_g1 = scaler_g1.fit_transform(trn_g1)
    #     trn_g1_pc = pca_g1.fit_transform(trn_g1)

    #     # 검증 변환
    #     val_g1 = imputer_g1.transform(val_df[g1_cols])
    #     val_g1 = scaler_g1.transform(val_g1)
    #     val_g1_pc = pca_g1.transform(val_g1)

    #     # PC 열명 만들기
    #     n_pc = trn_g1_pc.shape[1]
    #     pc_cols = [f'PC_G1_{i+1}' for i in range(n_pc)]

    #     # 데이터프레임에 붙이고 원본 G1 삭제
    #     trn_pc_df = pd.DataFrame(trn_g1_pc, columns=pc_cols)
    #     val_pc_df = pd.DataFrame(val_g1_pc, columns=pc_cols)

    #     trn_df = pd.concat([trn_df.drop(columns=g1_cols), trn_pc_df], axis=1)
    #     val_df = pd.concat([val_df.drop(columns=g1_cols), val_pc_df], axis=1)
    # else:
    #     print('[INFO] 이번 fold에서 G1 컬럼이 존재하지 않아 PCA 스킵')



    # --- G1 처리: 표준화 → PCA(0.95) → PC_G1_* 생성, G1 원본 유지 ---
    g1_cols = [c for c in G1 if c in trn_df.columns and c != TARGET_COL]
    if len(g1_cols) > 0:
        # 결측 채우기(수치 median) → 스케일러/ PCA 학습은 trn 기준
        imputer_g1 = SimpleImputer(strategy='median')
        scaler_g1  = StandardScaler()
        pca_g1     = PCA(n_components=VAR_RATIO, svd_solver='full')

        trn_g1 = imputer_g1.fit_transform(trn_df[g1_cols])
        trn_g1 = scaler_g1.fit_transform(trn_g1)
        trn_g1_pc = pca_g1.fit_transform(trn_g1)

        # 검증 변환
        val_g1 = imputer_g1.transform(val_df[g1_cols])
        val_g1 = scaler_g1.transform(val_g1)
        val_g1_pc = pca_g1.transform(val_g1)

        # PC 열명 만들기
        n_pc = trn_g1_pc.shape[1]
        pc_cols = [f'PC_G1_{i+1}' for i in range(n_pc)]

        # 데이터프레임에 붙이고 원본 G1 유지
        trn_pc_df = pd.DataFrame(trn_g1_pc, columns=pc_cols, index=trn_df.index)
        val_pc_df = pd.DataFrame(val_g1_pc, columns=pc_cols, index=val_df.index)

        trn_df = pd.concat([trn_df, trn_pc_df], axis=1)
        val_df = pd.concat([val_df, val_pc_df], axis=1)
    else:
        print('[INFO] 이번 fold에서 G1 컬럼이 존재하지 않아 PCA 스킵')



    # AutoGluon 학습
    fold_dir = SAVE_ROOT / f'fold_{fold}'
    if fold_dir.exists():
        print(f'[INFO] {fold_dir} 재사용/덮어쓰기')
    fold_dir.mkdir(parents=True, exist_ok=True)

    ag_trn = TabularDataset(trn_df)
    ag_val = TabularDataset(val_df)

    predictor = TabularPredictor(label=TARGET_COL, path=str(fold_dir), eval_metric='f1_macro')
    predictor.fit(train_data=ag_trn, tuning_data=ag_val, presets=PRESETS, time_limit=TIME_LIMIT, verbosity=2)

    # 검증 예측
    y_true = val_df[TARGET_COL].astype(str).reset_index(drop=True)
    y_pred = predictor.predict(ag_val).astype(str).reset_index(drop=True)

    score = f1_score(y_true, y_pred, average='macro', labels=ALL_CLASSES, zero_division=0)
    print(f'[Fold {fold}] macro-F1: {score:.4f}')
    fold_scores.append(score)

    oof_pred.iloc[val_idx] = y_pred.values

# OOF 성능
y_all = train_pair[TARGET_COL].astype(str).reset_index(drop=True)
assert oof_pred.isna().sum() == 0, 'OOF 예측에 결측이 있습니다.'

oof_score = f1_score(y_all, oof_pred.astype(str), average='macro', labels=ALL_CLASSES, zero_division=0)
print('\n===== CV 결과 =====')
print('Fold scores:', np.round(fold_scores, 4).tolist())
print(f'OOF macro-F1: {oof_score:.4f}')

oof_df = pd.DataFrame({'y_true': y_all, 'y_pred': oof_pred.astype(str)})
oof_df.to_csv(OOF_PATH, index=False, encoding='utf-8')
print('[INFO] OOF 저장:', OOF_PATH.resolve())


===== [Fold 1/5] =====


Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.11.9
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26100
CPU Count:          28
Memory Avail:       21.84 GB / 31.69 GB (68.9%)
Disk Space Avail:   837.35 GB / 953.01 GB (87.9%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 1800s
AutoGluon will save models to "c:\Users\SSAFY\Desktop\STUDY\Autogluon\ag_cv_models\fold_1"
Train Data Rows:    17354
Train Data Columns: 77
Tuning Data Rows:    4339
Tuning Data Columns: 77
Label Column:       target
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 21) unique label values:  [np.int64(0), np.int64(20), np.int64(1), np.int64(19), np.int64(15), np.int64(8), np.int64(12), np.int64(4), np.int64(5), np.int64(11)]
	If 'multiclass' is not the correc

[1000]	valid_set's multi_logloss: 0.546444	valid_set's f1_macro: 0.823349


	0.8263	 = Validation score   (f1_macro)
	24.22s	 = Training   runtime
	0.44s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 1752.43s of the 1752.43s of remaining time.
	Fitting with cpus=20, gpus=0, mem=0.2/21.7 GB
	0.8286	 = Validation score   (f1_macro)
	13.32s	 = Training   runtime
	0.16s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 1738.14s of the 1738.14s of remaining time.
	Fitting with cpus=28, gpus=0, mem=0.2/21.7 GB
	0.8014	 = Validation score   (f1_macro)
	4.19s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 1733.70s of the 1733.70s of remaining time.
	Fitting with cpus=28, gpus=0, mem=0.2/21.4 GB
	0.7985	 = Validation score   (f1_macro)
	9.42s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 1724.03s of the 1724.03s of remaining time.
	Fitting with cpus=20, gpus=0
	0.8156	 = Validation score

[1000]	valid_set's multi_logloss: 0.706388	valid_set's f1_macro: 0.824661


	0.8251	 = Validation score   (f1_macro)
	34.82s	 = Training   runtime
	0.41s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.00s of the 1272.30s of remaining time.
	Ensemble Weights: {'NeuralNetFastAI': 0.5, 'NeuralNetTorch': 0.25, 'ExtraTreesGini': 0.083, 'ExtraTreesEntr': 0.083, 'RandomForestEntr': 0.042, 'XGBoost': 0.042}
	0.8746	 = Validation score   (f1_macro)
	0.45s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 528.18s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 10932.8 rows/s (4339 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("c:\Users\SSAFY\Desktop\STUDY\Autogluon\ag_cv_models\fold_1")
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.11.9
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26100
CPU Count:          28
Memory Avail:       20.82 GB / 31.69 GB (65.7%)
D

[Fold 1] macro-F1: 0.8746

===== [Fold 2/5] =====


Beginning AutoGluon training ... Time limit = 1800s
AutoGluon will save models to "c:\Users\SSAFY\Desktop\STUDY\Autogluon\ag_cv_models\fold_2"
Train Data Rows:    17354
Train Data Columns: 77
Tuning Data Rows:    4339
Tuning Data Columns: 77
Label Column:       target
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 21) unique label values:  [np.int64(20), np.int64(1), np.int64(15), np.int64(0), np.int64(8), np.int64(16), np.int64(14), np.int64(18), np.int64(3), np.int64(5)]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       multiclass
Preprocessing data ...
Train Data Class Count: 21
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:          

[1000]	valid_set's multi_logloss: 0.553579	valid_set's f1_macro: 0.831979


	0.8329	 = Validation score   (f1_macro)
	24.96s	 = Training   runtime
	0.43s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 1756.53s of the 1756.53s of remaining time.
	Fitting with cpus=20, gpus=0, mem=0.2/20.8 GB
	0.8294	 = Validation score   (f1_macro)
	10.83s	 = Training   runtime
	0.11s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 1745.04s of the 1745.04s of remaining time.
	Fitting with cpus=28, gpus=0, mem=0.2/20.8 GB
	0.8013	 = Validation score   (f1_macro)
	4.53s	 = Training   runtime
	0.08s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 1740.24s of the 1740.24s of remaining time.
	Fitting with cpus=28, gpus=0, mem=0.2/20.5 GB
	0.8052	 = Validation score   (f1_macro)
	9.99s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 1729.97s of the 1729.97s of remaining time.
	Fitting with cpus=20, gpus=0
	0.8201	 = Validation score 

[Fold 2] macro-F1: 0.8722

===== [Fold 3/5] =====


Train Data Class Count: 21
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    21003.62 MB
	Train Data (Original)  Memory Usage: 13.99 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Useless Original Features (Count: 1): ['row_na_cnt']
		These features carry no predictive signal and should be manually investigated.
		This is typically a feature which has the same value for all rows.
		These features do not 

[Fold 3] macro-F1: 0.8766

===== [Fold 4/5] =====


AutoGluon will save models to "c:\Users\SSAFY\Desktop\STUDY\Autogluon\ag_cv_models\fold_4"
Train Data Rows:    17355
Train Data Columns: 77
Tuning Data Rows:    4338
Tuning Data Columns: 77
Label Column:       target
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 21) unique label values:  [np.int64(0), np.int64(20), np.int64(1), np.int64(19), np.int64(15), np.int64(8), np.int64(16), np.int64(12), np.int64(14), np.int64(18)]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       multiclass
Preprocessing data ...
Train Data Class Count: 21
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    20809.96 MB
	Train Data (Original)  Memo

[Fold 4] macro-F1: 0.8772

===== [Fold 5/5] =====


Beginning AutoGluon training ... Time limit = 1800s
AutoGluon will save models to "c:\Users\SSAFY\Desktop\STUDY\Autogluon\ag_cv_models\fold_5"
Train Data Rows:    17355
Train Data Columns: 77
Tuning Data Rows:    4338
Tuning Data Columns: 77
Label Column:       target
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 21) unique label values:  [np.int64(0), np.int64(19), np.int64(15), np.int64(1), np.int64(16), np.int64(12), np.int64(14), np.int64(18), np.int64(3), np.int64(20)]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       multiclass
Preprocessing data ...
Train Data Class Count: 21
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:        

[1000]	valid_set's multi_logloss: 0.564171	valid_set's f1_macro: 0.824979


	0.827	 = Validation score   (f1_macro)
	21.81s	 = Training   runtime
	0.35s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 1757.45s of the 1757.45s of remaining time.
	Fitting with cpus=20, gpus=0, mem=0.2/20.2 GB
	0.8296	 = Validation score   (f1_macro)
	10.1s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 1746.82s of the 1746.82s of remaining time.
	Fitting with cpus=28, gpus=0, mem=0.2/20.3 GB
	0.804	 = Validation score   (f1_macro)
	4.56s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 1742.00s of the 1742.00s of remaining time.
	Fitting with cpus=28, gpus=0, mem=0.2/20.0 GB
	0.8052	 = Validation score   (f1_macro)
	9.95s	 = Training   runtime
	0.08s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 1731.81s of the 1731.81s of remaining time.
	Fitting with cpus=20, gpus=0
	0.8174	 = Validation score   (

[Fold 5] macro-F1: 0.8764

===== CV 결과 =====
Fold scores: [0.8746, 0.8722, 0.8766, 0.8772, 0.8764]
OOF macro-F1: 0.8755
[INFO] OOF 저장: C:\Users\SSAFY\Desktop\STUDY\Autogluon\oof_predictions.csv


In [12]:

# === 6. Full-train + Test 예측 (전역 대표/드롭 적용 기준, G1은 full-train 기준으로 PCA) ===
if test is None:
    print('[WARN] test.csv 없음 → 스킵')
else:
    # 전역 pair-drop 반영된 사본 사용
    trn_full = train_pair.copy()
    tst_full = test_pair.copy()

    # # G1 처리: 결측→표준화→PCA(0.95) → PC_G1_* 남기고 원본 삭제 (full-train 기준으로 적합)
    # g1_cols_full = [c for c in G1 if c in trn_full.columns and c != TARGET_COL]
    # if len(g1_cols_full) > 0:
    #     imputer_g1_full = SimpleImputer(strategy='median')
    #     scaler_g1_full  = StandardScaler()
    #     pca_g1_full     = PCA(n_components=VAR_RATIO, svd_solver='full')

    #     trn_g1f = imputer_g1_full.fit_transform(trn_full[g1_cols_full])
    #     trn_g1f = scaler_g1_full.fit_transform(trn_g1f)
    #     trn_g1f_pc = pca_g1_full.fit_transform(trn_g1f)

    #     tst_g1f = imputer_g1_full.transform(tst_full[g1_cols_full])
    #     tst_g1f = scaler_g1_full.transform(tst_g1f)
    #     tst_g1f_pc = pca_g1_full.transform(tst_g1f)

    #     n_pc_full = trn_g1f_pc.shape[1]
    #     pc_cols_full = [f'PC_G1_{i+1}' for i in range(n_pc_full)]

    #     trn_pc_full = pd.DataFrame(trn_g1f_pc, columns=pc_cols_full)
    #     tst_pc_full = pd.DataFrame(tst_g1f_pc, columns=pc_cols_full)

    #     trn_full = pd.concat([trn_full.drop(columns=g1_cols_full), trn_pc_full], axis=1)
    #     tst_full = pd.concat([tst_full.drop(columns=g1_cols_full), tst_pc_full], axis=1)
    # else:
    #     print('[INFO] full-train 기준 G1 컬럼 없음 → PCA 스킵')


    # G1 처리: 결측→표준화→PCA(0.95) → PC_G1_* 추가 (원본은 유지)
    g1_cols_full = [c for c in G1 if c in trn_full.columns and c != TARGET_COL]
    if len(g1_cols_full) > 0:
        imputer_g1_full = SimpleImputer(strategy='median')
        scaler_g1_full  = StandardScaler()
        pca_g1_full     = PCA(n_components=VAR_RATIO, svd_solver='full')

        # train
        trn_g1f = imputer_g1_full.fit_transform(trn_full[g1_cols_full])
        trn_g1f = scaler_g1_full.fit_transform(trn_g1f)
        trn_g1f_pc = pca_g1_full.fit_transform(trn_g1f)

        # test
        tst_g1f = imputer_g1_full.transform(tst_full[g1_cols_full])
        tst_g1f = scaler_g1_full.transform(tst_g1f)
        tst_g1f_pc = pca_g1_full.transform(tst_g1f)

        # PC 컬럼 생성
        n_pc_full = trn_g1f_pc.shape[1]
        pc_cols_full = [f'PC_G1_{i+1}' for i in range(n_pc_full)]

        trn_pc_full = pd.DataFrame(trn_g1f_pc, columns=pc_cols_full, index=trn_full.index)
        tst_pc_full = pd.DataFrame(tst_g1f_pc, columns=pc_cols_full, index=tst_full.index)

        # 원본은 유지하고 PC만 추가
        trn_full = pd.concat([trn_full, trn_pc_full], axis=1)
        tst_full = pd.concat([tst_full, tst_pc_full], axis=1)
    else:
        print('[INFO] full-train 기준 G1 컬럼 없음 → PCA 스킵')




    # AutoGluon 전체 학습
    FULL_DIR.mkdir(parents=True, exist_ok=True)
    ag_full = TabularDataset(trn_full)
    full_predictor = TabularPredictor(label=TARGET_COL, path=str(FULL_DIR), eval_metric='f1_macro')
    full_predictor.fit(train_data=ag_full, presets=PRESETS, time_limit=TIME_LIMIT, verbosity=2)

    # Test 예측 & 제출 저장
    ag_test = TabularDataset(tst_full)
    test_pred = full_predictor.predict(ag_test).astype(str)

    if ID_COL is not None and ID_COL in test.columns:
        sub = pd.DataFrame({ID_COL: test[ID_COL], TARGET_COL: test_pred})
    else:
        sub = pd.DataFrame({'row_id': np.arange(len(test_pred)), TARGET_COL: test_pred})

    sub.to_csv(SUB_PATH, index=False, encoding='utf-8')
    print('[INFO] submission 저장:', SUB_PATH.resolve())

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.11.9
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26100
CPU Count:          28
Memory Avail:       19.18 GB / 31.69 GB (60.5%)
Disk Space Avail:   828.62 GB / 953.01 GB (86.9%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 1800s
AutoGluon will save models to "c:\Users\SSAFY\Desktop\STUDY\Autogluon\ag_full_model"
Train Data Rows:    21693
Train Data Columns: 82
Label Column:       target
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 21) unique label values:  [np.int64(0), np.int64(20), np.int64(1), np.int64(19), np.int64(15), np.int64(8), np.int64(16), np.int64(12), np.int64(14), np.int64(18)]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_ty

[INFO] submission 저장: C:\Users\SSAFY\Desktop\STUDY\Autogluon\submission.csv


In [13]:
# === Feature Importance (AutoGluon: 전체) ===
import matplotlib.pyplot as plt

# 라벨 자동 탐지
label_candidates = ["target", "TARGET", "label", "Label", "y"]
label_col = None
for col in train_pair.columns:
    if col in label_candidates:
        label_col = col
        break

# 중요도 계산
fi = predictor.feature_importance(train_pair, silent=True)

# 전체 출력
display(fi)

# 시각화 (전체 피처)
plt.figure(figsize=(8, max(6, 0.3*len(fi))))
fi.sort_values("importance").plot(kind="barh", y="importance", legend=False)
plt.title("Feature Importance (All Features)")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()


These features in provided data are not utilized by the predictor and will be ignored: ['ID', 'X_05', 'X_20', 'X_22', 'X_25', 'X_51', 'row_na_cnt']


KeyError: "1 required columns are missing from the provided dataset to transform using AutoMLPipelineFeatureGenerator. 1 missing columns: ['PC_G1_1'] | 74 available columns: ['X_01', 'X_02', 'X_03', 'X_04', 'X_06', 'X_07', 'X_08', 'X_11', 'X_12', 'X_13', 'X_14', 'X_15', 'X_16', 'X_18', 'X_19', 'X_23', 'X_24', 'X_26', 'X_27', 'X_28', 'X_29', 'X_31', 'X_32', 'X_34', 'X_35', 'X_36', 'X_37', 'X_38', 'X_41', 'X_42', 'X_43', 'X_44', 'X_46', 'X_48', 'X_49', 'X_50', 'X_52', 'row_num_mean', 'row_num_std', 'row_outlier_cnt', 'G1_mean', 'G1_std', 'G1_min', 'G1_max', 'G1_range', 'X05_div_X09', 'X05_mul_X09', 'X05_sq', 'X09_sq', 'X20_div_X22', 'X20_mul_X22', 'X20_sq', 'X22_sq', 'X25_div_X51', 'X25_mul_X51', 'X25_sq', 'X51_sq', 'row_neg_count', 'X_11_std', 'X_19_std', 'X_37_std', 'X_40_std', 'X_08_sq', 'X_19_sq', 'X_29_sq', 'X_46_sq', 'X_41_sq', 'X_49_sq', 'X_08xX_19', 'X_08_div_X_19', 'X_19_div_X_08', 'X_08xX_29', 'X_08_div_X_29', 'X_29_div_X_08']"


## 전처리 메모 (요약)
- **PAIR_GROUPS 대표 선택**: `(-결측률) 정규화 + MI 정규화 + F 정규화` 평균으로 대표 1개 선택(전역 고정). 대표 외 변수는 **삭제**.
- **G1 그룹**: 수치 결측 median 대체 → 표준화(`StandardScaler`) → `PCA(n_components=0.95)` → `PC_G1_i`만 유지, G1 원본 **삭제**.
- **K-Fold 내 누수 방지**: G1의 `imputer/scaler/PCA`는 **각 폴드의 train 부분으로만 적합** 후 val에 적용. (전역 pair-drop은 기준 통일을 위해 train 전체에서 1회 산정)
- AutoGluon은 나머지 결측/범주형 처리를 자체적으로 수행.
