## 문제 정의
- 환자의 데이터를 입력으로 하여 갑상선 질환 여부를 분류하는 이진 분류 문제
    - 평가지표: F1-score
- 컴피티션 주소
    - https://www.kaggle.com/competitions/scu-ai-competition-202401



# 컴피티션 데이터 다운로드 및 압축 풀기

- 데이터 경로 변수


In [None]:
DATA_PATH = "C:/Users/ltk65/Downloads/scu-ai-competition-202401/"
DATA_PATH

'./'

- 시드값

In [None]:
SEED = 42

- 데이터 불러오기

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv(f"{DATA_PATH}train.csv") # 학습데이터
test = pd.read_csv(f"{DATA_PATH}test.csv") # 테스트 데이터
train.shape , test.shape

((4223, 18), (3456, 17))

# 타겟 컬럼
- 0: 정상 환자
- 1: 갑상선 질환 환자

In [None]:
train.head()

Unnamed: 0,ID,나이,성별,티록신_복용_여부,항갑상선제_복용_여부,지병_여부,임신_여부,갑상선_수술_이력,I131_치료_여부,갑상선저하_인지_여부,갑상선항진증_인지_여부,리튬_치료_여부,갑상선종_여부,종양_여부,TSH,FreeT3,FreeT4,target
0,train_0,59.0,남,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,,,0.77,0
1,train_1,63.0,남,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,33.0,1.5,,1
2,train_2,65.0,여,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,1.7,2.3,0.95,0
3,train_3,33.0,남,아니오,아니오,예,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,6.2,,0.66,0
4,train_4,64.0,여,예,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,아니오,1.2,,0.95,0


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4223 entries, 0 to 4222
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            4223 non-null   object 
 1   나이            4220 non-null   float64
 2   성별            4086 non-null   object 
 3   티록신_복용_여부     4223 non-null   object 
 4   항갑상선제_복용_여부   4223 non-null   object 
 5   지병_여부         4223 non-null   object 
 6   임신_여부         4223 non-null   object 
 7   갑상선_수술_이력     4223 non-null   object 
 8   I131_치료_여부    4223 non-null   object 
 9   갑상선저하_인지_여부   4223 non-null   object 
 10  갑상선항진증_인지_여부  4223 non-null   object 
 11  리튬_치료_여부      4223 non-null   object 
 12  갑상선종_여부       4223 non-null   object 
 13  종양_여부         4223 non-null   object 
 14  TSH           3832 non-null   float64
 15  FreeT3        2989 non-null   float64
 16  FreeT4        3854 non-null   float64
 17  target        4223 non-null   int64  
dtypes: float64(4), int64(1), obj

# 타겟값 확인
- 비율 확인하기

In [None]:
train["target"].mean()

0.11816244376035993

# 결측치 처리

In [None]:
train.isnull().sum()

ID                 0
나이                 3
성별               137
티록신_복용_여부          0
항갑상선제_복용_여부        0
지병_여부              0
임신_여부              0
갑상선_수술_이력          0
I131_치료_여부         0
갑상선저하_인지_여부        0
갑상선항진증_인지_여부       0
리튬_치료_여부           0
갑상선종_여부            0
종양_여부              0
TSH              391
FreeT3          1234
FreeT4           369
target             0
dtype: int64

In [None]:
test.isnull().sum()

ID                0
나이                1
성별              117
티록신_복용_여부         0
항갑상선제_복용_여부       0
지병_여부             0
임신_여부             0
갑상선_수술_이력         0
I131_치료_여부        0
갑상선저하_인지_여부       0
갑상선항진증_인지_여부      0
리튬_치료_여부          0
갑상선종_여부           0
종양_여부             0
TSH             333
FreeT3          975
FreeT4          312
dtype: int64

- Data leakage를 피하기 위해 임의로 정한 상수 또는 학습데이터의 통계치를 이용하여 결측치를 처리해야 한다.

In [None]:
fill_age = train["나이"].median()
fill_tsh = train["TSH"].median()
fill_free_t3 = train["FreeT3"].median()
fill_free_t4 = train["FreeT4"].median()

In [None]:
train["나이"] = train["나이"].fillna(fill_age)
train["성별"] = train["성별"].fillna("UNK")
train["TSH"] = train["TSH"].fillna(fill_tsh)
train["FreeT3"] = train["FreeT3"].fillna(fill_free_t3)
train["FreeT4"] = train["FreeT4"].fillna(fill_free_t4)

In [None]:
test["나이"] = test["나이"].fillna(fill_age)
test["성별"] = test["성별"].fillna("UNK")
test["TSH"] = test["TSH"].fillna(fill_tsh)
test["FreeT3"] = test["FreeT3"].fillna(fill_free_t3)
test["FreeT4"] = test["FreeT4"].fillna(fill_free_t4)

In [None]:
train.isnull().sum().sum() , test.isnull().sum().sum()

(0, 0)

In [None]:
train.columns

Index(['ID', '나이', '성별', '티록신_복용_여부', '항갑상선제_복용_여부', '지병_여부', '임신_여부',
       '갑상선_수술_이력', 'I131_치료_여부', '갑상선저하_인지_여부', '갑상선항진증_인지_여부', '리튬_치료_여부',
       '갑상선종_여부', '종양_여부', 'TSH', 'FreeT3', 'FreeT4', 'target'],
      dtype='object')

# 특성 공학(Feature Engineering)
- Feature Engineering 과정에서 평가를 위한 예측을 해야하기 때문에 테스트 세트에 대해서도 동일한 작업을 진행해줘야 한다.


## Feature Extraction

- 특성으로 사용할 변수 추가하기

In [None]:
train_ft = train.iloc[:,1:-1].copy() # ID 컬럼 및 정답 컬럼 제외
test_ft = test.iloc[:,1:].copy() # ID 컬럼 제외
train_ft.shape, test_ft.shape

((4223, 16), (3456, 16))

## Feature Encoding

- One-Hot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder
cols = train_ft.select_dtypes("object").columns.tolist()
enc = OneHotEncoder(handle_unknown = 'ignore')

In [None]:
# 학습데이터
tmp = pd.DataFrame(
    enc.fit_transform(train_ft[cols]).toarray(),
    columns = enc.get_feature_names_out()
)
train_ft = pd.concat([train_ft,tmp],axis=1).drop(columns=cols) # 범주형 컬럼 제거
train_ft.head()

Unnamed: 0,나이,TSH,FreeT3,FreeT4,성별_UNK,성별_남,성별_여,티록신_복용_여부_아니오,티록신_복용_여부_예,항갑상선제_복용_여부_아니오,...,갑상선저하_인지_여부_아니오,갑상선저하_인지_여부_예,갑상선항진증_인지_여부_아니오,갑상선항진증_인지_여부_예,리튬_치료_여부_아니오,리튬_치료_여부_예,갑상선종_여부_아니오,갑상선종_여부_예,종양_여부_아니오,종양_여부_예
0,59.0,4.953486,2.021445,0.77,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,63.0,33.0,1.5,0.965153,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,65.0,1.7,2.3,0.95,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,33.0,6.2,2.021445,0.66,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,64.0,1.2,2.021445,0.95,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [None]:
# 테스트데이터
tmp = pd.DataFrame(
    enc.transform(test_ft[cols]).toarray(), # 테스트데이터는 transform 만 해야한다.
    columns = enc.get_feature_names_out()
)
test_ft = pd.concat([test_ft,tmp],axis=1).drop(columns=cols) # 범주형 컬럼 제거
test_ft.head()

Unnamed: 0,나이,TSH,FreeT3,FreeT4,성별_UNK,성별_남,성별_여,티록신_복용_여부_아니오,티록신_복용_여부_예,항갑상선제_복용_여부_아니오,...,갑상선저하_인지_여부_아니오,갑상선저하_인지_여부_예,갑상선항진증_인지_여부_아니오,갑상선항진증_인지_여부_예,리튬_치료_여부_아니오,리튬_치료_여부_예,갑상선종_여부_아니오,갑상선종_여부_예,종양_여부_아니오,종양_여부_예
0,37.0,4.953486,2.021445,0.83,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,55.0,0.05,1.5,0.72,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,71.0,0.1,1.9,0.97,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,35.0,0.97,2.021445,0.97,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,16.0,0.3,2.021445,1.06,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


## Feature Scaling
- Min-Max Scaling
    - 데이터 수치의 범위를 바꿔주는 정규화 스케일링 기법
    - 데이터 수치의 범위를 0 ~ 1 사이로 바꿔준다.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# 학습데이터
train_ft[train_ft.columns] = scaler.fit_transform(train_ft)
train_ft.head()

Unnamed: 0,나이,TSH,FreeT3,FreeT4,성별_UNK,성별_남,성별_여,티록신_복용_여부_아니오,티록신_복용_여부_예,항갑상선제_복용_여부_아니오,...,갑상선저하_인지_여부_아니오,갑상선저하_인지_여부_예,갑상선항진증_인지_여부_아니오,갑상선항진증_인지_여부_예,리튬_치료_여부_아니오,리튬_치료_여부_예,갑상선종_여부_아니오,갑상선종_여부_예,종양_여부_아니오,종양_여부_예
0,0.608247,0.010484,0.10983,0.278075,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,0.649485,0.069905,0.08078,0.382435,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0.670103,0.003591,0.125348,0.374332,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,0.340206,0.013125,0.10983,0.219251,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,0.659794,0.002532,0.10983,0.374332,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [None]:
# 테스트 데이터
test_ft[test_ft.columns] = scaler.transform(test_ft)  # 테스트데이터는 transform 만 해야한다.
test_ft.head()

Unnamed: 0,나이,TSH,FreeT3,FreeT4,성별_UNK,성별_남,성별_여,티록신_복용_여부_아니오,티록신_복용_여부_예,항갑상선제_복용_여부_아니오,...,갑상선저하_인지_여부_아니오,갑상선저하_인지_여부_예,갑상선항진증_인지_여부_아니오,갑상선항진증_인지_여부_예,리튬_치료_여부_아니오,리튬_치료_여부_예,갑상선종_여부_아니오,갑상선종_여부_예,종양_여부_아니오,종양_여부_예
0,0.381443,0.010484,0.10983,0.31016,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,0.56701,9.5e-05,0.08078,0.251337,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0.731959,0.000201,0.103064,0.385027,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,0.360825,0.002045,0.10983,0.385027,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,0.164948,0.000625,0.10983,0.433155,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


- 정답 데이터

In [None]:
target = train["target"]
target

0       0
1       1
2       0
3       0
4       0
       ..
4218    0
4219    0
4220    0
4221    0
4222    0
Name: target, Length: 4223, dtype: int64

# 하이퍼파라미터 튜닝

In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.6.1-py3-none-any.whl (380 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.5-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.5 alembic-1.13.1 colorlog-6.8.2 optuna-3.6.1


In [None]:
import optuna

from lightgbm import LGBMClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

In [None]:
class Objective:
    def __init__(self,x_train,y_train,seed):
        self.x_train = x_train # Features
        self.y_train = y_train # target
        self.seed = seed # 시드값
        self.cv = StratifiedKFold(n_splits=5,shuffle=True, random_state=SEED) # cv 객체
    def __call__(self,trial): # 객체를 함수처럼 사용할수 있게 해주는 매직 메소드
        hp = {
            'n_estimators': trial.suggest_int('n_estimators', 80, 300),
             'num_leaves': trial.suggest_int('num_leaves', 2, 1024),
            'max_depth': trial.suggest_int('max_depth', 1, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.0001, 0.1),
            'class_weight': trial.suggest_categorical('class_weight', ['balanced', None]),
            'min_child_samples': trial.suggest_int('min_child_samples', 10, 50),
            'subsample': trial.suggest_float('subsample', 0.7, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.7, 1.0),
            'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
            'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 10.0),
        }

        model = LGBMClassifier(random_state=self.seed,verbosity=-1 ,**hp)

        scores = cross_val_score(model,self.x_train,self.y_train,cv = self.cv ,scoring='f1')
        return np.mean(scores)

In [None]:
sampler = optuna.samplers.TPESampler(seed=SEED)

study = optuna.create_study(
    direction="maximize",
    sampler=sampler,
)
objective = Objective(train_ft,target,SEED) # 목적함수 역할을 하는 Objective 클래스 객체 생성
study.optimize(objective, n_trials=200)
print("Best Score:", study.best_value) # 최고 점수
print("Best trial:", study.best_trial.params) # 최고 점수의 하이퍼파라미터 조합

[I 2024-06-24 11:03:42,628] A new study created in memory with name: no-name-2e1d4fa2-f1b6-43db-a0e2-09f8752f2515
[I 2024-06-24 11:03:44,342] Trial 0 finished with value: 0.8588353704523554 and parameters: {'n_estimators': 162, 'num_leaves': 974, 'max_depth': 8, 'learning_rate': 0.05990598257128396, 'class_weight': 'balanced', 'min_child_samples': 12, 'subsample': 0.9598528437324805, 'colsample_bytree': 0.8803345035229626, 'reg_alpha': 0.7080725777960455, 'reg_lambda': 0.20584494295802447}. Best is trial 0 with value: 0.8588353704523554.
[I 2024-06-24 11:03:46,710] Trial 1 finished with value: 0.8709075316260085 and parameters: {'n_estimators': 294, 'num_leaves': 853, 'max_depth': 3, 'learning_rate': 0.01826431422398935, 'class_weight': None, 'min_child_samples': 31, 'subsample': 0.8295835055926347, 'colsample_bytree': 0.7873687420594125, 'reg_alpha': 0.6118528947223795, 'reg_lambda': 1.3949386065204183}. Best is trial 1 with value: 0.8709075316260085.
[I 2024-06-24 11:03:48,232] Trial

Best Score: 0.8831775041492229
Best trial: {'n_estimators': 293, 'num_leaves': 331, 'max_depth': 4, 'learning_rate': 0.07137798627394543, 'class_weight': None, 'min_child_samples': 10, 'subsample': 0.8587153655510363, 'colsample_bytree': 0.9721729495558943, 'reg_alpha': 0.12597322665453978, 'reg_lambda': 5.351820797261611}


# 모델 학습
- 이전 단계의 검증 결과가 유의미한 성능을 보인다면 학습데이터 전체를 학습하고 평가를 위해 테스트 데이터에 대한 예측을 진행한다.

In [None]:
model = LGBMClassifier(random_state=SEED,verbosity=-1 ,**study.best_trial.params)
model.fit(train_ft,target)

# 테스트 데이터 예측


In [None]:
pred = model.predict(test_ft)
pred

array([0, 0, 0, ..., 1, 0, 0])

# 평가를 위한 제출 파일 생성
- sample_submission.csv 파일을 불러와서 예측 결과를 target 컬럼에 넣어 csv 파일로 저장후에 컴피티션 페이지에 제출한다.

In [None]:
submit = pd.read_csv(f"{DATA_PATH}sample_submission.csv")
submit

In [None]:
submit["target"] = pred
submit

- 예측 결과를 csv 파일로 저장하고 컴피티션 페이지에 제출하여 결과 확인하기


In [None]:
# submit.to_csv("submit_임태균_0625_04.csv",index=False) # 인덱스는 제외하기 위해 False