## Machine Learning 프로젝트 수행을 위한 코드 구조화

- ML project를 위해서 사용하는 템플릿 코드를 만듭니다.

1. **필요한 라이브러리와 데이터를 불러옵니다.**


2. **EDA를 수행합니다.** 이 때 EDA의 목적은 풀어야하는 문제를 위해서 수행됩니다.


3. **전처리를 수행합니다.** 이 때 중요한건 **feature engineering**을 어떻게 하느냐 입니다.


4. **데이터 분할을 합니다.** 이 때 train data와 test data 간의 분포 차이가 없는지 확인합니다.


5. **학습을 진행합니다.** 어떤 모델을 사용하여 학습할지 정합니다. 성능이 잘 나오는 GBM을 추천합니다.


6. **hyper-parameter tuning을 수행합니다.** 원하는 목표 성능이 나올 때 까지 진행합니다. 검증 단계를 통해 지속적으로 **overfitting이 되지 않게 주의**하세요.


7. **최종 테스트를 진행합니다.** 데이터 분석 대회 포맷에 맞는 submission 파일을 만들어서 성능을 확인해보세요.

## 1. 라이브러리, 데이터 불러오기

In [None]:
# kaggle API를 통해서 데이터 다운로드 받기.
import os

os.environ['KAGGLE_USERNAME'] = 'emphymachine'
os.environ['KAGGLE_KEY'] = '5106adb35fcdacd90c40c4e03e9447ce'

In [None]:
!kaggle competitions download -c spaceship-titanic

Downloading spaceship-titanic.zip to /content
100% 299k/299k [00:00<00:00, 868kB/s]
100% 299k/299k [00:00<00:00, 867kB/s]


In [None]:
!unzip spaceship-titanic.zip

Archive:  spaceship-titanic.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [None]:
# 설치에 필요한 라이브러리들이 있다면 모두 적어둡니다. anaconda에 기본적으로 설치되지 않은 라이브러리들을 적어두세요.
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.2.0-py3-none-any.whl (390 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.6/390.6 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.11.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cmaes>=0.9.1 (from optuna)
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully 

In [None]:
# 데이터분석 4종 세트
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 모델들, 성능 평가
# (저는 일반적으로 정형데이터로 머신러닝 분석할 때는 이 2개 모델은 그냥 돌려봅니다. 특히 RF가 테스트하기 좋습니다.)
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# XGBoost import
from xgboost.sklearn import XGBClassifier, XGBRegressor
from lightgbm.sklearn import LGBMClassifier, LGBMRegressor

# 상관관계 분석, VIF : 다중공선성 제거
from statsmodels.stats.outliers_influence import variance_inflation_factor

# KFold(CV), partial : optuna를 사용하기 위함
from sklearn.model_selection import KFold
from functools import partial

# hyper-parameter tuning을 위한 라이브러리, optuna
import optuna

In [None]:
# 데이터를 불러옵니다.
train = pd.read_csv("/content/train.csv")
test = pd.read_csv('/content/test.csv')

## 2. EDA

- 데이터에서 찾아야 하는 기초적인 내용들을 확인합니다.


- class imbalance, target distribution, outlier, correlation을 확인합니다.

In [None]:
## On your Own
train.Transported.value_counts()

True     4378
False    4315
Name: Transported, dtype: int64

이런 식으로 여러가지 그래프를 그려가며, 데이터에 대한 인사이트를 얻습니다!

### 3. 전처리

#### 결측치 처리

In [None]:
# 결측치가 있는 column
# dtype이 object인 데이터들을 수치화(encoding)
from sklearn.impute import KNNImputer

drop_cols = ["PassengerId", "Cabin", "Name"]
numeric_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
cat_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP']

# 편의상 분석에 사용하지 않을 column을 제거.
train = train.drop(columns=drop_cols)

# categorical feature mapping
## 1. LabelEncoder() -> Ordinal Encoding
## 2. get_dummies() / OneHotEncoder() -> One-Hot Encoding
## 3. 직접 mapping을 만듬. (O)

cat_map = {}
for cat_col in cat_cols:
    _map = {}
    for i, col in enumerate(train[cat_col].unique()[:-1]): # 항상 마지막이 nan이라서 마지막을 제외.
        _map[col] = i  # {"Europa" : 0, "Earth" : 1, "Mars" : 2}

    # Ordinal Encoding을 수행. (nan은 아무곳도 mapping이 안되어, 그대로 남음)
    train[cat_col] = train[cat_col].map(_map)
    cat_map[cat_col] = _map  # {"HomePlanet" : {"Europa" : 0, "Earth" : 1, "Mars" : 2}}

imp = KNNImputer(n_neighbors=5)
data = imp.fit_transform(train[numeric_cols]) # KNN Imputation for numeric features
# imputation 결과가 np.array라 다시 dataframe으로 만들어줌.
_train = pd.DataFrame(data=data, columns=numeric_cols)

# imputation한 column을 그에 맞는 위치의 train data에 overwrite.
for num_col in numeric_cols:
    train[num_col] = _train[num_col] # overwrite with imputed column.

train

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,0.0,0.0,0.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,False
1,1.0,0.0,0.0,24.0,0.0,109.0,9.0,25.0,549.0,44.0,True
2,0.0,0.0,0.0,58.0,1.0,43.0,3576.0,0.0,6715.0,49.0,False
3,0.0,0.0,0.0,33.0,0.0,0.0,1283.0,371.0,3329.0,193.0,False
4,1.0,0.0,0.0,16.0,0.0,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...
8688,0.0,0.0,2.0,41.0,1.0,0.0,6819.0,0.0,1643.0,74.0,False
8689,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,False
8690,1.0,0.0,0.0,26.0,0.0,0.0,0.0,1872.0,1.0,0.0,True
8691,0.0,0.0,2.0,32.0,0.0,0.0,1049.0,0.0,353.0,3235.0,False


In [None]:
#train.mode()  # pd.DataFrame
#train.mean() # pd.Series
train.mode()["HomePlanet"].values[0]

1.0

In [None]:
# Imputation for cat features

for cat_col in cat_cols:
    # categorical feature들은 각각의 최빈값으로 결측치를 채움.
    train[cat_col] = train[cat_col].fillna(train.mode()[cat_col].values[0])

train[train.isnull().any(axis=1)]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported


In [None]:
# True / False여도 잘됨.
train.Transported = train.Transported.astype('int') # 0 / 1
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8693 non-null   float64
 1   CryoSleep     8693 non-null   float64
 2   Destination   8693 non-null   float64
 3   Age           8693 non-null   float64
 4   VIP           8693 non-null   float64
 5   RoomService   8693 non-null   float64
 6   FoodCourt     8693 non-null   float64
 7   ShoppingMall  8693 non-null   float64
 8   Spa           8693 non-null   float64
 9   VRDeck        8693 non-null   float64
 10  Transported   8693 non-null   int64  
dtypes: float64(10), int64(1)
memory usage: 747.2 KB


### 4. 학습 데이터 분할

In [None]:
# 첫번째 테스트용으로 사용하고, 실제 학습시에는 K-Fold CV를 사용합니다.
from sklearn.model_selection import train_test_split

X = train.drop(columns="Transported")
y = train.Transported

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

(6954, 10) (6954,) (1739, 10) (1739,)


### 5. 학습 및 평가

In [None]:
# (XGBoost VS LightGBM) VS Random Forest
## row수가 10,000을 넘기면 LightGBM이 더 효율적이고 그렇지 않으면 XGBoost가 더 효율적이다.
## 데이터가 아주 작거나(< 1,000), hyper-parameter tuning을 별로 안할거다 : Random Forest
## 그렇지 않다면, XGBoost.
model = XGBClassifier()

In [None]:
print("\nFitting XGBoost...")
model.fit(X_train, y_train)


Fitting XGBoost...


In [None]:
# metric은 그때마다 맞게 바꿔줘야 합니다.
from sklearn.metrics import accuracy_score
evaluation_metric = accuracy_score

In [None]:
print("Prediction")
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

train_score = evaluation_metric(y_train, pred_train)
val_score = evaluation_metric(y_val, pred_val)

print("Train Score : %.4f" % train_score)
print("Validation Score : %.4f" % val_score)

Prediction
Train Score : 0.8798
Validation Score : 0.7780


### 6. Hyper-parameter Tuning

> GridSearchCV

** LightGBM의 hyperparameter **

[Official Documentation] https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

[Blog 1] https://smecsm.tistory.com/133

[Blog 2] https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5

[Blog 3] https://nurilee.com/2020/04/03/lightgbm-definition-parameter-tuning/

In [None]:
# GridSearchCV를 이용하여 가장 좋은 성능을 가지는 모델을 찾아봅시다. (이것은 첫번째엔 선택입니다.)
# Lightgbm은 hyper-parameter의 영향을 많이 받기 때문에, 저는 보통 맨처음에 한번 정도는 가볍게 GCV를 해봅니다.
# 성능 향상이 별로 없다면, lightgbm으로 돌린 대략적인 성능이 이 정도라고 생각하면 됩니다.
# 만약 성능 향상이 크다면, 지금 데이터는 hyper-parameter tuning을 빡빡하게 하면 성능 향상이 많이 이끌어 낼 수 있습니다.
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth" : [3, 4, 5],
    "n_estimators" : [100, 300],
    "learning_rate" : [0.1, 0.01],
    "colsample_bynode" : [0.7, 0.8],
    "tree_method" : ['gpu_hist'],
    'random_state' : [42]
} # 3 x 2 x 2 x 2 = 24

gcv = GridSearchCV(estimator=model, param_grid=param_grid, cv=5,
                  n_jobs=-1, verbose=2)

gcv.fit(X_train, y_train)
print("Best Estimator : ", gcv.best_estimator_)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best Estimator :  XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=0.7,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=42, ...)


In [None]:
print("Prediction with Best Estimator")
gcv_pred_train = gcv.predict(X_train)
gcv_pred_val = gcv.predict(X_val)

gcv_train_score = evaluation_metric(y_train, gcv_pred_train)
gcv_val_score = evaluation_metric(y_val, gcv_pred_val)

print("Train ACC Score : %.4f" % gcv_train_score)
print("Validation ACC Score : %.4f" % gcv_val_score)

Prediction with Best Estimator
Train ACC Score : 0.8164
Validation ACC Score : 0.7821


> optuna를 사용해봅시다 !

In [None]:
def optimizer(trial, X, y, K):
    # 조절할 hyper-parameter 조합을 적어줍니다.
    n_estimators = trial.suggest_int('n_estimators', 100, 500)
    max_depth = trial.suggest_int('max_depth', 4, 8)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3)
    colsample_bynode = trial.suggest_categorical('colsample_bynode', [0.5, 0.7])
    reg_lambda = trial.suggest_categorical('reg_lambda', [0.1, 1, 2])
    gamma = trial.suggest_categorical('gamma', [1, 2, 5, 10])


    # 원하는 모델을 지정합니다, optuna는 시간이 오래걸리기 때문에 저는 보통 RF로 일단 테스트를 해본 뒤에 LGBM을 사용합니다.
    model = XGBClassifier(n_estimators=n_estimators,
                          max_depth=max_depth,
                          learning_rate=learning_rate,
                          colsample_bynode=colsample_bynode,
                          reg_lambda=reg_lambda,
                          gamma=gamma,
                          tree_method='gpu_hist',
                          random_state=42,
                          sampling_method='gradient_based')


    # K-Fold Cross validation을 구현합니다.
    folds = KFold(n_splits=K)
    scores = []

    for train_idx, val_idx in folds.split(X, y):
        X_train = X.iloc[train_idx, :]
        y_train = y.iloc[train_idx]

        X_val = X.iloc[val_idx, :]
        y_val = y.iloc[val_idx]

        model.fit(X_train, y_train)
        preds = model.predict(X_val)
        score = evaluation_metric(y_val, preds)
        scores.append(score)


    # K-Fold의 평균 loss값을 돌려줍니다.
    return np.mean(scores)

In [None]:
K = 5 # Kfold 수
opt_func = partial(optimizer, X=X_train, y=y_train, K=K)

study = optuna.create_study(direction="maximize") # 최소/최대 어느 방향의 최적값을 구할 건지.
study.optimize(opt_func, n_trials=30)

[I 2023-06-13 08:51:41,993] A new study created in memory with name: no-name-225f5483-8de0-4181-90f1-225f2a3788cf
[I 2023-06-13 08:51:43,310] Trial 0 finished with value: 0.7981030157900998 and parameters: {'n_estimators': 396, 'max_depth': 7, 'learning_rate': 0.14266740936797134, 'colsample_bynode': 0.5, 'reg_lambda': 2, 'gamma': 2}. Best is trial 0 with value: 0.7981030157900998.
[I 2023-06-13 08:51:45,183] Trial 1 finished with value: 0.795514846210738 and parameters: {'n_estimators': 216, 'max_depth': 8, 'learning_rate': 0.03380984611916527, 'colsample_bynode': 0.7, 'reg_lambda': 1, 'gamma': 10}. Best is trial 0 with value: 0.7981030157900998.
[I 2023-06-13 08:51:46,271] Trial 2 finished with value: 0.7968087758405785 and parameters: {'n_estimators': 119, 'max_depth': 8, 'learning_rate': 0.27469990355431745, 'colsample_bynode': 0.7, 'reg_lambda': 2, 'gamma': 5}. Best is trial 0 with value: 0.7981030157900998.
[I 2023-06-13 08:51:48,480] Trial 3 finished with value: 0.79839088901416

In [None]:
# optuna가 시도했던 모든 실험 관련 데이터
study.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_colsample_bynode,params_gamma,params_learning_rate,params_max_depth,params_n_estimators,params_reg_lambda,state
0,0,0.798103,2023-06-13 08:51:41.997992,2023-06-13 08:51:43.310297,0 days 00:00:01.312305,0.5,2,0.142667,7,396,2.0,COMPLETE
1,1,0.795515,2023-06-13 08:51:43.315955,2023-06-13 08:51:45.182627,0 days 00:00:01.866672,0.7,10,0.03381,8,216,1.0,COMPLETE
2,2,0.796809,2023-06-13 08:51:45.190627,2023-06-13 08:51:46.271132,0 days 00:00:01.080505,0.7,5,0.2747,8,119,2.0,COMPLETE
3,3,0.798391,2023-06-13 08:51:46.273083,2023-06-13 08:51:48.480430,0 days 00:00:02.207347,0.7,1,0.031004,4,408,0.1,COMPLETE
4,4,0.797528,2023-06-13 08:51:48.482877,2023-06-13 08:51:49.032056,0 days 00:00:00.549179,0.5,2,0.175418,6,100,2.0,COMPLETE
5,5,0.795371,2023-06-13 08:51:49.033935,2023-06-13 08:51:50.233909,0 days 00:00:01.199974,0.7,5,0.263275,6,446,2.0,COMPLETE
6,6,0.797958,2023-06-13 08:51:50.239680,2023-06-13 08:51:51.038213,0 days 00:00:00.798533,0.7,1,0.286218,8,177,2.0,COMPLETE
7,7,0.794652,2023-06-13 08:51:51.040267,2023-06-13 08:51:52.151096,0 days 00:00:01.110829,0.5,10,0.296232,6,368,1.0,COMPLETE
8,8,0.795371,2023-06-13 08:51:52.153052,2023-06-13 08:51:53.523984,0 days 00:00:01.370932,0.7,10,0.147029,6,435,1.0,COMPLETE
9,9,0.796521,2023-06-13 08:51:53.526959,2023-06-13 08:51:55.166180,0 days 00:00:01.639221,0.7,2,0.199999,8,489,1.0,COMPLETE


In [None]:
print("Best Score: %.4f" % study.best_value) # best score 출력
print("Best params: ", study.best_trial.params) # best score일 때의 하이퍼파라미터들

Best Score: 0.8000
Best params:  {'n_estimators': 292, 'max_depth': 6, 'learning_rate': 0.13695327999193407, 'colsample_bynode': 0.5, 'reg_lambda': 1, 'gamma': 2}


In [None]:
# 실험 기록 시각화
optuna.visualization.plot_optimization_history(study)

In [None]:
# hyper-parameter들의 중요도
optuna.visualization.plot_param_importances(study)

In [None]:
print("Validation ACC")
best_params = study.best_params
best_model = XGBClassifier(**best_params)
best_model.fit(X_train, y_train)
print("Validation Score : %.3f" % evaluation_metric(y_val, best_model.predict(X_val)))

Validation ACC
Validation Score : 0.787


### 7. 테스트 및 제출 파일 생성

In [None]:
test2 = test.copy() # 임시 거처

In [None]:
test = test2.copy() # 복원

In [None]:
## X_test 만들기

# 편의상 분석에 사용하지 않을 column을 제거.
test = test.drop(columns=drop_cols)

for cat_col in cat_cols:
    test[cat_col] = test[cat_col].map(cat_map[cat_col])

# 학습했던 imp로 변환
data = imp.transform(test[numeric_cols]) # KNN Imputation for numeric features
_test = pd.DataFrame(data=data, columns=numeric_cols)

for num_col in numeric_cols:
    test[num_col] = _test[num_col] # overwrite with imputed column.

test

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,1.0,1.0,0.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,19.0,0.0,0.0,9.0,0.0,2823.0,0.0
2,0.0,1.0,2.0,31.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,38.0,0.0,0.0,6652.0,0.0,181.0,585.0
4,1.0,0.0,0.0,20.0,0.0,10.0,0.0,635.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
4272,1.0,1.0,0.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0
4273,1.0,0.0,0.0,42.0,0.0,0.0,847.0,17.0,10.0,144.0
4274,2.0,1.0,2.0,34.2,0.0,0.0,0.0,0.0,0.0,0.0
4275,0.0,0.0,,31.8,0.0,0.0,2680.0,0.0,0.0,523.0


In [None]:
# Imputation for cat features

for cat_col in cat_cols:
    test[cat_col] = test[cat_col].fillna(train.mode()[cat_col].values[0])

test[test.isnull().any(axis=1)]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck


In [None]:
best_params = study.best_params

best_model = XGBClassifier(**best_params)
best_model.fit(X, y)

X_test = test.values

preds = best_model.predict(X_test)
preds

array([1, 0, 1, ..., 1, 1, 1])

In [None]:
submission = pd.read_csv('./sample_submission.csv')
submission

Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,False
3,0021_01,False
4,0023_01,False
...,...,...
4272,9266_02,False
4273,9269_01,False
4274,9271_01,False
4275,9273_01,False


In [None]:
submission['Transported'] = preds

# 원래대로 고치기
pred_map = {0 : False, 1 : True}
submission.Transported = submission.Transported.map(pred_map)

In [None]:
submission.to_csv("submission.csv", index=False)