<a href="https://github.com/Bae-ChangHyun"><img src="https://img.shields.io/badge/Github-000000?style=flat&logo=github&logoColor=ffffff&labelColor=000000&link=https%3A%2F%2Fgithub.com%2FBae-ChangHyun"/></a> <br>
<a href="https://changsroad.tistory.com/"><img src="https://img.shields.io/badge/Tistory-f44336?style=flat&logo=tistory&logoColor=ffffff&link=https%3A%2F%2Fchangsroad.tistory.com%2F"/></a> <br>
<a href="mailto:matthew624@naver.com"><img src="https://img.shields.io/badge/Naver-03C75A?style=flat&logo=naver&logoColor=ffffff&link=mailto%3Amatthew624%40naver.com"/></a><br>
<a href="https://www.kaggle.com/competitions/spaceship-titanic" target="_blank"><img align="left" alt="Kaggle" title="Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Machine Learning 프로젝트 수행을 위한 코드 구조화

- ML project를 위해서 사용하는 템플릿 코드를 만듭니다.

1. **필요한 라이브러리와 데이터를 불러옵니다.**


2. **EDA를 수행합니다.** 이 때 EDA의 목적은 풀어야하는 문제를 위해서 수행됩니다.


3. **전처리를 수행합니다.** 이 때 중요한건 **feature engineering**을 어떻게 하느냐 입니다.


4. **데이터 분할을 합니다.** 이 때 train data와 test data 간의 분포 차이가 없는지 확인합니다.


5. **학습을 진행합니다.** 어떤 모델을 사용하여 학습할지 정합니다. 성능이 잘 나오는 GBM을 추천합니다.


6. **hyper-parameter tuning을 수행합니다.** 원하는 목표 성능이 나올 때 까지 진행합니다. 검증 단계를 통해 지속적으로 **overfitting이 되지 않게 주의**하세요.


7. **최종 테스트를 진행합니다.** 데이터 분석 대회 포맷에 맞는 submission 파일을 만들어서 성능을 확인해보세요.

## 1. 라이브러리, 데이터 불러오기

In [1]:
# 데이터분석 4종 세트
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import os
import random

# 모델들, 성능 평가
# (저는 일반적으로 정형데이터로 머신러닝 분석할 때는 이 2개 모델은 그냥 돌려봅니다. 특히 RF가 테스트하기 좋습니다.)
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# KFold(CV), partial : optuna를 사용하기 위함
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from functools import partial
# hyper-parameter tuning을 위한 라이브러리
from sklearn.model_selection import GridSearchCV
import optuna
from optuna.samplers import TPESampler

# 상관관계 분석, VIF : 다중공선성 제거
from statsmodels.stats.outliers_influence import variance_inflation_factor

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
# 데이터를 불러옵니다.
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

## 2. EDA

- 데이터에서 찾아야 하는 기초적인 내용들을 확인합니다.


- class imbalance, target distribution, outlier, correlation을 확인합니다.

In [3]:

# 1. 결측치 체크 --> 특정 column이 많은 결측치를 포함하고 있는지!
null_row=train.isnull().any(axis=1) # boolean mask
train[null_row] # row에 하나라도 결측치가 있는 row
# 2. dtype이 object인 column들 체크 (str)
cat_cols=train.columns[train.dtypes == 'object'] # 카테고리형 변수
num_cols=train.columns[~(train.dtypes == 'object')] # 수치형 변수
# 3. class imbalance 체크!
train.Transported.value_counts()
train.HomePlanet.value_counts()
train.Destination.value_counts()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
10,0008_02,Europa,True,B/1/P,TRAPPIST-1e,34.0,False,0.0,0.0,,0.0,0.0,Altardr Flatic,True
15,0012_01,Earth,False,,TRAPPIST-1e,31.0,False,32.0,0.0,876.0,0.0,0.0,Justie Pooles,False
16,0014_01,Mars,False,F/3/P,55 Cancri e,27.0,False,1286.0,122.0,,0.0,0.0,Flats Eccle,False
23,0020_03,Earth,True,E/0/S,55 Cancri e,29.0,False,0.0,0.0,,0.0,0.0,Mollen Mcfaddennon,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8667,9250_01,Europa,False,E/597/P,TRAPPIST-1e,29.0,False,0.0,2972.0,,28.0,188.0,Chain Reedectied,True
8674,9257_01,,False,F/1892/P,TRAPPIST-1e,13.0,False,39.0,0.0,1085.0,24.0,0.0,Ties Apple,False
8675,9259_01,Earth,,F/1893/P,TRAPPIST-1e,44.0,False,1030.0,1015.0,0.0,11.0,,Annah Gilleyons,True
8684,9274_01,,True,G/1508/P,TRAPPIST-1e,23.0,False,0.0,0.0,0.0,0.0,0.0,Chelsa Bullisey,True


Transported
True     4378
False    4315
Name: count, dtype: int64

HomePlanet
Earth     4602
Europa    2131
Mars      1759
Name: count, dtype: int64

Destination
TRAPPIST-1e      5915
55 Cancri e      1800
PSO J318.5-22     796
Name: count, dtype: int64

In [4]:
train
# passengerId : group_num(4자리, usually family but not always) + id(그룹내에서 2자리)
# HomePlanet, Desitination : categorical feature.
# Cabin (side deck / otherwise)
# CryoSleep, VIP : bool
# Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck : 쓴 돈.
# Name : 이름
# Transported(y) : True / False

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


In [5]:
# VIP에 따른 transported 분포
pd.pivot_table(train,values='Transported' ,index='VIP',aggfunc=['mean'])
train.groupby(['VIP','Transported']).size()

# HomePlanet, VIP에 따른 transported 분포
pd.pivot_table(train,values='Transported' ,index=['HomePlanet','VIP'],aggfunc=['mean'])

Unnamed: 0_level_0,mean
Unnamed: 0_level_1,Transported
VIP,Unnamed: 1_level_2
False,0.506332
True,0.38191


VIP    Transported
False  False          4093
       True           4198
True   False           123
       True             76
dtype: int64

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,Transported
HomePlanet,VIP,Unnamed: 2_level_2
Earth,False,0.424337
Europa,False,0.670072
Europa,True,0.48855
Mars,False,0.53418
Mars,True,0.15873


### 3. 전처리

#### 결측치 처리

In [6]:
## TO-DO ##
## group 정보를 passenger_Id로부터 추출해서, group_size가 4 이상인 고객들은 in_large_group 이라는 column에 1, 나머지는 0으로 하는 feature 생성
group = train.PassengerId.apply(lambda x:x[:4])
temp = group.value_counts()
large_group_num = temp[temp >= 4].index
# isin이 0 혹은 1반환 
train['in_large_group'] = group.isin(large_group_num) * 1
# 의미있어보이지 않음..
pd.pivot_table(train,values='Transported' ,index=['in_large_group'],aggfunc=['mean'])
train

Unnamed: 0_level_0,mean
Unnamed: 0_level_1,Transported
in_large_group,Unnamed: 1_level_2
0,0.490742
1,0.58516


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,in_large_group
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False,0
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False,0
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True,0
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False,0


In [7]:
train['HomePlanet'].value_counts()

HomePlanet
Earth     4602
Europa    2131
Mars      1759
Name: count, dtype: int64

In [8]:
# 결측치가 있는 column
# 결측치 제거 방법
# 1) cat_cols에 대해서는 mode를 계산해서 채워주고, num_cols에 대해서는 mean을 채워줌
train[cat_cols] = train[cat_cols].apply(lambda col: col.fillna(col.mode()[0]), axis=0) # mode()를 채울 땐 [0]을 추가해줘야 함.
train[num_cols] = train[num_cols].apply(lambda col: col.fillna(col.mean()), axis=0)
train.info()
# 2)
# 3)

# categorical feature encoding
train = pd.get_dummies(data=train, columns=['HomePlanet','Destination'])
drop_cols=['PassengerId', 'Cabin', 'Name']
train.drop(drop_cols, axis=1, inplace=True)
train

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PassengerId     8693 non-null   object 
 1   HomePlanet      8693 non-null   object 
 2   CryoSleep       8693 non-null   bool   
 3   Cabin           8693 non-null   object 
 4   Destination     8693 non-null   object 
 5   Age             8693 non-null   float64
 6   VIP             8693 non-null   bool   
 7   RoomService     8693 non-null   float64
 8   FoodCourt       8693 non-null   float64
 9   ShoppingMall    8693 non-null   float64
 10  Spa             8693 non-null   float64
 11  VRDeck          8693 non-null   float64
 12  Name            8693 non-null   object 
 13  Transported     8693 non-null   bool   
 14  in_large_group  8693 non-null   int32  
dtypes: bool(3), float64(6), int32(1), object(5)
memory usage: 806.6+ KB


Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,in_large_group,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,0,False,True,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,0,True,False,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,0,False,True,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,0,False,True,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,0,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,0,False,True,False,True,False,False
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,False,0,True,False,False,False,True,False
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,0,True,False,False,False,False,True
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,0,False,True,False,True,False,False


### 4. 학습 데이터 분할

In [9]:
# 첫번째 테스트용으로 사용하고, 실제 학습시에는 K-Fold CV를 사용합니다.
from sklearn.model_selection import train_test_split

X =  train.drop(['Transported'],axis=1)
y =  train['Transported']

X_train, X_val, y_train, y_val = train_test_split(X,y,train_size=0.8,random_state=624)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

(6954, 15) (6954,) (1739, 15) (1739,)


### 5. 학습 및 평가

In [10]:
params = {
    'n_estimators': 500,  # 결정 트리의 개수
    'max_depth': 7,    # 각 결정 트리의 최대 깊이 (None으로 설정하면 깊이 제한 없음->오버피팅)
    'min_samples_split': 10,  # 노드를 분할하기 위한 최소 샘플 수
    'min_samples_leaf': 5,   # 리프 노드에 있어야 하는 최소 샘플 수
    'max_features': 0.7,  # 각 노드에서 분할에 사용할 특성의 최대 수 ('auto', 'sqrt', 'log2', 또는 정수)
    'random_state': 624,   # 랜덤 시드 (재현성을 위해 고정된 값 사용)
    'n_jobs':-1
}

Humane search
- n_estimator의 경우도 숫자를 늘려도 큰 변화없어보임. 
- max_depth의 경우 숫자가 커질수록 train score가 올라가지만, valid는 여전함.
- min_samples_leaf의 경우 낮게 설정해도 score 동일

In [11]:
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)

print("Prediction")
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

train_score = accuracy_score(y_train, pred_train)
val_score = accuracy_score(y_val, pred_val)

print("Train Score : %.4f" % train_score)
print("Validation Score : %.4f" % val_score)

Prediction
Train Score : 0.8108
Validation Score : 0.7953


### 6. Hyper-parameter Tuning

In [18]:
params_grid = {
    'n_estimators': [50,100,200,500],  # 결정 트리의 개수
    'max_depth': [6,7,8,15],    # 각 결정 트리의 최대 깊이 (None으로 설정하면 깊이 제한 없음->오버피팅)
    'min_samples_split': [2,10,25],  # 노드를 분할하기 위한 최소 샘플 수
    'min_samples_leaf': [1,5,10],   # 리프 노드에 있어야 하는 최소 샘플 수
    'max_features': [0.5,0.7,0.9],  # 각 노드에서 분할에 사용할 특성의 최대 수 ('auto', 'sqrt', 'log2', 또는 정수)
}

gcv = GridSearchCV(estimator=RandomForestClassifier(random_state=624), param_grid=params_grid, cv=5, n_jobs=-1, verbose=2)

gcv.fit(X_train, y_train)
print("Best Estimator : ", gcv.best_estimator_)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits


Best Estimator :  RandomForestClassifier(max_depth=8, max_features=0.5, min_samples_leaf=5,
                       min_samples_split=25, random_state=624)


In [19]:
print("Prediction with Best Estimator")
gcv_pred_train = gcv.predict(X_train)
gcv_pred_val = gcv.predict(X_val)

gcv_train_score = accuracy_score(y_train, gcv_pred_train)
gcv_val_score = accuracy_score(y_val, gcv_pred_val)

print("Train ACC Score : %.4f" % gcv_train_score)
print("Validation ACC Score : %.4f" % gcv_val_score)

Prediction with Best Estimator
Train ACC Score : 0.8159
Validation ACC Score : 0.7987


> optuna를 사용해봅시다 !

In [14]:
#optuna-dashboard sqlite:///db.sqlite3

In [20]:
def objective(trial):
    # hyper-parameter
    params_optuna={
        "n_estimators" : trial.suggest_int('n_estimators', 50, 200),
        "max_depth" : trial.suggest_int('max_depth', 5, 10),
        "min_samples_split" : trial.suggest_categorical('min_samples_split',[2,10,25]),
        "min_samples_leaf" : trial.suggest_int('min_samples_leaf', 1, 10),
        "max_features" : trial.suggest_float('max_features',0.5, 0.8),
        }
    
    model = RandomForestClassifier(**params_optuna)
    
    model.fit(X_train,y_train)
    preds=model.predict(X_val)
    scores=accuracy_score(y_val,preds)

    return scores

In [22]:
K = 5 # Kfold 수
sampler = TPESampler(seed=624)
study = optuna.create_study(direction="maximize", # 최소/최대 어느 방향의 최적값을 구할 건지.
                            sampler=sampler,
                            storage="sqlite:///db.sqlite3",
                            study_name="test") 
study.optimize(objective, n_trials=100)

[I 2023-11-28 11:46:38,835] A new study created in RDB with name: test
[I 2023-11-28 11:46:39,051] Trial 0 finished with value: 0.8027602070155262 and parameters: {'n_estimators': 65, 'max_depth': 10, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_features': 0.5063128300410934}. Best is trial 0 with value: 0.8027602070155262.
[I 2023-11-28 11:46:39,342] Trial 1 finished with value: 0.7935595169637722 and parameters: {'n_estimators': 120, 'max_depth': 6, 'min_samples_split': 25, 'min_samples_leaf': 4, 'max_features': 0.5442442579447845}. Best is trial 0 with value: 0.8027602070155262.
[I 2023-11-28 11:46:39,620] Trial 2 finished with value: 0.7929844738355377 and parameters: {'n_estimators': 98, 'max_depth': 6, 'min_samples_split': 25, 'min_samples_leaf': 9, 'max_features': 0.7594693664193721}. Best is trial 0 with value: 0.8027602070155262.
[I 2023-11-28 11:46:39,870] Trial 3 finished with value: 0.7964347326049454 and parameters: {'n_estimators': 71, 'max_depth': 8, 'min_samples

In [24]:
print("Best Score: %.4f" % study.best_value) # best score 출력
print("Best params: ", study.best_trial.params) # best score일 때의 하이퍼파라미터들

Best Score: 0.8062
Best params:  {'n_estimators': 141, 'max_depth': 10, 'min_samples_split': 2, 'min_samples_leaf': 7, 'max_features': 0.5885449854405155}


In [None]:
print("Validation ACC")
best_params =
best_model =
best_model.fit(X_train, y_train)
print("Validation Score : %.3f" % evaluation_metric(y_val, best_model.predict(X_val)))

### 7. 테스트 및 제출 파일 생성

In [None]:
## X_test 만들기 -> traindata에 사용한 전처리 기법을 그대로 사용하기!


In [None]:
best_params = study.best_params

best_model = RandomForestClassifier(**best_params)
best_model.fit(X, y)

X_test = test.values

preds = best_model.predict(X_test)
preds

In [None]:
submission = pd.read_csv('./sample_submission.csv')
submission

In [None]:
submission['Transported'] = preds
submission.to_csv("submission.csv", index=False)