1. train.csv : 학습 데이터
- id : 샘플 아이디
- Species: 펭귄의 종을 나타내는 문자열
- Island : 샘플들이 수집된 Palmer Station 근처 섬 이름
- Clutch Completion : 관찰된 펭귄 둥지의 알이 2개인 경우 Full Clutch이며 Yes로 표기
- Culmen Length (mm) : 펭귄 옆모습 기준 부리의 가로 길이
- Culmen Depth (mm) : 펭귄 옆모습 기준 부리의 세로 길이
- Flipper Length (mm) : 펭귄의 팔(날개) 길이
- Sex : 펭귄의 성별
- Delta 15 N (o/oo)  : 토양에 따라 변화하는 안정 동위원소 15N:14N의 비율
- Delta 13 C (o/oo) : 먹이에 따라 변화하는 안정 동위원소 13C:12C의 비율
- Body Mass (g): 펭귄의 몸무게를 나타내는 숫자 (g)


2. test.csv : 테스트 데이터
- id : 샘플 아이디
- Species: 펭귄의 종을 나타내는 문자열
- Island : 샘플들이 수집된 Palmer Station 근처 섬 이름
- Clutch Completion : 관찰된 펭귄 둥지의 알이 2개인 경우 Full Clutch이며 Yes로 표기
- Culmen Length (mm) : 펭귄 옆모습 기준 부리의 가로 길이
- Culmen Depth (mm) : 펭귄 옆모습 기준 부리의 세로 길이
- Flipper Length (mm) : 펭귄의 팔(날개) 길이
- Sex : 펭귄의 성별
- Delta 15 N (o/oo)  : 토양에 따라 변화하는 안정 동위원소 15N:14N의 비율
- Delta 13 C (o/oo) : 먹이에 따라 변화하는 안정 동위원소 13C:12C의 비율


3. sample_submissoin.csv : 제출 양식
- id : 샘플 아이디
- Body Mass (g) : 펭귄의 몸무게를 나타내는 숫자 (g)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

In [2]:
path = "./dataset/"
train = "train.csv"
test="test.csv"
submit = "sample_submission.csv"

# 1. EDA

In [3]:
# 훈련용 데이터가 114개 밖에 되지 않음
df = pd.read_csv(path+train)
df

Unnamed: 0,id,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Body Mass (g)
0,0,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,50.0,15.3,220,MALE,8.30515,-25.19017,5550
1,1,Chinstrap penguin (Pygoscelis antarctica),Dream,No,49.5,19.0,200,MALE,9.63074,-24.34684,3800
2,2,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,45.1,14.4,210,FEMALE,8.51951,-27.01854,4400
3,3,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,44.5,14.7,214,FEMALE,8.20106,-26.16524,4850
4,4,Gentoo penguin (Pygoscelis papua),Biscoe,No,49.6,16.0,225,MALE,8.38324,-26.84272,5700
...,...,...,...,...,...,...,...,...,...,...,...
109,109,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.6,17.8,185,FEMALE,,,3700
110,110,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,39.2,18.6,190,MALE,9.11006,-25.79549,4250
111,111,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,43.2,18.5,192,MALE,8.97025,-26.03679,4100
112,112,Chinstrap penguin (Pygoscelis antarctica),Dream,No,46.9,16.6,192,FEMALE,9.80589,-24.73735,2700


In [4]:
df["Sex"].value_counts()

MALE      56
FEMALE    55
Name: Sex, dtype: int64

In [5]:
# null인 row 모두 표기하기 -> 총 5개이나, sex의 경우 NaN을 특정 값으로 대체하기 어렵기 때문에 소거가 필요함.
df[df.isnull().sum(axis=1) > 0]

Unnamed: 0,id,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Body Mass (g)
6,6,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,42.0,20.2,190,,9.13362,-25.09368,4250
8,8,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,34.1,18.1,193,,,,3475
18,18,Adelie Penguin (Pygoscelis adeliae),Dream,No,39.8,19.1,184,MALE,,,4650
70,70,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.2,14.4,214,,8.24253,-26.8154,4650
109,109,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.6,17.8,185,FEMALE,,,3700


In [6]:
# species 종류 확인 -> 3종
df["Species"].value_counts()

Gentoo penguin (Pygoscelis papua)            48
Adelie Penguin (Pygoscelis adeliae)          41
Chinstrap penguin (Pygoscelis antarctica)    25
Name: Species, dtype: int64

In [7]:
df["Island"].value_counts()

Biscoe       57
Dream        44
Torgersen    13
Name: Island, dtype: int64

In [8]:
# 결측치 정보 확인하기
def check_missing_col(dataframe):
    missing_col = []
    counted_missing_col = 0
    for i, col in enumerate(dataframe.columns):
        missing_values = sum(dataframe[col].isna())
        is_missing = True if missing_values >= 1 else False
        if is_missing:
            counted_missing_col += 1
            print(f'결측치가 있는 컬럼은: {col}입니다')
            print(f'해당 컬럼에 총 {missing_values}개의 결측치가 존재합니다.')
            missing_col.append([col, dataframe[col].dtype])
    if counted_missing_col == 0:
        print('결측치가 존재하지 않습니다')
    return missing_col

missing_col = check_missing_col(df)

결측치가 있는 컬럼은: Sex입니다
해당 컬럼에 총 3개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 15 N (o/oo)입니다
해당 컬럼에 총 3개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 13 C (o/oo)입니다
해당 컬럼에 총 3개의 결측치가 존재합니다.


In [9]:
# 결측치를 처리하는 함수를 작성합니다.
def handle_na(data, missing_col):
    temp = data.copy()
    for col, dtype in missing_col:
        if dtype == 'O':
            # 카테고리형 feature가 결측치인 경우 해당 행들을 삭제해 주었습니다.
            temp = pd.get_dummies(temp)
        elif dtype == int or dtype == float:
            # 수치형 feature가 결측치인 경우 평균값을 채워주었습니다.
            temp.loc[:,col] = temp[col].fillna(temp[col].mean())
    return temp

data = handle_na(df, missing_col)

data.head()

Unnamed: 0,id,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Delta 15 N (o/oo),Delta 13 C (o/oo),Body Mass (g),Species_Adelie Penguin (Pygoscelis adeliae),Species_Chinstrap penguin (Pygoscelis antarctica),Species_Gentoo penguin (Pygoscelis papua),Island_Biscoe,Island_Dream,Island_Torgersen,Clutch Completion_No,Clutch Completion_Yes,Sex_FEMALE,Sex_MALE
0,0,50.0,15.3,220,8.30515,-25.19017,5550,0,0,1,1,0,0,0,1,0,1
1,1,49.5,19.0,200,9.63074,-24.34684,3800,0,1,0,0,1,0,1,0,0,1
2,2,45.1,14.4,210,8.51951,-27.01854,4400,0,0,1,1,0,0,0,1,1,0
3,3,44.5,14.7,214,8.20106,-26.16524,4850,0,0,1,1,0,0,0,1,1,0
4,4,49.6,16.0,225,8.38324,-26.84272,5700,0,0,1,1,0,0,1,0,0,1


In [10]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler,RobustScaler, Normalizer, QuantileTransformer, PowerTransformer

continuous_names = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)']

# MinMaxScaler -> 2번째로 성능이 좋음
# scaler = MinMaxScaler()

# scaler = StandardScaler()

# scaler = RobustScaler()

# Normalizer -> 단순 정규화만으로는 높은 성능 향상을 기대하기 어려움
# scaler = Normalizer()

# scaler = QuantileTransformer()

# scaler = QuantileTransformer(output_distribution = 'normal')


# PowerTransformer -> 가장 성능이 좋음
scaler = PowerTransformer()

def scale(df, columns):
    train_scaler = scaler.fit_transform(df[columns])
    df[columns] = pd.DataFrame(data=train_scaler, columns=columns)
    
    return df
data = scale(data, continuous_names)
data[continuous_names].head()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Delta 15 N (o/oo),Delta 13 C (o/oo)
0,1.024468,-0.88635,1.147245,-0.75632,0.711786
1,0.922733,1.026663,-0.13912,1.526093,1.506057
2,0.05537,-1.355427,0.535764,-0.342208,-1.747236
3,-0.05897,-1.19889,0.787456,-0.964676,-0.456605
4,0.943029,-0.522589,1.431906,-0.603175,-1.458288


In [11]:
from lightgbm import LGBMRegressor
# lgbm1 -> default parameter of LGBM

lgb_params = {'n_estimators'     : 10000,      # Number of boosting iterations.
              'random_state'     : 42,            # Random seed initilizer for the model, helps to replicate the experiments.
              'learning_rate'    : 0.0015,              # The model learning rate.
              'subsample'        : 0.8,            # Row subsample from the dataset, like feature_fraction, but this will randomly select part of data without resampling
              'subsample_freq'   : 1,               # Use or not subsample frequency.
              'colsample_bytree' : 0.8,            # LightGBM will randomly select a subset of features on each iteration (tree).
              'min_child_weight' : 1e-3,            # Minimal sum hessian in one leaf, it can be used to deal with over-fitting.
              'min_child_samples': 32,              # Minimal number of data in one leaf. Can be used to deal with over-fitting.
              'device_type'      : 'gpu',
             }   

lgbm1 = LGBMRegressor(**lgb_params)

In [12]:
X = data.drop(["id", "Body Mass (g)"], axis=1)
Y = data["Body Mass (g)"]

x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.2, random_state=42)

In [13]:
lgbm1.fit(x_train, y_train,
          early_stopping_rounds=200,
          eval_set=[(x_val, y_val)],
          eval_metric='rmse',
          verbose=1)

[1]	valid_0's rmse: 851.891	valid_0's l2: 725718
Training until validation scores don't improve for 200 rounds
[2]	valid_0's rmse: 851.345	valid_0's l2: 724787
[3]	valid_0's rmse: 850.657	valid_0's l2: 723618
[4]	valid_0's rmse: 849.905	valid_0's l2: 722339
[5]	valid_0's rmse: 849.175	valid_0's l2: 721099
[6]	valid_0's rmse: 848.49	valid_0's l2: 719936
[7]	valid_0's rmse: 847.705	valid_0's l2: 718604
[8]	valid_0's rmse: 847.066	valid_0's l2: 717521
[9]	valid_0's rmse: 846.484	valid_0's l2: 716535
[10]	valid_0's rmse: 845.684	valid_0's l2: 715182
[11]	valid_0's rmse: 844.862	valid_0's l2: 713792
[12]	valid_0's rmse: 844.095	valid_0's l2: 712496
[13]	valid_0's rmse: 843.388	valid_0's l2: 711303
[14]	valid_0's rmse: 842.893	valid_0's l2: 710468
[15]	valid_0's rmse: 842.122	valid_0's l2: 709170
[16]	valid_0's rmse: 841.376	valid_0's l2: 707914
[17]	valid_0's rmse: 840.851	valid_0's l2: 707030
[18]	valid_0's rmse: 840.012	valid_0's l2: 705619
[19]	valid_0's rmse: 839.26	valid_0's l2: 704358

LGBMRegressor(colsample_bytree=0.8, device_type='gpu', learning_rate=0.0015,
              min_child_samples=32, n_estimators=10000, random_state=42,
              subsample=0.8, subsample_freq=1)

In [14]:
lgbm1.score(x_val, y_val)

0.8113261508580973

In [15]:
df_test = pd.read_csv(path+test)
df_test.head()

Unnamed: 0,id,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo)
0,0,Chinstrap penguin (Pygoscelis antarctica),Dream,Yes,52.0,20.7,210.0,MALE,9.43146,-24.6844
1,1,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,55.9,17.0,228.0,MALE,8.3118,-26.35425
2,2,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,38.9,18.8,190.0,FEMALE,8.36936,-26.11199
3,3,Chinstrap penguin (Pygoscelis antarctica),Dream,Yes,45.2,16.6,191.0,FEMALE,9.62357,-24.78984
4,4,Adelie Penguin (Pygoscelis adeliae),Biscoe,No,37.9,18.6,172.0,FEMALE,8.38404,-25.19837


In [16]:
test_missing_col = check_missing_col(df_test) 

결측치가 있는 컬럼은: Sex입니다
해당 컬럼에 총 6개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 15 N (o/oo)입니다
해당 컬럼에 총 9개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 13 C (o/oo)입니다
해당 컬럼에 총 8개의 결측치가 존재합니다.


In [17]:
test_data = handle_na(df_test, test_missing_col)

In [18]:
test_data = scale(test_data, continuous_names)

test_data[continuous_names]

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Delta 15 N (o/oo),Delta 13 C (o/oo)
0,1.473395,1.852118,0.812797,1.284128,1.284180
1,2.066785,-0.153476,1.796145,-0.770564,-0.930653
2,-0.838347,0.785995,-0.684713,-0.651907,-0.548701
3,0.343289,-0.352777,-0.596715,1.589638,1.170618
4,-1.040353,0.678193,-2.586222,-0.621897,0.699957
...,...,...,...,...,...
223,1.040370,1.394004,0.346078,1.991197,1.377900
224,0.569839,-1.206289,1.228780,-0.221357,-1.329888
225,0.569839,-1.777897,0.812797,-1.452512,0.539566
226,1.235213,1.225659,0.202294,1.868711,1.236097


In [19]:
test_data = test_data.drop(["id"], axis=1)
preds = lgbm1.predict(test_data)

In [20]:
submission = pd.read_csv(path+submit)
submission

Unnamed: 0,id,Body Mass (g)
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0
...,...,...
223,223,0
224,224,0
225,225,0
226,226,0


In [21]:
submission["Body Mass (g)"] = preds

In [22]:
submission.to_csv("./dataset/submission.csv", index=False)

In [23]:
submission

Unnamed: 0,id,Body Mass (g)
0,0,4616.166327
1,1,5322.764813
2,2,3564.738569
3,3,3481.228468
4,4,3801.954404
...,...,...
223,223,4511.416073
224,224,4756.422928
225,225,4775.953752
226,226,4511.416073
