1. train.csv : 학습 데이터
- id : 샘플 아이디
- Species: 펭귄의 종을 나타내는 문자열
- Island : 샘플들이 수집된 Palmer Station 근처 섬 이름
- Clutch Completion : 관찰된 펭귄 둥지의 알이 2개인 경우 Full Clutch이며 Yes로 표기
- Culmen Length (mm) : 펭귄 옆모습 기준 부리의 가로 길이
- Culmen Depth (mm) : 펭귄 옆모습 기준 부리의 세로 길이
- Flipper Length (mm) : 펭귄의 팔(날개) 길이
- Sex : 펭귄의 성별
- Delta 15 N (o/oo)  : 토양에 따라 변화하는 안정 동위원소 15N:14N의 비율
- Delta 13 C (o/oo) : 먹이에 따라 변화하는 안정 동위원소 13C:12C의 비율
- Body Mass (g): 펭귄의 몸무게를 나타내는 숫자 (g)


2. test.csv : 테스트 데이터
- id : 샘플 아이디
- Species: 펭귄의 종을 나타내는 문자열
- Island : 샘플들이 수집된 Palmer Station 근처 섬 이름
- Clutch Completion : 관찰된 펭귄 둥지의 알이 2개인 경우 Full Clutch이며 Yes로 표기
- Culmen Length (mm) : 펭귄 옆모습 기준 부리의 가로 길이
- Culmen Depth (mm) : 펭귄 옆모습 기준 부리의 세로 길이
- Flipper Length (mm) : 펭귄의 팔(날개) 길이
- Sex : 펭귄의 성별
- Delta 15 N (o/oo)  : 토양에 따라 변화하는 안정 동위원소 15N:14N의 비율
- Delta 13 C (o/oo) : 먹이에 따라 변화하는 안정 동위원소 13C:12C의 비율


3. sample_submissoin.csv : 제출 양식
- id : 샘플 아이디
- Body Mass (g) : 펭귄의 몸무게를 나타내는 숫자 (g)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

In [2]:
path = "./dataset/"
train = "train.csv"
test = "test.csv"
submit = "sample_submission.csv"

# 1. EDA

In [3]:
# 훈련용 데이터가 114개 밖에 되지 않음
df = pd.read_csv(path+train)
df

Unnamed: 0,id,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Body Mass (g)
0,0,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,50.0,15.3,220,MALE,8.30515,-25.19017,5550
1,1,Chinstrap penguin (Pygoscelis antarctica),Dream,No,49.5,19.0,200,MALE,9.63074,-24.34684,3800
2,2,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,45.1,14.4,210,FEMALE,8.51951,-27.01854,4400
3,3,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,44.5,14.7,214,FEMALE,8.20106,-26.16524,4850
4,4,Gentoo penguin (Pygoscelis papua),Biscoe,No,49.6,16.0,225,MALE,8.38324,-26.84272,5700
...,...,...,...,...,...,...,...,...,...,...,...
109,109,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.6,17.8,185,FEMALE,,,3700
110,110,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,39.2,18.6,190,MALE,9.11006,-25.79549,4250
111,111,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,43.2,18.5,192,MALE,8.97025,-26.03679,4100
112,112,Chinstrap penguin (Pygoscelis antarctica),Dream,No,46.9,16.6,192,FEMALE,9.80589,-24.73735,2700


In [4]:
df["Sex"].value_counts()

MALE      56
FEMALE    55
Name: Sex, dtype: int64

In [5]:
# null인 row 모두 표기하기 -> 총 5개이나, sex의 경우 NaN을 특정 값으로 대체하기 어렵기 때문에 소거가 필요함.
df[df.isnull().sum(axis=1) > 0]

Unnamed: 0,id,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Body Mass (g)
6,6,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,42.0,20.2,190,,9.13362,-25.09368,4250
8,8,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,34.1,18.1,193,,,,3475
18,18,Adelie Penguin (Pygoscelis adeliae),Dream,No,39.8,19.1,184,MALE,,,4650
70,70,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.2,14.4,214,,8.24253,-26.8154,4650
109,109,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.6,17.8,185,FEMALE,,,3700


In [6]:
# species 종류 확인 -> 3종
df["Species"].value_counts()

Gentoo penguin (Pygoscelis papua)            48
Adelie Penguin (Pygoscelis adeliae)          41
Chinstrap penguin (Pygoscelis antarctica)    25
Name: Species, dtype: int64

In [7]:
df["Island"].value_counts()

Biscoe       57
Dream        44
Torgersen    13
Name: Island, dtype: int64

In [8]:
# 결측치 정보 확인하기
def check_missing_col(dataframe):
    missing_col = []
    counted_missing_col = 0
    for i, col in enumerate(dataframe.columns):
        missing_values = sum(dataframe[col].isna())
        is_missing = True if missing_values >= 1 else False
        if is_missing:
            counted_missing_col += 1
            print(f'결측치가 있는 컬럼은: {col}입니다')
            print(f'해당 컬럼에 총 {missing_values}개의 결측치가 존재합니다.')
            missing_col.append([col, dataframe[col].dtype])
    if counted_missing_col == 0:
        print('결측치가 존재하지 않습니다')
    return missing_col

missing_col = check_missing_col(df)

결측치가 있는 컬럼은: Sex입니다
해당 컬럼에 총 3개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 15 N (o/oo)입니다
해당 컬럼에 총 3개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 13 C (o/oo)입니다
해당 컬럼에 총 3개의 결측치가 존재합니다.


In [9]:
# 결측치를 처리하는 함수를 작성합니다.
def handle_na(data, missing_col):
    temp = data.copy()
    for col, dtype in missing_col:
        if dtype == 'O':
            # 카테고리형 feature가 결측치인 경우 해당 행들을 삭제해 주었습니다.
            temp[col] = temp[col].fillna("No Gender")
        elif dtype == int or dtype == float:
            # 수치형 feature가 결측치인 경우 평균값을 채워주었습니다.
            temp.loc[:,col] = temp[col].fillna(temp[col].mean())
    return temp

data = handle_na(df, missing_col)

data

Unnamed: 0,id,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Body Mass (g)
0,0,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,50.0,15.3,220,MALE,8.305150,-25.190170,5550
1,1,Chinstrap penguin (Pygoscelis antarctica),Dream,No,49.5,19.0,200,MALE,9.630740,-24.346840,3800
2,2,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,45.1,14.4,210,FEMALE,8.519510,-27.018540,4400
3,3,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,44.5,14.7,214,FEMALE,8.201060,-26.165240,4850
4,4,Gentoo penguin (Pygoscelis papua),Biscoe,No,49.6,16.0,225,MALE,8.383240,-26.842720,5700
...,...,...,...,...,...,...,...,...,...,...,...
109,109,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.6,17.8,185,FEMALE,8.737634,-25.723051,3700
110,110,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,39.2,18.6,190,MALE,9.110060,-25.795490,4250
111,111,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,43.2,18.5,192,MALE,8.970250,-26.036790,4100
112,112,Chinstrap penguin (Pygoscelis antarctica),Dream,No,46.9,16.6,192,FEMALE,9.805890,-24.737350,2700


In [10]:
encoder = LabelEncoder()


encoder.fit(data["Species"])
data["Species"] = encoder.transform(data["Species"])

encoder.fit(data["Island"])
data["Island"] = encoder.transform(data["Island"])

encoder.fit(data["Clutch Completion"])
data["Clutch Completion"] = encoder.transform(data["Clutch Completion"])

encoder.fit(data["Sex"])
data["Sex"] = encoder.transform(data["Sex"])

In [11]:
data["Sex"].value_counts()

1    56
0    55
2     3
Name: Sex, dtype: int64

In [12]:
from lightgbm import LGBMRegressor
# lgbm1 -> default parameter of LGBM

lgb_params = {'n_estimators'     : 10000,      # Number of boosting iterations.
              'random_state'     : 42,            # Random seed initilizer for the model, helps to replicate the experiments.
              'learning_rate'    : 0.0015,              # The model learning rate.
              'subsample'        : 0.8,            # Row subsample from the dataset, like feature_fraction, but this will randomly select part of data without resampling
              'subsample_freq'   : 1,               # Use or not subsample frequency.
              'colsample_bytree' : 0.8,            # LightGBM will randomly select a subset of features on each iteration (tree).
              'min_child_weight' : 1e-3,            # Minimal sum hessian in one leaf, it can be used to deal with over-fitting.
              'min_child_samples': 32,              # Minimal number of data in one leaf. Can be used to deal with over-fitting.
              'device_type'      : 'gpu',
             }   

lgbm1 = LGBMRegressor(**lgb_params)

In [13]:
X = data.drop(["id", "Body Mass (g)"], axis=1)
Y = data["Body Mass (g)"]

x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.2, random_state=42)

In [14]:
lgbm1.fit(x_train, y_train,
          early_stopping_rounds=200,
          eval_set=[(x_val, y_val)],
          eval_metric='rmse',
          verbose=0)

LGBMRegressor(colsample_bytree=0.8, device_type='gpu', learning_rate=0.0015,
              min_child_samples=32, n_estimators=10000, random_state=42,
              subsample=0.8, subsample_freq=1)

In [15]:
lgbm1.score(x_val, y_val)

0.7898972986785392

In [16]:
df_test = pd.read_csv(path+test)
df_test

Unnamed: 0,id,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo)
0,0,Chinstrap penguin (Pygoscelis antarctica),Dream,Yes,52.0,20.7,210.0,MALE,9.43146,-24.68440
1,1,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,55.9,17.0,228.0,MALE,8.31180,-26.35425
2,2,Adelie Penguin (Pygoscelis adeliae),Dream,Yes,38.9,18.8,190.0,FEMALE,8.36936,-26.11199
3,3,Chinstrap penguin (Pygoscelis antarctica),Dream,Yes,45.2,16.6,191.0,FEMALE,9.62357,-24.78984
4,4,Adelie Penguin (Pygoscelis adeliae),Biscoe,No,37.9,18.6,172.0,FEMALE,8.38404,-25.19837
...,...,...,...,...,...,...,...,...,...,...
223,223,Chinstrap penguin (Pygoscelis antarctica),Dream,Yes,49.3,19.9,203.0,MALE,9.88809,-24.59513
224,224,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.5,14.8,217.0,FEMALE,8.58487,-26.59290
225,225,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.5,13.5,210.0,FEMALE,7.99530,-25.32829
226,226,Chinstrap penguin (Pygoscelis antarctica),Dream,Yes,50.5,19.6,201.0,MALE,9.80590,-24.72940


In [17]:
test_missing_col = check_missing_col(df_test) 

결측치가 있는 컬럼은: Sex입니다
해당 컬럼에 총 6개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 15 N (o/oo)입니다
해당 컬럼에 총 9개의 결측치가 존재합니다.
결측치가 있는 컬럼은: Delta 13 C (o/oo)입니다
해당 컬럼에 총 8개의 결측치가 존재합니다.


In [18]:
test_data = handle_na(df_test, test_missing_col)

In [19]:
encoder = LabelEncoder()


encoder.fit(test_data["Species"])
test_data["Species"] = encoder.transform(test_data["Species"])

encoder.fit(test_data["Island"])
test_data["Island"] = encoder.transform(test_data["Island"])

encoder.fit(test_data["Clutch Completion"])
test_data["Clutch Completion"] = encoder.transform(test_data["Clutch Completion"])

encoder.fit(test_data["Sex"])
test_data["Sex"] = encoder.transform(test_data["Sex"])

In [20]:
test_data = test_data.drop(["id"], axis=1)
preds = lgbm1.predict(test_data)


In [21]:
len(preds)

228

In [22]:
submission = pd.read_csv(path+submit)
submission

Unnamed: 0,id,Body Mass (g)
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0
...,...,...
223,223,0
224,224,0
225,225,0
226,226,0


In [23]:
submission["Body Mass (g)"] = preds

In [24]:
submission.to_csv("./dataset/submission.csv", index=False)

In [25]:
submission

Unnamed: 0,id,Body Mass (g)
0,0,4634.758950
1,1,5299.791063
2,2,3591.536891
3,3,3472.633471
4,4,3792.983043
...,...,...
223,223,4416.424320
224,224,4766.071200
225,225,4765.216323
226,226,4416.424320
