## 작업형 2유형 최종정리
- 작업형1 : 3문제 (30점), 데이터 전처리
- `작업형2 : 1문제 (40점), 분류/회귀 예측 모델링`
- 작업형3 : 2문제 (30점), 가설 검정

## 주요 라이브러리
- palmerpenguins : 팔머펜귄의 목표는 iris의 대안으로 데이터 탐색 및 시각화를 위한 데이터셋 제공.
- scikit-learn : 머신러닝을 위한 라이브러리
- lightgbm : LightGBM은 Microsoft에서 개발한 오픈 소스 기계 학습 라이브러리로, 대용량 데이터셋에서 빠른 속도와 높은 성능을 제공하는 것이 특징

## 주의
- 각 코드에 대한 설명은 별도로 하지 않습니다.

## 데이터 파일 불러오기

In [24]:
import pandas as pd
from palmerpenguins import load_penguins

df = load_penguins()
df['ID'] = df.reset_index().index+1

## 데이터 확인

In [25]:
print(df.head())
print()
print(df.info())
print()
print(df.isnull().sum())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  year  ID  
0       3750.0    male  2007   1  
1       3800.0  female  2007   2  
2       3250.0  female  2007   3  
3          NaN     NaN  2007   4  
4       3450.0  female  2007   5  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64


In [26]:
df = df.dropna()
print(df.isnull().sum())

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
year                 0
ID                   0
dtype: int64


## 데이터셋 분리

In [27]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['sex'])
y = df['sex']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(266, 8) (67, 8) (266,) (67,)


## 컬럼만 따로 분리

In [28]:
X_train_id = X_train.pop('ID')
X_test_id = X_test.pop('ID')

print(X_train_id.shape, X_test_id.shape)

(266,) (67,)


## 데이터 타입별로 분리

In [29]:
import numpy as np

object_df = X_train.select_dtypes(include=object)
number_df = X_train.select_dtypes(include=np.number)

In [30]:
for column in object_df.columns:
    print(object_df[column].value_counts())

Adelie       115
Gentoo       101
Chinstrap     50
Name: species, dtype: int64
Biscoe       135
Dream         98
Torgersen     33
Name: island, dtype: int64


In [31]:
print(number_df.describe())

       bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \
count      266.000000     266.000000         266.000000   266.000000   
mean        44.079323      17.072556         201.409774  4232.612782   
std          5.420164       1.969091          14.269156   811.099787   
min         32.100000      13.100000         172.000000  2700.000000   
25%         39.600000      15.400000         190.000000  3600.000000   
50%         44.900000      17.200000         197.000000  4100.000000   
75%         48.500000      18.600000         214.000000  4800.000000   
max         59.600000      21.500000         231.000000  6050.000000   

              year  
count   266.000000  
mean   2008.041353  
std       0.816216  
min    2007.000000  
25%    2007.000000  
50%    2008.000000  
75%    2009.000000  
max    2009.000000  


## 모델 생성

In [61]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from lightgbm import LGBMClassifier

transformer = ColumnTransformer([
    ('scaler', MinMaxScaler(), number_df.columns),
    ('encoder', OneHotEncoder(), object_df.columns)
], remainder='passthrough')

pipeline = Pipeline([
    ('preprocessor', transformer),
    ('model', LGBMClassifier(random_state=42, learning_rate=0.05, max_depth=3))
])

pipeline.fit(X_train, y_train)

## 모델 평가

In [62]:
from sklearn.metrics import roc_auc_score

def get_scores(model, X_train, X_test, y_train, y_test):
    train_predictions = model.predict_proba(X_train)
    test_predictions = model.predict_proba(X_test)
    train_score = roc_auc_score(y_train, train_predictions[:, 1])
    test_score = roc_auc_score(y_test, test_predictions[:, 1])
    return f"train : {train_score}, test : {test_score}"

print(get_scores(pipeline, X_train, X_test, y_train, y_test))

train : 0.9969727833418208, test : 0.9560931899641577


## 평가제출

In [63]:
final_predictions = pipeline.predict(X_test)
result = pd.DataFrame({
    'ID' : X_test_id,
    'preds' : final_predictions
})
print(result)

      ID   preds
30    31  female
320  321    male
79    80    male
202  203  female
63    64    male
..   ...     ...
291  292    male
4      5  female
83    84    male
322  323  female
66    67  female

[67 rows x 2 columns]
