# 빅데이터 분석기사 실기 기출 3회 작업형 2 
[캐글 경진대회 링크](https://www.kaggle.com/competitions/big-data-analytics-certification/overview)  
[캐글 노트북 공유 링크]() 


## [분류] 여행 보험 패키지 상품을 구매할 확률 값을 구하시오
- 예측할 값(y): TravelInsurance (여행보험 패지지를 구매 했는지 여부 0:구매안함, 1:구매)
- 평가: roc-auc 평가지표
- data: t2-1-train.csv, t2-1-test.csv

제출 형식
```
id,TravelInsurance
0,0.3
1,0.48
2,0.3
3,0.83
```

## 시험 환경처럼 진행

```
                     Employment Type GraduateOrNot FrequentFlyer  \
count                           1490          1490          1490   
unique                             2             2             2   
top     Private Sector/Self Employed           Yes            No   
freq                            1056          1270          1175   

       EverTravelledAbroad  
count                 1490  
unique                   2  
top                     No  
freq                  1209   

                     Employment Type GraduateOrNot FrequentFlyer  \
count                            497           497           497   
unique                             3             2             2   
top     Private Sector/Self Employed           Yes            No   
freq                             360           422           395   

       EverTravelledAbroad  
count                  497  
unique                   2  
top                     No  
freq                   398   
```

### train과 test의 범주형 데이터 nunique가 다른 경우 ?
#### 1) train, test 합친 후에 라벨인코딩 해주고 다시 분리

#### 2) train, test 합친 후에 더미변수화 해주고 다시 분리
```python
# 더미변수화

x_dummies = pd.get_dummies(pd.concat([x_train, x_test]))
x_train_dummies = x_dummies[:x_train.shape[0]]
x_test_dummies = x_dummies[x_train.shape[0]:]
```

### 모델 성능 기록
```
- 결측치 0
0.794 : rf = RandomForestClassifier(random_state=2023)
0.792 : lgbm = LGBMClassifier(random_state=2023)

- 결측치 중앙값 대체
0.796 : rf = RandomForestClassifier(random_state=2023)
0.794 : lgbm = LGBMClassifier(random_state=2023)
```

In [36]:
import pandas as pd

# 0. 데이터 로드
train = pd.read_csv("./Data/t2-1-train.csv")
test = pd.read_csv("./Data/t2-1-test.csv")

print(train.shape)
print(train.head(), "\n")
print(test.shape)
print(test.head(), "\n")


# 1. EDA
# 1-1) train EDA
print("="*20, "train EDA", "="*20)
print(train.info(), "\n")
print(train.isnull().sum(), "\n")      # 결측치 : AnnualIncome  4
print(train.describe(), "\n")

# 1-2) test EDA
print("="*20, "test EDA", "="*20)
print(test.info(), "\n")
print(test.isnull().sum(), "\n")      # 결측치 : AnnualIncome  3
print(test.describe(), "\n")

# 1-3) train, test nunique 비교
print(train.describe(include="O"), "\n")     # train, test nunique값 다름. 라벨인코딩 해주자
print(test.describe(include="O"), "\n")


# 2. 전처리
# 2-1) 결측치 처리, id 같은 불필요 칼럼 제거, X와 y 분리
train["AnnualIncome"] = train["AnnualIncome"].fillna(train["AnnualIncome"].median())
test["AnnualIncome"] = test["AnnualIncome"].fillna(train["AnnualIncome"].median())

print(train.isnull().sum(), "\n")
print(test.isnull().sum(), "\n")

# train의 id 칼럼 삭제
train = train.drop("id", axis=1)

# test의 id 칼럼 따로 저장
test_id = test.pop("id")

# X와 y 분리
y = train.pop("TravelInsurance")
print(train.head(1), "\n")
print(y.head(3), "\n")



# 2-2) 스케일링
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# 수치형 칼럼 선택
con_cols = train.select_dtypes(exclude="object").copy().columns
print("연속형 칼럼명", con_cols)

train[con_cols] = scaler.fit_transform(train[con_cols])
test[con_cols] = scaler.transform(test[con_cols])

print(train.head(), "\n")
print(test.head(), "\n")



# 2-3) 인코딩 : 라벨 인코더
from sklearn.preprocessing import LabelEncoder

# 범주형 칼럼만 선택
cat_cols = train.select_dtypes(include="object").copy().columns
print("범주형 칼럼명", cat_cols)

# 범주형 칼럼 중 train에는 없는데, test에는 unique한 값이 있는 경우 

# 데이터 합치기
df = pd.concat([train, test])

for col in cat_cols :
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

# train test 다시 분리
train = df[:train.shape[0]].copy()
test = df[train.shape[0]:].copy()

print(train.head(), "\n")
print(test.head(), "\n")

 


# 3. 검증 데이터 분리
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train, y, random_state=1, test_size=0.2, stratify=y)
print(X_train.shape, X_val.shape)



# 4. 모델링
# 4-1) 분류 성능 지표
from sklearn.metrics import roc_auc_score

# 4-2) 랜덤포레스트 분류
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=2023)
rf.fit(X_train, y_train)

# 구매를 '할' 확률이므로 클래스 1의 예측 확률을 구해야함
rf_pred_proba = rf.predict_proba(X_val)[:, 1]

# roc_auc_score
rf_roc = roc_auc_score(y_val, rf_pred_proba)
print("랜덤포레스트 roc : ", rf_roc) # 0.79



# 4-3) lgbm 분류
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(random_state=2023, max_depth=3, n_estimators=300)
lgbm.fit(X_train, y_train)

# 구매를 '할' 확률이므로 클래스 1의 예측 확률을 구해야함
lgbm_pred_proba = lgbm.predict_proba(X_val)[:, 1]

# roc_auc_score
lgbm_roc = roc_auc_score(y_val, lgbm_pred_proba)
print("lgbm roc : ", lgbm_roc) # 0.80



# # 하이퍼 파라미터 튜닝
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import GridSearchCV
# from lightgbm import LGBMClassifier

# rf = RandomForestClassifier(random_state=2023)
# lgbm = LGBMClassifier(random_state=2023)

# models = [rf, lgbm]

# # Hyperparameter tuning
# from sklearn.model_selection import GridSearchCV

# params = {"n_estimators":[100,200,300], "max_depth":[1,2,3]}

# for model in models :
#     gs = GridSearchCV(model, param_grid=params, cv=5, scoring="roc_auc", n_jobs=4)
#     gs.fit(X_train, y_train)
    
#     print("="*20)
#     print(f"model : {model}")
#     print(f"params : {gs.best_params_}")
#     print(f"score : {gs.best_score_}")


# 최종 모델 : lgbm 분류
pred = lgbm.predict_proba(test)[:, 1]

# 5. 제출 : df, csv
submit = pd.DataFrame({"id" : test_id, "TravelInsurance" : pred})
submit.to_csv("3_submission", index=False)

check = pd.read_csv("3_submission")
print(check.head())

(1490, 10)
      id  Age               Employment Type GraduateOrNot  AnnualIncome  \
0  10000   28  Private Sector/Self Employed           Yes     1250000.0   
1  10001   31  Private Sector/Self Employed           Yes     1250000.0   
2  10002   29  Private Sector/Self Employed           Yes     1200000.0   
3  10003   33             Government Sector           Yes      650000.0   
4  10004   28  Private Sector/Self Employed           Yes      800000.0   

   FamilyMembers  ChronicDiseases FrequentFlyer EverTravelledAbroad  \
0              6                1            No                  No   
1              7                1            No                  No   
2              7                0            No                  No   
3              6                1            No                  No   
4              6                0            No                 Yes   

   TravelInsurance  
0                0  
1                0  
2                1  
3                1  
4     

### 그리드 서치
- 사실 크게 차이 안나는듯
```
====================
model : RandomForestClassifier(random_state=2023)
params : {'max_depth': 3, 'n_estimators': 300}
score : 0.793811369122429
====================
model : LGBMClassifier(random_state=2023)
params : {'max_depth': 3, 'n_estimators': 300}
score : 0.8097657948810022
```