<img align="right" src="https://ds-cs-images.s3.ap-northeast-2.amazonaws.com/Codestates_Fulllogo_Color.png" width=100>

## *AIB / SECTION 2 / SPRINT 2 / NOTE 4*

# 📝 Assignment
---

# 모델선택(Model Selection)

### 1) 캐글 대회를 이어서 진행합니다. RandomizedSearchCV 를 사용하여 하이퍼파라미터 튜닝을 진행합니다.

- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)를 사용하세요.
- 분류문제에서 맞는 [scoring parameter](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) metric을 사용하세요.
- [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) 사용을 권합니다.
- RandomizedSearchCV 를 사용해서 하이퍼파라미터 튜닝을 진행하고 최고 성능을 보이는 모델로 예측을 진행한 후 캐글에 제출합니다.
- **(Urclass Quiz) 캐글 Leaderboard에서 개선된 본인 Score를 과제 제출폼에 제출하세요.**

In [None]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.3.0-py2.py3-none-any.whl (82 kB)
     |████████████████████████████████| 82 kB 119 kB/s             
Installing collected packages: category-encoders
Successfully installed category-encoders-2.3.0
You should consider upgrading via the '/root/.pyenv/versions/3.8.1/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [None]:
# 패키지 호출
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay



In [None]:
# 데이터 호출
target = "vacc_h1n1_f"

train = pd.merge(pd.read_csv("train.csv"),
                 pd.read_csv("train_labels.csv", usecols = [0])[target], left_index = True, right_index = True)
X_test = pd.read_csv("test.csv")
y_test = pd.read_csv("submission.csv", index_col = [0])

In [None]:
# 학습 데이터 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42154 entries, 0 to 42153
Data columns (total 39 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 33734 non-null  float64
 1   h1n1_knowledge               33734 non-null  float64
 2   behavioral_antiviral_meds    33635 non-null  float64
 3   behavioral_avoidance         33482 non-null  float64
 4   behavioral_face_mask         33710 non-null  float64
 5   behavioral_wash_hands        33683 non-null  float64
 6   behavioral_large_gatherings  33640 non-null  float64
 7   behavioral_outside_home      33633 non-null  float64
 8   behavioral_touch_face        33571 non-null  float64
 9   doctor_recc_h1n1             40269 non-null  float64
 10  doctor_recc_seasonal         40269 non-null  float64
 11  chronic_med_condition        40837 non-null  float64
 12  child_under_6_months         32705 non-null  float64
 13  health_insurance

In [None]:
# 테스트 데이터 확인
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28104 entries, 0 to 28103
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 22492 non-null  float64
 1   h1n1_knowledge               22492 non-null  float64
 2   behavioral_antiviral_meds    22432 non-null  float64
 3   behavioral_avoidance         22294 non-null  float64
 4   behavioral_face_mask         22478 non-null  float64
 5   behavioral_wash_hands        22456 non-null  float64
 6   behavioral_large_gatherings  22421 non-null  float64
 7   behavioral_outside_home      22422 non-null  float64
 8   behavioral_touch_face        22384 non-null  float64
 9   doctor_recc_h1n1             26897 non-null  float64
 10  doctor_recc_seasonal         26897 non-null  float64
 11  chronic_med_condition        27250 non-null  float64
 12  child_under_6_months         21813 non-null  float64
 13  health_insurance

In [None]:
# 특성공학 진행 (behavioral관련 행 추가 및 seasional항 제거)
def FeatureEngineering(df):
  label_b = [col for col in df.columns if "behavioral" in col]
  df["behavioral_total"] = df[label_b].sum(axis = 1)
  
  label_s = [col for col in df.columns if "sea" in col]
  df = df.drop(columns = label_s)
  return 

FeatureEngineering(train)
FeatureEngineering(X_test)

In [None]:
# 학습데이터 타겟 분류
X_train = train.drop(columns = target)
y_train = train[target]

In [None]:
# 모델 파이프 형성
pipe = Pipeline([
    ("ordinalencoder", OrdinalEncoder()),
    ("simpleimputer", SimpleImputer(strategy = "constant", fill_value = 0)),
    ("randomforestclassifier", RandomForestClassifier(n_estimators = 100, random_state = 29, n_jobs = -1))
    ])

In [None]:
# k교차검정 확인
k = 3
scores = cross_val_score(pipe, X_train, y_train, cv = k, scoring = "f1")
print('f1 score mean for {k} folds:', scores.mean())
print('f1 score std for {k} folds:', scores.std())

f1 score mean for {k} folds: 0.5540963759960126
f1 score std for {k} folds: 0.001075314902438931


In [None]:
# Grid Search를 통해 튜닝할 하이퍼파라미터의 범위 지정
dists = {
    "simpleimputer__strategy": ["mean", "median", "most_frequent", "constant"], 
    "randomforestclassifier__n_estimators" : range(100,400, 100),
    "randomforestclassifier__max_depth" : range(10,30, 5)
    }

k = 3

# 하이퍼 파라미터 튜닝
gscv = GridSearchCV(
    pipe,
    param_grid = dists, 
    cv = k,
    scoring = 'f1',
    verbose = 1,
    n_jobs = -1
    )

gscv.fit(X_train, y_train)

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:  1.5min finished


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('ordinalencoder', OrdinalEncoder()),
                                       ('simpleimputer',
                                        SimpleImputer(fill_value=0,
                                                      strategy='constant')),
                                       ('randomforestclassifier',
                                        RandomForestClassifier(n_jobs=-1,
                                                               random_state=29))]),
             n_jobs=-1,
             param_grid={'randomforestclassifier__max_depth': range(10, 30, 5),
                         'randomforestclassifier__n_estimators': range(100, 400, 100),
                         'simpleimputer__strategy': ['mean', 'median',
                                                     'most_frequent',
                                                     'constant']},
             scoring='f1', verbose=1)

In [None]:
# 튜닝된 하이퍼파라미터 확인
print('최적 하이퍼파라미터: ', gscv.best_params_)
print('f1: ', gscv.best_score_)

최적 하이퍼파라미터:  {'randomforestclassifier__max_depth': 15, 'randomforestclassifier__n_estimators': 100, 'simpleimputer__strategy': 'mean'}
f1:  0.5615310771339748


In [None]:
# Random Search를 통해 튜닝할 하이퍼파라미터의 범위 지정
dists = {
    "simpleimputer__strategy": ["mean", "median", "most_frequent", "constant"], 
    "randomforestclassifier__n_estimators" : range(50,150),
    "randomforestclassifier__max_depth" : range(10,20)
    }
    
k = 3

# 하이퍼 파라미터 튜닝
clf = RandomizedSearchCV(
    pipe,
    param_distributions = dists, 
    n_iter = 50,
    cv = k,
    scoring = 'f1',
    verbose = 2,
    n_jobs = -1
    )

clf.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    4.4s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:   43.5s finished


RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder()),
                                             ('simpleimputer',
                                              SimpleImputer(fill_value=0,
                                                            strategy='constant')),
                                             ('randomforestclassifier',
                                              RandomForestClassifier(n_jobs=-1,
                                                                     random_state=29))]),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'randomforestclassifier__max_depth': range(10, 20),
                                        'randomforestclassifier__n_estimators': range(50, 150),
                                        'simpleimputer__strategy': ['mean',
                                                                    'median',
 

In [None]:
# 튜닝된 하이퍼파라미터 확인
print('최적 하이퍼파라미터: ', clf.best_params_)
print('f1: ', clf.best_score_)

최적 하이퍼파라미터:  {'simpleimputer__strategy': 'constant', 'randomforestclassifier__n_estimators': 94, 'randomforestclassifier__max_depth': 18}
f1:  0.5602403446253788


In [None]:
# 만들어진 모델에서 가장 성능이 좋은 모델을 불러옵니다.
pipe = clf.best_estimator_

# test 데이터 예측 및 저장
y_pred = pipe.predict(X_test)
y_test["vacc_h1n1_f"] = y_pred 
y_test.to_csv("submission_4.csv")

## 🔥 도전과제(Github - Discussion)


### 2) [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 를 사용하여 하이퍼파라미터 튜닝을 진행합니다.
- 모델 성능을 높이기 위해 가능한 시도를 다 해보세요.
- 모델 성능 개선에 가장 큰 영향을 준 특성공학이나 하이퍼파라미터 튜닝에 대해서 왜 성능 개선에 큰 영향을 주었는지 설명해 보시고 서로의 결과에 대해 공유하고 토론해 보세요. 



In [None]:
### 이곳에서 과제를 진행해 주세요 ### 