<a href="https://colab.research.google.com/github/DAVID-hub02/ai_12_project/blob/main/n224a_model_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" src="https://ds-cs-images.s3.ap-northeast-2.amazonaws.com/Codestates_Fulllogo_Color.png" width=100>

## *AIB / SECTION 2 / SPRINT 2 / NOTE 4*

# 📝 Assignment
---

# 모델선택(Model Selection)

### 1) 캐글 대회를 이어서 진행합니다. RandomizedSearchCV 를 사용하여 하이퍼파라미터 튜닝을 진행합니다.

- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)를 사용하세요.
- 분류문제에서 맞는 [scoring parameter](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) metric을 사용하세요.
- [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) 사용을 권합니다.
- RandomizedSearchCV 를 사용해서 하이퍼파라미터 튜닝을 진행하고 최고 성능을 보이는 모델로 예측을 진행한 후 캐글에 제출합니다.
- **(Urclass Quiz) 캐글 Leaderboard에서 개선된 본인 Score를 과제 제출폼에 제출하세요.**

In [None]:
import pandas as pd
target = 'vacc_h1n1_f'
train = pd.merge(pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/vacc_flu/train.csv'), 
                 pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/vacc_flu/train_labels.csv')[target], left_index=True, right_index=True)
test = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/vacc_flu/test.csv')
sample_submission = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/vacc_flu/submission.csv')

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

#Data preprocessing
from sklearn.model_selection import train_test_split

#Pipeline
from sklearn.pipeline import make_pipeline

#Accuracy
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

#Encoder
from category_encoders import OrdinalEncoder

#Model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#Imputer
from sklearn.impute import SimpleImputer

In [None]:
train, val = train_test_split(train, train_size=0.80, test_size=0.20, stratify=train[target], random_state=2)

train.shape, val.shape, test.shape

((33723, 39), (8431, 39), (28104, 38))

In [None]:
def engineer(df):
  # delete high cardinarity feature
  df = df.drop(['state', 'employment_industry', 'employment_occupation'], axis = 1)

  # Create a new feature by 'behaviorals'
  behaviorals = [col for col in df.columns if 'behavioral' in col]
  df['behavior_tot'] = df[behaviorals].sum(axis=1)
  df = df.drop(df[behaviorals], axis = 1)

  # Create a new feature by 'people'
  df['people'] = df['n_adult_r']+df['household_children']+df['n_people_r']
  df = df.drop(['n_adult_r', 'household_children', 'n_people_r'], axis=1)

  # Removal of variables for seasonal vaccines
  dels = [col for col in df.columns if 'seas' in col]
  df.drop(columns=dels, inplace=True)
  
  df['inc_pov'].replace(4, np.nan, inplace=True)

  return df

In [None]:
train = engineer(train)
val = engineer(val)
test = engineer(test)

In [None]:
features = train.drop([target], axis = 1).columns

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [None]:
# 파이프라인 사용
pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(strategy = "most_frequent"),
    RandomForestClassifier(random_state=10, n_jobs=-1, oob_score=True)
)

pipe.fit(X_train, y_train)

print('검증정확도 : ', pipe.score(X_val, y_val))
print('훈련정확도 : ', pipe.score(X_train, y_train))

y_pred_train = pipe.predict(X_train)
y_pred_val = pipe.predict(X_val)

print('검증 F1-score : ', f1_score(y_val, y_pred_val))
print('훈련 F1-score : ', f1_score(y_train, y_pred_train))

검증정확도 :  0.804293678092753
훈련정확도 :  0.9807549743498503
검증 F1-score :  0.5036101083032491
훈련 F1-score :  0.95922598479613


In [None]:
### 이곳에서 과제를 진행해 주세요 ### 
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

In [None]:
pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'ordinalencoder', 'simpleimputer', 'randomforestclassifier', 'ordinalencoder__cols', 'ordinalencoder__drop_invariant', 'ordinalencoder__handle_missing', 'ordinalencoder__handle_unknown', 'ordinalencoder__mapping', 'ordinalencoder__return_df', 'ordinalencoder__verbose', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'randomforestclassifier__bootstrap', 'randomforestclassifier__ccp_alpha', 'randomforestclassifier__class_weight', 'randomforestclassifier__criterion', 'randomforestclassifier__max_depth', 'randomforestclassifier__max_features', 'randomforestclassifier__max_leaf_nodes', 'randomforestclassifier__max_samples', 'randomforestclassifier__min_impurity_decrease', 'randomforestclassifier__min_samples_leaf', 'randomforestclassifier__min_samples_split', 'randomforestclassifier__min_weight_fraction_leaf', 'randomforestclassif

In [None]:
dists = {
    #'ordinalEncoder__smoothing': [2.,20.,50.,60.,100.,500.,1000.], # int로 넣으면 error(bug)   
    'simpleimputer__strategy': ['most_frequent'], 
    'randomforestclassifier__n_estimators': randint(50, 100), 
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None], 
    'randomforestclassifier__min_samples_leaf': randint(1, 10),  
    'randomforestclassifier__max_features': uniform(0, 1) # max_features
}

RD_cv = RandomizedSearchCV(pipe, param_distributions = dists, n_iter = 10, cv = 3, n_jobs=-1, scoring='f1', verbose=1)

RD_cv.fit(X_train, y_train);

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [None]:
print('최적 하이퍼파라미터: ', RD_cv.best_params_)
print('MAE: ', -RD_cv.best_score_)

최적 하이퍼파라미터:  {'randomforestclassifier__max_depth': 10, 'randomforestclassifier__max_features': 0.878004223034204, 'randomforestclassifier__min_samples_leaf': 9, 'randomforestclassifier__n_estimators': 81, 'simpleimputer__strategy': 'most_frequent'}
MAE:  -0.5367266003109649


In [None]:
### 이곳에서 과제를 진행해 주세요 ### 
pd.DataFrame(RD_cv.cv_results_).sort_values(by='rank_test_score').T

Unnamed: 0,4,8,6,3,1,7,2,9,0,5
mean_fit_time,2.056969,1.238646,1.927202,1.277174,1.559824,0.919838,1.464326,0.827321,1.474916,1.817179
std_fit_time,0.004991,0.029711,0.20818,0.151506,0.043933,0.01998,0.013721,0.010288,0.019028,0.047561
mean_score_time,0.243723,0.03037,0.22862,0.160312,0.065672,0.299441,0.126219,0.067318,0.112345,0.22802
std_score_time,0.015099,0.004847,0.057133,0.08595,0.01616,0.05645,0.00718,0.048806,0.010708,0.049128
param_randomforestclassifier__max_depth,10,10,15,20,15,,10,,,20
param_randomforestclassifier__max_features,0.878004,0.571467,0.710158,0.341154,0.805686,0.242493,0.259261,0.215593,0.203571,0.056754
param_randomforestclassifier__min_samples_leaf,9,6,4,9,4,7,9,8,2,2
param_randomforestclassifier__n_estimators,81,70,79,62,56,64,91,57,67,96
param_simpleimputer__strategy,most_frequent,most_frequent,most_frequent,most_frequent,most_frequent,most_frequent,most_frequent,most_frequent,most_frequent,most_frequent
params,"{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': 20, 'ran..."


In [None]:
pipe = RD_cv.best_estimator_

In [None]:
RD_cv.best_estimator_

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['opinion_h1n1_vacc_effective',
                                      'opinion_h1n1_risk',
                                      'opinion_h1n1_sick_from_vacc', 'agegrp',
                                      'employment_status', 'census_msa'],
                                mapping=[{'col': 'opinion_h1n1_vacc_effective',
                                          'data_type': dtype('O'),
                                          'mapping': Somewhat Effective      1
Not Very Effective      2
Very Effective          3
Not At All Effective    4
Dont Know               5
NaN                     6
Refused                 7
dtype...
                                         {'col': 'census_msa',
                                          'data_type': dtype('O'),
                                          'mapping': MSA, Not Principle City    1
Non-MSA                    2
MSA, Principle City        3
NaN                    

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X_train = X_train.apply(le.fit_transform)

In [None]:
X_test = X_test.apply(le.fit_transform)

In [None]:
test_predict = pipe.predict(X_test)
test_predict = pd.DataFrame(test_predict, columns=['vacc_h1n1_f'])
test_predict

Unnamed: 0,vacc_h1n1_f
0,0
1,0
2,0
3,0
4,0
...,...
28099,0
28100,0
28101,0
28102,0


In [None]:
# RandomizedSearchCV로 도출된 Hyper_parameter 적용
test_predict.to_csv(test_predict.to_csv('/Users/jsshin/Desktop/Code state/Section2/N22x/N224/RDcv_test_predict3_3.csv', index_label = 'id', header = True))


',vacc_h1n1_f\n0,0\n1,0\n2,0\n3,0\n4,0\n5,0\n6,0\n7,0\n8,0\n9,0\n10,0\n11,0\n12,0\n13,0\n14,0\n15,0\n16,0\n17,0\n18,0\n19,0\n20,0\n21,0\n22,0\n23,0\n24,0\n25,0\n26,0\n27,1\n28,0\n29,0\n30,0\n31,0\n32,0\n33,0\n34,0\n35,0\n36,0\n37,0\n38,0\n39,0\n40,0\n41,0\n42,0\n43,0\n44,0\n45,0\n46,0\n47,0\n48,0\n49,0\n50,0\n51,1\n52,0\n53,0\n54,0\n55,0\n56,0\n57,0\n58,0\n59,0\n60,0\n61,0\n62,0\n63,0\n64,0\n65,0\n66,1\n67,0\n68,0\n69,0\n70,0\n71,0\n72,0\n73,0\n74,0\n75,0\n76,0\n77,0\n78,0\n79,0\n80,0\n81,0\n82,0\n83,0\n84,0\n85,0\n86,0\n87,0\n88,0\n89,0\n90,0\n91,0\n92,0\n93,0\n94,0\n95,0\n96,0\n97,0\n98,0\n99,0\n100,0\n101,0\n102,0\n103,0\n104,0\n105,0\n106,0\n107,0\n108,1\n109,0\n110,0\n111,0\n112,0\n113,0\n114,0\n115,0\n116,0\n117,0\n118,1\n119,0\n120,0\n121,0\n122,0\n123,0\n124,0\n125,0\n126,0\n127,1\n128,0\n129,0\n130,0\n131,0\n132,0\n133,0\n134,0\n135,0\n136,0\n137,0\n138,0\n139,0\n140,0\n141,0\n142,0\n143,0\n144,0\n145,0\n146,0\n147,1\n148,0\n149,0\n150,0\n151,0\n152,0\n153,0\n154,0\n155,0\n156

In [None]:
y_pred_proba = pipe.predict_proba(X_val)[:, 1]
y_pred_proba

array([0.19706684, 0.20582328, 0.17812013, ..., 0.21454719, 0.02367017,
       0.30183857])

In [None]:
# 최적의 임계값 찾기
from sklearn.metrics import roc_curve

# roc_curve(타겟값, prob of 1)
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)

roc = pd.DataFrame({
    'FPR(Fall-out)': fpr, 
    'TPRate(Recall)': tpr, 
    'Threshold': thresholds
})


# threshold 최대값의 인덱스, np.argmax()
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

print('idx:', optimal_idx, ', threshold:', optimal_threshold)

roc.head()

idx: 1223 , threshold: 0.24248271714416153


Unnamed: 0,FPR(Fall-out),TPRate(Recall),Threshold
0,0.0,0.0,1.930643
1,0.0,0.000496,0.930643
2,0.0,0.004963,0.877783
3,0.000156,0.004963,0.876183
4,0.000156,0.012407,0.853592


In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

threshold = 0.242482717
y_pred_proba = pipe.predict_proba(X_train)[:,1]
train_predict = y_pred_proba > threshold

print(classification_report(y_train, train_predict))
print('훈련 정확도', pipe.score(X_train, y_train))
print('훈련 f1_score:',f1_score(y_train, train_predict))

              precision    recall  f1-score   support

           0       0.85      0.78      0.82     25661
           1       0.45      0.57      0.50      8062

    accuracy                           0.73     33723
   macro avg       0.65      0.68      0.66     33723
weighted avg       0.76      0.73      0.74     33723

훈련 정확도 0.757228004625923
훈련 f1_score: 0.5041563446187722


In [None]:
y_pred_proba = pipe.predict_proba(X_test)[:, 1]
test_predict = 1*(y_pred_proba > threshold)

In [None]:
test_predict = pd.DataFrame(test_predict, columns=['vacc_h1n1_f'])
test_predict

Unnamed: 0,vacc_h1n1_f
0,0
1,0
2,0
3,0
4,0
...,...
28099,0
28100,0
28101,0
28102,0


In [None]:
# RandomizedSearchCV로 도출된 Hyper_parameter 적용
test_predict.to_csv(test_predict.to_csv('/Users/jsshin/Desktop/Code state/Section2/N22x/N224/RDcv_test_predict3_4.csv', index_label = 'id', header = True))

#0.5004

',vacc_h1n1_f\n0,0\n1,0\n2,0\n3,0\n4,0\n5,0\n6,0\n7,0\n8,0\n9,0\n10,0\n11,1\n12,0\n13,1\n14,1\n15,1\n16,0\n17,0\n18,0\n19,1\n20,1\n21,1\n22,1\n23,0\n24,0\n25,1\n26,0\n27,1\n28,1\n29,0\n30,1\n31,0\n32,1\n33,0\n34,0\n35,0\n36,0\n37,0\n38,0\n39,0\n40,1\n41,0\n42,1\n43,0\n44,1\n45,1\n46,1\n47,1\n48,1\n49,1\n50,0\n51,1\n52,1\n53,0\n54,0\n55,0\n56,0\n57,0\n58,0\n59,1\n60,0\n61,0\n62,1\n63,0\n64,0\n65,0\n66,1\n67,0\n68,1\n69,0\n70,0\n71,0\n72,0\n73,0\n74,0\n75,0\n76,0\n77,1\n78,0\n79,0\n80,0\n81,1\n82,0\n83,0\n84,0\n85,0\n86,0\n87,1\n88,0\n89,0\n90,0\n91,0\n92,0\n93,0\n94,0\n95,0\n96,0\n97,0\n98,0\n99,0\n100,0\n101,0\n102,1\n103,0\n104,0\n105,0\n106,0\n107,0\n108,1\n109,0\n110,0\n111,1\n112,0\n113,0\n114,0\n115,0\n116,1\n117,0\n118,1\n119,0\n120,0\n121,1\n122,0\n123,0\n124,1\n125,0\n126,1\n127,1\n128,1\n129,0\n130,1\n131,0\n132,0\n133,1\n134,1\n135,0\n136,1\n137,0\n138,0\n139,1\n140,1\n141,0\n142,0\n143,0\n144,0\n145,0\n146,0\n147,1\n148,0\n149,0\n150,1\n151,0\n152,0\n153,0\n154,0\n155,0\n156

## 🔥 도전과제(Github - Discussion)


### 2) [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 를 사용하여 하이퍼파라미터 튜닝을 진행합니다.
- 모델 성능을 높이기 위해 가능한 시도를 다 해보세요.
- 모델 성능 개선에 가장 큰 영향을 준 특성공학이나 하이퍼파라미터 튜닝에 대해서 왜 성능 개선에 큰 영향을 주었는지 설명해 보시고 서로의 결과에 대해 공유하고 토론해 보세요. 



In [None]:
### 이곳에서 과제를 진행해 주세요 ### 