### 참고
**사용한 모델**
- Logistic Regression (로지스틱 모델) 
- Decision Tree (의사결정나무 모델)
- Multi-layer Perceptron classifier (다중 레이어 신경망 모델)

### import modules

In [32]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd

# 모델 평가 지표 => f1 score
from sklearn.metrics import f1_score

# Hyper parameter tuning tool: Grid Search => Parameter Grid
from sklearn.model_selection import ParameterGrid

### Load the data
data from [here-> http://archive.ics.uci.edu](http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks))

In [10]:
path_data = 'http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
df = pd.read_csv(path_data, header = None)

# set columns name
df.columns = ['Band'+str(i) for i in range(1, 61)] + ['Y']

# View the data
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.163,0.2028,0.1694,0.2328,0.2684,...,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157,M
204,0.0323,0.0101,0.0298,0.0564,0.076,0.0958,0.099,0.1018,0.103,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,M
205,0.0522,0.0437,0.018,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.016,0.0029,0.0051,0.0062,0.0089,0.014,0.0138,0.0077,0.0031,M
206,0.0303,0.0353,0.049,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,M
207,0.026,0.0363,0.0136,0.0272,0.0214,0.0338,0.0655,0.14,0.1843,0.2354,...,0.0146,0.0129,0.0047,0.0039,0.0061,0.004,0.0036,0.0061,0.0115,M


### Simple data cleaning 

In [23]:
# 특징 열과 라벨 분리
X = df.drop('Y', axis = 1)
Y = df['Y']

In [24]:
# 학습 데이터와 평가 데이터 분리
from sklearn.model_selection import train_test_split
Train_X, Test_X, Train_Y, Test_Y = train_test_split(X, Y, random_state = 42)

In [28]:
print(Train_X.shape) # 샘플 156개, 특징 60개 => 단순한 모델 필요(일반적으로)
print(Test_X.shape)

(156, 60)
(52, 60)


In [30]:
# 라벨의 카테고리 빈도 확인 => 일반적인 비율
print(Train_Y.value_counts())
print(Test_Y.value_counts())

-1    81
 1    75
Name: Y, dtype: int64
-1    30
 1    22
Name: Y, dtype: int64


In [29]:
# 라벨을 분석 모델에 포함시키기 위해 
# int로 변경
Train_Y.replace({"M":-1, "R":1}, inplace = True)
Test_Y.replace({"M":-1, "R":1}, inplace = True)

## Modeling with Parameter tuning

### [Case 1] Logistic Regression (로지스틱 회귀)
- 복잡도 파라미터가 1개
- 단순함
- 우연성 내제


In [35]:
from sklearn.linear_model import LogisticRegression

**Introduction to Hyper parameter of Logistic Reg.**
- **C** : float, default=1.0
    Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.  
    
    복잡도와 반비례 관계, 회귀계수의 크기 조절하는 값.
    



In [44]:
def LR_model_test(C):
    
    # random_state도 조금은 여러번 수행해봐야함
    model = LogisticRegression(
        C = C, # 복잡도 파라미터, 
        max_iter = 100000, # 단순한 모델 => max_iter 크게해도 좋다.
        random_state = 10) # 여러번 수행하는 것이 좋다.
    model.fit(Train_X, Train_Y) 
    pred_Y = model.predict(Test_X)
    
    return f1_score(Test_Y, pred_Y)

In [45]:
# 넓은 범위에서 먼저 탐색
C_list = [0.1, 0.3, 0.5, 1, 5, 10, 30, 50]

for C in C_list : 
    print('C = {}:  \t{}'.format(C, LR_model_test(C)))

C = 0.1:  	0.7727272727272727
C = 0.3:  	0.8085106382978724
C = 0.5:  	0.7916666666666666
C = 1:  	0.7916666666666666
C = 5:  	0.8260869565217391
C = 10:  	0.8333333333333333
C = 30:  	0.8333333333333333
C = 50:  	0.8333333333333333


0.1 < C < 50 은 범위가 더 넓다.  
따라서 0.1 < C < 1 에서 국소적으로 다시 탐색

ParameterGrid 방법으로 Grid Search

In [46]:
# 파라미터 그리드 설정
LR_parameter_grid = ParameterGrid({"C":np.linspace(0.1, 1, 50),
                                  "max_iter":[100000],
                                  "random_state":[10]})

# 파라미터 튜닝 수행 
best_score = -1
for parameter in LR_parameter_grid:
    model = LR(**parameter).fit(Train_X, Train_Y)
    pred_Y = model.predict(Test_X)
    score = f1_score(Test_Y, pred_Y)
    
    if score > best_score:
        best_score = score
        best_parameter = parameter

print(best_parameter, best_score)

{'C': 0.17346938775510207, 'max_iter': 100000, 'random_state': 10} 0.8260869565217391


### [Case 2] Decision Tree (의사 결정 나무)
- 복잡도 파라미터가 2개
- 단순함
- 우연성 거의 없음

In [47]:
from sklearn.tree import DecisionTreeClassifier

**Introduction to Hyper parameter of <u>Decision Tree<u/>**
- **max_depth** : int, default=None  
    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
    
    복잡도와 **비례**, Tree 모델의 깊이, 과적합을 피하기 위해 보통 **4**이하로 설정
    
    
- **min_samples_leaf** : int or float, default=2  
    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
    
    복잡도와 **반비례**, 

In [48]:
def DTC_model_test(max_depth, min_samples_leaf):
    
    model = DecisionTreeClassifier(
        max_depth = max_depth, 
        min_samples_leaf = min_samples_leaf)
    
    model.fit(Train_X, Train_Y) 
    pred_Y = model.predict(Test_X)
    
    return f1_score(Test_Y, pred_Y)

In [50]:
# 넓은 범위에서 선 탐색
for max_depth in [3, 6, 9, 12]:
    for min_samples_leaf in [1, 2, 3]:
        score = DTC_model_test(max_depth = max_depth, min_samples_leaf = min_samples_leaf)
        print("{}-{}:{}".format(max_depth, min_samples_leaf, score))

3-1:0.6976744186046512
3-2:0.7727272727272727
3-3:0.7555555555555555
6-1:0.6938775510204083
6-2:0.7500000000000001
6-3:0.7058823529411765
9-1:0.7083333333333333
9-2:0.7083333333333333
9-3:0.7200000000000001
12-1:0.7346938775510203
12-2:0.6938775510204083
12-3:0.6666666666666667


max depth가 크고 (복잡도 증가) min_samples_leaf가 큰 경우 (복잡도 감소) 좋은 성능이 나옴을 확인

In [52]:
# 파라미터 그리드 설정
DTC_parameter_grid = ParameterGrid({"max_depth": np.arange(6, 15),
                                  "min_samples_leaf": np.arange(2, 5)})

# 파라미터 튜닝 수행 
best_score = -1
for parameter in DTC_parameter_grid:
    model = DecisionTreeClassifier(**parameter).fit(Train_X, Train_Y)
    pred_Y = model.predict(Test_X)
    score = f1_score(Test_Y, pred_Y)
    
    if score > best_score:
        best_score = score
        best_parameter = parameter

print(best_parameter, best_score)

{'max_depth': 10, 'min_samples_leaf': 2} 0.7500000000000001


### [Case 3] Multi-layer Perceptron classifier (다중 레이어 <u>신경망</u> 모델)
- 복잡도 파라미터가 1개 -> 은닉층을 튜플 단위로 1개로 고려
- 복잡함
- 우연성 내제

In [53]:
from sklearn.neural_network import MLPClassifier

In [54]:
def MLP_model_test(hidden_layer_sizes):
    
    model = MLPClassifier(
        hidden_layer_sizes = hidden_layer_sizes, 
        random_state = 12)
    
    model.fit(Train_X, Train_Y) 
    pred_Y = model.predict(Test_X)
    
    return f1_score(Test_Y, pred_Y)

In [59]:
# hidden layer sizes list => hls_lst
hls_lst = [(5, ), (10, ), (3, 3), (5, 5), (10, 10)]  # (5, ) => 층이 1개, 노드가 5개 

# 넓은 범위에서 선 탐색
for hidden_layer_sizes in hls_lst :
    score = MLP_model_test(hidden_layer_sizes= hidden_layer_sizes)
    print('hidden layer sizes : {},\t  score : {}'.format(hidden_layer_sizes, score))

hidden layer sizes : (5,),	  score : 0.5945945945945945
hidden layer sizes : (10,),	  score : 0.8444444444444444
hidden layer sizes : (3, 3),	  score : 0.4571428571428572
hidden layer sizes : (5, 5),	  score : 0.0
hidden layer sizes : (10, 10),	  score : 0.8372093023255814


max_iter warnings 발생  
은닉층 사이즈가 (5, 5) 일 때, f1 score가 0 이 나옴  =>  초기값의 영향 ..? (더 단순한 모델과 더 복잡한 모델 둘 다 성능이 나왔으므로.)
  
f1 score 가 잘 나온 순서는 앞에서 부터 다음과 같다. (10, 10) > (10, ) > ...   => 은닉노드가 많을 수록 좋은 결과가 나왔다.   
=> 더 복잡한 모델을 고려해야한다.


In [63]:
# 파라미터 그리드 설정
MLP_parameter_grid = ParameterGrid({"random_state": [41, 102, 15],
                                  "hidden_layer_sizes": [(14, ), (5, 5), (10, 10), (11, 13), (5, 5, 5), (10, 10, 10)],
                                   "max_iter":[200, 2000, 20000]})

# 파라미터 튜닝 수행 
best_score = -1
for parameter in MLP_parameter_grid:
    model = MLPClassifier(**parameter).fit(Train_X, Train_Y)
    pred_Y = model.predict(Test_X)
    score = f1_score(Test_Y, pred_Y)
    
    print('parameter : {},'.format(parameter))
    print('\t  score : {} '.format(score))
    
    if score > best_score:
        best_score = score
        best_parameter = parameter


print('-'*60)
print('best_parameter : {},\t  best_score : {}'.format(best_parameter, best_score))

parameter : {'hidden_layer_sizes': (14,), 'max_iter': 200, 'random_state': 41},
	  score : 0.8695652173913043 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 200, 'random_state': 102},
	  score : 0.8444444444444444 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 200, 'random_state': 15},
	  score : 0.8260869565217391 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 2000, 'random_state': 41},
	  score : 0.888888888888889 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 2000, 'random_state': 102},
	  score : 0.8695652173913043 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 2000, 'random_state': 15},
	  score : 0.888888888888889 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 20000, 'random_state': 41},
	  score : 0.888888888888889 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 20000, 'random_state': 102},
	  score : 0.8695652173913043 
parameter : {'hidden_layer_sizes': (14,), 'max_iter': 20000, 'random_state': 15},
	  score : 0.8888888888

## Tip
seed 값은 결과에 큰 영향을 주지 않음.  

결과적으로 **EDA와 Feature Engineering**이 중요.  

?  또 다른 Tuning 팁은(대회 전용 잡기술) 3개를 제출했을 때 parameter 값들을 기준으로 범위를 잡아서 제출하는 것도 한 방법임(대회 막바지 기준)
