## **실습 4. AI 모델링 최적화**
## 본 실습파일은 <u><b>학습자용</u> 입니다.
* 본 과정에서는 웹페이지에서 추출한 Feature(특징) 기반으로 악성사이트를 탐지하는 머신러닝 분류문제를 예제코드를 통해서 해결할 것입니다.
---


### **[실습 프로세스]**
### 0. 데이터 불러오기
### 1. 데이터 전처리
### 2. train_test_split을 이용하여, train_x, test_x, train_y, test_y로 데이터 분리
### 3. GridSearch 활용 AI모델링



# <b>Step 0. 라이브러리 import 및 데이터 불러오기
### **가. 라이브러리 import**

* 데이터 프레임 관련 라이브러리

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

### **나.  학습데이터 불러오기**

In [2]:
df = pd.read_csv("df_train.csv", sep = ",")
x = df.drop("label", axis = 1)
y = df.loc[:, "label"]

### **다.  데이터 전처리**

### **라. train_test_split을 이용하여 train/test  데이터 분리**

In [3]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 2021)

### **마. Confusion Matrix 함수 정의**
#### Confusion Matrix란 Training 을 통한 Prediction 성능을 측정하기 위해 예측 value와 실제 value를 비교하기 위한 표입니다.
#### 아래 함수는 이번 과제에서 confusion matrix 결과를 보기 쉽게 표현한 것으로 사용 예를 참고하여 모델 결과 확인에 사용하시기 바랍니다.

**<span style="color:green">[참고링크] 공식 Document**</span>
 
* confusion matrix(https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [4]:
from sklearn.metrics import classification_report as creport
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

In [8]:
def plot_confusion_matrix(ax, matrix, labels = ['malicious','benign'], title='Confusion matrix', fontsize=9):
    ax.set_xticks([x for x in range(len(labels))])
    ax.set_yticks([y for y in range(len(labels))])

    # Place labels on minor ticks
    ax.set_xticks([x + 0.5 for x in range(len(labels))], minor=True)
    ax.set_xticklabels(labels, rotation='90', fontsize=fontsize, minor=True)
    ax.set_yticks([y + 0.5 for y in range(len(labels))], minor=True)
    ax.set_yticklabels(labels[::-1], fontsize=fontsize, minor=True)

    # Hide major tick labels
    ax.tick_params(which='major', labelbottom='off', labelleft='off')

    # Finally, hide minor tick marks
    ax.tick_params(which='minor', width=0)

    # Plot heat map
    proportions = [1. * row / sum(row) for row in matrix]
    ax.pcolor(np.array(proportions[::-1]), cmap=plt.cm.Blues)

    # Plot counts as text
    for row in range(len(matrix)):
        for col in range(len(matrix[row])):
            confusion = matrix[::-1][row][col]
            if confusion != 0:
                ax.text(col + 0.5, row + 0.5, int(confusion),
                        fontsize=fontsize,
                        horizontalalignment='center',
                        verticalalignment='center')

    # Add finishing touches
    ax.grid(True, linestyle=':')
    ax.set_title(title, fontsize=fontsize)
    ax.set_xlabel('prediction', fontsize=fontsize)
    ax.set_ylabel('actual', fontsize=fontsize)

    plt.show()

### <span style="color:blue">[예시] Confusion Matrix 사용 방법<span>

- 샘플
#### > confusion = confusion_matrix(test_y, dt_pred)
#### > fig, ax = plt.subplots(figsize=(10,3))
#### > plot_confusion_matrix(ax, confusion, fontsize=30)

---

# <b>RandomForest GridSearchCV
### 만족할만한 하이퍼파라미터 조합을 찾는 단순한 방법은 수동으로 하이퍼파라미터를 조정하면서 찾는 방법입니다.
### GridSearchcv는 자동으로 복수개의 내부 모형을 생성하고 이를 모두 실행시켜서 최적의 하이퍼파라미터를 탐색해 줍니다.
### 탐색하고자 하는 하이퍼파라미터를 지정하면 가능한 모든 하이퍼파라미터 조합에 대해 교차 검증을 사용해 평가하게 됩니다.


* 주요 파라미터<br>
<table align="left">
    <tr>
        <td align="center">파라미터 명</td><td align="center">설명</td>
    </tr>
     <tr>
        <td align="center">param_grid</td><td>파라미터 딕셔너리</td>
    </tr>
    <tr>
        <td align="center">scoring</td><td>예측 성능을 측정할 평가 방법</td>
    </tr>
    <tr>
        <td align="center">cv</td><td>교차 검증을 위해 분할되는 폴드 수</td>
    </tr>
</table>

**<span style="color:green">[참고링크] 공식 Document**</span>
 
* GridSearchCV(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* model evaluation(https://scikit-learn.org/stable/modules/model_evaluation.html)

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV

### test_size = 0.2, random_state = 2021

In [6]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 2021)

In [7]:
#선언하기
model_dt = DecisionTreeClassifier(random_state = 2021)

#성능예측
cv_score = cross_val_score(model_dt, x_train, y_train, cv = 10)

#결과확인
print(cv_score) #0.9300631515
print(cv_score.mean())

[0.93495935 0.92276423 0.92682927 0.93902439 0.93061224 0.93061224
 0.93061224 0.92653061 0.95102041 0.94693878]
0.9339903766384603


In [8]:
#파라미터 선언
param = {"max_depth" : range(1, 21)}

#GridSearch 선언
model = GridSearchCV(model_dt, param, cv = 10, scoring = "accuracy")

In [9]:
#학습하기
model.fit(x_train, y_train)

In [91]:
# mean_test_score 확인
print("test_size = 0.2, random_state = 2021")
print('=' * 80)
print('최적파라미터:', model.best_params_)
print('-' * 80)
print('최고성능:', model.best_score_)
print('=' * 80)

test_size = 0.2, random_state = 2021
최적파라미터: {'max_depth': 12}
--------------------------------------------------------------------------------
최고성능: 0.9334784425715679


In [10]:
# mean_test_score 확인
print("test_size = 0.2, random_state = 2021")
print('=' * 80)
print('최적파라미터:', model.best_params_)
print('-' * 80)
print('최고성능:', model.best_score_)
print('=' * 80)

test_size = 0.2, random_state = 2021
최적파라미터: {'max_depth': 14}
--------------------------------------------------------------------------------
최고성능: 0.935209888833582


In [92]:
#test_size = 0.2, random_state = 2021, max_depth = 12

# 2.선언 
model = DecisionTreeClassifier(max_depth = 12, random_state = 2021)

# 3. fit(), 학습
model.fit(x_train, y_train)

# 4. predict(), 예측
y_pred = model.predict(x_test)

# Test 데이터 결과 Confusion Matrix 확인
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[340  18]
 [ 14 361]]
              precision    recall  f1-score   support

           0       0.96      0.95      0.96       358
           1       0.95      0.96      0.96       375

    accuracy                           0.96       733
   macro avg       0.96      0.96      0.96       733
weighted avg       0.96      0.96      0.96       733



In [11]:
#test_size = 0.2, random_state = 2021, max_depth = 12

# 2.선언 
model = DecisionTreeClassifier(max_depth = 14, random_state = 2021)

# 3. fit(), 학습
model.fit(x_train, y_train)

# 4. predict(), 예측
y_pred = model.predict(x_test)

# Test 데이터 결과 Confusion Matrix 확인
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[293  26]
 [ 18 277]]
              precision    recall  f1-score   support

           0       0.94      0.92      0.93       319
           1       0.91      0.94      0.93       295

    accuracy                           0.93       614
   macro avg       0.93      0.93      0.93       614
weighted avg       0.93      0.93      0.93       614



In [12]:
#선언하기
model_dt = DecisionTreeClassifier(max_depth = 14, random_state = 2021)

#성능예측
cv_score = cross_val_score(model_dt, x_train, y_train, cv = 10)

#결과확인
print(cv_score)
print(cv_score.mean())

[0.93495935 0.93495935 0.92682927 0.93902439 0.93061224 0.93469388
 0.93061224 0.93061224 0.95102041 0.93877551]
0.935209888833582


In [13]:
#파라미터 선언
param = {"min_samples_split" : range(1, 21)}

#GridSearch 선언
model = GridSearchCV(model_dt, param, cv = 10, scoring = "accuracy")

In [15]:
#학습하기
model.fit(x_train, y_train) #max_depth = 14

In [16]:
# mean_test_score 확인
print("test_size = 0.2, max_depth = 14, random_state = 2021")
print('=' * 80)
print('최적파라미터:', model.best_params_)
print('-' * 80)
print('최고성능:', model.best_score_)
print('=' * 80)

test_size = 0.2, max_depth = 14, random_state = 2021
최적파라미터: {'min_samples_split': 5}
--------------------------------------------------------------------------------
최고성능: 0.9352181848349096


In [17]:
#선언하기
model_dt = DecisionTreeClassifier(max_depth = 14, min_samples_split = 5, random_state = 2021)

#성능예측
cv_score = cross_val_score(model_dt, x_train, y_train, cv = 10)

#결과확인
print(cv_score)
print(cv_score.mean())

[0.93495935 0.93495935 0.91869919 0.92682927 0.94285714 0.93877551
 0.94285714 0.93061224 0.94693878 0.93469388]
0.9352181848349096


In [18]:
#test_size = 0.2, random_state = 2021, max_depth = 12, min_samples_split = 3

# 2.선언 
model = DecisionTreeClassifier(max_depth = 14, min_samples_split = 5, random_state = 2021)

# 3. fit(), 학습
model.fit(x_train, y_train)

# 4. predict(), 예측
y_pred = model.predict(x_test)

# Test 데이터 결과 Confusion Matrix 확인
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[297  22]
 [ 23 272]]
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       319
           1       0.93      0.92      0.92       295

    accuracy                           0.93       614
   macro avg       0.93      0.93      0.93       614
weighted avg       0.93      0.93      0.93       614



In [None]:
#min_samples_split가 1, 2이면 정확도가 더 높다? 하지만 성능은 3보다 떨어진다??

### test_size = 0.3, random_state = 2021

In [19]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 2021)

In [20]:
#선언하기
model_dt = DecisionTreeClassifier(random_state = 2021)

#성능예측
cv_score = cross_val_score(model_dt, x_train, y_train, cv = 10)

#결과확인
print(cv_score)
print(cv_score.mean())

[0.94883721 0.93023256 0.93488372 0.91627907 0.93953488 0.92093023
 0.93023256 0.95794393 0.94392523 0.92990654]
0.935270593349272


In [21]:
#파라미터 선언
param = {"max_depth" : range(1, 21)}

#GridSearch 선언
model = GridSearchCV(model_dt, param, cv = 10, scoring = "accuracy")

In [22]:
#학습하기
model.fit(x_train, y_train)

In [23]:
# mean_test_score 확인
print("test_size = 0.3, random_state = 2021")
print('=' * 80)
print('최적파라미터:', model.best_params_)
print('-' * 80)
print('최고성능:', model.best_score_)
print('=' * 80)

test_size = 0.3, random_state = 2021
최적파라미터: {'max_depth': 11}
--------------------------------------------------------------------------------
최고성능: 0.9357270158661162


In [24]:
#test_size = 0.3, random_state = 2021, max_depth = 11

# 2.선언 
model = DecisionTreeClassifier(max_depth = 11, random_state = 2021)

# 3. fit(), 학습
model.fit(x_train, y_train)

# 4. predict(), 예측
y_pred = model.predict(x_test)

# Test 데이터 결과 Confusion Matrix 확인
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[453  35]
 [ 25 408]]
              precision    recall  f1-score   support

           0       0.95      0.93      0.94       488
           1       0.92      0.94      0.93       433

    accuracy                           0.93       921
   macro avg       0.93      0.94      0.93       921
weighted avg       0.94      0.93      0.93       921



In [25]:
#선언하기
model_dt = DecisionTreeClassifier(max_depth = 11, random_state = 2021)

#성능예측
cv_score = cross_val_score(model_dt, x_train, y_train, cv = 10)

#결과확인
print(cv_score)
print(cv_score.mean())

[0.94418605 0.93488372 0.93953488 0.93488372 0.94418605 0.90232558
 0.94418605 0.95327103 0.92990654 0.92990654]
0.9357270158661162


In [26]:
#파라미터 선언
# param = {"max_depth" : range(1, 21)}
param = {"min_samples_split" : range(1, 21)}

#GridSearch 선언
model = GridSearchCV(model_dt, param, cv = 10, scoring = "accuracy")

In [27]:
#학습하기
model.fit(x_train, y_train) #max_depth = 11

In [28]:
# mean_test_score 확인
print("test_size = 0.3, max_depth = 11, random_state = 2021")
print('=' * 80)
print('최적파라미터:', model.best_params_)
print('-' * 80)
print('최고성능:', model.best_score_)
print('=' * 80)

test_size = 0.3, max_depth = 11, random_state = 2021
최적파라미터: {'min_samples_split': 1}
--------------------------------------------------------------------------------
최고성능: 0.9357270158661162


In [29]:
#test_size = 0.3, random_state = 2021, max_depth = 15, min_samples_split = 1

# 2.선언 
model = DecisionTreeClassifier(max_depth = 11, min_samples_split = 1, random_state = 2021)

# 3. fit(), 학습
model.fit(x_train, y_train)

# 4. predict(), 예측
y_pred = model.predict(x_test)

# Test 데이터 결과 Confusion Matrix 확인
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[453  35]
 [ 25 408]]
              precision    recall  f1-score   support

           0       0.95      0.93      0.94       488
           1       0.92      0.94      0.93       433

    accuracy                           0.93       921
   macro avg       0.93      0.94      0.93       921
weighted avg       0.94      0.93      0.93       921

