# 7/18 ~ 7/25　までの研究成果

## Dosのみを学習したオートエンコーダを用いた不均衡データセットの精度向上


### 背景
- 全てのデータを学習したオートエンコーダではu2rを学習することすらできなかった．
- 巨大なDosのデータにu2rが飲み込まれてしまっている可能性がある．

### 目的
- 全てのデータを学習したオートエンコーダでは見込みのなかったu2rの精度の向上
- u2rデータがDosのデータのみを学習したオートエンコーダで再構成誤差が大きくなるかを検証する．

### 使用するデータセット
- KDD99
- KDD99 10%

### 手法
### 結果
### 考察

### 1. 準備（読み込み，前処理）

ライブラリの読み込み，と各バージョンの出力

In [1]:
from utils_kdd99 import *
print_version()

python:      3.10.11
sklearn:     1.2.2
tensorflow:  2.12.0
keras:       2.12.0
numpy:       1.23.5
pandas:      1.5.3


データを読み込み，説明変数と目的変数に分割する．

In [2]:
data_x, data_y = load_data(use_full_dataset=False, standard_scale=True, verbose=0, )

目的変数の種類と，データの数

In [3]:
data_y.value_counts()

smurf              280790
neptune            107201
normal              97278
back                 2203
satan                1589
ipsweep              1247
portsweep            1040
warezclient          1020
teardrop              979
pod                   264
nmap                  231
guess_passwd           53
buffer_overflow        30
land                   21
warezmaster            20
imap                   12
rootkit                10
loadmodule              9
ftp_write               8
multihop                7
phf                     4
perl                    3
spy                     2
Name: true_label, dtype: int64

4つのクラスラベルに変換する．

In [4]:
data_y = data_y.map(lambda x: attack_label_class[x])

変換後の目的変数の種類と，データの数

In [5]:
data_y.value_counts()

dos       391458
normal     97278
probe       4107
r2l         1126
u2r           52
Name: true_label, dtype: int64

クラスラベルを数値に変換

In [6]:
data_y = data_y.map(lambda x: correspondences[x])

k分割交差検証
- 訓練データと検証データの比率は2:1
- 訓練データと検証データの正解ラベルの各種類の比率は一定（stratify=True）

In [7]:
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.33, random_state=RANDOM_SEED, stratify=data_y)

訓練データと検証データのサイズ確認

In [8]:
print(f"x_train: {x_train.shape}, x_test: {x_test.shape}")


x_train: (330994, 38), x_test: (163027, 38)


### Dosのみを学習したオートエンコーダの作成
- 隠れ層の次元数(38->10->5->10->38)
- 活性化関数：ReLU
- 最適化関数：adam
- 損失関数：平均二乗誤差
- エポック数：1
- バッチサイズ：32

In [9]:
ae_model = keras.Sequential([
    Dense(units=10, activation='relu', input_dim=38, name='encoder1'),
    Dense(units=5, activation='relu', name='encoder2'),
    Dense(units=10, activation='relu'),
    Dense(units=38, activation='relu'),
])
ae_model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
ae_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder1 (Dense)            (None, 10)                390       
                                                                 
 encoder2 (Dense)            (None, 5)                 55        
                                                                 
 dense (Dense)               (None, 10)                60        
                                                                 
 dense_1 (Dense)             (None, 38)                418       
                                                                 
Total params: 923
Trainable params: 923
Non-trainable params: 0
_________________________________________________________________


Dosのみのデータセットを作成

In [10]:
dos_x_train = x_train[y_train == correspondences['dos']]

オートエンコーダの学習

In [11]:
ae_model.fit(dos_x_train, dos_x_train,
          epochs=1, # データセットを使って学習する回数
        batch_size=32,
        shuffle=True,
        verbose=1,
        use_multiprocessing=True
          )

2023-07-25 10:11:22.162197: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz




<keras.callbacks.History at 0x29dc4ab30>

エンコーダー部分を抜き取る

In [12]:
encoder = keras.Sequential([ae_model.get_layer('encoder1'),
                            ae_model.get_layer('encoder2')])


Dosで学習したエンコーダによって出力された特徴量を命名

In [13]:
dos_columns = list(map(lambda x: 'dos' + str(x), range(5)))
dos_columns

['dos0', 'dos1', 'dos2', 'dos3', 'dos4']

特徴量抽出

In [14]:
x_train_encoded = pd.DataFrame(data=encoder.predict(x_train), index=x_train.index, columns=dos_columns)
x_test_encoded = pd.DataFrame(data=encoder.predict(x_test), index=x_test.index, columns=dos_columns)
x_train_encoded.head()




Unnamed: 0,dos0,dos1,dos2,dos3,dos4
212221,0.0,0.0,2.938297,0.0,0.0
30903,0.598978,0.0,0.0,0.331348,0.389877
9739,0.0,0.0,2.94486,0.0,0.0
37540,0.62231,0.0,0.0,0.424348,0.410116
418638,0.0,0.0,2.944614,0.0,0.0


Dosで学習した特徴量を元の特徴量にマージする

In [15]:
x_train_new_feature = x_train.merge(x_train_encoded, right_index=True, left_index=True)
x_test_new_feature = x_test.merge(x_test_encoded, right_index=True, left_index=True)
x_train_new_feature.head()


Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,dos0,dos1,dos2,dos3,dos4
212221,-0.067792,-0.002017,-0.026287,-0.006673,-0.04772,-0.002571,-0.044136,-0.009782,-0.417192,-0.005679,...,-0.158629,-0.464418,-0.463202,-0.25204,-0.249464,0.0,0.0,2.938297,0.0,0.0
30903,-0.067792,-0.002774,0.472896,-0.006673,-0.04772,-0.002571,-0.044136,-0.009782,2.39698,-0.005679,...,0.553404,-0.464418,-0.463202,-0.25204,-0.249464,0.598978,0.0,0.0,0.331348,0.389877
9739,-0.067792,-0.002017,-0.026287,-0.006673,-0.04772,-0.002571,-0.044136,-0.009782,-0.417192,-0.005679,...,-0.158629,-0.464418,-0.463202,-0.25204,-0.249464,0.0,0.0,2.94486,0.0,0.0
37540,-0.067792,-0.002776,-0.01412,-0.006673,-0.04772,-0.002571,-0.044136,-0.009782,2.39698,-0.005679,...,0.790749,-0.464418,-0.463202,-0.25204,-0.249464,0.62231,0.0,0.0,0.424348,0.410116
418638,-0.067792,-0.002535,-0.026287,-0.006673,-0.04772,-0.002571,-0.044136,-0.009782,-0.417192,-0.005679,...,-0.158629,-0.464418,-0.463202,-0.25204,-0.249464,0.0,0.0,2.944614,0.0,0.0


### LightGBMを用いた学習

In [16]:
lgb_train = lgb.Dataset(x_train_new_feature, y_train)
lgb_eval = lgb.Dataset(x_test_new_feature, y_test, reference=lgb_train)

# LightGBM parameters
params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'multiclass',
        'num_class': 5,
        'metric': {'multi_error'}, # 評価指標 : 誤り率(= 1-正答率)  another multi_logloss
        'learning_rate': 0.1,
        'num_leaves': 23,
        'min_data_in_leaf': 1,
        'num_iteration': 1000, #1000回学習
        'verbose': 0
}

# モデルの学習
model = lgb.train(params, # パラメータ
            train_set=lgb_train, # トレーニングデータの指定
            valid_sets=lgb_eval, # 検証データの指定
            callbacks=[lgb.early_stopping(100)]
               )

# テストデータの予測 (クラス1の予測確率(クラス1である確率)を返す)
y_pred_prob = model.predict(x_test_new_feature)
# テストデータの予測 (予測クラス(0 or 1 or...)を返す)
y_pred = np.argmax(y_pred_prob, axis=1) # 一番大きい予測確率のクラスを予測クラスに
y_pred = pd.Series(y_pred)
y_pred.value_counts()




You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[5]	valid_0's multi_error: 0.00104277


0    129189
1     32052
2      1348
3       380
4        58
dtype: int64

In [17]:
print(classification_report(y_test, y_pred, target_names=correspondences.keys()))

              precision    recall  f1-score   support

         dos       1.00      1.00      1.00    129181
      normal       1.00      1.00      1.00     32102
       probe       0.99      0.98      0.99      1355
         r2l       0.92      0.94      0.93       372
         u2r       0.03      0.12      0.05        17

    accuracy                           1.00    163027
   macro avg       0.79      0.81      0.79    163027
weighted avg       1.00      1.00      1.00    163027



In [18]:
for key, confusion_matrix in zip(correspondences.keys(), multilabel_confusion_matrix(y_test, y_pred)):
    print(f"{key}    TP: {confusion_matrix[1][1]}, TN: {confusion_matrix[0][0]}, FP: {confusion_matrix[0][1]}, FN: {confusion_matrix[1][0]}")


dos    TP: 129169, TN: 33826, FP: 20, FN: 12
normal    TP: 32005, TN: 130878, FP: 47, FN: 97
probe    TP: 1333, TN: 161657, FP: 15, FN: 22
r2l    TP: 348, TN: 162623, FP: 32, FN: 24
u2r    TP: 2, TN: 162954, FP: 56, FN: 15


LightGBMモデルに学習データを予測させ，十分に学習できているか検証する

In [19]:
y_pred_prob = model.predict(x_train_new_feature)
# テストデータの予測 (予測クラス(0 or 1 or...)を返す)
y_pred = np.argmax(y_pred_prob, axis=1) # 一番大きい予測確率のクラスを予測クラスに
y_pred = pd.Series(y_pred)
y_pred.value_counts()


0    262289
1     65088
2      2754
3       740
4       123
dtype: int64

In [20]:
y_pred_prob

array([[9.07070221e-01, 8.77961389e-02, 3.98045508e-03, 1.10245473e-03,
        5.07304006e-05],
       [3.19483032e-01, 6.72210774e-01, 6.44034884e-03, 1.78376415e-03,
        8.20810776e-05],
       [9.07070221e-01, 8.77961389e-02, 3.98045508e-03, 1.10245473e-03,
        5.07304006e-05],
       ...,
       [9.07070221e-01, 8.77961389e-02, 3.98045508e-03, 1.10245473e-03,
        5.07304006e-05],
       [9.07070221e-01, 8.77961389e-02, 3.98045508e-03, 1.10245473e-03,
        5.07304006e-05],
       [3.19484351e-01, 6.72213550e-01, 6.44037544e-03, 1.77964144e-03,
        8.20814166e-05]])

In [21]:
print(classification_report(y_train, y_pred, target_names=correspondences.keys()))


              precision    recall  f1-score   support

         dos       1.00      1.00      1.00    262277
      normal       1.00      1.00      1.00     65176
       probe       1.00      1.00      1.00      2752
         r2l       0.98      0.96      0.97       754
         u2r       0.15      0.51      0.23        35

    accuracy                           1.00    330994
   macro avg       0.82      0.89      0.84    330994
weighted avg       1.00      1.00      1.00    330994



そもそも，u2rに関しては，学習すらできていない．

In [22]:
params['is_unbalance'] = True
# モデルの学習
model = lgb.train(params, # パラメータ
            train_set=lgb_train, # トレーニングデータの指定
            valid_sets=lgb_eval, # 検証データの指定
            callbacks=[lgb.early_stopping(100)]
               )

# テストデータの予測 (クラス1の予測確率(クラス1である確率)を返す)
y_pred_prob = model.predict(x_train_new_feature)
# テストデータの予測 (予測クラス(0 or 1 or...)を返す)
y_pred = np.argmax(y_pred_prob, axis=1) # 一番大きい予測確率のクラスを予測クラスに
y_pred = pd.Series(y_pred)
y_pred.value_counts()

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
Training until validation scores don't improve for 100 rounds




Early stopping, best iteration is:
[5]	valid_0's multi_error: 0.00104277


0    262289
1     65088
2      2754
3       740
4       123
dtype: int64

In [23]:
print(classification_report(y_train, y_pred, target_names=correspondences.keys()))

              precision    recall  f1-score   support

         dos       1.00      1.00      1.00    262277
      normal       1.00      1.00      1.00     65176
       probe       1.00      1.00      1.00      2752
         r2l       0.98      0.96      0.97       754
         u2r       0.15      0.51      0.23        35

    accuracy                           1.00    330994
   macro avg       0.82      0.89      0.84    330994
weighted avg       1.00      1.00      1.00    330994

