reference: 

* https://www.kaggle.com/datasets/hassan06/nslkdd/code
* https://www.kaggle.com/code/nadagamal3/network-security-attack-classification

In [1]:
import pandas as pd
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset, random_split, ConcatDataset
from torch import optim

In [2]:
# project_home = '/content/drive/MyDrive/torch/example/basic/NSL_KDD'
project_home = '.'

In [3]:
# !unzip {project_home}/data/NSL-KDD.zip -d {project_home}/data/

模型的基本设置参数

In [3]:
config = {
    "train": f'{project_home}/data/KDDTrain+.txt',
    "test": f'{project_home}/data/KDDTest+.txt',
    "batch_size": 128,
}
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## 数据预处理

In [4]:
columns = (['duration','protocol_type','service','flag','src_bytes','dst_bytes','land','wrong_fragment','urgent','hot',
            'num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
            'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','srv_count',
            'serror_rate','srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate','diff_srv_rate','srv_diff_host_rate',
            'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
            'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate',
            'dst_host_srv_rerror_rate','attack','level'])

把 `KDDTrain+.txt`和 `KDDTest+.txt` 合并，方便进行数据预处理;

若不合并，有以下几个方面的缺点：
1. （最主要的）经过`get_dummies`函数处理之后，训练集和测试集的特征维度不一致，后续还需要人工处理成一致才能进行模型训练；

2. 对训练集和测试集分别进行数据预处理，这样会增加代码量;

In [5]:
# 合并训练集和测试集
train = pd.read_csv(config['train'], names=columns, header=None)
test = pd.read_csv(config['test'], names=columns, header=None)
data = pd.concat([train, test], axis=0)
data.to_csv(f'{project_home}/data/train_test.csv', index=False)

给多类别属性OneHot编码

经过 `get_dummies`函数处理之后 'protocol_type', 'service', 'flag'，属性列被分成了多列，这些新的列的值都是0或1。原属性列会从数据集中删除。

In [6]:
data_df = pd.get_dummies(data,
                              columns=['protocol_type', 'service', 'flag'],
                              prefix="",
                              prefix_sep="")

In [7]:
data_df.shape

(148517, 124)

鉴于我们将训练集与测试集先合并再拆开，故两个数据集的OneHot编码结果是一样的，维度一致 

传闻`level`属性列，含有label的信息，为了防止label泄露，我们将其删除

In [8]:
data_df.drop('level', axis=1, inplace=True)

In [11]:
data_df.columns

Index(['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment',
       'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised',
       ...
       'REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S0', 'S1', 'S2', 'S3', 'SF', 'SH'],
      dtype='object', length=123)

## Dataset && DataLoader

In [12]:
len(data_df.iloc[0])

123

In [25]:
data_df.attack.unique()

array(['normal', 'neptune', 'warezclient', 'ipsweep', 'portsweep',
       'teardrop', 'nmap', 'satan', 'smurf', 'pod', 'back',
       'guess_passwd', 'ftp_write', 'multihop', 'rootkit',
       'buffer_overflow', 'imap', 'warezmaster', 'phf', 'land',
       'loadmodule', 'spy', 'perl', 'saint', 'mscan', 'apache2',
       'snmpgetattack', 'processtable', 'httptunnel', 'ps', 'snmpguess',
       'mailbomb', 'named', 'sendmail', 'xterm', 'worm', 'xlock',
       'xsnoop', 'sqlattack', 'udpstorm'], dtype=object)

把attack列放到最后一列

In [34]:
data_df.columns

Index(['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment',
       'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised',
       ...
       'REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S0', 'S1', 'S2', 'S3', 'SF', 'SH'],
      dtype='object', length=123)

In [35]:
# attack的下标为 38
list(data_df.columns).index('attack')

38

In [38]:
columns = list(data_df.columns)

In [41]:
# 删除attack
columns.pop(columns.index('attack'))
len(columns)

122

In [42]:
# 把attack放到最后一列，便于后续让最后一列作为label
columns.append('attack')

In [43]:
data_df = data_df[columns]

attack已经切换到最后一列

In [44]:
data_df.columns[-1]

'attack'

In [51]:
data_df.attack = data_df.attack.map(lambda x: 0 if x == 'normal' else 1)

attack属性列，已由字符串转换成数字 0 或 1

In [52]:
data_df.attack

0        0
1        0
2        1
3        0
4        0
        ..
22539    0
22540    0
22541    1
22542    0
22543    1
Name: attack, Length: 148517, dtype: int64

MyDataset继承自Dataset，重写__getitem__和__len__方法，用于后续DataLoader的使用

In [53]:
class MyDataset(Dataset):

  def __init__(self, df):
    self.df = df

  def __getitem__(self,idx):
    data = self.df.iloc[idx,:-1]
    label  = self.df.iloc[idx, -1] # 最后一列(attack)作为label
    return torch.tensor(data).type(torch.float32), label

  def __len__(self):
    return self.df.shape[0]

data_dataset = MyDataset(data_df)
train_dataset, val_dataset, test_dataset = random_split(data_dataset, [0.8, 0.1, 0.1])

train_loader = DataLoader(train_dataset, batch_size=config["batch_size"], shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=config["batch_size"], shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=config["batch_size"], shuffle=False)

In [54]:
len(train_loader), len(val_loader), len(test_loader)

(929, 117, 117)

## Model

`x = self.relu(x + self.cc4(x))`

为残差网络的结构，可以有效防止梯度消失，提高模型的训练效果；
使用残差网络在增加模型的深度同时使得模型具有更强的能力，但这样并不会导致模型的过拟合；

正如我们知道模型的深度不是越深越好，而是要适当的深度，残差网络的出现，使得我们可以更加放心的增加模型的深度，而不用担心模型的过拟合；
如果增加的这一层，没有起到作用，那么残差网络会自动地把这一层的权重逼近0，这样就不会影响模型的训练效果；就是 `self.cc4(x)` 的值为逼近0的值，这样就不会影响模型的训练效果；

In [55]:
class Net(nn.Module):

  def __init__(self):
    super().__init__()
    self.bn = nn.BatchNorm1d(122)
    mid = 64
    self.linear1 = nn.Linear(122, mid)
    self.cc1 = nn.Linear(mid, mid)
    self.cc2 = nn.Linear(mid, mid)
    self.cc3 = nn.Linear(mid, mid)
    self.cc4 = nn.Linear(mid, mid)
    self.cc5 = nn.Linear(mid, mid)
    self.cc6 = nn.Linear(mid, mid)
    self.linear2 = nn.Linear(mid, 2)
    self.relu = nn.ReLU()
  
  # 残差网络
  def forward(self, x):
    x = self.bn(x)
    x = self.relu(self.linear1(x))
    x = self.relu(x + self.cc1(x))
    x = self.relu(x + self.cc2(x))
    x = self.relu(x + self.cc3(x))
    x = self.relu(x + self.cc4(x))
    x = self.relu(x + self.cc5(x))
    x = self.relu(x + self.cc6(x))
    x = self.linear2(x)
    return x

Net()

Net(
  (bn): BatchNorm1d(122, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (linear1): Linear(in_features=122, out_features=64, bias=True)
  (cc1): Linear(in_features=64, out_features=64, bias=True)
  (cc2): Linear(in_features=64, out_features=64, bias=True)
  (cc3): Linear(in_features=64, out_features=64, bias=True)
  (cc4): Linear(in_features=64, out_features=64, bias=True)
  (cc5): Linear(in_features=64, out_features=64, bias=True)
  (cc6): Linear(in_features=64, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=2, bias=True)
  (relu): ReLU()
)

## train

In [56]:
# 在测试集上运行判断准确率
@torch.no_grad()
def predict(model, data_loader):
  model.eval()
  num = 0
  right = 0
  for loader in data_loader:
    features = loader[0].to(device)
    labels = loader[1].to(device)
    num += len(labels)
    y_hat = model(features).argmax(dim=-1)
    batch_right = torch.sum(y_hat == labels).item()
    right += batch_right
  return right / num

In [57]:
def train(train_loader, val_loader, epochs):
  model = Net().to(device)
  loss_fn = nn.CrossEntropyLoss()
  optimizer = optim.SGD(model.parameters(), lr=3e-4, momentum=0.9)
  cur_val_loss = float('inf')
  for i in range(epochs):
    epoch_loss = 0
    epoch_right = 0
    nums = 0
    model.train()
    for loader in train_loader:
      features = loader[0].to(device)
      labels = loader[1].to(device)
      optimizer.zero_grad()
      y = model(features)
      loss = loss_fn(y, labels)
      loss.backward()
      # 梯度裁剪预防梯度爆炸，限制模型参数的最大值为1
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
      epoch_loss += loss.item() / features.size(0)
      # 每个bacth有512个，输出这512个中预测正确的个数
      # print(torch.sum(y.argmax(dim=-1) == labels).item(), end=' ')
      epoch_right += torch.sum(y.argmax(dim=-1) == labels).item()
      nums += features.size(0)
      optimizer.step()
    print(f"epoch {i}\nacc: {epoch_right / nums} loss: {epoch_loss / len(train_loader)}")
    
    # 在验证集上测试
    val_loss = 0
    val_right = 0
    val_nums = 0
    
    model.eval()
    with torch.no_grad():
      for loader in val_loader:
        features = loader[0].to(device)
        labels = loader[1].to(device)
        y = model(features)
        loss = loss_fn(y, labels)
        val_loss += loss.item() / features.size(0)
        val_right += torch.sum(y.argmax(dim=-1) == labels).item()
        val_nums += features.size(0)
    print(f"val acc: {val_right / val_nums} val loss: {val_loss / len(val_loader)}")
    print(f"test acc: {predict(model, test_loader)}")
    print()

    # 保存验证集损失最小的模型
    if val_loss < cur_val_loss:
      cur_val_loss = val_loss
      torch.save(model.state_dict(), f"model_{i}.pth")
  return model
model = train(train_loader, val_loader, 10)

epoch 0
acc: 0.8958035248371404 loss: 0.0020531468053963564
val acc: 0.9525989765688123 val loss: 0.0038347863510119705
test acc: 0.9512490741364218

epoch 1
acc: 0.9598868820172707 loss: 0.0008200583640026272
val acc: 0.9602073794775114 val loss: 0.003411474112325754
test acc: 0.9606087132179651

epoch 2
acc: 0.9682949820728197 loss: 0.00066381181662272
val acc: 0.9679504443845947 val loss: 0.0035253631245319005
test acc: 0.9684196350414114

epoch 3
acc: 0.9734206406652415 loss: 0.000571790750098654
val acc: 0.9684217613789389 val loss: 0.002991055831552929
test acc: 0.9698336812335869

epoch 4
acc: 0.975465854192267 loss: 0.0005237178929710836
val acc: 0.9698357123619714 val loss: 0.003115694736688732
test acc: 0.9709110497609589

epoch 5
acc: 0.9766609995455081 loss: 0.00048812380793412436
val acc: 0.9709130083490439 val loss: 0.0030689245150459367
test acc: 0.9718537472224092

epoch 6
acc: 0.9777383136667396 loss: 0.000458844711245173
val acc: 0.9704416913546997 val loss: 0.0025036

In [28]:
model

Net(
  (bn): BatchNorm1d(123, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (linear1): Linear(in_features=123, out_features=64, bias=True)
  (cc1): Linear(in_features=64, out_features=64, bias=True)
  (cc2): Linear(in_features=64, out_features=64, bias=True)
  (cc3): Linear(in_features=64, out_features=64, bias=True)
  (cc4): Linear(in_features=64, out_features=64, bias=True)
  (cc5): Linear(in_features=64, out_features=64, bias=True)
  (cc6): Linear(in_features=64, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=2, bias=True)
  (relu): ReLU()
)

## predict

```
epoch 6
acc: 0.9777383136667396 loss: 0.000458844711245173
val acc: 0.9704416913546997 val loss: 0.002503662280097572
test acc: 0.9720557538212915
```
从第6个epoch开始，看到验证集模型的loss就不再下降了，所以我们为了避免过拟合，只训练6个epoch

其实个人在有些情况下会使用 `0.2 * train_loss + 0.8 * val_loss` 来判断loss不再下降的epoch数，这样可以更好的防止过拟合。因为如果只看 `val_loss`，这样可能会使得最终训练出来的模型，在验证集上的loss很小，但是在测试集上的loss很大，使得在验证集上过拟合了。

但是个人发现，并没有其他人用过这种方式，故本人并不能说，这种方式一定能提高模型的效果。

不能根据test 测试数据集 的loss和acc来选择模型的权重，因为这样是作弊行为，同时也会导致模型的泛化能力下降；

接下来合并训练集和测试集，设置训练的epoch为6，重新训练模型，然后再进行预测；

In [59]:
# 合并训练集和验证集，使得更多的数据能够参与到模型的训练过程中，从而提高模型的泛化性
train_val_dataset = ConcatDataset([train_dataset, val_dataset])
train_val_loader = DataLoader(train_val_dataset, batch_size=config["batch_size"], shuffle=True)

In [60]:
def final_train(train_loader, epochs):
  model = Net().to(device)
  loss_fn = nn.CrossEntropyLoss()
  optimizer = optim.SGD(model.parameters(), lr=3e-4, momentum=0.9)
  for i in range(epochs):
    for loader in train_loader:
      features = loader[0].to(device)
      labels = loader[1].to(device)
      optimizer.zero_grad()
      y = model(features)
      loss = loss_fn(y, labels)
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
      optimizer.step()
  return model
model = final_train(train_val_loader, 6)

将最终的模型参数保存到文件中，避免日后再训练，方便后续的预测使用；

In [62]:
# 默认注释保存模型的代码，只是为了防止不小心运行，因为已经保存了模型
# torch.save(model, 'best_model.pth')

In [63]:
best_model = torch.load('best_model.pth')

In [64]:
predict(best_model, test_loader)

0.9703050299643121

故我们认为我们的模型最终在测试集上的准确率为：0.97

此时无论在测试集上的结果如何，都不要去再调了，不然会使得模型去拟合测试集，导致模型的泛化能力下降；