<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_13_2_anomaly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 14: Other Neural Network Techniques**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 13 Video Material

* Part 13.1: Using Denoising AutoEncoders [[Video]](https://www.youtube.com/watch?v=BBrRD89sTk8&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_13_1_auto_encode.ipynb)
* **Part 13.2: Anomaly Detection** [[Video]](https://www.youtube.com/watch?v=wubZ516TkI8&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_13_2_anomaly.ipynb)
* Part 13.3: Model Drift and Retraining [[Video]](https://www.youtube.com/watch?v=F4395B1ySpg&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_13_3_retrain.ipynb)
* Part 13.4: Tensor Processing Units (TPUs) [[Video]](https://www.youtube.com/watch?v=Cp3xOyxOZNo&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_13_4_tpu.ipynb)
* Part 13.5: Future Directions in Artificial Intelligence [[Video]](https://www.youtube.com/watch?v=RjxvEZh73Yc&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_13_5_new_tech.ipynb)



# Google CoLab Instructions

The following code checks that Google CoLab is and sets up the correct hardware settings for PyTorch.


In [1]:
import torch

try:
    import google.colab
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
has_mps = torch.backends.mps.is_built()
device = "mps" if has_mps else "gpu" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Note: not using Google CoLab
Using device: mps


# Part 13.2: Anomaly Detection

Anomaly detection is an unsupervised training technique that analyzes the degree to which incoming data differs from the data you used to train the neural network. Traditionally, cybersecurity experts have used anomaly detection to ensure network security. However, you can use anomalies in data science to detect input for which you have not trained your neural network.  

There are several data sets that many commonly use to demonstrate anomaly detection. In this part, we will look at the KDD-99 dataset.


* [Stratosphere IPS Dataset](https://www.stratosphereips.org/category/dataset.html)
* [The ADFA Intrusion Detection Datasets (2013) - for HIDS](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-IDS-Datasets/)
* [ITOC CDX (2009)](https://westpoint.edu/centers-and-research/cyber-research-center/data-sets)
* [KDD-99 Dataset](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)

## Read in KDD99 Data Set

Although the KDD99 dataset is over 20 years old, it is still widely used to demonstrate Intrusion Detection Systems (IDS) and Anomaly detection. KDD99 is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, held in conjunction with KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. This database contains a standard set of data to be audited, including various intrusions simulated in a military network environment.

The following code reads the KDD99 CSV dataset into a Pandas data frame. The standard format of KDD99 does not include column names. Because of that, the program adds them.

In [2]:
import pandas as pd
import urllib.request
import os

# Set Pandas display options
pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 5)

# Download the file using urllib
url = 'https://github.com/jeffheaton/jheaton-ds2/raw/main/kdd-with-columns.csv'
filename = 'kdd-with-columns.csv'

if not os.path.isfile(filename):
    try:
        urllib.request.urlretrieve(url, filename)
    except:
        print('Error downloading')
        raise

print(filename)

# Original file: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
df = pd.read_csv(filename)

print("Read {} rows.".format(len(df)))
# df = df.sample(frac=0.1, replace=False) # Uncomment this line to sample only 10% of the dataset
df.dropna(inplace=True, axis=1) 
# For now, just drop NA's (rows with missing values)

# Display 5 rows
pd.set_option('display.max_columns', 5)
pd.set_option('display.max_rows', 5)
print(df)

kdd-with-columns.csv
Read 494021 rows.
        duration protocol_type  ... dst_host_srv_rerror_rate  outcome
0              0           tcp  ...                      0.0  normal.
1              0           tcp  ...                      0.0  normal.
...          ...           ...  ...                      ...      ...
494019         0           tcp  ...                      0.0  normal.
494020         0           tcp  ...                      0.0  normal.

[494021 rows x 42 columns]


The KDD99 dataset contains many columns that define the network state over time intervals during which a cyber attack might have taken place.  The " outcome " column specifies either "normal," indicating no attack, or the type of attack performed.  The following code displays the counts for each type of attack and "normal".

In [3]:
df.groupby('outcome')['outcome'].count()

outcome
back.               2203
buffer_overflow.      30
                    ... 
warezclient.        1020
warezmaster.          20
Name: outcome, Length: 23, dtype: int64

## Preprocessing 

We must perform some preprocessing before we can feed the KDD99 data into the neural network. We provide the following two functions to assist with preprocessing. The first function converts numeric columns into Z-Scores. The second function replaces categorical values with dummy variables.

In [4]:
import pandas as pd

def encode_numeric_zscore(df, name):
    """
    Apply z-score normalization to a specified numeric column.

    Parameters:
    df (DataFrame): The pandas DataFrame containing the column.
    name (str): The name of the column to normalize.
    """
    mean = df[name].mean()
    sd = df[name].std()
    df[name] = (df[name] - mean) / sd

def encode_text_dummy(df, name):
    """
    Convert a categorical column to dummy variables.

    Parameters:
    df (DataFrame): The pandas DataFrame containing the column.
    name (str): The name of the categorical column.
    """
    dummies = pd.get_dummies(df[name], prefix=name, dtype=float)
    df = pd.concat([df, dummies], axis=1)
    df.drop(name, axis=1, inplace=True)
    return df

def process_dataframe(df):
    """
    Process a DataFrame by encoding its features.

    Parameters:
    df (DataFrame): The pandas DataFrame to process.
    """
    for name in df.columns:
        if name == 'outcome':
            continue
        #elif df[name].dtype == bool:
        #    print("**", name)
        #    df[name] = df[name].astype(float)
        elif name in ['protocol_type', 'service', 'flag', 'land', 'logged_in',
                      'is_host_login', 'is_guest_login']:
            df = encode_text_dummy(df, name)
        else:
            encode_numeric_zscore(df, name)
    return df


This code converts all numeric columns to Z-Scores and all textual columns to dummy variables. We now use these functions to preprocess each of the columns. Once the program preprocesses the data, we display the results.

In [5]:
pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 5)

df = process_dataframe(df)
df.dropna(inplace=True, axis=1)
print(df.head())


   duration  src_bytes  dst_bytes  ...  is_host_login_0  is_guest_login_0  \
0 -0.067792  -0.002879   0.138664  ...              1.0               1.0   
1 -0.067792  -0.002820  -0.011578  ...              1.0               1.0   
2 -0.067792  -0.002824   0.014179  ...              1.0               1.0   
3 -0.067792  -0.002840   0.014179  ...              1.0               1.0   
4 -0.067792  -0.002842   0.035214  ...              1.0               1.0   

   is_guest_login_1  
0               0.0  
1               0.0  
2               0.0  
3               0.0  
4               0.0  

[5 rows x 121 columns]


We divide the data into two groups, "normal" and the various attacks to perform anomaly detection. The following code divides the data into two data frames and displays each of these two groups' sizes. 

In [6]:
normal_mask = df['outcome']=='normal.'
attack_mask = df['outcome']!='normal.'

df.drop('outcome',axis=1,inplace=True)

df_normal = df[normal_mask]
df_attack = df[attack_mask]

print(f"Normal count: {len(df_normal)}")
print(f"Attack count: {len(df_attack)}")

Normal count: 97278
Attack count: 396743


Next, we convert these two data frames into Numpy arrays. Keras requires this format for data.

In [7]:
# This is the numeric feature vector, as it goes to the neural net
x_normal = df_normal.values
x_attack = df_attack.values

## Training the Autoencoder

It is important to note that we are not using the outcome column as a label to predict. We will train an autoencoder on the normal data and see how well it can detect that the data not flagged as "normal" represents an anomaly. This anomaly detection is unsupervised; there is no target (y) value to predict. 

Next, we split the normal data into a 25% test set and a 75% train set. The program will use the test data to facilitate early stopping.

In [8]:
from sklearn.model_selection import train_test_split

x_normal_train, x_normal_test = train_test_split(
    x_normal, test_size=0.25, random_state=42)


In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Convert numpy arrays to PyTorch tensors and move them to the appropriate device
x_normal_train_tensor = torch.tensor(x_normal_train).float().to(device)
x_normal_tensor = torch.tensor(x_normal).float().to(device)
x_attack_tensor = torch.tensor(x_attack).float().to(device)

# Create DataLoader for batch processing
train_data = TensorDataset(x_normal_train_tensor, x_normal_train_tensor)
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)

# Define the model using Sequential
model = nn.Sequential(
    nn.Linear(x_normal.shape[1], 25),
    nn.ReLU(),
    nn.Linear(25, 3),
    nn.ReLU(),
    nn.Linear(3, 25),
    nn.ReLU(),
    nn.Linear(25, x_normal.shape[1])
).to(device)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10
# Training loop
for epoch in range(num_epochs):
    running_loss = 0
    den = 0
    for data in train_loader:
        inputs, targets = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        running_loss +=loss.item()
        den+=1

    print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {running_loss/den}')
    running_loss = 0.0


Epoch [1/10], Loss: 0.32732871251074563
Epoch [2/10], Loss: 0.2666971355484668
Epoch [3/10], Loss: 0.25296189310355927
Epoch [4/10], Loss: 0.2429972947542474
Epoch [5/10], Loss: 0.22758804882525285
Epoch [6/10], Loss: 0.2041895669215081
Epoch [7/10], Loss: 0.18810490698149232
Epoch [8/10], Loss: 0.1733971669941144
Epoch [9/10], Loss: 0.16838515430255874
Epoch [10/10], Loss: 0.1590980681635668


We display the size of the train and test sets.

In [10]:
print(f"Normal train count: {len(x_normal_train)}")
print(f"Normal test count: {len(x_normal_test)}")

Normal train count: 72958
Normal test count: 24320


We are now ready to train the autoencoder on the normal data. The autoencoder will learn to compress the data to a vector of just three numbers. The autoencoder should be able to also decompress with reasonable accuracy. As is typical for autoencoders, we are merely training the neural network to produce the same output values as were fed to the input layer.

## Detecting an Anomaly

We are now ready to see if the abnormal data is an anomaly. The first two scores show the in-sample and out of sample RMSE errors. These two scores are relatively low at around 0.33 because they resulted from normal data. The much higher 0.76 error occurred from the abnormal data. The autoencoder is not as capable of encoding data that represents an attack. This higher error indicates an anomaly.

In [11]:
model.eval()  # Set the model to evaluation mode

# Function to calculate RMSE
def calculate_rmse(model, data):
    with torch.no_grad():
        predictions = model(data)
        mse_loss = nn.MSELoss()(predictions, data)
    return torch.sqrt(mse_loss).item()

# Evaluating the model
score1 = calculate_rmse(model, torch.tensor(x_normal_test).float().to(device))
score2 = calculate_rmse(model, x_normal_tensor)
score3 = calculate_rmse(model, x_attack_tensor)

print(f"Out of Sample Normal Score (RMSE): {score1}")
print(f"Insample Normal Score (RMSE): {score2}")
print(f"Attack Underway Score (RMSE): {score3}")

Out of Sample Normal Score (RMSE): 0.391261488199234
Insample Normal Score (RMSE): 0.3845501244068146
Attack Underway Score (RMSE): 0.5126159191131592
