# Network Traffic Analysis *Project*

In class we have covered binary classification (e.g., spam v. ham, fake v. real etc.). In this assignment, you will perform multi-class classification using the network traffic data.

We want you to do this in two ways:

**Direct Multi-Class Classification [KNN [Done], DT [Done], NN, etc]**

Directly use our previous methods for binary classification (Decision Trees, KNN, Perceptron, Neural Networks) to predict multiple classes.

**Direct Multi-Class Classification with Resampling [Done]**

Resample the large, unbalanced dataset to have a smaller and more balanced dataset for classifier

**Today: Hierarchical Multi-Class Classification**

Perform binary classification first (benign vs. malicious). Once a sample has been identified as malicious, perform multi-class classification to identify what kind of malicious activity is occurring.


In [7]:
# 10/25 Cla
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

folder_path = '/content/drive/MyDrive/CS345/MachineLearningCSV/MachineLearningCVE/'
df = pd.read_csv(folder_path + 'clean_traffic_data.csv')
fig = plt.figure(figsize=(25,6))
sns.countplot(x=' Label', data=df)
plt.show()

FileNotFoundError: ignored

## Hierarchical Multi-Class Classification

## 1st layer of training: classify BENIGN and MALICIOUS activities

In [5]:
import numpy as np
import pandas as pd
np.random.seed = 1

label_set = set(df[' Label'])
train_df_list = []
test_df_list = []
print('-'*60)


for label in label_set:
  mask = np.random.ran(len(df[df[' Label'] == label])) < 0.8
  train_df_list.append(df[df[' Label'] == label][mask])
  test_df_list.append(df[df[' Label'] == label][~mask])

df_train = pd.concat(train_df_list)
df_test = pd.concat(test_df_list)

print('-'*60)
print('check if testing set contains all the categories:', set(df_train[' Label']) == set(df_test[' Label']))
print('-'*60)
for label in label_set:
  print('num training samplesfor "{}": {}'.format(label, len(df_train[df_train[' Label'] == label])))
  print('num testing samplesfor "{}": {}'.format(label, len(df_train[df_test[' Label'] == label])))

NameError: ignored

In [None]:
# train a binary classifier

df_binary = df_train.copy()
df_binary.loc[df_binary[' Label'] != 'BENIGN', ' Label'] == 'MALICIOUS'

fig = plt.figure(figsize = (10,6))
sns.countplotplot(x=' Label', data=df_binary)
plt.show()

In [9]:
# train a binary model
# reduce and balance binary dataset using resampling/downsampling
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy={'BENIGN':50000, "MALICIOUS": 50000}, random_state=1)

X = df_binary[df_binary.colums[:-1]]
y = df_binary[df_binary.colums[-1]]
X_resampled, y_resampled = rus.fit(X, y)
resampled_df_binary = pd.DataFrame(columns=df_binary.columns)
resampled_df_binary[resampled_df_binary.columns[:-1]] = X_resampled
resampled_df_binary[resampled_df_binary.columns[-1]] = y_resampled

# check histogram of resample data
fig = plt.figure(figsize = (10,6))
sns.countplotplot(x=' Label', data=resampled_df_binary)
plt.show()

NameError: ignored

In [None]:
# split the resampled training data into train and test 10%
import numpy as np
np.random.seed = 0

# gen train and validation set
mask = np.random.rand(len(resampled_df_binary)) < 0.9
train_set = resampled_df_binary[mask]
val_set = resampled_df_binary[~mask]

X_train = train_set[train_set.columns[:-1]]
y_train = train_set[train_set.columns[-1]]
X_val = val_set[val_set.columns[:-1]]
y_val = val_set[val_set.columns[-1]]

# model training
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

mlp = MLPClassifier(hidden_layer_sizes=(40,), random_state=1, max_iter=300).fit(X_train, y_train)
pred_val = mlp.predict(X_val)

print('Validation Accurcacy"', accuracy_score(pred_val, y_val))
df_binary_test = df_test.copy()
df_binary_test.loc[df_binary_test[' Label'] != 'BENIGN', ' Label'] = 'MALICIOUS'

X_test = df_binary_test[df_binary_test.colums[:-1]]
y_test = df_binary_test[df_binary_test.colums[-1]]

pred_test = mlp.predict(X_test)
print('Testing Accuracy', accuracy_score(pred_test, y_test))
print(classification_report(y_test, pred_test))

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier().fit(X_train, y_train)

pred_val = rfc.predict(X_val)
print('Validation Accurcacy"', accuracy_score(pred_val, y_val))

pred_test = rfc.predict(X_test)
print('Testing Accuracy', accuracy_score(pred_test, y_test))
print(classification_report(y_test, pred_test))


## 2nd layer of Training: Classify different types of malicious activities