Author: Shangyuan Liu

Username: acp21sl

UCard: 001768913

Module: COM6013 - Cybersecurity and Artificial Intelligence Dissertation Project

Project Name: Malicious Endpoint Detection and Response

Step 01: Data pre-processing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter("ignore", UserWarning)

加载数据集

In [3]:
data_path = "../Dataset/IoT-DS2.csv"  # the original dataset
dataset = pd.read_csv(data_path)  # read the raw dataset into data-frame

df_raw = dataset.copy()
df_raw = df_raw.drop('Bwd_IAT_Mean.1', axis=1)  # remove the last column - all NAN

df_raw

对Dataframe进行统计分析, 查看正常数据与异常数据的比例

统计攻击类型和对应的数量

In [None]:
print('The number of rows in raw dataset', df_raw.shape[0])  # rows  -- 1438157
print('The number of columns in raw dataset', df_raw.shape[1])  # columns  --  86

label_statistics = df_raw["Label"].value_counts()  # the statistics of Label features
cat_statistics = df_raw["Cat"].value_counts()  # the statistics of Cat features

print('\nThe statistics of Label feature \n', label_statistics)
print('\nThe statistics of Cat feature \n', cat_statistics)

清理无穷大和无穷小的值

In [None]:
df_delInf = df_raw.replace([np.inf, -np.inf], np.nan).dropna() # replace and delete all inf and -inf values to NaN
df_delInf   

In [None]:
df_repZero = df_delInf.replace(0, np.nan)   # replace all 0 values to NaN

def missing_rate(df):
    """
    calculate the rate of missing values (NaN) in each feature
    Args:
        df (_data-frame_): df_raw
    Returns:
        _float_: percentage of missing values in each feature data
    """
    # statistics on the number and percentage of missing values
    nan_percent = (df.isnull().sum() / len(df)) * 100
    # Get the percentage of missing values in each column, sorted in ascending order
    # >0 is to screen out columns without missing values and return only those with missing values
    nan_percent = nan_percent[nan_percent > 0].sort_values()
    return nan_percent

missingVal_feature = missing_rate(df_repZero)

# print the rate of NaN value
print("The percentage of each feature's missing value\n", missingVal_feature)

Set a threshold value to remove any features with a percentage of missing value above the threshold

In [None]:
threshold = 90 
# the percentage of missing value over 90%
missingVal_90 = missingVal_feature[missingVal_feature > threshold]

# set a list to store any features should be removed
delete_list = missingVal_90.index.tolist()

# features in delete_list are deleted to create a new data-frame
df_delMissVal = df_delInf.drop(delete_list, axis=1)

df_delMissVal

Remove all meaningless features

In [None]:
# # remove all meaningless features
# df_dropFeats = df_delMissVal.drop(["Flow_ID", "Src_IP", "Src_Port", "Dst_IP", "Dst_Port", "Protocol", "Timestamp"], axis=1)
# df_dropFeats

数据可视化

In [None]:
df_delMissVal['Label'].value_counts()

In [None]:
df_delMissVal['Label'].value_counts().index.to_list()

In [None]:
plt.figure(figsize=(7, 8)) 

labels = df_delMissVal['Label'].value_counts().index.to_list()
data = df_delMissVal['Label'].value_counts().to_list()
plt.bar(labels, data, color=['salmon','turquoise'])
plt.title('Number of normal and anomaly data in dataset')

for a, b in zip(labels, data):
    plt.text(a, b, '%.0f' % b, ha='center', va='bottom', font='TIMES NEW ROME', fontsize=11)

# Solve the problem of unclear and incomplete pictures
plt.savefig("../images/labels_batChart.png", dpi=500, bbox_inches='tight')
plt.show()

In [None]:
plt.figure(figsize=(9, 9))  # adjusting the size of graphics

data = df_delMissVal['Label'].value_counts()  # the number of each label

plt.pie(data,
        labels=df_delMissVal['Label'].value_counts().index, # set pie chart labels
        colors=["#d5695d", "#5d8ca8"], # set colours
        explode=(0, 0.2),  # 
        autopct='%.2f%%',  # formatted output percentages
       )
plt.title("Distribution of normal and anomaly data in dataset")
plt.legend()

# Solve the problem of unclear and incomplete pictures
plt.savefig("../images/labels_pieChart.png",dpi=500, bbox_inches = 'tight') 
plt.show()

In [None]:
df_delMissVal['Cat'].value_counts()

In [None]:
plt.figure(figsize=(22, 10))

categories = df_delMissVal['Cat'].value_counts().index
y_pos = np.arange(len(categories)) 
amount = df_delMissVal['Cat'].value_counts()
plt.barh(y_pos, amount, align='center', color='skyblue')
plt.yticks(y_pos, categories)
plt.title('Distribution of different types of categories in the dataset')
plt.xlabel('Number of occurences')
plt.ylabel('Categories')
for i, v in enumerate(amount):
    plt.text(v + 5, i - 0.3 , str(v))

# Solve the problem of unclear and incomplete pictures
plt.savefig("../images/cat_barChart.png", dpi=500, bbox_inches = 'tight')  
plt.show()

映射 特征转换

In [None]:
# converting text data to numeric type data
df_convert = df_delMissVal.replace(['Normal', 'Anomaly'], [0, 1]).replace(
    ['Normal', 'DDoS', 'PortScan', 'Okiru', 'Reconnaissance', 'Mirai', 'Sparta', 'MQQT_bruteforce', 'Torii', 'C&C', 'DoS', 'Attack', 'Flood', 'HeartBeat', 'MITM ARP Spoofing', 'FileDownload', 'Theft'], 
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])

df_convert

In [None]:
# 计算相关系数 Pearson correlation coefficient - PCC 
# 相关系数可视化 - 热力图 heatmap
# get all features data-frame
df_feature = df_convert.iloc[:, : -2]
df_pcc = df_feature.corr('pearson')  # calculate pearson correlation coefficient

plt.subplots(figsize=(len(df_pcc), len(df_pcc)))
sns.heatmap(df_pcc, annot=True, vmax=1, square=True, cmap="Blues")
plt.savefig("../images/feature_pcc.png",dpi=500,bbox_inches = 'tight')  # Solve the problem of unclear and incomplete pictures
plt.show()


Save the dataframe as a .csv file

In [None]:
df_convert.to_csv("../Dataset/dataset_cleaned.csv", index = False)