# Anomaly Detection In Cybersecurity

#### Anomaly detection in cybersecurity involves using machine learning algorithms to identify unusual patterns or deviations from normal behavior within a system. By analyzing vast amounts of data, such as network traffic, user behavior, or system logs, these algorithms can detect anomalies that might indicate potential security threats, such as intrusions, malware, or suspicious activities. The goal is to swiftly recognize and respond to these anomalies, helping enhance overall cybersecurity measures and protect systems from potential attacks.

#### In this project I am using network traffic data to detect threats in real-time using deep learning algorithm FNN.

### Import relevant libraries

In [306]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import MultiLabelBinarizer

### Perform preprocessing steps

In [307]:
column_names = [
    'Duration', 'Protocol_type', 'Service', 'Flag', 'Src_bytes', 'Dst_bytes', 'Land', 'Wrong_fragment',
    'Urgent', 'Hot', 'Num_failed_logins', 'Logged_in', 'Num_compromised', 'Root_shell', 'Su_attempted',
    'Num_root', 'Num_file_creations', 'Num_shells', 'Num_access_files', 'Num_outbound_cmds', 'Is_host_login',
    'Is_guest_login', 'Count', 'Srv_count', 'Serror_rate', 'Srv_serror_rate', 'Rerror_rate', 'Srv_rerror_rate',
    'Same_srv_rate', 'Diff_srv_rate', 'Srv_diff_host_rate', 'Dst_host_count', 'Dst_host_srv_count',
    'Dst_host_same_srv_rate', 'Dst_host_diff_srv_rate', 'Dst_host_same_src_port_rate', 'Dst_host_srv_diff_host_rate',
    'Dst_host_serror_rate', 'Dst_host_srv_serror_rate', 'Dst_host_rerror_rate', 'Dst_host_srv_rerror_rate',
    'Attack_Type', 'Difficulty_Level'
]

#### Load the dataset

In [308]:
raw_data = pd.read_csv('KDDTrain+.txt', names=column_names)
df = raw_data.copy()
df.head()

Unnamed: 0,Duration,Protocol_type,Service,Flag,Src_bytes,Dst_bytes,Land,Wrong_fragment,Urgent,Hot,...,Dst_host_same_srv_rate,Dst_host_diff_srv_rate,Dst_host_same_src_port_rate,Dst_host_srv_diff_host_rate,Dst_host_serror_rate,Dst_host_srv_serror_rate,Dst_host_rerror_rate,Dst_host_srv_rerror_rate,Attack_Type,Difficulty_Level
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21


In [309]:
# Selecting only categorical columns for encoding
categorical_cols = ['Protocol_type', 'Service', 'Flag']


In [310]:

# Applying one-hot encoding using Pandas get_dummies
encoded_df = pd.get_dummies(df[categorical_cols], drop_first=True)

# Dropping the original categorical columns from the original DataFrame
df = df.drop(columns=categorical_cols)

# Concatenating the original DataFrame with the encoded categorical DataFrame
data_encoded = pd.concat([df, encoded_df], axis=1)


In [311]:
X = data_encoded.drop(columns=['Attack_Type'])  # Features
y = data_encoded['Attack_Type']  # Target variable

#### First preprocess target for feature selection

#### Multilabel binarization simplifies handling multiple labels in machine learning tasks by converting categorical data into a binary matrix format, easing computation and enabling efficient model training. It facilitates handling multiple classes or categories simultaneously, crucial for tasks like text classification or image tagging.

In [312]:
# Convert the strings in 'y' to lists of attack types
y_list = [labels.split(',') for labels in y]  # Assuming labels are comma-separated

# Initialize and fit the MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_dum = mlb.fit_transform(y_list)

# Display the encoded shape
print("Shape after one-hot encoding:", y_dum.shape)


Shape after one-hot encoding: (125973, 23)


### Perform feature selection using random forest

#### Feature selection and extraction will help us to determine the features that actually influence the results and prevent unnecessary complications in model training due to high dimension data.

In [313]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)

In [314]:
clf.fit(X, y_dum)

In [315]:
feature_importances = clf.feature_importances_

In [316]:
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})


In [317]:

# Sort the features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)


In [318]:
# Display the top N features
top_n = 10  # Define the number of top features you want to display

# Show the top N features and their importance scores
top_features = feature_importance_df.head(29)
print(top_features)


                         Feature  Importance
1                      Src_bytes    0.075999
26                 Diff_srv_rate    0.068413
21                   Serror_rate    0.064305
25                 Same_srv_rate    0.059084
19                         Count    0.058654
114                      Flag_S0    0.054364
118                      Flag_SF    0.050242
30        Dst_host_same_srv_rate    0.050080
35      Dst_host_srv_serror_rate    0.048872
38              Difficulty_Level    0.036705
2                      Dst_bytes    0.036338
32   Dst_host_same_src_port_rate    0.034446
31        Dst_host_diff_srv_rate    0.032307
34          Dst_host_serror_rate    0.029878
29            Dst_host_srv_count    0.023698
33   Dst_host_srv_diff_host_rate    0.023055
22               Srv_serror_rate    0.022901
20                     Srv_count    0.022397
36          Dst_host_rerror_rate    0.019238
28                Dst_host_count    0.018357
23                   Rerror_rate    0.016278
54        

In [319]:
top_feature_names = top_features['Feature'].tolist()

# Extract the top features from the original DataFrame 'X'
X_top_features = X[top_feature_names]

# Verify the extracted DataFrame
X_top_features.head()

Unnamed: 0,Src_bytes,Diff_srv_rate,Serror_rate,Same_srv_rate,Count,Flag_S0,Flag_SF,Dst_host_same_srv_rate,Dst_host_srv_serror_rate,Difficulty_Level,...,Dst_host_count,Rerror_rate,Service_eco_i,Service_ecr_i,Logged_in,Service_private,Dst_host_srv_rerror_rate,Protocol_type_tcp,Protocol_type_udp,Service_http
0,491,0.0,0.0,1.0,2,0,1,0.17,0.0,20,...,150,0.0,0,0,0,0,0.0,1,0,0
1,146,0.15,0.0,0.08,13,0,1,0.0,0.0,15,...,255,0.0,0,0,0,0,0.0,0,1,0
2,0,0.07,1.0,0.05,123,1,0,0.1,1.0,19,...,255,0.0,0,0,0,1,0.0,1,0,0
3,232,0.0,0.2,1.0,5,0,1,1.0,0.01,21,...,30,0.0,0,0,1,0,0.01,1,0,1
4,199,0.0,0.0,1.0,30,0,1,1.0,0.0,21,...,255,0.0,0,0,1,0,0.0,1,0,1


In [320]:
X_top_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Data columns (total 29 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Src_bytes                    125973 non-null  int64  
 1   Diff_srv_rate                125973 non-null  float64
 2   Serror_rate                  125973 non-null  float64
 3   Same_srv_rate                125973 non-null  float64
 4   Count                        125973 non-null  int64  
 5   Flag_S0                      125973 non-null  uint8  
 6   Flag_SF                      125973 non-null  uint8  
 7   Dst_host_same_srv_rate       125973 non-null  float64
 8   Dst_host_srv_serror_rate     125973 non-null  float64
 9   Difficulty_Level             125973 non-null  int64  
 10  Dst_bytes                    125973 non-null  int64  
 11  Dst_host_same_src_port_rate  125973 non-null  float64
 12  Dst_host_diff_srv_rate       125973 non-null  float64
 13 

In [321]:
X_top_features.describe()

Unnamed: 0,Src_bytes,Diff_srv_rate,Serror_rate,Same_srv_rate,Count,Flag_S0,Flag_SF,Dst_host_same_srv_rate,Dst_host_srv_serror_rate,Difficulty_Level,...,Dst_host_count,Rerror_rate,Service_eco_i,Service_ecr_i,Logged_in,Service_private,Dst_host_srv_rerror_rate,Protocol_type_tcp,Protocol_type_udp,Service_http
count,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,...,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0
mean,45566.74,0.063053,0.284485,0.660928,84.107555,0.276655,0.594929,0.521242,0.278485,19.50406,...,182.148945,0.119958,0.036405,0.024426,0.395736,0.173474,0.12024,0.815167,0.119018,0.320211
std,5870331.0,0.180314,0.446456,0.439623,114.508607,0.447346,0.490908,0.448949,0.445669,2.291503,...,99.206213,0.320436,0.187296,0.154368,0.48901,0.378658,0.319459,0.388164,0.32381,0.46656
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.09,2.0,0.0,0.0,0.05,0.0,18.0,...,82.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,44.0,0.0,0.0,1.0,14.0,0.0,1.0,0.51,0.0,20.0,...,255.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,276.0,0.06,1.0,1.0,143.0,1.0,1.0,1.0,1.0,21.0,...,255.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
max,1379964000.0,1.0,1.0,1.0,511.0,1.0,1.0,1.0,1.0,21.0,...,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [322]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(X_top_features)

In [323]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y_dum, test_size=0.2, random_state=42)

### Prepare deep learning model and train it with dataset

In [329]:
model = Sequential()

# Add input layer and hidden layers
model.add(Dense(64, input_shape=(x_train.shape[1],), activation='relu'))
model.add(Dropout(0.2))  # Dropout layer to prevent overfitting
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
# Add output layer
model.add(Dense(y_dum.shape[1], activation='softmax'))  # Softmax for multiclass classification


In [330]:

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


In [331]:

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=16, validation_data=(x_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [332]:

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {accuracy*100:.4f}%")

Test Accuracy: 99.4959%


### Save the trained model

In [333]:
model.save('trained_model.h5')

#### This work is completely an individual work of Anubhav Natani, only the dataset is taken from NSL-KDD and reference of which are provided below. The notebook is available at my github, [click here](https://github.com/Anubhavnatani04).

#### Thank you!

### <u>References</u>
#### [1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009.

#### [2] J. McHugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory,” ACM Transactions on Information and System Security, vol. 3, no. 4, pp. 262–294, 2000.
