# REPORT FOR ANALYZING AND PREDICTING TYPES OF CAT USING MACHINE LEARNING




**Author: Đoàn Quốc Kiên**

In this Colab, I will depict in detail the process to build the model with the best performance in analyzing observations and predicting potential cyber-attack types (or Cat in short).



Structure of my report:



Huge TITLE texts (aside from the capitalized title above) are for describing major steps of the data analysis using Machine Learning.



**Bold** texts with dash (-) indicate the smaller but still key tasks to complete the steps mentioned above.



*Italic* text with dash (-) describe smaller, broken-down actions in the tasks in order to better demonstrate the steps



Other text are purely description/explanation for the above or below steps, tasks or actions

#Initialize Code + Import Data

In [None]:
!pip install imblearn


In [None]:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix, classification_report

from imblearn.over_sampling import RandomOverSampler



df = pd.read_csv('/kaggle/input/iotdata/IoT Network Intrusion Dataset.csv')


## Preprocess Data

In this step, I combine basic knowledge I have learnt (handling null values, duplicates, encoding data, etc.) along with my own implementation and encoding on some features that cannot be encoded normally

**- Drop unnecessary columns**



Timestamp and Sub_Cat is clearly meaningless in processing data, so I drop this in advance

In [None]:
df = df.drop(['Timestamp','Sub_Cat'], axis = 1)

df

**- Handling NaN value**



Null or NaN value here is formally represent by the number -1, or the value inf, so I convert all of them to NaN first, and then explicitly handling them with various processes

*- Set all inf and -1 value to NaN*


In [None]:
df.replace([np.inf, -np.inf, -1], np.nan, inplace=True)

*- Drop features with too much (over 10%) NaN values*

In [None]:
for(column) in df.columns:

  if((df[column].isna().sum()) / len(df[column]) > 0.1):

    df = df.drop(column, axis = 1)

    print(column)


*- Drop observating that still contain NaN values.*



Since this dataset is very large, after this step, there are still a reasonable amount of data to work with

In [None]:
print(df.isna().sum().to_string())

df.dropna(inplace = True)



**- Remove duplicates**

In [None]:
df.drop_duplicates(keep=False, inplace=True)

**- Count and list all types of CAT**



This is here to show how imbalance the target after observations is. It will be useful for future attempt to oversampling

In [None]:
plt.hist(df['Cat'])

plt.show()

**- Encode CATs and labels**



Using simple label encoding, since CAT is the target and the 'labels' column only contain 2 values

In [None]:
CAT_to_num = {'Mirai' : 1, 'DoS' : 2, 'Scan' : 3, 'Normal' : 4, 'MITM ARP Spoofing' : 5}

df['Cat'] = df['Cat'].map(CAT_to_num)

df['Cat'].unique()



df['Label'] = (df['Label'] == 'Anomaly').astype(int)

df['Label'].unique()

In [None]:
df.head()

**- Check relevances between each information with the information need to classify**



This step is crucial, telling us what data is useful, and what is not. Using histograms to check the relevances between each feature with the target, we can remove some of the useless feature to optimize our code

*- Create an alt_df to encoding all of the data using simple label encoding.*



 Since no calculation is made in this version of our dataframe, it is safe to perform this type of encoding to simplify our data

In [None]:
#Create an alternate version which marked by unique id to facilitate data visualization

alt_df = df.copy()

for label in alt_df.columns[:-2]:

  mp = {val: i for i, val in enumerate(alt_df[label].unique())}

  alt_df[label] = alt_df[label].map(mp)

In [None]:
alt_df.head()

*- Simple Oversampling for the alternate dataframe alt_df.*



This will be used to display correlations between the features and the target

In [None]:
ros = RandomOverSampler()

alt_df, y_resampled = ros.fit_resample(alt_df, alt_df.iloc[:, -1])

alt_df = pd.DataFrame(alt_df, columns = alt_df.columns)

plt.hist(alt_df['Cat'])

plt.show()

*- Plot a histogram for each feature corresponding for Cat.*

In [None]:
for label in df.columns:

  plt.hist(alt_df[alt_df['Cat'] == 1][label], color = 'Red', label = 'Mirai', alpha = 0.1, density = True, hatch='..')

  plt.hist(alt_df[alt_df['Cat'] == 2][label], color = 'Blue', label = 'DoS', alpha = 0.1, density = True, hatch='--')

  plt.hist(alt_df[alt_df['Cat'] == 3][label], color = 'Green', label = 'Scan', alpha = 0.1, density = True, hatch='//')

  plt.hist(alt_df[alt_df['Cat'] == 4][label], color = 'Yellow', label = 'Normal', alpha = 0.1, density = True, hatch='**')

  plt.hist(alt_df[alt_df['Cat'] == 5][label], color = 'Black', label = 'MITM ARP Spoofing', alpha = 0.1, density = True, hatch='||')

  plt.title(label)

  plt.ylabel("Probability")

  plt.xlabel(label)

  plt.legend()

  plt.show()

Base on the histograms above, there are many irrelevant data to the observation, including data with no values whatsoever and data with the same value for every observation. We of course do not want that, so we have to drop all of those features. However, there are many other features with little relation to the target, so I should check and verify it

*- Check for relationship points and show it.*


In [None]:
correlation_matrix = alt_df.corr()

target_correlation = correlation_matrix['Cat']

plt.figure(figsize=(10, 20))
sns.heatmap(target_correlation.to_frame(), annot=True, cmap='coolwarm', fmt=".3f")
plt.title('Correlation between Features and Target (Cat)')
plt.show()

*- Remove unnecessary data.*

Including features with less than 0.03 relationship points or have a null value




In [None]:
# Identify features with low correlation (absolute value < 0.03)
features_to_remove = target_correlation[abs(target_correlation) < 0.03].index.tolist()
features_to_remove.extend(target_correlation[target_correlation.isnull()].index.tolist())
# Remove these features from the original dataframe (df)
df = df.drop(columns=features_to_remove)

# Print the list of removed features
print("Removed Features:", features_to_remove)

In [None]:
df.head()

*- Encode categorical data by various methods.*



**For Flow_ID, Src_IP, Src_Port, Dst_IP and Dst_Port**, judging by the histogram, the number of unique values is too large to use one-hot encoding. Instead, for each feature, I split it into 5 columns corresponding to the percentage of that values results in each types of Cat. I call it percentage-based encoding. I also standardize it in advance since the value of these new columns are related to each other



**For Protocol**, its number of unique values is small enough to use one-hot encoding.

In [None]:
#Percentage-based Encoding

def encode_and_scale(df, column_to_encode):

    # Step 1: Calculate the percentage for each target value per unique value in column_to_encode

    df_counts = df.groupby([column_to_encode, 'Cat']).size().unstack(fill_value=0)

    df_percentage = df_counts.div(df_counts.sum(axis=1), axis=0)



    # Step 2: Rename columns to reflect the target values as percentages

    df_percentage.columns = [f'{column_to_encode}_{int(col)}_perc' for col in df_percentage.columns]



    # Step 3: Merge the new columns with the original dataframe

    df = df.merge(df_percentage, on=column_to_encode, how='left').drop(columns=column_to_encode)



    # Step 4: Identify new columns and reorder to place them in the original column's position

    new_cols = list(df_percentage.columns)



    # Safely reordering columns in case any new columns are missing

    all_cols = new_cols + [col for col in df.columns if col not in new_cols]

    df = df.reindex(columns=all_cols)



    # Step 5: Standard scale the new percentage columns as a single group

    scaler = StandardScaler()

    df[new_cols] = scaler.fit_transform(df[new_cols].values.reshape(-1, 1)).reshape(df[new_cols].shape)



    return df

for column in ['Flow_ID', 'Src_IP', 'Src_Port', 'Dst_IP', 'Dst_Port']:

  df = encode_and_scale(df, column)

df.head()

In [None]:
#One-hot Encoding

one_hot = pd.get_dummies(df['Protocol'], prefix='Protocol')

df.drop(columns=['Protocol'], inplace=True)

for column in one_hot.columns:

    df.insert(0, column, one_hot[column])


In [None]:
df.head()

**- Split datas for training and testing**




*- Scale data (skip scaled data) and Oversample for underwhelming predictions*

In [None]:
def scale_dataset(dataframe, oversample=False):

  X = dataframe[dataframe.columns[:-1]].values

  y = dataframe[dataframe.columns[-1]].values



  # Get columns that haven't been standardized in advance

  cols_to_standardize = [col for col in dataframe.columns[:-1] if not col.endswith('perc')]

  X_to_standardize = dataframe[cols_to_standardize].values



  scaler = StandardScaler()

  X_standardized = scaler.fit_transform(X_to_standardize)



  # Replace the original columns with the standardized ones

  X_df = pd.DataFrame(X)

  for i, col in enumerate(cols_to_standardize):

      X_df[col] = X_standardized[:, i]

  X = X_df.values



  if oversample:

    ros = RandomOverSampler()

    X, y = ros.fit_resample(X, y)



  data = np.hstack((X, np.reshape(y, (-1, 1))))

  # print(X.shape, y.shape)

  return data, X, y

*- Split train and test data*

In [None]:
train, test = train_test_split(df, test_size = 0.2)



train, X_train, y_train = scale_dataset(train, oversample = True)

test, X_test, y_test = scale_dataset(test, oversample = False)

In [None]:
print(len(y_train))

print(sum(y_train == 1))

print(sum(y_train == 2))

print(sum(y_train == 3))

print(sum(y_train == 4))

print(sum(y_train == 5))

## Use different models to analyze and make predictions

 **- K-nearest Neighbors**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import precision_score, make_scorer, f1_score





k_range = list(range(1, 16, 2))

weight_options = ['uniform', 'distance']

param_grid = dict(n_neighbors=k_range, weights=weight_options)

print(param_grid)



scorer = make_scorer(f1_score, average='weighted')



knn = KNeighborsClassifier()

grid = GridSearchCV(knn, param_grid, cv = 3, scoring=scorer)

grid.fit(X_train, y_train)

print(grid.best_score_)

print(grid.best_params_)








In [None]:
y_pred = grid.predict(X_test)



# Generate the classification report

report = classification_report(y_test, y_pred)

print(report)

**- Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier



param_grid_dt = {

    'criterion': ['gini', 'entropy'],

    'max_depth': [None, 5, 10],

    'min_samples_split': [2, 5],

    'min_samples_leaf': [1, 2]

}



dt = DecisionTreeClassifier()



grid_dt = GridSearchCV(estimator=dt, param_grid=param_grid_dt, cv=3, scoring=scorer)



grid_dt.fit(X_train, y_train)



print(grid_dt.best_score_)

grid_dt.best_params_

In [None]:
y_pred = grid_dt.predict(X_test)



# Generate the classification report

report_dt = classification_report(y_test, y_pred)

print(report_dt)

**- Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier



param_grid_rf = {

    'n_estimators': [5, 9, 15],

    'max_depth': [None, 5, 10],

    'min_samples_split': [2, 5],

    'min_samples_leaf': [1, 2],

    'bootstrap': [True, False]

}



rf = RandomForestClassifier()



grid_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=3, scoring=scorer)



grid_rf.fit(X_train, y_train)



print(grid_rf.best_score_)

grid_rf.best_params_

In [None]:
y_pred = grid_rf.predict(X_test)



# Generate the classification report

report_rf = classification_report(y_test, y_pred)

print(report_rf)

**- Logistic Regression (scrapped)**

In [None]:
'''from sklearn.linear_model import LogisticRegression



param_grid_lr = {

    'penalty': ['l1', 'l2'],

    'C': [0.01, 1, 10],

    'solver': ['liblinear'],  # 'liblinear' handles L1 penalty well

    'max_iter': [15, 25, 50]

}



lr = LogisticRegression()



grid_lr = GridSearchCV(estimator=lr, param_grid=param_grid_lr, cv=3, scoring=scorer)



grid_lr.fit(X_train, y_train)



print(grid_lr.best_score_)

grid_lr.best_params_'''

In [None]:
'''y_pred = grid_lr.predict(X_test)



# Generate the classification report

report_lr = classification_report(y_test, y_pred)

print(report_lr)'''

**- Naive Bayes**

In [None]:
from sklearn.naive_bayes import GaussianNB



# Naive Bayes

param_grid_nb = {}  # No hyperparameters to tune for GaussianNB



nb = GaussianNB()



grid_nb = GridSearchCV(estimator=nb, param_grid=param_grid_nb, cv=3, scoring=scorer)



grid_nb.fit(X_train, y_train)



print(grid_nb.best_score_)

grid_nb.best_params_

In [None]:
y_pred_nb = grid_nb.predict(X_test)

report_nb = classification_report(y_test, y_pred_nb)

print(report_nb)

**- Support Vector Machine**

In [None]:
from sklearn.svm import SVC



param_grid_svm = {

                'C': [0.1, 1, 10 ],

              'gamma': [1, 0.01, 0.001],

              'kernel': ['rbf']

}



svm = SVC()



grid_svm = GridSearchCV(estimator=svm, param_grid=param_grid_svm, cv=3, scoring=scorer)



grid_svm.fit(X_train, y_train)



print(grid_svm.best_score_)

grid_svm.best_params_

In [None]:
y_pred = grid_svm.predict(X_test)



# Generate the classification report

report_svm = classification_report(y_test, y_pred)

print(report_svm)

**- Neural Network (scrapped)**

*- Define the structure of the network*

In [None]:
'''import tensorflow as tf

def train_model(X_train, y_train, num_nodes, dropout_prob, lt, batch_size, epochs):

  nn_model = tf.keras.Sequential([

      tf.keras.layers.Dense(num_nodes, activation = 'relu', input_shape = (10,)),

      tf.keras.layers.Dropout(dropout_prob),

      tf.keras.layers.Dense(num_nodes, activation = 'relu'),

      tf.keras.layers.Dropout(dropout_prob),

      tf.keras.layers.Dense(1, activation = 'sigmoid')

  ])



  nn_model.compile(optimizer = tf.keras.optimizers.Adam(lr), loss = 'binary_crossentropy', metrics = ['accuracy'])

  history = nn_model.fit(

    X_train, y_train, epochs = 100, batch_size = 32, validation_split = 0.2, verbose = 0

  )

  return nn_model, history'''

*- Print the history to compare performance*

In [None]:
'''def plot_history(history):

  fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 4))

  ax1.plot(history.history['loss'], label = 'loss')

  ax1.plot(history.history['val_loss'], label = 'val_loss')

  ax1.set_xlabel('Epoch')

  ax1.set_ylabel('Binary crossentropy')

  ax1.grid(True)



  ax2.plot(history.history['accuracy'], label = 'accuracy')

  ax2.plot(history.history['val_accuracy'], label = 'val_accuracy')

  ax2.set_xlabel('Epoch')

  ax2.set_ylabel('Accuracy')

  ax2.grid(True)



  plt.show()'''

*- Train neural network*

In [None]:
'''least_val_loss = float('inf')

least_loss_model = None

epochs = 10

for num_nodes in [4, 8, 16]:

  for dropout_prob in [0, 0.2]:

    for lr in [0.01, 0.005, 0.001]:

      for batch_size in [4, 8, 16]:

        print(f"{num_nodes} nodes, dropout {dropout_prob}, lr {lr}, batch size {batch_size}")

        model, history = train_model(X_train, y_train, num_nodes, dropout_prob, lr, batch_size, epochs)

        plot_history(history)

        val_loss = model.evaluate(X_valid, y_valid)[0]

        if val_loss < least_val_loss:

          least_val_loss = val_loss

          least_loss_model = model

'''

In [None]:
'''y_pred = least_loss_model.predict(X_test)

y_pred = (y_pred > 0.5).astype(int).reshape(-1,)



report_nn = classification_report(y_test, y_pred)

print(report_nn)'''

**- Compare the performance of all models**

In [None]:
model_results = []

def evaluate_model(model, X_test, y_test, model_name):
  y_pred = model.predict(X_test)
  accuracy = accuracy_score(y_test, y_pred)
  report = classification_report(y_test, y_pred, output_dict=True)
  model_results.append({
      'Model': model_name,
      'Accuracy': accuracy,
      'Precision': report['weighted avg']['precision'],
      'Recall': report['weighted avg']['recall'],
      'F1-score': report['weighted avg']['f1-score']
  })


evaluate_model(grid, X_test, y_test, 'K-Nearest Neighbors')

evaluate_model(grid_dt, X_test, y_test, 'Decision Tree')

evaluate_model(grid_rf, X_test, y_test, 'Random Forest')

evaluate_model(grid_nb, X_test, y_test, 'Naive Bayes')

evaluate_model(grid_svm, X_test, y_test, 'Support Vector Machine')


# Create a Pandas DataFrame from the model_results list
results_df = pd.DataFrame(model_results)

# Display the table
print(results_df)

plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Performance', data=results_df)
plt.title('Model Performance Comparison')
plt.ylabel('Performance')
plt.legend()
plt.show()

In conclusion, based on the given chart and table depicting the performance of the model, it is safe to say that Random Forest have the higest accuracy, while also is the most time-consuming model to train.