<a href="https://colab.research.google.com/github/Basel-byte/Network-Anomaly-Detection/blob/main/Pr_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Lab: Clustering  

In this exercise, we will know how K-Means and Normalized Cut algorithms can be used for network anomaly detection.

# Outline
- [ 1 - Extracting Dataset](#1)
- [ 2 - Packages ](#2)
- [ 3 - Reading data and Preprocessing](#3)
  - [3.1 Reading Columns' Names](#3.1)
  - [3.2 Reading dataset](#3.2)
  - [3.3 Removing class column](#3.3)
  - [3.4 Changing Categorical features to numericals](#3.4)
  
   
- [ 4 - K-means](#4)
- [ 5 - Spectral Clustering](#5)
  - [5.1 Getting Laplacian and Degree Matrices](#5.1)
  - [5.2 Spectral Clustering Algorithm](#5.2)
- [ 6 - New Clustering Algorithm](#6)
- [ 7 - Clustering Evaluation](#7)
  - [ 7.1 - Getting Predicted labels](#7.1)
  - [ 7.2 - Precision Measure](#7.2)
  - [ 7.3 - Recall Measure7.3)
  - [ 7.4 - F1 Measure](#7.4)
  - [ 7.5 - Entropy Measure](#7.5)
  

  

<a name="1"></a>
## 1 - Extracting Dataset

In [37]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# do not run if you have unzipped the dataset
!gzip -d /content/drive/MyDrive/kddcup.data.gz
!gzip -d /content/drive/MyDrive/corrected.gz

gzip: /content/drive/MyDrive/kddcup.data.gz: No such file or directory
gzip: /content/drive/MyDrive/corrected.gz: No such file or directory


<a name="2"></a>
## 2 - Packages 

First, let's run the cell below to import all the packages that you will need during this assignment.
- [numpy](https://numpy.org/) is the fundamental package for scientific computing with Python.
<!-- - [matplotlib](http://matplotlib.org) is a popular library to plot graphs in Python. -->
<!-- - [tensorflow](https://www.tensorflow.org/) a popular platform for machine learning. -->
- [pandas](https://pandas.pydata.org/) is open source data analysis and manipulation tool.

In [38]:
import numpy as np
import pandas as pd
import os
import requests
from enum import Enum
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import rbf_kernel
from scipy.linalg import eigh
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import mutual_info_score
from scipy.stats import entropy
from scipy.stats import mode

<a name="3"></a>
## 3 - Reading data and Preprocessing

<a name="3.1"></a>
#### Reading columns' names 

In [39]:
DATASET_COLUMNS_FILE = "/content/drive/MyDrive/kddcup1999_columns.txt"
column_types =[]

with open(DATASET_COLUMNS_FILE, 'r') as file:
    column_labels: str = file.read()

column_regex: re.Pattern = re.compile(r"^(?P<column_name>\w+): (?P<data_type>\w+)\.$")
for column_type in column_labels.splitlines()[:]:
    match = column_regex.match(column_type)
    column_types.append(match.group("column_name"))
del column_labels

<a name="3.2"></a>
#### Reading the training dataset

In [40]:
data = pd.read_csv("/content/drive/MyDrive/kddcup.data", header=None)
data.columns = column_types

<a name="3.3"></a>
#### Removing class column

In [41]:
data_without_labels = data.drop(columns=["class"])
labels = data["class"]
del data
del column_types

<a name="3.4"></a>
### Change the categorical features to numericals

In [42]:
def convert_string_to_numeric(data_frame):
  # df_copy = data_frame.copy()
  for col in data_frame:
    if data_frame[col].dtypes == object:
      my_dict = {elem: index for index, elem in enumerate(data_frame[col].unique())}
      data_frame[col].replace(my_dict, inplace=True)
  return data_frame

In [43]:
def convert_string_to_binary_numeric(data_frame):
  # df_copy = data_frame.copy()
  for col in data_frame:
    if data_frame[col].dtypes == object:
      df1 = pd.get_dummies(data_frame[col])
      data_frame = pd.concat([data_frame, df1], axis=1).reindex(data_frame.index)
      data_frame.drop(col, axis=1, inplace=True)
  return data_frame

In [44]:
# use either of them but not both
# convert_string_to_numeric(data_without_labels.copy())

data_without_labels = convert_string_to_binary_numeric(data_without_labels).values.astype(np.float32)

<a name="4"></a>
## 4 - K-means



In [45]:
def Kmeans(data_frame_no_label, k = 3):
  return labels

<a name="5"></a>
## 5 - Spectral clustering


<a name="5.1"> </a>
### 5.1 Getting Laplacian and Similarity Matrices

In [46]:
def get_laplacian_degree(data_no_label, gamma=0.000000000005):
  sim_matrix = rbf_kernel(data_no_label, gamma=gamma)
  del data_no_label
  np.fill_diagonal(sim_matrix, 0)
  deg_matrix = np.sum(sim_matrix, axis=1)
  np.fill_diagonal(sim_matrix, deg_matrix)
  print(deg_matrix[deg_matrix == 0].shape, np.mean(deg_matrix))
  return sim_matrix, np.diag(deg_matrix)

<a name="5.2"></a>
### 5.2 Spectral Clustering Algorithm

In [47]:
def spectral_clustering(data_no_label, k = 3):
  data_no_label, dummy = train_test_split(data_no_label, train_size=0.0025, random_state=42, stratify=labels)
  del dummy
  laplacian_mat, degree_mat = get_laplacian_degree(data_no_label) # deg_mat is a vector
  eigenvalues, eigenvectors = eigh(laplacian_mat, degree_mat)
  del eigenvalues
  eigenvectors = eigenvectors[:, :k] / np.linalg.norm(eigenvectors[:, :k], ord=2, axis=1, keepdims=True)
  return Kmeans(eigenvectors, k)

<a name="6"></a> 
## 6-  New Clustering Algorithm

In [None]:
# def new_clustering_algorithm(data_frame_no_label, k = 3):
  

<a name="7"></a>
## 7- Clustering Evaluation

<a name="7.1"></a>
### 7.1 - getting predicted labels

In [49]:
c_labels = spectral_clustering(data_without_labels)

(0,) 12233.534


In [None]:
# run this line in spectral clustering 
labels = train_test_split(labels, train_size=0.0025, random_state=42, stratify=labels)

counts = np.unique(labels, return_counts=True)[1]

In [50]:
# the following is dummy data
v = np.array([1, 2, 1, 0, 2, 1, 0, 1, 2, 2, 1, 0])
c = np.array([0, 1, 1, 0, 2, 0, 0, 1, 2, 2, 1, 0])
counts = np.unique(v, return_counts=True)[1]

<a name="7.2"></a>
### 7.2 - Precision Measure


In [51]:
def measure_recall(real_labels, predicted_labels):
  sum = 0;
  unique_predic = np.unique(predicted_labels)
  for c_label in unique_predic:
    cluster = real_labels[np.array(np.where(predicted_labels == c_label)[0].tolist())]
    major_class = mode(cluster, axis=None, keepdims=True)
    print(major_class.count[0] / counts[major_class.mode[0]])
    sum += major_class.count[0] / counts[major_class.mode[0]]
  return sum / unique_predic.size

print(measure_recall(v, c), recall_score(v, c, average='macro'))

1.0
0.6
0.75
0.7833333333333333 0.7833333333333333


<a name="7.3"></a>
### 7.3 - Recall Measure

In [52]:
def measure_precision(real_labels, predicted_labels):
  sum = 0;
  unique_predic = np.unique(predicted_labels)
  for c_label in unique_predic:
    cluster = real_labels[np.array(np.where(predicted_labels == c_label)[0].tolist())]
    major_class = mode(cluster, axis=None, keepdims=True)
    print(major_class.count[0] / cluster.size)
    sum += major_class.count[0] / cluster.size
  return sum / unique_predic.size
print(measure_precision(v, c), precision_score(v, c, average='macro'))

0.6
0.75
1.0
0.7833333333333333 0.7833333333333333


<a name="7.4"></a>
### 7.4 F1 Score

In [84]:
def measure_f1(real_labels, predicted_labels):
  sum = 0;
  unique_predic = np.unique(predicted_labels)
  for c_label in unique_predic:
    cluster = real_labels[np.array(np.where(predicted_labels == c_label)[0].tolist())]
    major_class = mode(cluster, axis=None, keepdims=True)
    print(major_class.count[0] / cluster.size, ", ", major_class.count[0] / counts[major_class.mode[0]])
    recall = major_class.count[0] / counts[major_class.mode[0]]
    precis = major_class.count[0] / cluster.size
    sum += (2 * recall * precis) / (recall + precis)
  return sum / unique_predic.size
print(measure_f1(v, c), f1_score(v, c, average='macro'))

0.6 ,  1.0
0.75 ,  0.6
1.0 ,  0.75
0.7579365079365079 0.7579365079365079


<a name="7.5"></a>
### 7.5 Conditional Entropy

In [53]:
def measure_entropy(real_labels, predicted_labels):
  class_entropy = entropy(real_labels, base=2)
  return class_entropy - mutual_info_score(real_labels, predicted_labels)
measure_entropy(v, c)

2.4753653518164693