## Anomaly Detection in Network Traffic profile
Using unsupervised learning techniques such as isolation forests or autoencoders to detect unusual patterns or anomalies in network traffic data, 
which could indicate potential security breaches or system malfunctions.

Resources : https://www.kaggle.com/datasets/galaxyh/kdd-cup-1999-data

Steps:
1. Data Preprocessing: Load and preprocess the dataset, handling missing values and scaling features.
2. Feature Engineering: Select relevant features and create new ones if necessary.
3. Model Training:
   - Isolation Forest: Detect outliers by isolating observations.
   - Autoencoder: Use neural networks to reconstruct data and identify anomalies based on reconstruction errors.
4. Evaluation: Assess model performance using metrics like precision, recall, and F1-score.
5. Deployment: Implement the model to monitor live network traffic for anomalies.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Define the column names for the dataset
column_names = [
    "duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent",
    "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
    "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
    "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
    "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
    "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate", "label"
]

# Load the dataset
df = pd.read_csv("C:/Users/Admin/Desktop/data/kddcup.data_10_percent.gz", compression='gzip', header=None, names=column_names)


In [12]:
df

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH,anomaly,anomaly_autoencoder
0,0,181,5450,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
1,0,239,486,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
2,0,235,1337,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
3,0,219,1337,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
4,0,217,2032,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,0,310,1881,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
494017,0,282,2286,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
494018,0,203,1200,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0
494019,0,291,1200,0,0,0,0,0,1,0,...,False,False,False,False,False,False,True,False,0,0


In [None]:
df.head

In [2]:
# Data Preprocessing
# Convert categorical data to numeric
protocol_type = pd.get_dummies(df['protocol_type'], prefix='protocol_type')
service = pd.get_dummies(df['service'], prefix='service')
flag = pd.get_dummies(df['flag'], prefix='flag')

df = pd.concat([df, protocol_type, service, flag], axis=1)
df.drop(['protocol_type', 'service', 'flag'], axis=1, inplace=True)

# Standardize the data
scaler = StandardScaler()
X = df.drop(['label'], axis=1)
X_scaled = scaler.fit_transform(X)

In [3]:
#Isolation Forest
from sklearn.ensemble import IsolationForest

# Train Isolation Forest
model = IsolationForest(contamination=0.1)
model.fit(X_scaled)

# Predict anomalies
df['anomaly'] = model.predict(X_scaled)
df['anomaly'] = df['anomaly'].apply(lambda x: 1 if x == -1 else 0)

In [7]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [10]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

# Define the autoencoder model
input_dim = X_scaled.shape[1]  # Number of input features
encoding_dim = 14  # Dimension of the encoded representation

# Input layer
input_layer = Input(shape=(input_dim,))

# Encoder layers
encoder = Dense(encoding_dim, activation="tanh")(input_layer)
encoder = Dense(int(encoding_dim / 2), activation="relu")(encoder)

# Decoder layers
decoder = Dense(int(encoding_dim / 2), activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)

# Autoencoder model
autoencoder = Model(inputs=input_layer, outputs=decoder)

# Compile the model
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
autoencoder.fit(X_scaled, X_scaled, epochs=100, batch_size=32, shuffle=True, validation_split=0.1)

# Predict anomalies
reconstructions = autoencoder.predict(X_scaled)
mse = tf.keras.losses.mse(X_scaled, reconstructions)
mse_mean = tf.reduce_mean(mse)
mse_std = tf.math.reduce_std(mse)

# Calculate the threshold
threshold = mse_mean + 2 * mse_std

# Identify anomalies
df['anomaly_autoencoder'] = (mse > threshold).numpy().astype(int)


Epoch 1/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 1ms/step - loss: 0.8013 - val_loss: 1.5403
Epoch 2/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 937us/step - loss: 0.7184 - val_loss: 1.5159
Epoch 3/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 983us/step - loss: 0.6841 - val_loss: 1.3158
Epoch 4/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 973us/step - loss: 0.6717 - val_loss: 1.6664
Epoch 5/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 956us/step - loss: 0.6770 - val_loss: 1.3616
Epoch 6/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 1ms/step - loss: 0.6411 - val_loss: 1.4185
Epoch 7/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 1ms/step - loss: 0.6185 - val_loss: 1.4864
Epoch 8/100
[1m13895/13895[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 1ms/step - loss: 0.7229 - v

In [11]:
from sklearn.metrics import classification_report

# True labels (normal = 0, anomaly = 1)
y_true = (df['label'] != 'normal.').astype(int)

# Isolation Forest results
print("Isolation Forest:")
print(classification_report(y_true, df['anomaly']))

# Autoencoder results
print("Autoencoder:")
print(classification_report(y_true, df['anomaly_autoencoder']))

Isolation Forest:
              precision    recall  f1-score   support

           0       0.17      0.79      0.28     97278
           1       0.58      0.07      0.13    396743

    accuracy                           0.21    494021
   macro avg       0.38      0.43      0.21    494021
weighted avg       0.50      0.21      0.16    494021

Autoencoder:
              precision    recall  f1-score   support

           0       0.20      1.00      0.33     97278
           1       0.93      0.01      0.02    396743

    accuracy                           0.20    494021
   macro avg       0.56      0.50      0.17    494021
weighted avg       0.79      0.20      0.08    494021



#### Conclusion : 
Anomaly detection in network traffic using the KDD Cup 1999 dataset is effectively achieved through unsupervised learning techniques like Isolation Forest and Autoencoders. Both methods demonstrated their capability to identify unusual patterns that could signify potential security breaches or system malfunctions. 
Isolation Forest: Simple to implement and interpret. Effective in detecting outliers with good performance metrics.
Autoencoder: Capable of capturing complex patterns in data. Requires more computational resources and tuning.
Both techniques, when properly tuned, can be vital tools for enhancing network security through real-time anomaly detection.