# Anomaly Detection in Network Traffic

This notebook demonstrates the process of detecting anomalies in network traffic using the KDD Cup 1999 dataset. We'll go through the following steps:

1. Data Collection and Preprocessing
2. Exploratory Data Analysis (EDA)
3. Clustering and Anomaly Detection
4. Visualization of Results

Let's begin by importing the necessary libraries and loading our data.


# 1. Data Collection and Preprocessing

In this section, we'll load the KDD Cup 1999 dataset, preprocess it, and prepare it for our analysis.


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Load the KDD Cup 1999 dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz"
columns = ["duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", 
           "land", "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", 
           "num_compromised", "root_shell", "su_attempted", "num_root", "num_file_creations", 
           "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login", 
           "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", 
           "rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", 
           "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", 
           "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate", "dst_host_serror_rate", 
           "dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "label"]

df = pd.read_csv(url, names=columns)

# Preprocess the data
df['label'] = df['label'].apply(lambda x: 0 if x == 'normal.' else 1)  # Binary classification (normal vs anomaly)

# Encode categorical variables
le = LabelEncoder()
df['protocol_type'] = le.fit_transform(df['protocol_type'])
df['service'] = le.fit_transform(df['service'])
df['flag'] = le.fit_transform(df['flag'])

# Feature Scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.drop(['label'], axis=1))

# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(scaled_features, df['label'], test_size=0.2, random_state=42)

print("Data preprocessing completed. Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape)


# 2. Exploratory Data Analysis (EDA)

Now that we have preprocessed our data, let's visualize it to gain some insights.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plot the label distribution
plt.figure(figsize=(10, 6))
sns.countplot(df['label'])
plt.title('Label Distribution: Normal vs Anomalies')
plt.show()

# Visualizing the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


# 3. Clustering and Anomaly Detection

In this section, we'll apply three different methods for anomaly detection:
1. DBSCAN
2. Isolation Forest
3. Autoencoder

We'll evaluate each method using classification reports and confusion matrices.


In [None]:
from sklearn.cluster import DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
from tensorflow.keras import layers, models

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=10)
y_pred_dbscan = dbscan.fit_predict(X_test)

print("DBSCAN - Classification Report:")
print(classification_report(y_test, y_pred_dbscan))
print("DBSCAN - Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dbscan))

# Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
y_pred_iso_forest = iso_forest.fit_predict(X_test)

print("
Isolation Forest - Classification Report:")
print(classification_report(y_test, y_pred_iso_forest))
print("Isolation Forest - Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_iso_forest))

# Autoencoder
input_dim = X_train.shape[1]
autoencoder = models.Sequential([
    layers.InputLayer(input_shape=(input_dim,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(input_dim, activation='sigmoid')
])

autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(X_train, X_train, epochs=10, batch_size=64, validation_data=(X_test, X_test), verbose=0)

reconstruction = autoencoder.predict(X_test)
reconstruction_error = tf.reduce_mean(tf.square(X_test - reconstruction), axis=1)
threshold = tf.reduce_mean(reconstruction_error) + 2 * tf.math.reduce_std(reconstruction_error)
y_pred_autoencoder = tf.where(reconstruction_error > threshold, 1, 0)

print("
Autoencoder - Classification Report:")
print(classification_report(y_test, y_pred_autoencoder))
print("Autoencoder - Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_autoencoder))


# 4. Visualization of Results

Finally, let's visualize our results using PCA to reduce the dimensionality of our data to 2D.


In [None]:
from sklearn.decomposition import PCA

# PCA for visualization
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=y_pred_dbscan, cmap='coolwarm', alpha=0.7)
plt.title('DBSCAN - PCA Visualization of Clusters and Anomalies')
plt.colorbar(label='Cluster')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()


# Conclusion

In this notebook, we've demonstrated the process of anomaly detection in network traffic using the KDD Cup 1999 dataset. We've applied three different methods: DBSCAN, Isolation Forest, and Autoencoder.

Each method has its strengths and weaknesses:
- DBSCAN is good at identifying clusters of normal behavior and flagging outliers as anomalies.
- Isolation Forest is efficient for high-dimensional data and can handle large datasets well.
- Autoencoder can capture complex patterns in the data and is flexible in terms of architecture design.

The choice of method depends on the specific requirements of your use case, such as interpretability, scalability, and the nature of your data.

To improve this analysis, you could:
1. Try different hyperparameters for each method
2. Combine multiple methods for ensemble learning
3. Use more advanced deep learning techniques like LSTM autoencoders for sequence data
4. Incorporate domain knowledge to engineer more relevant features

Remember to always validate your results and consider the practical implications of your anomaly detection system in a real-world network security context.
