# Hierarchical Clustering Analysis: Global Cybersecurity Threats

This notebook performs hierarchical clustering analysis on the Global Cybersecurity Threats dataset (2015-2024) to identify hierarchical patterns and relationships in cybersecurity incidents.

## 1. Import Required Libraries

First, let's import all the necessary libraries for our hierarchical clustering analysis.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster import hierarchy
from scipy.spatial import distance
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
import joblib

# Configure plotting
%matplotlib inline
plt.style.use('seaborn')
sns.set_theme(style="whitegrid")

# Set random seed for reproducibility
np.random.seed(42)

# Set Pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## 2. Load and Prepare Data

Let's load our Global Cybersecurity Threats dataset and prepare it for hierarchical clustering.

In [None]:
# Load the dataset
data_path = '../Sheets/Global_Cybersecurity_Threats_2015-2024.csv'
df = pd.read_csv(data_path)

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFeature Information:")
print(df.info())
print("\nFirst few rows:")
display(df.head())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

## 3. Preprocess Features

Let's prepare our data for hierarchical clustering by:
1. Encoding categorical variables
2. Scaling numerical features
3. Creating the feature matrix

In [None]:
# Create a copy of the dataframe for preprocessing
df_processed = df.copy()

# Initialize label encoders for categorical columns
categorical_columns = ['Country', 'Attack Type', 'Target Industry', 
                      'Attack Source', 'Security Vulnerability Type', 
                      'Defense Mechanism Used']
label_encoders = {}

# Encode categorical variables
for column in categorical_columns:
    label_encoders[column] = LabelEncoder()
    df_processed[column] = label_encoders[column].fit_transform(df_processed[column])

# Select features for clustering
features_for_clustering = [
    'Year',
    'Country',
    'Attack Type',
    'Target Industry',
    'Financial Loss (in Million $)',
    'Number of Affected Users',
    'Attack Source',
    'Security Vulnerability Type',
    'Defense Mechanism Used',
    'Incident Resolution Time (in Hours)'
]

# Create feature matrix X
X = df_processed[features_for_clustering].values

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Scaled feature matrix shape:", X_scaled.shape)
print("\nFeature names:", features_for_clustering)

## 4. Compute Linkage Matrix and Create Dendrogram

Now we'll perform hierarchical clustering using different linkage methods and visualize the results using dendrograms.

In [None]:
# Compute the linkage matrix using different methods
methods = ['ward', 'complete', 'average', 'single']
linkage_matrices = {}

for method in methods:
    linkage_matrices[method] = hierarchy.linkage(X_scaled, method=method)

# Create a figure with subplots for different linkage methods
plt.figure(figsize=(20, 15))

for i, method in enumerate(methods, 1):
    plt.subplot(2, 2, i)
    
    # Create dendrogram
    hierarchy.dendrogram(
        linkage_matrices[method],
        truncate_mode='lastp',  # show only the last p merged clusters
        p=30,  # show only the last 30 merged clusters
        show_leaf_counts=True,
        leaf_rotation=90.,
        leaf_font_size=8.,
    )
    
    plt.title(f'Hierarchical Clustering Dendrogram\n(method: {method})')
    plt.xlabel('Sample index or (cluster size)')
    plt.ylabel('Distance')

plt.tight_layout()
plt.show()

# Compute and print the cophenetic correlation coefficient for each method
print("\nCophenetic Correlation Coefficient for each method:")
for method in methods:
    c, _ = hierarchy.cophenet(linkage_matrices[method], distance.pdist(X_scaled))
    print(f"{method}: {c:.3f}")

## 5. Determine Optimal Number of Clusters

Let's determine the optimal number of clusters using the elbow method and silhouette scores.

In [None]:
# Let's use the 'ward' method for determining optimal clusters
# since it tends to create more balanced clusters
linkage_matrix = linkage_matrices['ward']

# Calculate distances for different numbers of clusters
max_clusters = 10
distances = []
silhouette_scores = []
k_values = range(2, max_clusters + 1)

for k in k_values:
    # Get cluster labels
    labels = hierarchy.fcluster(linkage_matrix, k, criterion='maxclust')
    
    # Calculate total distance
    distances.append(linkage_matrix[-(k-1), 2])
    
    # Calculate silhouette score
    silhouette_scores.append(silhouette_score(X_scaled, labels))

# Plot elbow curve and silhouette scores
plt.figure(figsize=(12, 5))

# Elbow curve
plt.subplot(1, 2, 1)
plt.plot(k_values, distances, 'bx-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Distance')
plt.title('Elbow Method')

# Silhouette scores
plt.subplot(1, 2, 2)
plt.plot(k_values, silhouette_scores, 'rx-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')

plt.tight_layout()
plt.show()

# Print silhouette scores
print("\nSilhouette Scores for different numbers of clusters:")
for k, score in zip(k_values, silhouette_scores):
    print(f"k={k}: {score:.3f}")

## 6. Create Final Clusters and Analyze Results

Based on the elbow method and silhouette analysis, let's create the final clusters and analyze their characteristics.

In [None]:
# Create final clusters using the optimal number of clusters (k=4 based on analysis)
optimal_k = 4
cluster_labels = hierarchy.fcluster(linkage_matrix, optimal_k, criterion='maxclust')

# Add cluster labels to the original dataframe
df_processed['Cluster'] = cluster_labels

# Analyze cluster characteristics
cluster_stats = []

for cluster in range(1, optimal_k + 1):
    cluster_data = df_processed[df_processed['Cluster'] == cluster]
    
    # Calculate statistics for numerical features
    financial_loss_mean = cluster_data['Financial Loss (in Million $)'].mean()
    affected_users_mean = cluster_data['Number of Affected Users'].mean()
    resolution_time_mean = cluster_data['Incident Resolution Time (in Hours)'].mean()
    
    # Get most common values for categorical features
    most_common_attack = label_encoders['Attack Type'].inverse_transform([cluster_data['Attack Type'].mode()[0]])[0]
    most_common_industry = label_encoders['Target Industry'].inverse_transform([cluster_data['Target Industry'].mode()[0]])[0]
    most_common_source = label_encoders['Attack Source'].inverse_transform([cluster_data['Attack Source'].mode()[0]])[0]
    
    cluster_stats.append({
        'Cluster': cluster,
        'Size': len(cluster_data),
        'Avg Financial Loss': financial_loss_mean,
        'Avg Affected Users': affected_users_mean,
        'Avg Resolution Time': resolution_time_mean,
        'Most Common Attack': most_common_attack,
        'Most Common Industry': most_common_industry,
        'Most Common Source': most_common_source
    })

# Create a DataFrame with cluster statistics
cluster_summary = pd.DataFrame(cluster_stats)
display(cluster_summary)

In [None]:
# Visualize clusters using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis')
plt.title('Hierarchical Clustering Results (PCA Visualization)')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.colorbar(scatter, label='Cluster')
plt.show()

# Calculate explained variance ratio
explained_variance = pca.explained_variance_ratio_
print(f"Explained variance ratio: {explained_variance}")
print(f"Total variance explained: {sum(explained_variance):.2%}")

## 7. Save Results

Let's save the clustering results and the model for future use.

In [None]:
import joblib

# Save the hierarchical clustering model
clustering_results = {
    'model': hierarchical_cluster,
    'labels': labels,
    'scaled_data': X_scaled,
    'feature_names': feature_names,
    'linkage_matrix': Z
}

# Save the model and results
joblib.dump(clustering_results, 'hierarchical_model.joblib')

## 8. Cluster Analysis and Interpretation

Let's analyze the characteristics of each cluster to understand the patterns in cybersecurity threats.

In [None]:
# Add cluster labels to the original dataframe
df['Cluster'] = labels

# Calculate mean values for each cluster
cluster_means = df.groupby('Cluster').mean()

# Print cluster characteristics
print("Cluster Characteristics:")
print("-----------------------")
for cluster in range(len(cluster_means)):
    print(f"\nCluster {cluster}:")
    print(f"Number of instances: {len(df[df['Cluster'] == cluster])}")
    print("\nMean values:")
    for feature in feature_names:
        print(f"{feature}: {cluster_means.loc[cluster, feature]:.2f}")
    
    print("\nMost common attack types:")
    print(df[df['Cluster'] == cluster]['Attack_Type'].value_counts().head())

In [None]:
# Visualize cluster characteristics using a heatmap
plt.figure(figsize=(12, 8))
cluster_means_normalized = (cluster_means - cluster_means.mean()) / cluster_means.std()
sns.heatmap(cluster_means_normalized, annot=True, cmap='coolwarm', center=0)
plt.title('Cluster Characteristics Heatmap')
plt.ylabel('Cluster')
plt.xlabel('Features')
plt.tight_layout()
plt.show()

## 9. Temporal Analysis

Let's analyze how the clusters are distributed over time to identify any temporal patterns in cybersecurity threats.

In [None]:
# Create temporal analysis plot
plt.figure(figsize=(15, 8))

# Calculate cluster distribution over time
temporal_distribution = df.groupby(['Year', 'Cluster']).size().unstack()

# Create a stacked area plot
temporal_distribution.plot(kind='area', stacked=True)
plt.title('Temporal Distribution of Clusters')
plt.xlabel('Year')
plt.ylabel('Number of Incidents')
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

# Calculate relative proportions
proportions = temporal_distribution.div(temporal_distribution.sum(axis=1), axis=0)

# Create a stacked percentage plot
plt.figure(figsize=(15, 8))
proportions.plot(kind='area', stacked=True)
plt.title('Relative Proportions of Clusters Over Time')
plt.xlabel('Year')
plt.ylabel('Proportion of Incidents')
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## 10. Conclusions

The hierarchical clustering analysis has helped us identify distinct patterns in cybersecurity threats. The dendrogram and cluster analysis reveal the relationships between different types of threats and their characteristics. The temporal analysis shows how these patterns have evolved over time.

Key findings:
1. The optimal number of clusters was determined through the dendrogram analysis
2. Each cluster represents a distinct pattern of cybersecurity threats
3. The temporal analysis reveals how threat patterns have evolved from 2015 to 2024
4. The heatmap shows the key characteristics that distinguish each cluster

This analysis can be used to:
- Better understand the relationships between different types of cybersecurity threats
- Identify patterns in attack characteristics
- Track the evolution of threat patterns over time
- Inform security strategies based on cluster characteristics