# Cluster Analysis and Description

This notebook provides comprehensive analysis and description of clustering results.

We will:
1. Load clustering results
2. Generate cluster descriptions with automatic labeling
3. Display formatted descriptions
4. Analyze key characteristics and patterns
5. Export descriptions to files (Markdown, CSV, JSON)
6. Create cluster comparison matrix

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from suricata_rule_clustering import clustering, features
from suricata_rule_clustering.cluster_analysis import ClusterDescriptor, format_cluster_description

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

## 1. Load Clustering Results

In [None]:
# Load feature matrix, DataFrame, and labels
X = np.load('../data/feature_matrix.npy')
df = pd.read_pickle('../data/clustered_rules.pkl')
labels = np.load('../data/cluster_labels.npy')

print(f"Feature matrix shape: {X.shape}")
print(f"DataFrame shape: {df.shape}")
print(f"Number of clusters: {len(set(labels))}")
print(f"Cluster sizes:")
unique, counts = np.unique(labels, return_counts=True)
for cluster_id, count in zip(unique, counts):
    print(f"  Cluster {cluster_id}: {count} rules ({count/len(labels)*100:.2f}%)")

## 2. Initialize Feature Extractor

We need to recreate the feature extractor to access TF-IDF vocabulary for term extraction.

In [None]:
# Recreate feature extractor to get TF-IDF vocabulary
extractor = features.RuleFeatureExtractor()

# Re-extract features to populate TF-IDF vectorizer
print("Re-extracting features to populate TF-IDF vectorizer...")
X_temp = extractor.create_feature_matrix(df, include_tfidf=True, tfidf_max_features=100)
print(f"Feature extractor ready with {len(extractor.tfidf_vectorizer.get_feature_names_out())} TF-IDF features")

## 3. Generate Cluster Descriptions

This will generate comprehensive descriptions for all clusters including:
- Automatic labels
- Statistical summaries
- Dominant characteristics (classtypes, protocols, actions)
- Top TF-IDF terms
- Representative rules

In [None]:
# Create cluster descriptor
descriptor = ClusterDescriptor(feature_extractor=extractor)

# Generate descriptions for all clusters
print("Generating cluster descriptions...")
descriptions = descriptor.describe_all_clusters(X, labels, df)
print(f"Generated descriptions for {len(descriptions)} clusters")

## 4. Display Cluster Descriptions

### Summary Table

In [None]:
# Create summary DataFrame
summary_data = []
for cluster_id in sorted(descriptions.keys()):
    desc = descriptions[cluster_id]
    summary = desc['summary']
    chars = desc['characteristics']
    
    # Safely get top classtype
    classtypes = chars.get('classtypes', [])
    top_classtype = classtypes[0].get('name', 'N/A') if classtypes else 'N/A'
    
    # Safely get top protocol
    protocols = chars.get('protocols', [])
    top_protocol = protocols[0].get('name', 'N/A') if protocols else 'N/A'
    
    # Get top terms
    top_terms = ', '.join([t['term'] for t in desc['top_terms'][:5]]) if desc['top_terms'] else 'N/A'
    
    summary_data.append({
        'Cluster': cluster_id,
        'Label': desc['label'],
        'Size': summary['size'],
        'Percentage': f"{summary['percentage']:.2f}%",
        'Top Classtype': top_classtype,
        'Top Protocol': top_protocol,
        'Key Terms': top_terms
    })

summary_df = pd.DataFrame(summary_data)
print("\n=== CLUSTER SUMMARY ===")
display(summary_df)

### Detailed Descriptions

Display detailed description for each cluster:

In [None]:
# Display detailed descriptions
for cluster_id in sorted(descriptions.keys()):
    print(format_cluster_description(descriptions[cluster_id]))
    print("\n")

## 5. Visualize Cluster Characteristics

In [None]:
# Plot cluster sizes
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart of cluster sizes
cluster_sizes = summary_df.sort_values('Size', ascending=False)
axes[0].barh(cluster_sizes['Label'], cluster_sizes['Size'])
axes[0].set_xlabel('Number of Rules')
axes[0].set_title('Cluster Sizes')
axes[0].invert_yaxis()

# Pie chart of top 10 clusters
top10 = cluster_sizes.head(10)
other_size = summary_df['Size'].sum() - top10['Size'].sum()
sizes = list(top10['Size']) + [other_size] if other_size > 0 else list(top10['Size'])
labels_pie = list(top10['Label']) + ['Others'] if other_size > 0 else list(top10['Label'])
axes[1].pie(sizes, labels=labels_pie, autopct='%1.1f%%', startangle=90)
axes[1].set_title('Distribution of Top 10 Clusters')

plt.tight_layout()
plt.show()

In [None]:
# Plot classtype distribution across clusters
classtype_data = []
for cluster_id in sorted(descriptions.keys()):
    desc = descriptions[cluster_id]
    chars = desc['characteristics']
    if 'classtypes' in chars:
        for ct in chars['classtypes'][:3]:  # Top 3 classtypes
            classtype_data.append({
                'Cluster': desc['label'],
                'Classtype': ct['name'],
                'Count': ct['count']
            })

# Only create plot if we have classtype data
if classtype_data:
    classtype_df = pd.DataFrame(classtype_data)
    classtype_pivot = classtype_df.pivot_table(index='Cluster', columns='Classtype', values='Count', fill_value=0)

    # Plot top classtypes
    plt.figure(figsize=(14, 10))
    classtype_pivot.plot(kind='barh', stacked=True, figsize=(14, 10), colormap='tab20')
    plt.xlabel('Number of Rules')
    plt.ylabel('Cluster')
    plt.title('Classtype Distribution Across Clusters')
    plt.legend(title='Classtype', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()
else:
    print("No classtype data available for visualization (all classtypes are None/NaN)")

## 6. Interactive Cluster Explorer

Select a cluster to explore in detail:

In [None]:
# Interactive cluster selection
from ipywidgets import interact, Dropdown

def explore_cluster(cluster_label):
    # Find cluster ID from label
    cluster_id = None
    for cid, desc in descriptions.items():
        if desc['label'] == cluster_label:
            cluster_id = cid
            break
    
    if cluster_id is None:
        print("Cluster not found")
        return
    
    # Display detailed description
    print(format_cluster_description(descriptions[cluster_id]))
    
    # Show sample rules from this cluster
    cluster_rules = df[labels == cluster_id]
    print("\n\nSample Rules from this Cluster:")
    print("=" * 80)
    for i, (_, rule) in enumerate(cluster_rules.head(10).iterrows(), 1):
        print(f"\n{i}. {rule.get('msg', 'N/A')}")
        print(f"   Protocol: {rule.get('protocol', 'N/A')}, Classtype: {rule.get('classtype', 'N/A')}, Priority: {rule.get('priority', 'N/A')}")
        print(f"   SID: {rule.get('sid', 'N/A')}")

# Create dropdown
cluster_labels = [desc['label'] for desc in descriptions.values()]
interact(explore_cluster, cluster_label=Dropdown(options=sorted(cluster_labels), description='Cluster:'))

## 7. Export Descriptions to Files

In [None]:
# Export to Markdown
descriptor.export_to_markdown(descriptions, '../outputs/cluster_descriptions.md')

In [None]:
# Export to CSV
descriptor.export_to_csv(descriptions, '../outputs/cluster_summary.csv')

In [None]:
# Export to JSON
descriptor.export_to_json(descriptions, '../outputs/cluster_descriptions.json')

## 8. Cluster Comparison Matrix

Compare clusters to find similar clusters and understand relationships.

In [None]:
# Calculate cluster centroids
cluster_centroids = []
cluster_ids_sorted = sorted(descriptions.keys())

for cluster_id in cluster_ids_sorted:
    cluster_mask = labels == cluster_id
    cluster_features = X[cluster_mask]
    centroid = cluster_features.mean(axis=0)
    cluster_centroids.append(centroid)

cluster_centroids = np.array(cluster_centroids)

# Calculate pairwise distances between centroids
from sklearn.metrics import pairwise_distances
centroid_distances = pairwise_distances(cluster_centroids, metric='euclidean')

# Create distance matrix DataFrame
cluster_labels_short = [descriptions[cid]['label'][:30] for cid in cluster_ids_sorted]
distance_df = pd.DataFrame(
    centroid_distances,
    index=cluster_labels_short,
    columns=cluster_labels_short
)

# Plot heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(distance_df, annot=True, fmt='.2f', cmap='YlOrRd', square=True, cbar_kws={'label': 'Distance'})
plt.title('Cluster Similarity Matrix (Lower Distance = More Similar)')
plt.xlabel('Cluster')
plt.ylabel('Cluster')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Find most similar cluster pairs
print("\nMost Similar Cluster Pairs:")
print("=" * 80)

# Get upper triangle indices (avoid diagonal and duplicates)
similarity_scores = []
for i in range(len(cluster_ids_sorted)):
    for j in range(i+1, len(cluster_ids_sorted)):
        similarity_scores.append({
            'Cluster 1': descriptions[cluster_ids_sorted[i]]['label'],
            'Cluster 2': descriptions[cluster_ids_sorted[j]]['label'],
            'Distance': centroid_distances[i, j]
        })

similarity_df = pd.DataFrame(similarity_scores).sort_values('Distance').head(10)
display(similarity_df)

## Summary

We have successfully:
- Generated comprehensive descriptions for all clusters
- Created automatic labels based on cluster characteristics
- Identified key terms and patterns for each cluster
- Found representative rules for each cluster
- Exported descriptions to multiple formats (Markdown, CSV, JSON)
- Analyzed cluster relationships and similarities

## Next Steps

- Review the exported descriptions in `outputs/cluster_descriptions.md`
- Use cluster labels and descriptions for rule management and organization
- Investigate similar clusters to understand rule patterns
- Fine-tune clustering parameters based on description quality