# Workload Analysis

This notebook identify representative workload classes (e.g., short CPU-bound jobs, GPU-heavy jobs, data-intensive analytics, multi-node MPI jobs, etc.) from Perlmutter job traces.


We'll implement K-means clustering and use the elbow method to find the optimal number of clusters, with progress visualization using tqdm.

**Imports**

In [None]:
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import eda_utils
import clustering_utils
import visualization_utils

ModuleNotFoundError: No module named 'plotly'

**Load and Explore Dataset**

In [None]:
# Load the iris dataset
print("Loading iris dataset...")
iris = load_iris()
features = iris.feature_names
target = iris.target_names

# Convert to polars DataFrame
df = pl.DataFrame({
    features[0]: iris.data[:, 0],
    features[1]: iris.data[:, 1],
    features[2]: iris.data[:, 2],
    features[3]: iris.data[:, 3],
    'species': [target[i] for i in iris.target]
})

# Perform exploratory data analysis
eda_utils.describe_dataset(df, target_col='species')

# Create a pairplot to visualize relationships between features
eda_utils.create_pairplot(df, hue='species', title='Pairplot of Iris Dataset Features')

# Standardize the features (excluding the target column)
scaled_df, scaler = eda_utils.standardize_features(df, exclude_cols=['species'])
print("Standardized Features:")
display(scaled_df.head())

# Get standardized data for clustering
X = scaled_df.to_numpy()

# Find optimal number of clusters using the elbow method
k_range = range(1, 11)
inertias, times = clustering_utils.find_optimal_k(X, k_range=k_range, verbose=True)

# Plot the elbow curve
clustering_utils.plot_elbow_method(k_range, inertias, annotate=True)

# Apply KMeans clustering with the optimal number of clusters
optimal_k = 3  # Based on the elbow method
labels, centers, n_clusters = clustering_utils.apply_clustering(
    X, algorithm="kmeans", params={"n_clusters": optimal_k}
)

# Add cluster labels to the original dataframe
result_df = df.with_columns(pl.Series(name="cluster", values=labels))

# Display results
print("Clustering Results:")
display(result_df.head())

# Compare clusters with original species
print("\nCluster vs. Species Distribution:")
display(result_df.group_by(['species', 'cluster']).agg(pl.count().alias('count')))

# Reduce dimensions for visualization
X_pca = visualization_utils.reduce_dimensions(X, method="pca", n_components=2)
centers_pca = visualization_utils.reduce_dimensions(centers, method="pca", n_components=2)

# Visualize the clusters using matplotlib
visualization_utils.plot_clusters_mpl(
    X_pca, labels, centers_pca, 
    title='Iris Clusters (PCA)'
)

# Visualize the clusters using plotly for interactivity
fig = visualization_utils.plot_clusters_plotly(
    X_pca, labels, centers_pca,
    point_labels=result_df['species'].to_list(),
    title='Interactive Iris Clusters (PCA)'
)
fig.show()

# For 3D visualization
X_pca_3d = visualization_utils.reduce_dimensions(X, method="pca", n_components=3)
centers_pca_3d = visualization_utils.reduce_dimensions(centers, method="pca", n_components=3)

fig_3d = visualization_utils.plot_clusters_plotly(
    X_pca_3d, labels, centers_pca_3d,
    point_labels=result_df['species'].to_list(),
    title='Interactive Iris Clusters (PCA 3D)'
)
fig_3d.show()

Loading iris dataset...


NameError: name 'eda_utils' is not defined

**Data Visualization**

**Data Preprocessing**

**Elbow Method**

**Clustering with Optimal K**

**Visualization**