In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering

# --- load ---
df = pd.read_csv("path", encoding="unicode_escape")

# --- numeric features only (drop obvious non-numeric columns) ---
drop_if_present = ['OrderDate','OrderDateTime','Order ID','Order ID','Order ID','Address','ADDRESSLINE1','ADDRESSLINE2','STATE','POSTALCODE','PHONE','CustomerID']
df = df.drop([c for c in drop_if_present if c in df.columns], axis=1, errors='ignore')
X = df.select_dtypes(include=[np.number]).copy()
X = X.dropna(axis=1, how='all')  # drop empty numeric cols
X = X.fillna(X.median())

# --- scale ---
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

# --- elbow: inertia for k=1..10 ---
ks = list(range(1, 11))
inertias = []
for k in ks:
    inertias.append(KMeans(n_clusters=k, random_state=42, n_init=10).fit(Xs).inertia_)

# --- automatic elbow detection: distance from line (first-last) ---
# points (k, inertia)
pts = np.column_stack((ks, inertias))
p1, p2 = pts[0], pts[-1]
# line vector
v = p2 - p1
# distances
distances = np.abs(np.cross(v, pts - p1) / np.linalg.norm(v))
optimal_k = int(ks[np.argmax(distances)])
if optimal_k < 2:
    optimal_k = 2

print("Inertia by k:", dict(zip(ks, inertias)))
print("Detected elbow (optimal k):", optimal_k)

# --- final KMeans ---
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10).fit(Xs)
df['kmeans_cluster'] = kmeans.labels_

# --- hierarchical clustering (Agglomerative) ---
agg = AgglomerativeClustering(n_clusters=optimal_k).fit(Xs)
df['hier_cluster'] = agg.labels_

# --- outputs ---
print("\nKMeans cluster counts:\n", df['kmeans_cluster'].value_counts().sort_index().to_dict())
print("\nHierarchical cluster counts:\n", df['hier_cluster'].value_counts().sort_index().to_dict())

# compact cluster summary (means of numeric features)
print("\nKMeans cluster centers (in original scale):")
centers = scaler.inverse_transform(kmeans.cluster_centers_)
centers_df = pd.DataFrame(centers, columns=X.columns).round(3)
print(centers_df)

# # save labelled data if needed
# df.to_csv("sales_clusters_labeled.csv", index=False)
# print("\nSaved labeled data to sales_clusters_labeled.csv")

  distances = np.abs(np.cross(v, pts - p1) / np.linalg.norm(v))


Inertia by k: {1: 25407.000000000004, 2: 20090.887012173356, 3: 16909.3272126169, 4: 14818.012273279455, 5: 13541.406778644374, 6: 12547.114818612226, 7: 11743.686897235177, 8: 11067.285484133725, 9: 10529.8644594963, 10: 10085.260683115195}
Detected elbow (optimal k): 4

KMeans cluster counts:
 {0: 478, 1: 948, 2: 663, 3: 734}

Hierarchical cluster counts:
 {0: 1094, 1: 690, 2: 478, 3: 561}

KMeans cluster centers (in original scale):
   ORDERNUMBER  QUANTITYORDERED  PRICEEACH  ORDERLINENUMBER     SALES  QTR_ID  \
0    10392.297           36.885     83.445            5.973  3747.880   1.368   
1    10251.919           35.880     98.117            6.524  4667.526   3.629   
2    10181.014           34.884     85.941            6.423  3467.217   1.485   
3    10250.628           33.097     63.045            6.752  2065.889   3.530   

   MONTH_ID   YEAR_ID     MSRP  
0     3.002  2005.000  100.167  
1     9.831  2003.567  127.541  
2     3.364  2003.565  100.665  
3     9.578  2003.590 

1. Import Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering


Explanation:

pandas → For loading and handling data.

numpy → For numerical operations.

StandardScaler → Standardizes numeric features (mean=0, std=1).

KMeans → KMeans clustering algorithm.

AgglomerativeClustering → Hierarchical clustering algorithm.

2. Load Dataset
df = pd.read_csv("path", encoding="unicode_escape")


Explanation:

Reads a CSV file with sales data.

encoding="unicode_escape" → Handles special characters in file.

3. Keep Only Numeric Columns
drop_if_present = ['OrderDate','OrderDateTime','Order ID','Address','ADDRESSLINE1','ADDRESSLINE2','STATE','POSTALCODE','PHONE','CustomerID']
df = df.drop([c for c in drop_if_present if c in df.columns], axis=1, errors='ignore')
X = df.select_dtypes(include=[np.number]).copy()
X = X.dropna(axis=1, how='all')  # drop empty numeric columns
X = X.fillna(X.median())          # replace missing values with median


Explanation:

Drop columns that are clearly non-numeric (dates, IDs, addresses, phone).

select_dtypes(include=[np.number]) → Keep only numeric features.

Drop numeric columns that are completely empty.

Fill missing numeric values with median.

4. Scale Data
scaler = StandardScaler()
Xs = scaler.fit_transform(X)


Explanation:

Standardizes features → mean=0, std=1.

Important for clustering because KMeans uses distance, and scaling prevents features with larger numbers from dominating.

5. Find Optimal k using Elbow Method
ks = list(range(1, 11))
inertias = []
for k in ks:
    inertias.append(KMeans(n_clusters=k, random_state=42, n_init=10).fit(Xs).inertia_)


Explanation:

Test k from 1 to 10.

inertia_ → Sum of squared distances of points to their cluster centers. Lower inertia = better fit.

Plotting inertia vs k usually gives an “elbow” where adding more clusters doesn’t improve much.

6. Automatic Elbow Detection
pts = np.column_stack((ks, inertias))
p1, p2 = pts[0], pts[-1]
v = p2 - p1
distances = np.abs(np.cross(v, pts - p1) / np.linalg.norm(v))
optimal_k = int(ks[np.argmax(distances)])
if optimal_k < 2:
    optimal_k = 2


Explanation:

Uses a geometric method to find elbow automatically:

Consider line from first to last inertia point.

Compute perpendicular distance of each point to this line.

The point with max distance is the “elbow” → optimal k.

Ensures k ≥ 2.

7. Print Inertia and Optimal k
print("Inertia by k:", dict(zip(ks, inertias)))
print("Detected elbow (optimal k):", optimal_k)


Explanation:

Shows inertia for each k.

Displays the optimal number of clusters determined automatically.

8. Run Final KMeans
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10).fit(Xs)
df['kmeans_cluster'] = kmeans.labels_


Explanation:

Fit KMeans with the chosen number of clusters.

labels_ → Cluster assignments for each row.

9. Run Agglomerative (Hierarchical) Clustering
agg = AgglomerativeClustering(n_clusters=optimal_k).fit(Xs)
df['hier_cluster'] = agg.labels_


Explanation:

Fit hierarchical clustering with the same number of clusters.

Assign cluster labels to the dataset.

10. Print Cluster Counts
print("\nKMeans cluster counts:\n", df['kmeans_cluster'].value_counts().sort_index().to_dict())
print("\nHierarchical cluster counts:\n", df['hier_cluster'].value_counts().sort_index().to_dict())


Explanation:

Count how many data points are in each cluster for KMeans and hierarchical clustering.

11. Show KMeans Cluster Centers (Original Scale)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
centers_df = pd.DataFrame(centers, columns=X.columns).round(3)
print(centers_df)


Explanation:

Cluster centers in scaled data → convert back to original units using inverse_transform.

Creates a DataFrame of cluster centers for interpretation (average values per cluster).

12. Optional: Save Clustered Data
# df.to_csv("sales_clusters_labeled.csv", index=False)


Can save the dataset with cluster labels to a CSV for later use.