# IS4487 Week 13 - Practice Code

This notebook accompanies the **Week 13 Reading** on unsupervised learning. It provides hands-on Python code to follow along with the clustering techniques covered in the reading:

- **K-Means Clustering** – For customer segmentation based on spending and visit frequency  
- **Gaussian Mixture Models (GMM)** – A probabilistic approach for soft clustering  
- **Evaluation Metrics** – To assess the performance and quality of clustering models

Each section includes a clear description of what the code does and its business context.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Reading-PracticeScripts/week13_practice_kmeans_gmm_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### 🛍️ K-Means Clustering: Customer Segmentation

In this example, we simulate a shopping mall dataset with 200 customers. Each customer has two features:
- **Annual Spending ($)**
- **Monthly Visit Frequency**

The goal is to use K-Means clustering to group these customers into 3 segments based on their behavior.
This helps marketers target different groups with personalized strategies.


In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Generate synthetic data: 200 shoppers with annual spending and monthly visits
X, y = make_blobs(n_samples=200, centers=[(1100, 5), (1300, 2), (1200, 4)], 
                  cluster_std=70, random_state=42)
df_large = pd.DataFrame(X, columns=['Spending', 'Visits'])

# Apply K-Means clustering with 3 clusters
kmeans_large = KMeans(n_clusters=3, random_state=42)
df_large['Cluster'] = kmeans_large.fit_predict(df_large[['Spending', 'Visits']])

# Plot the clusters and centroids
plt.figure(figsize=(10, 7))
sns.scatterplot(data=df_large, x='Spending', y='Visits', hue='Cluster', palette='viridis', s=60)
centroids_large = kmeans_large.cluster_centers_
plt.scatter(centroids_large[:, 0], centroids_large[:, 1], s=250, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering of 200 Shoppers')
plt.xlabel('Annual Spending ($)')
plt.ylabel('Visits per Month')
plt.legend()
plt.grid(True)
plt.show()


### 🎯 Gaussian Mixture Models (GMM): Probabilistic Clustering

This example applies GMM to the same shopper dataset. Unlike K-Means, GMM:
- Allows **soft assignments** (a customer can belong to multiple clusters with probabilities)
- Handles **elliptical clusters** of varying shapes and sizes

This is useful when customer behaviors overlap and aren't clearly separable.


In [None]:
from sklearn.mixture import GaussianMixture

# Apply Gaussian Mixture Model with 3 components
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
df_large['GMM_Cluster'] = gmm.fit_predict(df_large[['Spending', 'Visits']])

# Plot the GMM clusters and their means
plt.figure(figsize=(10, 7))
sns.scatterplot(data=df_large, x='Spending', y='Visits', hue='GMM_Cluster', palette='viridis', s=60)
gmm_means = gmm.means_
plt.scatter(gmm_means[:, 0], gmm_means[:, 1], s=250, c='red', marker='X', label='GMM Centroids')
plt.title('Gaussian Mixture Model Clustering of 200 Shoppers')
plt.xlabel('Annual Spending ($)')
plt.ylabel('Visits per Month')
plt.legend()
plt.grid(True)
plt.show()


### 📊 Clustering Evaluation Metrics

To evaluate clustering quality, we use:
- **Inertia (WCSS)** – Total within-cluster variance (K-Means only)
- **Silhouette Score** – Measures cohesion and separation (higher is better)
- **Davies-Bouldin Index** – Measures similarity between clusters (lower is better)

These help us understand how well our clusters are defined.


In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
# Re-import pandas since the environment was reset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_score, davies_bouldin_score


# Generate a dataset of 200 shoppers with Spending and Visits
X, y = make_blobs(n_samples=200, centers=[(1100, 5), (1300, 2), (1200, 4)], 
                  cluster_std=70, random_state=42)
df_large = pd.DataFrame(X, columns=['Spending', 'Visits'])

# Apply KMeans clustering
kmeans_large = KMeans(n_clusters=3, random_state=42)
df_large['Cluster'] = kmeans_large.fit_predict(df_large[['Spending', 'Visits']])

# Apply Gaussian Mixture Model
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
df_large['GMM_Cluster'] = gmm.fit_predict(df_large[['Spending', 'Visits']])

# Evaluate K-Means Clustering
inertia = kmeans_large.inertia_
silhouette = silhouette_score(df_large[['Spending', 'Visits']], df_large['Cluster'])
db_index = davies_bouldin_score(df_large[['Spending', 'Visits']], df_large['Cluster'])

# Evaluate GMM Clustering (no inertia)
silhouette_gmm = silhouette_score(df_large[['Spending', 'Visits']], df_large['GMM_Cluster'])
db_index_gmm = davies_bouldin_score(df_large[['Spending', 'Visits']], df_large['GMM_Cluster'])

# Display results
print("K-Means Evaluation:")
print(f"Inertia (WCSS): {inertia}")
print(f"Silhouette Score: {silhouette}")
print(f"Davies-Bouldin Index: {db_index}")

print("\nGMM Evaluation:")
print(f"Silhouette Score (GMM): {silhouette_gmm}")
print(f"Davies-Bouldin Index (GMM): {db_index_gmm}")


### 📊 Text Clustering - K Means


To evaluate customer reviews we will
- vectorize the words
- use K-Means to cluster the words
- create word clouds

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Sample customer review dataset
reviews = [
    "The product quality is excellent and delivery was fast.",
    "Customer service was friendly and helpful.",
    "Terrible experience. The item broke within a week.",
    "Affordable and good value for the price.",
    "The support team resolved my issue quickly.",
    "Product is overpriced and not worth it.",
    "Very satisfied with the customer support.",
    "Quick shipping but the packaging was poor.",
    "The item is decent but expected better quality.",
    "Amazing customer experience, highly recommend.",
    "Poor quality materials used in the product.",
    "Fast delivery and easy to order."
]

# Step 1: Convert text to TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(reviews)

# Step 2: Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Step 3: Create a DataFrame to display results
df_reviews = pd.DataFrame({'Review': reviews, 'Cluster': labels})

# Step 4: Generate WordClouds for each cluster
for cluster in range(3):
    text = " ".join(df_reviews[df_reviews['Cluster'] == cluster]['Review'])
    wordcloud = WordCloud(width=600, height=400, background_color='white').generate(text)
    plt.figure(figsize=(6, 4))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'Cluster {cluster}: Word Cloud')
    plt.show()

df_reviews
