> ### Note on Labs and Assigments:
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> These sections are graded and are not optional.
>

# IS 4487 Lab 13: Customer Segmentation with K-Means, GMM, and Text Clustering

## Outline

- Apply K-Means clustering on demographic and geographic data  
- Use Gaussian Mixture Models (GMM) for soft clustering  
- Perform text clustering with TF-IDF and K-Means  
- Explore both structured and unstructured customer data  
- Evaluate the quality and meaning of clusters  

In this lab, you will explore **unsupervised learning** techniques to uncover natural groupings in customer data using both numerical and text-based features.

We’ll practice segmentation using the **K-Means algorithm**, **Gaussian Mixture Models**, and **TF-IDF-based text clustering** on customer feedback.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab_13_kmeans_gmm_evaluation.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


### Dataset 1 Description: Online Food Orders

The dataset comes from an online food ordering platform and includes structured demographic, geographic, and behavioral data, along with feedback sentiment

The following table outlines the main attributes:

| Column                        | Data Type       | Description                                                  |
|------------------------------|------------------|--------------------------------------------------------------|
| `Age`                        | Integer           | Customer's age                                               |
| `Gender`                     | Categorical       | Gender of the customer                                       |
| `Marital Status`             | Categorical       | Marital status (e.g., Single, Married)                       |
| `Occupation`                 | Categorical       | Job category (e.g., Student, Employee, Self-employed)        |
| `Monthly Income`             | Categorical       | Income group (e.g., No Income, Lower, Middle, Upper)         |
| `Educational Qualifications` | Categorical       | Education level (e.g., High School, Graduate)                |
| `Family size`                | Integer           | Number of individuals in the household                       |
| `latitude`                   | Float             | Latitude of customer location                                |
| `longitude`                  | Float             | Longitude of customer location                               |
| `Output`                     | Categorical       | Order status (e.g., Confirmed, Delivered, Pending)           |
| `Feedback`                   | Categorical     | Sentiment (Positive, Negative, Neutral)                 |
| `Pin code`                   | Integer           | Postal code of customer address (not used in this lab)       |

### Dataset 2 Description: Restaurant Reviews

This dataset contains short text reviews from customers about their restaurant experiences. It is used for unsupervised text clustering with K-Means and TF-IDF.

| Column   | Data Type | Description                                                  |
|----------|-----------|--------------------------------------------------------------|
| Review   | Text      | Review written by the customer about their food experience   |
| Liked    | Integer   | Indicates whether the customer liked the food (1 = Yes, 0 = No) |

- Only the `Review` column is used in Part 4 for clustering.
- The `Liked` column is **not used**, as the focus is unsupervised learning.

**Source**: [Restaurant Reviews Dataset – Kaggle](https://www.kaggle.com/datasets/d4rklucif3r/restaurant-reviews?select=Restaurant_Reviews.tsv)

**Note**: Categorical columns are encoded numerically for modeling. Textual feedback is vectorized using TF-IDF in the final section of the lab.


## Part 1: Load and Prepare the Data

What you are going to do:
- Load the dataset
- Encode categorical variables
- Standardize numeric features for clustering

Why this matters:
Clustering models are sensitive to scale and format. Encoding and standardization ensure fair distance calculations.

Things to notice:
- Which variables need encoding?
- How does the data look before and after scaling?


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

url = 'https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/87a54cf1ee5fefcb5d58db405f58f3628810ae51/DataSets/onlinefoods.csv'
df = pd.read_csv(url)

# View first 5 rows of data
df.head()
df.info()

In [None]:
# Drop unused or irrelevant columns
df.drop(columns=['Unnamed: 12', 'Pin code'], inplace=True)

# Encode categorical columns
categorical_cols = ['Gender', 'Marital Status', 'Occupation',
                    'Monthly Income', 'Educational Qualifications',
                    'Output', 'Feedback']

le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# Select features for clustering
features = ['Age', 'Gender', 'Marital Status', 'Occupation',
            'Monthly Income', 'Educational Qualifications',
            'Family size', 'latitude', 'longitude']

X = df[features]

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## Part 2: K-Means Clustering

What you are going to do:
- Apply K-Means with different numbers of clusters
- Use the elbow method to select the best value of K
- Visualize clusters using PCA

Why this matters:
K-Means finds distinct groups based on feature similarity. Evaluating K helps avoid overfitting or underfitting clusters.

Things to notice:
- Where is the "elbow" in the inertia plot?
- Are the clusters clearly separated?


In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Elbow method
inertias = []
K_range = range(2, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(K_range, inertias, marker='o')
plt.title("Elbow Method for K")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.grid(True)
plt.show()


In [None]:
# Final KMeans model
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# PCA for 2D projection
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='Set2', s=50)
plt.title("K-Means Clusters (PCA Projection)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.show()


### 🔧 Try It Yourself - Part 2

1. Try different values for `n_clusters` (e.g., 3 or 5) and rerun the PCA visualization.
2. Analyze the cluster characteristics:

```
   df['Cluster'] = kmeans_labels
   df.groupby('Cluster')[['Age', 'Monthly Income', 'Family size']].mean()
```

3. What types of customers are grouped together? Are the clusters interpretable?

In [None]:
# 🔧 ADD CODE HERE

🔧 Add comments here

## Part 3: Gaussian Mixture Model (GMM)

What you are going to do:
- Fit a GMM using the same features
- Compare it with K-Means using silhouette score

Why this matters:

GMM supports soft clustering and overlapping groups, which can reveal nuanced patterns.

Things to notice:
- How does the cluster structure differ from K-Means?
- Which model performs better on internal metrics?

In [None]:
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

# GMM clustering
gmm = GaussianMixture(n_components=4, random_state=42)
gmm_labels = gmm.fit_predict(X_scaled)

# Evaluate performance
print("Silhouette Score (KMeans):", silhouette_score(X_scaled, kmeans_labels))
print("Silhouette Score (GMM):", silhouette_score(X_scaled, gmm_labels))


In [None]:
# GMM cluster visualization
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=gmm_labels, cmap='Set1', s=50)
plt.title("GMM Clusters (PCA Projection)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.show()

### 🔧 Try It Yourself — Part 3

1. Try a different number of components (e.g., 3 or 5) in the GMM and rerun
2. Group by GMM cluster and summarize:


```
df['GMM_Cluster'] = gmm_labels
df.groupby('GMM_Cluster')[['Age', 'Family size']].mean()
```


3. Which clustering approach gave more distinct groupings?

In [None]:
#🔧 Add code here

🔧 Add comment here

## Part 4: K-Means Text Clustering on Restaurant Reviews

What you are going to do:
- Load the Restaurant Reviews dataset
- Extract and preprocess the review text
- Convert text to numeric form using TF-IDF
- Apply K-Means to find review clusters
- Interpret each cluster by identifying key terms

Why this matters:
Customer feedback contains rich information that is often underutilized. Text clustering lets us uncover themes—like satisfaction, complaints, or food quality—that can guide business decisions.

Things to notice:
- Are some clusters clearly positive or negative?
- What recurring words or issues appear in the review groups?


In [None]:
# Load the restaurant reviews dataset
url = 'https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/87a54cf1ee5fefcb5d58db405f58f3628810ae51/DataSets/Restaurant_Reviews.tsv'
restaurant_reviews = pd.read_csv(url, sep='\t')


# Inspect the data
restaurant_reviews.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Extract text column
text_data = restaurant_reviews['Review'].astype(str)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9)
X_text = vectorizer.fit_transform(text_data)

# K-Means Clustering
kmeans_text = KMeans(n_clusters=3, random_state=42)
text_labels = kmeans_text.fit_predict(X_text)

# Evaluate clustering performance
print("Silhouette Score (Review Clustering):", silhouette_score(X_text, text_labels))



In [None]:
# Top terms in each cluster
terms = vectorizer.get_feature_names_out()
centroids = kmeans_text.cluster_centers_.argsort()[:, ::-1]

for i in range(3):
    print(f"\nTop words in Cluster {i}:")
    print(", ".join([terms[ind] for ind in centroids[i, :10]]))


### 🔧 Try It Yourself - Part 4

1. Change the number of clusters to 4 or 5 and rerun the model.
2. Use the following code to generate a word cloud for one of the clusters:

```
from wordcloud import WordCloud
   import matplotlib.pyplot as plt

   cluster_num = 0  # Choose from 0 to N-1
   cluster_text = restaurant_reviews[text_labels == cluster_num]['Review']
   combined_text = " ".join(cluster_text)

   wc = WordCloud(background_color='white', max_words=100).generate(combined_text)

   plt.imshow(wc, interpolation='bilinear')
   plt.axis('off')
   plt.title(f"Word Cloud for Cluster {cluster_num}")
   plt.show()
```

3. What topics or phrases stand out in the word cloud?
4. Could these insights inform marketing or support messaging?



In [None]:
# 🔧 add code here

🔧 Add comment here:

## 🔧 Part 5: Reflection (100 words or less)

In this lab, you applied clustering techniques to both structured customer data and unstructured restaurant reviews. You prepared each data type for analysis, applied appropriate clustering methods, and interpreted the results using visualizations and top features.

In Part 4, you explored how customer feedback can be grouped by themes or sentiment—without labeled outcomes—using TF-IDF and word clouds.

Use the cell below to answer the following questions:

1. What insights did the text clusters reveal about customer priorities or concerns?
2. How could a restaurant or food platform use these insights to improve service?


🔧 Add comment here:

## Export Your Notebook to Submit in Canvas
- Use the instructions from Lab 1

In [None]:
!jupyter nbconvert --to html "lab_13_LastnameFirstname.ipynb"