# üìó P2.2.3.2 ‚Äì Unsupervised Learning

## Topic: Customer Segmentation Example (Clustering)
---


**üß≠ Where this fits:** This is a **hands-on example** of **unsupervised learning** ‚Äî **clustering**. You saw the idea in the Unsupervised Overview; here we build a **customer segmentation** with K-Means.

## üéØ Learning Objectives

By the end of this notebook, you will be able to:

- Say what **clustering** is and why we use it when we have **no labels**
- Follow the steps in a clustering pipeline (prepare features ‚Üí vectorize ‚Üí cluster ‚Üí interpret)
- Build a simple customer segmentation with **K-Means** and interpret **cluster labels** and **centers**


## üìå What is clustering?

**Clustering** means grouping similar things together **without** being given any labels ‚Äî the model discovers the groups from the data.

- We have **only input (X)** ‚Äî no "correct group" for each point.
- The algorithm finds **similarity** (e.g. distance) and puts similar points in the same **cluster**.
- We then **interpret** the clusters (e.g. "this cluster buys more electronics", "this one is from Delhi").

**Some real-world examples:**

| Problem | Input | What we get |
|--------|--------|-------------|
| Customer segmentation | Purchase history, location, behavior | Groups of similar customers |
| Document grouping | Text of documents | Thematic clusters |
| Image segmentation | Pixels or features | Regions or groups |
| Anomaly detection | Many features | "Normal" clusters + outliers |

In this notebook we focus on **one** of these: **customer segmentation** ‚Äî grouping customers by purchase category and city so we can tailor offers.

## üìù Problem Statement

We want to **group customers** based on their purchase category (e.g. Electronics, Groceries, Clothing) and city (e.g. Mumbai, Delhi, Bangalore) ‚Äî with **no pre-defined labels**. The goal is to discover natural segments so we can personalize marketing and improve sales.

**Why is this important?**
- Target promotions to the right groups
- Improve customer experience
- Increase sales and loyalty

## üîç Why unsupervised learning?

We **do not have labels** ‚Äî no predefined groups like "Electronics Lovers" or "Groceries Shoppers". We only have customer features (purchase category + city). The goal is to **discover** natural groupings in the data.

**Why not supervised?** Supervised learning needs labeled data (each customer already assigned to a segment). Here we don't have that; we want the model to find the segments.

## ü§ñ Choosing the Model & Why

We use **K-Means Clustering** because:
- It is simple and widely used
- It groups customers based on similarity in their purchase category and city
- It works well after vectorizing text features

**Why not other models?**
- Hierarchical clustering, DBSCAN, etc. can be used, but K-Means is a classic choice for customer segmentation

## üõ†Ô∏è Example: Customer Segmentation Pipeline

This example shows the steps:
1. Prepare customer features (text: purchase category + city)
2. Vectorize text so the model can use it (text ‚Üí numbers)
3. Apply K-Means clustering (no train/test ‚Äî we fit on all data)
4. Interpret clusters (labels per customer, cluster centers)

*We use a **tiny dataset** (8 customers) so the flow is easy to follow; in practice you would use thousands of customer records.*

**When you run the code below**, you'll see **cluster labels** (which customer belongs to which group) and **cluster centers**; we explain what they mean in the section right after the code.

In [None]:
"""
Customer Segmentation using K-Means (Unsupervised)
--------------------------------------------------
Groups customers by purchase category + city with no labels.
Uses CountVectorizer (text ‚Üí numbers) and K-Means.
"""

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

# Step 1: Customer features (purchase category + city) as text
customers = [
    "Electronics Mumbai",
    "Groceries Delhi",
    "Clothing Mumbai",
    "Electronics Delhi",
    "Groceries Mumbai",
    "Clothing Delhi",
    "Electronics Bangalore",
    "Groceries Bangalore",
]

# Step 2: Vectorize text so we can run K-Means (text ‚Üí numbers)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(customers).toarray()  # dense matrix for K-Means

# Step 3: K-Means clustering (we choose 3 segments)
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_

# Step 4: Interpret ‚Äî who is in which cluster?
print("CUSTOMER SEGMENTATION (K-Means)")
print("---------------------------------")
print("Customer features          -> Cluster")
print("-" * 40)
for i, (features, cluster) in enumerate(zip(customers, labels)):
    print(f"  {features:25} -> {cluster}")
print()
print("Cluster centers (average feature weights per segment):")
print(kmeans.cluster_centers_)
print("\nFeature names (words used):", list(vectorizer.get_feature_names_out()))

### üëÄ At a glance ‚Äî what the output means

**1. The table (Customer ‚Üí Cluster)**  
Same **cluster number** = same **group**. So all "‚Üí 0" are one segment, all "‚Üí 1" another, all "‚Üí 2" another.

**2. Who is in which group?**

| Cluster | Who landed here | In simple words |
|---------|------------------|------------------|
| **0** | Electronics Mumbai, Clothing Mumbai, Groceries Mumbai, Clothing Delhi | Mostly **Mumbai** + one Delhi (Clothing) |
| **1** | Electronics Bangalore, Groceries Bangalore | **Bangalore** only |
| **2** | Groceries Delhi, Electronics Delhi | **Delhi** only (Electronics & Groceries) |

So by looking: **Cluster 0 = Mumbai (and similar)**, **Cluster 1 = Bangalore**, **Cluster 2 = Delhi**. The model grouped by **city** (and a bit by category).

**3. The numbers (cluster centers)**  
They are the **average** of each group. High number in the "mumbai" column ‚Üí that cluster has more Mumbai customers. You don‚Äôt need to add them up ‚Äî just think: *same group = similar city + category*.

---
## üìù Key Takeaways

- **Customer segmentation** = unsupervised **clustering**: no labels, only features (category + city) ‚Üí vectorize ‚Üí K-Means ‚Üí interpret clusters.
- **Vectorization** (text ‚Üí numbers) lets us use text with K-Means; no train/test split (we're not predicting a label, we're finding groups).
- We interpret using **cluster labels** (who is in which group) and **cluster centers** (what defines each segment).