# IS 4487 Assignment 13: Customer Segmentation with K-Means and GMM

In this assignment, you will:
- Segment customers using real bank data (credit, behavior, and usage)
- Apply K-Means and Gaussian Mixture Model (GMM) clustering
- Evaluate clusters using silhouette and Davies-Bouldin scores
- Recommend strategies for retention, loyalty, or upselling

## Why This Matters

Banks and financial institutions serve thousands of customers with varying behaviors and risk profiles. Clustering allows analysts to group similar customers for tailored marketing, credit offerings, and service strategies. This hands-on assignment prepares you to identify patterns in customer usage and translate them into actionable business insights.

<a href="https://colab.research.google.com/github/leemr0903/IS4487-Datasets/blob/main/assignment_13_kmeans_gmm_evaluation.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Bank Credit Card Customer Data

From: [Kaggle - Credit Card Customers Dataset](https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers)  
Access File From Class GitHub:  
📂 https://github.com/leemr0903/IS4487-Datasets/blob/main/BankChurners.csv

---

This dataset contains information on over **10,000 credit card customers**, including demographics, account activity, and spending behavior. It was originally used for churn prediction, but in this assignment, you’ll cluster customers **without labels** to uncover natural segments.

### Key Numeric Columns:
- `Customer_Age`: Age of the account holder
- `Credit_Limit`: Customer’s assigned credit limit
- `Total_Trans_Amt`: Total transaction amount in the past year
- `Total_Trans_Ct`: Total number of transactions in the past year
- `Avg_Utilization_Ratio`: Utilization of credit (0 to 1)
- `Avg_Open_To_Buy`: Available amount on the credit line
- `Total_Amt_Chng_Q4_Q1`: % change in transaction amount Q4 vs. Q1

---

| Column Name               | Type     | Description |
|---------------------------|----------|-------------|
| `CLIENTNUM`               | int64    | Unique customer ID *(drop this column)* |
| `Attrition_Flag`          | object   | Whether the customer churned *(not used in clustering)* |
| `Customer_Age`            | int64    | Age of the customer |
| `Gender`                  | object   | M or F |
| `Dependent_count`         | int64    | Number of dependents |
| `Education_Level`         | object   | Education category |
| `Marital_Status`          | object   | Marital status |
| `Income_Category`         | object   | Income bracket |
| `Card_Category`           | object   | Card type (e.g., Blue, Gold) |
| `Months_on_book`          | int64    | How long they've had the account |
| `Total_Relationship_Count`| int64    | Number of products used |
| `Months_Inactive_12_mon`  | int64    | Months inactive in last year |
| `Contacts_Count_12_mon`   | int64    | Customer service interactions |
| `Credit_Limit`            | float64  | Assigned credit limit |
| `Total_Revolving_Bal`     | int64    | Balance that rolls over month to month |
| `Avg_Open_To_Buy`         | float64  | Available credit left to spend |
| `Total_Amt_Chng_Q4_Q1`    | float64  | Change in transaction volume (Q4 vs Q1) |
| `Total_Trans_Amt`         | int64    | Total amount spent |
| `Total_Trans_Ct`          | int64    | Number of transactions |
| `Total_Ct_Chng_Q4_Q1`     | float64  | Change in transaction count |
| `Avg_Utilization_Ratio`   | float64  | Credit usage rate |
| `Naive_Bayes_..._1`       | float64  | Auto-generated churn score *(drop this)* |
| `Naive_Bayes_..._2`       | float64  | Auto-generated churn score *(drop this)* |

You’ll focus your clustering on **numeric features** related to spending, credit usage, and engagement. Drop any ID or model output columns before starting.

---


## 1. Load and Explore the Data

Business framing:  
Always start by understanding your data. Let’s load the dataset and take a quick look at the structure and variables.

Follow these steps:
- Load the dataset directly from GitHub
  - url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/main/DataSets/BankChurners.csv"
- Display `.info()`, `.describe()`, and `.head()`

### In your markdown:
1. What types of features are included? (numeric, categorical)
2. What might you be able to learn by clustering these customers?


In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.

2.

## 2. Drop Unnecessary Columns and Choose Features

Business framing:  
Not every column adds value. Drop ID columns and model outputs that don’t help with unsupervised clustering.

Tasks:
- Drop: `CLIENTNUM`, and all columns starting with `Naive_Bayes_`
- Choose at least 5–7 numeric features that describe usage or spending behavior

### In your markdown:
1. Which features did you choose and why?


In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.


## 3. Scale Your Features

Business framing:

Clustering depends on distance, so you need to scale features to ensure fairness across different units (dollars vs. percentages).

Tasks:
- Use `StandardScaler` to standardize your selected features




In [None]:
# Add code here 🔧

## 4. Determine the Optimal Number of Clusters (k)

Business framing:  
We don’t want to guess how many clusters there are — the elbow method helps identify where adding more clusters stops improving performance.

Tasks:
- Use the **Elbow Method** to find the best `k` between 2–10
- Plot inertia (SSE) vs. k

### In your markdown:
1. What value of `k` did you choose and why?


In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.


## 5. Apply K-Means Clustering

Business framing:  
K-Means helps us assign each customer to the closest cluster center. Then we can analyze the common behaviors in each group.

Tasks:
- Fit a KMeans model with your selected `k`
- Assign cluster labels to the dataframe
- Evaluate with **Silhouette Score** and **Davies-Bouldin Index**

### In your markdown:
1. How well did the clusters perform?
2. What are the average behaviors for each group?


In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.

2.

## 6. Apply Gaussian Mixture Model (GMM)

Business framing:  
GMM allows for softer boundaries — a customer can partially belong to multiple clusters. This makes it more flexible than K-Means.

Tasks:
- Fit a `GaussianMixture` model with the same `k`
- Assign GMM cluster labels to the dataframe
- Evaluate GMM using silhouette and Davies-Bouldin scores

### In your markdown:
1. How does GMM compare to K-Means?
2. Which method produced more useful clusters?


In [None]:
# Add code here 🔧

### ✍️ Your Response: 🔧
1.

2.

## Final Reflection: Turning Clusters into Strategy

Clustering is only useful if it leads to better decisions. Now that you've grouped customers based on their credit behavior and engagement, your job is to translate those segments into insights that a marketing or product team could act on.

Reflect on what your clustering revealed:
- What patterns or behaviors stand out in each group?
- Which groups show signs of disengagement or risk?
- Who are your most valuable or loyal customers?

In your markdown, answer the following:
1. Briefly describe the behavioral traits of each cluster in plain language  
   (e.g., “High Spenders with Low Credit Utilization” or “Low Activity, High Limit”)
2. Assign each cluster a **persona label** (e.g., “Loyal High Rollers”, “At-Risk Minimalists”)
3. Recommend at least one **business action** per group  
   (e.g., retention offers, increased credit line, outreach campaigns, etc.
)
4. How does this relate to your customized learning outcome you created in canvas?  


### ✍️ Your Response: 🔧
1.

2.

3.
4.

## Submission Instructions
✅ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [None]:
!jupyter nbconvert --to html "assignment_13_LastnameFirstname.ipynb"