# IS4487 Week 9 - Practice Code

This notebook is designed to help you follow along with the **Week 9 Lecture and Reading**, introducing you to segmentation through clustering.

The practice code demos are intended to give you a chance to see working code and can be a source for your lap and assignment work.  Each section contains short explanations and annotated code that reflect the steps in the reading.

### Topics for this demo:
- Use K-Means for Segmentation
- Plot the results

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Demos/demo_09_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### Context: Customer Segmentation
We will use a simple dataset using the following retail shopper characteristics

| Feature     | Description                                          | Type        |
| ----------- | ---------------------------------------------------- | ----------- |
| `visits`       | Number of visits in the last 2 years        | Numeric     |
| `spending`   | Total amount spent in the last 2 years           | Numeric |
| `annual income`   | Annual income in thousands of dollars             | Numeric     |
| `customer ID`   | customer ID number                        | Numeric |

There is no target variable!  You will be grouping the customers 

### KMeans Segmentation

K-Means is an unsupervised learning algorithm that groups data into k clusters by minimizing the distance between points and their clusterâ€™s center. It iteratively assigns points to the nearest centroid and updates centroids until the solution stabilizes.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Sample customer data
data = {
    'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    'Spending': [300, 450, 290, 1600, 1700, 1720, 700, 500, 900, 1000, 705, 790, 410, 520, 850],
    'Visits': [3,4,3,16,17,17,7,5,8,9,7,8,4,5,9],
    'Annual Income (k$)': [60, 82, 85, 220, 200, 210, 150, 142, 180, 205, 144, 160, 80, 104, 145]
}


# Create DataFrame
df = pd.DataFrame(data)

Prepare Data

In [None]:

# Features for clustering
X = df[['Annual Income (k$)', 'Spending', 'Visits']]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Create Model

In [None]:
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)

Visualize Model

In [None]:
# Plot clusters in using Age and Income
plt.figure(figsize=(8, 5))
for cluster in df['Cluster'].unique():
    cluster_data = df[df['Cluster'] == cluster]
    plt.scatter(cluster_data['Annual Income (k$)'], cluster_data['Visits'], label=f'Cluster {cluster}')

plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending')
plt.title('Customer Segmentation (by Visits, Income, Spending)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Visualize Model Again with Centroids

In [None]:
import numpy as np
from matplotlib.patches import Circle

# Plot clusters in using Age and Income
plt.figure(figsize=(8, 5))
ax = plt.gca() # Get the current axes for adding patches

for cluster in df['Cluster'].unique():
    cluster_data = df[df['Cluster'] == cluster]
    plt.scatter(cluster_data['Annual Income (k$)'], cluster_data['Visits'], label=f'Cluster {cluster}')

    # Calculate the center and standard deviation for each cluster
    center_x = cluster_data['Annual Income (k$)'].mean()
    center_y = cluster_data['Visits'].mean()
    radius_x = cluster_data['Annual Income (k$)'].std()
    radius_y = cluster_data['Visits'].std()

    # Use the larger standard deviation as the radius for a circle
    radius = max(radius_x, radius_y)

    # Create a circle patch
    circle = Circle((center_x, center_y), radius=radius, edgecolor='black', fc='None', lw=2)
    ax.add_patch(circle)


# Plot centroids
centroids = kmeans.cluster_centers_
plt.scatter(scaler.inverse_transform(centroids)[:, 0], scaler.inverse_transform(centroids)[:, 2], marker='x', s=100, color='red', label='Centroids')

plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending')
plt.title('Customer Segmentation (by Visits, Income, Spending)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()