# Clustering

- What is clustering? Identifying groups in our data.
- Why might we do clustering?
    - Exploration
    - Labeling
    - Features for Supervised Learning
- KMeans Algorithm
    1. Start with `k` random points
    1. Assign every observation to the closest centroids.
    1. Recalculate centroids
    1. Repeat

## Example 1: Mall Customers

### Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans

import wrangle_mall

In [None]:
customers = wrangle_mall.acquire()
train, validate, test = wrangle_mall.split(customers)
train_scaled, _, _ = wrangle_mall.scale(train, validate, test)

### Cluster

1. choose features to cluster on
1. choose k
1. create and fit the model

In [None]:
cols = ['spending_score', 'annual_income']
X = train_scaled[cols]
kmeans = KMeans(n_clusters=5).fit(X)

1. Look at the model's output
1. interpretation
1. visualize

In [None]:
# NB. clusters were created on scaled data, but can be used to analyze unscaled
train['cluster'] = kmeans.predict(X)
train.cluster = train.cluster.astype('category')
train.head()

In [None]:
sns.relplot(data=train, y='spending_score', x='annual_income', hue='cluster')

In [None]:
train.groupby('cluster').mean()

### Example: Clusters for Supervised Model

1. Do clustering
1. Build a model with the *just* the produced clusters
1. Look at the coefficients of the resulting model to determine which clusters are most important (most impactful on the model's predictions)
1. Use the most impactful clusters in combination with other features

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
cols = ['age', 'annual_income']
X = train_scaled[cols]
kmeans = KMeans(n_clusters=5).fit(X)

In [None]:
train['cluster'] = kmeans.predict(X)

cluster_df = pd.get_dummies(train.cluster)
cluster_df.columns = [f'cluster_{n}' for n in cluster_df]
cluster_df.head()

In [None]:
X = cluster_df
y = train.spending_score

lr = LinearRegression()
lr.fit(X, y)
lr.score(X, y)

In [None]:
lr.intercept_

In [None]:
pd.options.display.float_format = '{:15.2f}'.format

In [None]:
pd.Series(lr.coef_, index=X.columns).sort_values()

In [None]:
train['is_cluster_4'] = train['cluster'] == 4

In [None]:
X = train[['age', 'annual_income', 'is_cluster_4']]
y = train.spending_score

lr = LinearRegression()
lr.fit(X, y)
lr.score(X, y)

Takeaway:

- Adding cluster 4 was not helpful in our model
- When using clustering for supervised learning, we could use clusters based on independent variables in place of those variables
    - Ex. cluster on x1, x2, analyze clusters, build a model with clusters and x3 to predict y
    - Ex. cluster on bedroom count and bathroom count, use those clusers in combination with square footage to predict home value

### How do we choose a value for k?

It's a judgement call

- domain knowledge
- educated guesses
- the elbow method
    - **inertia**: sum of squared distance from each point to its centroid

Elbow Method Demo

1. Choose a range of k values
1. Create a model for each k and record **inertia**
1. Visualize results (k vs inertia)

In [None]:
cols = ['spending_score', 'annual_income']
X = train_scaled[cols]

inertias = {}

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    inertias[k] = kmeans.inertia_

pd.Series(inertias).plot(xlabel='k', ylabel='Inertia', figsize=(13, 7))
plt.grid()

## Example 2: Insurance Data

### Setup

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('https://gist.githubusercontent.com/zgulde/ad9305acb30b00f768d4541a41f5ba19/raw/01f4ac8f158e68b0d293ff726c0c1dd08cdd501d/insurance.csv')
df.head()

In [None]:
# data split
train_and_validate, test = train_test_split(df, test_size=.1, random_state=123)
train, validate = train_test_split(train_and_validate, test_size=.1, random_state=123)

# scale
scaler = MinMaxScaler()
cols = ['age', 'bmi', 'children', 'charges']
train_scaled = train.copy()
train_scaled[cols] = scaler.fit_transform(train[cols])

### Cluster

1. Choose a k
1. Create the model and produce clusters
1. Interpret results

In [None]:
X = train_scaled[cols]
kmeans = KMeans(n_clusters=4).fit(X)
train['cluster'] = kmeans.predict(X)
train.cluster = train.cluster.astype('category')

In [None]:
train.head()

In [None]:
sns.relplot(data=train, y='bmi', x='age', hue='cluster', col='smoker')

In [None]:
# clustering based on 'age', 'bmi', 'children', and 'charges'
train.groupby('cluster').mean()

In [None]:
sns.catplot(data=train, hue='children', y='charges', x='cluster', kind='bar')

In [None]:
s = pd.Series(inertias).rename('Inertia')

pd.concat([
    s, s.pct_change(periods=1)
], axis=1)

In [None]:
from sklearn.cluster import k_means

In [None]:
X = sns.load_dataset('iris').drop(columns='species')

inertias = {}

for k in range(2, 11):
    centroids, labels, inertia = k_means(X, k)
    inertias[k] = inertia
    

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 7))
s = pd.Series(inertias)
s.plot(ax=ax1, title='Inertia Vs K')
s.pct_change().plot(title='% Change in Inertia vs K', ax=ax2)

In [None]:
customers.head()

In [None]:
X = customers[['annual_income', 'spending_score']]

inertias = {}

for k in range(2, 11):
    centroids, labels, inertia = k_means(X, k)
    inertias[k] = inertia
    

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 7))
s = pd.Series(inertias)
s.plot(ax=ax1, title='Inertia Vs K')
s.pct_change().plot(title='% Change in Inertia vs K', ax=ax2)

In [None]:
bonus_df = pd.read_clipboard()

In [None]:
df = bonus_df.copy()

In [None]:
plt.scatter(df.x, df.y)

In [None]:
centroids, labels, inertia = k_means(df[['x', 'y']], 2)
df['cluster'] = labels
df.cluster = df.cluster.astype('category')
sns.relplot(data=df, y='y', x='x', hue='cluster')

In [None]:
from sklearn.preprocessing import scale

In [None]:
centroids, labels, inertia = k_means(scale(df[['x', 'y']]), 2)
df['cluster'] = labels
df.cluster = df.cluster.astype('category')
sns.relplot(data=df, y='y', x='x', hue='cluster')