# KMeans Clustering
Core


*Christina Brockway*

## Business Understanding

**Task:**
- To perform customer segmentations using KMeans.
- Help the company effectively allocate marketing resources
- Use customer age, education, years of employment, income, debt, whether they defaulted, and debt-to-income ratio to group into segments

**Stakeholder:**
Credit card company that is trying to allocate marketing material to the most relevent people at the lowest cost.



## Data Understanding:

Using the following dataset: 
https://assets.codingdojo.com/boomyeah2015/codingdojo/curriculum/content/chapter/cust_seg.csv

From: 
https://github.com/Nikhil-Adithyan/Customer-Segmentation-with-K-Means

### Load Data and Imports:

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import plotly.express as px

In [2]:
# load dataset
credit =  'data/cust_seg.csv'
df = pd.read_csv(credit)
df.head()

Unnamed: 0.1,Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
0,0,1,41,2,6,19,0.124,1.073,0.0,6.3
1,1,2,47,1,26,100,4.582,8.218,0.0,12.8
2,2,3,33,2,10,57,6.111,5.802,1.0,20.9
3,3,4,29,2,4,19,0.681,0.516,0.0,6.3
4,4,5,47,1,31,253,9.308,8.908,0.0,7.2


### Clean Data

In [3]:
df= df.drop(columns=['Unnamed: 0', 'Customer Id'])


In [4]:
df.isna().sum()

Age                  0
Edu                  0
Years Employed       0
Income               0
Card Debt            0
Other Debt           0
Defaulted          150
DebtIncomeRatio      0
dtype: int64

In [5]:
df=df.dropna()

In [6]:
df.isna().sum()

Age                0
Edu                0
Years Employed     0
Income             0
Card Debt          0
Other Debt         0
Defaulted          0
DebtIncomeRatio    0
dtype: int64

In [7]:
# Combine 'Column1' and 'Column2' into a new column 'Combined'
df['Combined Debt'] = df['Card Debt'].astype(float) + df['Other Debt']

In [8]:
df= df.drop(columns=['Card Debt', 'Other Debt'])

In [9]:
df.head(2)

Unnamed: 0,Age,Edu,Years Employed,Income,Defaulted,DebtIncomeRatio,Combined Debt
0,41,2,6,19,0.0,6.3,1.197
1,47,1,26,100,0.0,12.8,12.8


In [10]:
# Scale the data
# Instantiate Standard Scaler
scaler = StandardScaler()
# Fit & transform data.
scaled_df = scaler.fit_transform(df)

### Use KMeans to create various customer segments

In [None]:
#instantiate KMeans
ks = range(2, 11)

#create empty list for inertias and sils
inertias=[]
sils = []

#Loop through k values for range
for k in ks:
    kmeans= KMeans(n_clusters=k, n_init='auto', random_state=42)
    kmeans.fit(scaled_df)
    inertias.append(kmeans.inertia_)
    sils.append(silhouette_score(scaled_df, kmeans.labels_))

#Visualize the scores
plt.plot(ks, inertias, marker='.')
plt.xlabel('clusters')
plt.ylabel('inertia');

plt.show()

plt.plot(ks, sils, marker='.')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score');

Choose a K based on the results:  Will use K=3 based on the elbow plot

In [None]:
kmeans_model = KMeans(n_clusters=3, n_init='auto', random_state=42)
kmeans_model.fit(scaled_df)

### Analyze the clusters 

Create analytical visualizations that explore statistics for each feature for each cluster.

In [None]:
# Add the clusters as a column in the dataframe
df['cluster'] = kmeans_model.labels_
df.head()
#group the dataframe by cluster and aggregate by mean values
cluster_groups = df.groupby('cluster', as_index=False).mean()
cluster_groups

In [None]:
# Visualize means 
fig, axes = plt.subplots(2,4, figsize = (20, 10))
# fig is an array of dimensions 2,4.  axes.ravel() will flatten it to 12,1
axes = axes.ravel()
# Loop over columns and plot each in a separate figure, skip 'cluster' column
for i, col in enumerate(cluster_groups.columns[1:]):
  axes[i].bar(cluster_groups['cluster'], cluster_groups[col])
  axes[i].set_title(f'Mean {col}')

### Descriptions of each cluster based on the visualizations 

##### Cluster 0:

*  The customers in Cluster 0 are generally in their mid-thirties, and have a decent amount of education.  However, compared to other clusters, they have the lowest amount of education.  These cusotmers also have been employed the smallest number of years, averaging 7.5 years.  Given this their income is low but not as low as Cluser 2. They also have the lowest amount of debt, and the lowest debt-to-income ratio.   This customer group has not defaulted on their credit. This cluster appears to be better at managing their money and keep their debt low, possibly because their income is also low.   They appear to be cautious borrowers.


##### Cluster 1

*  Cluster 1 is the oldest group of customers. They also have the highest education, have been employed the longest amount of time, and have the highest income.  Unfortunately they also have the highest amount of debt and a high debt-to-income ratio.  In this group a small number of these customers have defaulted.  These customers are well educated and make a large amount of money. With a higher income comes more spending.  Perhaps they have a tendancy to  spend more than they make.  The majority of this cluster does pay their debts.


##### Cluster 2:

*  Cluster 2 is the youngest group, and have a slightly lower amount of education compared to Cluster 1.  These customers have the lowest amount of income, but only slightly lower than Cluster 0.  Add the high amount of debt to the low income, and this cluster has the highest debt-to-income ratio.  The majority of these customers have defaulted.  Being younger this customer group seems to be more impulsive with their spending even though they don't make a lot of money. 


### Stakeholder Recommendations

1.  Ideally the company should market their materials toward Clusters 0 and 1. These customers are more likely to pay back the credit they spend.
2.  Cluster 2 is more likely to not pay back the credit they spend, a high interest card with low credit amounts is the best type of card to market to these customers.  This way even if the card is defaulted on, with the high intrest, any amount they pay back should cover the credit they have spent.
3.  Marketing a card with a large limit to cluster 0 will guarantee the company profits from the customer base.  These customers have not defaulted on their cards and should be rewarded for their spending habits.  Having a low interest rate and a high limit will create a loyalty to this card versus other options.
4.  Cluster 1 should be marketed a card with a low interest rate with a lower spending limit.  These customers are likely to pay back the money they spend, but are not good at regulating their spending limits.  A low limit will prevent them from spending more than they could pay back.