# Clustering for Customer Segmentation with KMeans


## Key Takeaways
In this lab, you will gain a comprehensive understanding of KMeans clustering and its applications in data segmentation. 

_You will learn how to:_
- choose the optimal number of clusters, 
- visualize cluster results, and 
- apply clustering techniques to real-world datasets. 

Through practical exercises and projects, you will develop the skills necessary to leverage clustering for data-driven decision-making in various domains.

## Applications in Market Segmentation

__How data science helps:__

- Businesses analyze customer data to create targeted marketing strategies that cater to specific groups (segments), improving engagement and loyalty.

__Types of customer segmentation features:__

- Demographic, Geographic, Psychographic, and Behavioural.

__Demographic__ - grouping is based on demographic variables such as age, gender, income, occupation, and education level. 

__Geographic__ - group according to their location, which could be as broad as a country or as specific as a neighborhood. This helps in tailoring marketing campaigns that are culturally and regionally relevant. 

__Psychographic__ - includes lifestyle, values, attitudes, and personal traits. 

__Behavioural__ - customers are divided based on their behaviour patterns related to the business, such as purchase history, product usage frequency, brand loyalty, and user status (new, potential, or loyal customers).

__What data do I regularly segment on for Email Marketing Segmentation?__ 

Recency Frequency Monetary (RFM) features, time on list, time since last purchase, spend in last 30 days, products purchased, interests (what they clicking on), events attended, email scoring, clicked on a product page (which ones?), geographic region, number of tags, number of events, and many more.

__Algorithms used:__

- KMeans - great tool for finding similar customers.


## Prepare Data

In [2]:
# Libraries

# Data manipulation
import pandas as pd
# from pandas_profiling import ProfileReport

# Data Visualization
import matplotlib.pyplot as plt
import plotly.express as px

# Date manipulation
from datetime import date, datetime, timedelta

# Clustering algorithm
from sklearn.cluster import KMeans

# For Cat features
from category_encoders import OneHotEncoder

# For Scaling features
from sklearn.preprocessing import StandardScaler

# Model pipeline
from sklearn.pipeline import make_pipeline

# Reduce dimensionality
from sklearn.decomposition import PCA

# Evaluation metric
from sklearn.metrics import silhouette_score

# Warning
import warnings
warnings.simplefilter('ignore', category=Warning, lineno=0, append=False)

### Import

Let's look at the first few rows of data.

In [3]:
# import data
raw_df = pd.read_csv('./data/marketing_campaign.csv')
raw_df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


Its always good to get little more information about the data like data types, missing values and others. Info function is very helpful for this.

In [4]:
# Understand data structure
print(raw_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

We see that majority of the features are integers and very few float and object type. It also shows that there is a column with missing values which is `Income`. There is a better way to understand exactly how many values are missing from this feature.

#### Missing Values

In [None]:
# Checking missing values
raw_df.isnull().sum()

In [None]:
# Fill in missing values
raw_df.loc[
    raw_df['Income'].isnull()==True,# row location where value == True for NaN
    'Income' # col location
    ] = raw_df['Income'].mean()

In [None]:
# Checking missing values
raw_df.isnull().sum()

Now, if you run the previous command again, there should be no missing values.

In [None]:
raw_df.Z_CostContact.unique()

Before moving any further, lets try to understand the what each feature means.
- **ID**: Identification of a customer,
- **Year_Birth**: Customer's year of birth,
- **Education**: Customer's education level,
- **Marital_Status**: Customer's marital status,  ('Single', 'Together', 'Married', 'Divorced', 'Widow', etc),
- **Income**: Amount of money in customer account,
- **Kidhome**: Number of kids at home,
- **Teenhome**: Number of teenagers at customer's home,
- **Dt_Customer**: Date when customer make last purchase,
- **Recency**: Number of days since customer's last purchase,
- **[MntWines, MntFruits , MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds]**: Amount of purchases by customer on each product,
- **[NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases]**: How frequently the purchases are being made, based on different ways to purchase products.
- **NumWebVisitsMonth**: How frequent customers visit the website 
- **[AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, AcceptedCmp1, AcceptedCmp2]**: 
- **Complain**: Customer complaints (0 - no complaint, 1 - complaint)
- **Z_CostContact**: Customer contact
- **Z_Revenue**:
- **Response**:

### Explore

In [None]:
# Cast Dt_Customer from object to date
pd.to_datetime(raw_df['Dt_Customer'],format ="%d-%m-%Y")

In [None]:
#Cat features: want to check the number of features in each category
cat_features = ['Education', 'Marital_Status']
for i in cat_features:
    print(f'Feature {i}:\n {raw_df[i].unique()}')
# raw_df

In [None]:
raw_df.describe(include='number')
#look at variability and if you can see zeros it indicates that those attributes might not be needed

#### Create `prepare_data` function

In [None]:
def prepare_data(data):

    data = (
        data
        # Remove NA values
        .dropna()
        
        # Convert Dt_Customer datatype to Date
        .assign(
            Dt_Customer = lambda x: pd.to_datetime(x['Dt_Customer'], format="%d-%m-%Y")
            )
        
        # Feature: Customer Age - max customer date
        .assign(
            Cust_Age = lambda x: (x['Dt_Customer'].min() - x['Dt_Customer'])/timedelta(days=1)
            )
        
        # Spent = Sum(Mnt...)
        .assign(
            Spent = lambda x: x.loc[:,x.columns.str.contains('Mnt')].agg(
                ['sum'], #function to use
                axis=1 # tell pandas to aggregate each row
                )
            )
        
        # Remove unnecessary features
        .drop(
            columns = ['ID', 'Z_CostContact', 'Z_Revenue', 'Response', 'Dt_Customer']
            )
        )
    
    # Transform Cat features
    prepared_df = OneHotEncoder(use_cat_names=True).fit_transform(data)
    
    # Output: cleaned dataframe
    return prepared_df

### Split

In [None]:
# Features to use
X = prepare_data(raw_df)
X.head()

## Build Model

### Iterate: Optimizing the Number of Clusters

In [None]:
n_clusters = range(2,8)
inertia_errors = []
silhouette_scores = []

# Add `for` loop to train model and calculate inertia, silhouette score.
for k in n_clusters:
    # Build model
    model = make_pipeline(
        StandardScaler(), # clustering using distances - scaling is recommended
        KMeans(n_clusters=k, random_state=42)
    )
    # Train model
    model.fit(X)
    # Calculate inertia
    inertia_errors.append(model.named_steps["kmeans"].inertia_)
    # Calculate silhouette score
    silhouette_scores.append(
        silhouette_score(
            X, 
            model.named_steps["kmeans"].labels_)
    )

print("Inertia:", inertia_errors[:3])
print()
print("Silhouette Scores:", silhouette_scores[:3])

In [None]:
import plotly.express as px
# Create line plot of `inertia_errors` vs `n_clusters`
fig = px.line(
    x=n_clusters,
    y=inertia_errors,
    title="K-Means Model: Inertia vs Number of Clusters"
)
fig.update_layout(xaxis_title="Number of Clusters", yaxis_title="Inertia")
fig.show()

In [None]:
# Create a line plot of `silhouette_scores` vs `n_clusters`
fig = fig = px.line(
    x=n_clusters,
    y=silhouette_scores,
    title="K-Means Model: Silhouette Score vs Number of Clusters (Elbow Method)"
)
fig.update_layout(
    xaxis_title="Number of Clusters", yaxis_title="Silhouette Score"
)
fig.show()

### Evaluate

In [None]:
# final_model labels
final_model = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=3, random_state=42)
)
# Predict class labels
labels = final_model.fit_predict(X)
print(labels[:10])

In [None]:
X['labels'] = final_model.named_steps["kmeans"].labels_
print(labels[:10])

## Communicate

### 1. Using `plotnine` from `ggplot` library package coming from R language.

In [None]:
import plotnine as pn
from plotnine import *
p = (
    # Create a plot area
    ggplot(X, aes(x='Spent', y='Income')) +
    # Add points into the plot area
    geom_point(aes(fill = X['labels'].astype(str)))
    )
# Add a blue smooth line
p = p + geom_smooth(
    color = "blue",
    se = False
    )
# Format x-axis scale to display `$000,000`
p = p + scale_x_continuous(
    name='Spent',
    labels = lambda x: [f'${y:,.0f}' for y in x]
    )
# Format y-axis scale to display `$000,000`
p = p + scale_y_continuous(
    name='Income',
    labels = lambda x: [f'${x:,.0f}' for x in x],
    limits=(0,200_000)
    )
# Add title to the plot
p = p + labs(
    title = "Customer Clusters: Spent vs Income"
    ) + theme_classic() # Add plot theme

p.show()

### 2. Visualization using (PCA)

In [None]:
# Instantiate transformer
pca = PCA(n_components=2, random_state=42)

# Transform `X`
X_t = pca.fit_transform(X)
X_t
# # Put `X_t` into DataFrame
X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])

print("X_pca shape:", X_pca.shape)
X_pca.head()

In [None]:
# Create scatter plot
fig = px.scatter(
    data_frame=X_pca,
    x="PC1",
    y="PC2",
    color=labels.astype(str),
    title="Customer Clusters: Spent vs Income"
)
fig.update_layout(xaxis_title="Spent", 
                  yaxis_title="Income")
fig.show()