# Introduction
---
This project will be grouping people into clusters based on their features. 

# What is clustering and how does it work?
---
Clustering is an unsupervised learning technique that groups similar data points together into clusters. 
It works by finding patterns and structures in unlabeled data, where points within a cluster are more similar to each other than to other points in other clusters. Some common algorithms are K-means, Agglomerative clustering, and DBSCAN. 

# Data Introduction
---

- Customer ID
- Gender
- Age
- Annual Income
- Spending Score - Score assigned by the shop, based on customer behavior and spending nature
- Profession
- Work Experience - in years
- Family Size


# Decision Making for Modeling
---
![Image](images/AlgorithmCheatsheet.png)

# Data Info
---

In [171]:
import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

In [172]:
# load data
df = pl.read_csv("data/Customers.csv")
df.head()

CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
i64,str,i64,i64,i64,str,i64,i64
1,"""Male""",19,15000,39,"""Healthcare""",1,4
2,"""Male""",21,35000,81,"""Engineer""",3,3
3,"""Female""",20,86000,6,"""Engineer""",1,1
4,"""Female""",23,59000,77,"""Lawyer""",0,2
5,"""Female""",31,38000,40,"""Entertainment""",2,6


In [173]:
# check if there's more than 50 samples, there should be 2000
df.shape

(2000, 8)

Since, I want to predict which profession someone might be in based on their income, family size, and other factors, we are predicting a category, which is profession. 

Since, profession is a category, let's try to ignore it so we can make a "NO" decision on labeled data, which brings us to the clustering section of the sklearn algorithm cheat sheet. 

The next question will be if we know how many categories there are, and we can find that out. We'll also check out the null count, since people might not have a profession, and we'll just fill those nulls with another category called "No Profession".

In [174]:
df.null_count()

CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,35,0,0


In [175]:
df = df.with_columns(
    pl.col("Profession").fill_null("No Profession")
)

In [176]:
df.null_count()

CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0


# Data Visualization
---

Now let's check for how many categories we have.

In [177]:
df["Profession"].value_counts(sort=True)

Profession,count
str,u32
"""Artist""",612
"""Healthcare""",339
"""Entertainment""",234
"""Engineer""",179
"""Doctor""",161
"""Executive""",153
"""Lawyer""",142
"""Marketing""",85
"""Homemaker""",60
"""No Profession""",35


We have 10 categories for Profession, and to visualize the distribution of professions:

In [178]:
chart = df["Profession"].value_counts(sort=True).plot.bar(x="Profession", y="count")
chart = chart.properties(width=700, height=400, title="Profession Count")
chart

- Most people are a type of artist in this dataset, and the second most work in healthcare. 

Let's look at the gender distribution.

In [179]:
alt.Chart(df).mark_bar().encode(
    x="Gender",
    y="count(Gender)"
).properties(width=700, height=400)

- Majority are female, so we have a bias towards the female gender, so we may need to find out if our measures of central tendency and are significantly different.

In [180]:
def vconcat_bar(focus_var: str, title: str, color: str=None, width: int=700, height: int=400):
    base = alt.Chart(df).mark_boxplot().encode(
        x=alt.X("Age"),
        y=alt.Y(focus_var),
        color=color if color != None else focus_var
    ).properties(
        width=width, 
        height=height
    )

    for col in df.columns:
        if col in ("Age", "Gender", "Profession"):
            continue
        temp_chart = alt.Chart(df).mark_boxplot().encode(
            x=alt.X(f"{col}"),
            y=alt.Y(focus_var),
            color=color if color != None else focus_var
        ).properties(
            width=width,
            height=height
        )
        base = base & temp_chart

    return base.properties(title=alt.Title(title))


In [181]:
vconcat_bar("Gender", "Checking Class Balance")

Everything looks pretty much even, there's not much difference between any of these variables regarding the gender. We can conclude that gender doesn't really affect anything else, since everything looks pretty much the same for both males and females. We can also check the profession to see anything special.

So, we can move on to the next step of our sklearn algorithm cheat sheet.

Since, we have less than 10k samples, we can just go straight to the model, which is using KMeans, and then we might use Spectral Clustering and GMM.

In [182]:
vconcat_bar("Profession", "Checking Class Balance - Profession")

Now, we can move on to preprocessing the data.

# Preprocessing
---
Our plan is to standardize the data to make our variables more comparable, and to use One-Hot encoding for our categorical variables using a pipeline. After that, we can move on to our KMeans model. 

In [183]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [184]:
df = df.drop("CustomerID")

In [185]:
numeric_cols     = [col for col in df.columns if df[col].dtype in (pl.Float64, pl.Int64)]
categorical_cols = [col for col in df.columns if df[col].dtype in (pl.Categorical, pl.String)]

In [186]:
numeric_cols, categorical_cols

(['Age',
  'Annual Income ($)',
  'Spending Score (1-100)',
  'Work Experience',
  'Family Size'],
 ['Gender', 'Profession'])

In [187]:
numeric_transformer = Pipeline(
    steps=[("standardizer", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("OneHot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
        
    ],
    remainder="passthrough"
)

preprocessor.set_output(transform="polars")

pipe = Pipeline(steps=[("preprocessor", preprocessor)])

In [188]:
pipe

0,1,2
,steps,"[('preprocessor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


So, now that our preprocessor is ready, now we can test our KMeans model as baseline.  

# Modeling
---
KMeans clustering partitions a dataset into K distinct clusters. We must first specify the desired number of K clusters, then the KMeans algorithm will assign each observation to exactly one of the K clusters. Since, each observation belongs to at least one of the K clusters, then the clusters are non-overlapping (which means that an observation can only belong to one cluster via the first property). To determine a "good" cluster, the KMeans algorithm uses the "within-cluster variation", which we want to be as small as possible. So, we want to partition the observations into K clusters so that the total within-cluster variation, which is summed over all K clusters, is as small as possible. 

In [189]:
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.mixture import GaussianMixture

In [190]:
df_scaled = pipe.fit_transform(df)
df_scaled[:3]

num__Age,num__Annual Income ($),num__Spending Score (1-100),num__Work Experience,num__Family Size,cat__Gender_Female,cat__Gender_Male,cat__Profession_Artist,cat__Profession_Doctor,cat__Profession_Engineer,cat__Profession_Entertainment,cat__Profession_Executive,cat__Profession_Healthcare,cat__Profession_Homemaker,cat__Profession_Lawyer,cat__Profession_Marketing,cat__Profession_No Profession
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
-1.054089,-2.093501,-0.428339,-0.791207,0.117497,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
-0.983723,-1.656133,1.075546,-0.281162,-0.390051,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-1.018906,-0.540845,-1.609962,-0.791207,-1.405148,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [191]:
new_num_cols = list(pipe.named_steps["preprocessor"].named_transformers_["num"].get_feature_names_out())
new_cat_cols = list(pipe.named_steps["preprocessor"].named_transformers_["cat"].get_feature_names_out())
all_cols     = new_num_cols + new_cat_cols

In [192]:
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_clusters = kmeans.fit_predict(df_scaled)

So, this will be a little bit hard to visualize since we have multiple variables. We can use Principal Component Analysis (PCA) for dimensionality reduction. The first principal component direction is the one where the observations vary the most. 

In [193]:
from sklearn.decomposition import PCA

In [194]:
pca = PCA(n_components=2)
pca.fit(df_scaled)
scores = pca.transform(df_scaled)
pca_components = pca.components_
pca_explained_variance = pca.explained_variance_
pca_mean = pca.mean_
scores[:, 0]

array([-1.88619617, -1.68839776, -1.6426034 , ...,  0.1581532 ,
        1.20200114, -0.06268867], shape=(2000,))

In [266]:
def plot2d(df: pl.DataFrame, clusters, title):
    df_copy = df.__copy__()
    model_pca_df = df_copy.with_columns([
        pl.Series("PC1", scores[:, 0]),
        pl.Series("PC2", scores[:, 1]),
        pl.Series('Cluster', clusters.astype(str))
    ])

    chart = alt.Chart(model_pca_df).mark_circle(size=200).encode(
        x=alt.X('PC1:Q', title="PC1"),
        y=alt.Y('PC2:Q', title="PC2"),
        color=alt.Color("Cluster:N"),
        tooltip=["PC1", "PC2"]
    ).properties(
        width=700,
        height=400,
        title=title
    ).interactive()
    return chart

In [195]:
df_copy = df.__copy__()
kmeans_pca_df = df_copy.with_columns([
    pl.Series("PC1", scores[:, 0]),
    pl.Series("PC2", scores[:, 1]),
    pl.Series('Cluster', kmeans_clusters.astype(str))
])

alt.Chart(kmeans_pca_df).mark_circle(size=200).encode(
    x=alt.X('PC1:Q', title="PC1"),
    y=alt.Y('PC2:Q', title="PC2"),
    color=alt.Color("Cluster:N"),
    tooltip=["PC1", "PC2"]
).properties(
    width=700,
    height=400,
    title="PCA"
).interactive()

In [197]:
kmeans_pca_df

Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size,PC1,PC2,Cluster
str,i64,i64,i64,str,i64,i64,f64,f64,str
"""Male""",19,15000,39,"""Healthcare""",1,4,-1.886196,0.1402,"""1"""
"""Male""",21,35000,81,"""Engineer""",3,3,-1.688398,-1.012543,"""1"""
"""Female""",20,86000,6,"""Engineer""",1,1,-1.642603,0.503401,"""1"""
"""Female""",23,59000,77,"""Lawyer""",0,2,-1.967027,-0.89113,"""1"""
"""Female""",31,38000,40,"""Entertainment""",2,6,-0.826954,0.361305,"""1"""
…,…,…,…,…,…,…,…,…,…
"""Female""",71,184387,40,"""Artist""",8,7,2.581793,0.417899,"""0"""
"""Female""",91,73158,32,"""Doctor""",7,7,1.009244,1.645262,"""0"""
"""Male""",87,90961,14,"""Healthcare""",9,2,0.158153,1.561624,"""0"""
"""Male""",77,182109,4,"""Executive""",7,2,1.202001,1.244837,"""0"""


In [206]:
stats = kmeans_pca_df.group_by("Cluster").agg([
    pl.col("Age").mean().alias("Avg_Age"),
    pl.col("Annual Income ($)").mean().alias("Avg_Income"),
])
stats

Cluster,Avg_Age,Avg_Income
str,f64,f64
"""0""",73.567227,120205.218487
"""1""",26.60687,102126.216603


In [207]:
from sklearn.metrics import silhouette_score

In [208]:
sil_score = silhouette_score(df_scaled, kmeans_clusters)
sil_score

0.11703237208125648

Since, we have an outrageous score of 0.11 for our silhouette score and the data, according to our PCA analysis, is very much overlapping with 2 principal components. And you van even see the shape of it, it doesn't matter how many clusters we put, everything is just compacted into one area for now.

In [215]:
spec = SpectralClustering(n_clusters=2, random_state=42)
spec_clusters = spec.fit_predict(df_scaled)

In [267]:
plot2d(df, spec_clusters, "Spectral Clustering PCA")

In [217]:
gmm = GaussianMixture(n_components=2, random_state=42)
gmm_clusters = gmm.fit_predict(df_scaled)

In [None]:
plot2d(df, gmm_clusters, "Gaussian Mixture PCA")

In [251]:
def elbow_plot(df: pl.DataFrame, max_clusters: int=10, min_clusters: int=1):
    wcss = []
    for i in range(min_clusters, max_clusters+1):
        kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
        kmeans.fit(df)
        wcss.append(kmeans.inertia_)
    df_elbow = pl.DataFrame({
        'Number of clusters': range(min_clusters, max_clusters+1),
        "WCSS": wcss 
    })

    chart = alt.Chart(df_elbow).mark_line(point=True).encode(
        x=alt.X("Number of clusters:Q", title="Number of Clusters (K)"),
        y=alt.Y("WCSS:Q", title="Within-Cluster Sum of Squares/Within-Cluster Variation")
    ).properties(
        width=700, 
        height=400,
        title='Elbow Method for Optimal K Clusters'
    )
    return chart.interactive()

In [240]:
elbow_plot(df_scaled)

shape: (10, 2)
┌────────────────────┬──────────────┐
│ Number of clusters ┆ WCSS         │
│ ---                ┆ ---          │
│ i64                ┆ f64          │
╞════════════════════╪══════════════╡
│ 1                  ┆ 12636.501    │
│ 2                  ┆ 11105.262725 │
│ 3                  ┆ 10040.773681 │
│ 4                  ┆ 9329.103123  │
│ 5                  ┆ 8681.596812  │
│ 6                  ┆ 8200.122053  │
│ 7                  ┆ 7845.873965  │
│ 8                  ┆ 7552.725581  │
│ 9                  ┆ 7305.401698  │
│ 10                 ┆ 6995.815573  │
└────────────────────┴──────────────┘


- There's not really a clear optimal cluster

In [220]:
import plotly.express as px 

In [222]:
pca = PCA(n_components=3, random_state=42)
pca.fit(df_scaled)
scores = pca.transform(df_scaled)
pca_components = pca.components_
pca_explained_variance = pca.explained_variance_
pca_mean = pca.mean_
scores[:, 0]

array([-1.88619617, -1.68839776, -1.6426034 , ...,  0.1581532 ,
        1.20200114, -0.06268867], shape=(2000,))

In [241]:
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_clusters2 = kmeans.fit_predict(df_scaled)

In [244]:
kmeans_df = df.__copy__()
kmeans_pca3d_df = kmeans_df.with_columns([
    pl.Series("PC1", scores[:, 0]),
    pl.Series("PC2", scores[:, 1]),
    pl.Series("PC3", scores[:, 2]),
    pl.Series('Cluster', kmeans_clusters2.astype(str))
])

In [245]:
fig = px.scatter_3d(kmeans_pca3d_df, x="PC1", y="PC2", z="PC3", color="Cluster")
fig.show()

In [252]:
spec = SpectralClustering(n_clusters=3, random_state=42)
spec_clusters2 = spec.fit_predict(df_scaled)

In [254]:
spec_df = df.__copy__()
spec_pca3d_df = spec_df.with_columns([
    pl.Series("PC1", scores[:, 0]),
    pl.Series("PC2", scores[:, 1]),
    pl.Series("PC3", scores[:, 2]),
    pl.Series('Cluster', spec_clusters2.astype(str))
])

In [255]:
fig = px.scatter_3d(spec_pca3d_df, x="PC1", y="PC2", z="PC3", color="Cluster")
fig.show()

In [248]:
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_clusters2 = gmm.fit_predict(df_scaled)

In [249]:
gmm_df = df.__copy__()
gmm_pca3d_df = gmm_df.with_columns([
    pl.Series("PC1", scores[:, 0]),
    pl.Series("PC2", scores[:, 1]),
    pl.Series("PC3", scores[:, 2]),
    pl.Series('Cluster', gmm_clusters2.astype(str))
])

In [250]:
fig = px.scatter_3d(gmm_pca3d_df, x="PC1", y="PC2", z="PC3", color="Cluster")
fig.show()

# Impact
---

# References
---