# Introduction
---
This project will be grouping people into clusters based on their features. 

# What is clustering and how does it work?
---
Clustering is an unsupervised learning technique that groups similar data points together into clusters. 
It works by finding patterns and structures in unlabeled data, where points within a cluster are more similar to each other than to other points in other clusters. Some common algorithms are K-means, Agglomerative clustering, and DBSCAN. 

# Data Introduction
---

- Customer ID
- Gender
- Age
- Annual Income
- Spending Score - Score assigned by the shop, based on customer behavior and spending nature
- Profession
- Work Experience - in years
- Family Size


# Decision Making for Modeling
---
![Image](images/AlgorithmCheatsheet.png)

# Data Info
---

In [174]:
import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

In [175]:
# load data
df = pl.read_csv("data/Customers.csv")
df.head()

CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
i64,str,i64,i64,i64,str,i64,i64
1,"""Male""",19,15000,39,"""Healthcare""",1,4
2,"""Male""",21,35000,81,"""Engineer""",3,3
3,"""Female""",20,86000,6,"""Engineer""",1,1
4,"""Female""",23,59000,77,"""Lawyer""",0,2
5,"""Female""",31,38000,40,"""Entertainment""",2,6


In [176]:
# check if there's more than 50 samples, there should be 2000
df.shape

(2000, 8)

Since, I want to predict which profession someone might be in based on their income, family size, and other factors, we are predicting a category, which is profession. 

Since, profession is a category, let's try to ignore it so we can make a "NO" decision on labeled data, which brings us to the clustering section of the sklearn algorithm cheat sheet. 

The next question will be if we know how many categories there are, and we can find that out. We'll also check out the null count, since people might not have a profession, and we'll just fill those nulls with another category called "No Profession".

In [177]:
df.null_count()

CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,35,0,0


In [178]:
df = df.with_columns(
    pl.col("Profession").fill_null("No Profession")
)

In [179]:
df.null_count()

CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0


# Data Visualization
---

Now let's check for how many categories we have.

In [180]:
df["Profession"].value_counts(sort=True)

Profession,count
str,u32
"""Artist""",612
"""Healthcare""",339
"""Entertainment""",234
"""Engineer""",179
"""Doctor""",161
"""Executive""",153
"""Lawyer""",142
"""Marketing""",85
"""Homemaker""",60
"""No Profession""",35


We have 10 categories for Profession, and to visualize the distribution of professions:

In [181]:
chart = df["Profession"].value_counts(sort=True).plot.bar(x="Profession", y="count")
chart = chart.properties(width=700, height=400, title="Profession Count")
chart

- Most people are a type of artist in this dataset, and the second most work in healthcare. 

Let's look at the gender distribution.

In [182]:
alt.Chart(df).mark_bar().encode(
    x="Gender",
    y="count(Gender)"
).properties(width=700, height=400)

- Majority are female, so we have a bias towards the female gender, so we may need to find out if our measures of central tendency and are significantly different.

In [183]:
def vconcat_bar(focus_var: str, title: str, color: str=None, width: int=700, height: int=400):
    base = alt.Chart(df).mark_boxplot().encode(
        x=alt.X("Age"),
        y=alt.Y(focus_var),
        color=color if color != None else focus_var
    ).properties(
        width=width, 
        height=height
    )

    for col in df.columns:
        if col in ("Age", "Gender", "Profession"):
            continue
        temp_chart = alt.Chart(df).mark_boxplot().encode(
            x=alt.X(f"{col}"),
            y=alt.Y(focus_var),
            color=color if color != None else focus_var
        ).properties(
            width=width,
            height=height
        )
        base = base & temp_chart

    return base.properties(title=alt.Title(title))


In [184]:
vconcat_bar("Gender", "Checking Class Balance")

Everything looks pretty much even, there's not much difference between any of these variables regarding the gender. We can conclude that gender doesn't really affect anything else, since everything looks pretty much the same for both males and females. We can also check the profession to see anything special.

So, we can move on to the next step of our sklearn algorithm cheat sheet.

Since, we have less than 10k samples, we can just go straight to the model, which is using KMeans, and then we might use Spectral Clustering and GMM.

In [185]:
vconcat_bar("Profession", "Checking Class Balance - Profession")

Now, we can move on to preprocessing the data.

# Preprocessing
---
Our plan is to standardize the data to make our variables more comparable, and to use One-Hot encoding for our categorical variables using a pipeline. After that, we can move on to our KMeans model. 

In [186]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [187]:
df = df.drop("CustomerID")

In [188]:
numeric_cols     = [col for col in df.columns if df[col].dtype in (pl.Float64, pl.Int64)]
categorical_cols = [col for col in df.columns if df[col].dtype in (pl.Categorical, pl.String)]

In [189]:
numeric_cols, categorical_cols

(['Age',
  'Annual Income ($)',
  'Spending Score (1-100)',
  'Work Experience',
  'Family Size'],
 ['Gender', 'Profession'])

In [190]:
numeric_transformer = Pipeline(
    steps=[("standardizer", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("OneHot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
        
    ],
    remainder="passthrough"
)

preprocessor.set_output(transform="polars")

pipe = Pipeline(steps=[("preprocessor", preprocessor)])

In [191]:
pipe

0,1,2
,steps,"[('preprocessor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


So, now that our preprocessor is ready, now we can test our KMeans model as baseline.  

# Modeling
---

In [192]:
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.mixture import GaussianMixture

In [193]:
df_scaled = pipe.fit_transform(df)
df_scaled[:3]

num__Age,num__Annual Income ($),num__Spending Score (1-100),num__Work Experience,num__Family Size,cat__Gender_Female,cat__Gender_Male,cat__Profession_Artist,cat__Profession_Doctor,cat__Profession_Engineer,cat__Profession_Entertainment,cat__Profession_Executive,cat__Profession_Healthcare,cat__Profession_Homemaker,cat__Profession_Lawyer,cat__Profession_Marketing,cat__Profession_No Profession
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
-1.054089,-2.093501,-0.428339,-0.791207,0.117497,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
-0.983723,-1.656133,1.075546,-0.281162,-0.390051,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-1.018906,-0.540845,-1.609962,-0.791207,-1.405148,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [194]:
new_num_cols = list(pipe.named_steps["preprocessor"].named_transformers_["num"].get_feature_names_out())
new_cat_cols = list(pipe.named_steps["preprocessor"].named_transformers_["cat"].get_feature_names_out())
all_cols     = new_num_cols + new_cat_cols

In [195]:
kmeans = KMeans(n_clusters=3)
kmeans_labels = kmeans.fit_predict(df_scaled)

In [196]:
kmeans_labels

array([1, 1, 1, ..., 2, 2, 0], shape=(2000,), dtype=int32)

In [197]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)

In [211]:
alt.Chart(df_scaled).mark_boxplot().encode(
    x="num_Annual Income ($):Q",
    y="cat_Gender_Female:N"
)

In [202]:
df_scaled.plot.scatter(x="num_Annual Income ($):Q", y="num_Spending Score (1-100):Q")

In [199]:
alt.Chart(df_scaled).mark_line().encode(
    x=alt.X(""),
    y=alt.Y(""),
    color=kmeans_labels
)

ValueError: Unable to determine data type for the field ""; verify that the field name is not misspelled. If you are referencing a field from a transform, also confirm that the data type is specified correctly.

alt.Chart(...)

# Impact
---

# References
---