# EDA notebook

This  notebook is used to explore the data iteratively. The actual transformations used in the production workflow is found in the dlt pipeline and ML training workflow. These are also more optimized using Spark etc. whereas here other tools are used as well

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyspark.sql.functions as F

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import KNNImputer

In [0]:
train_data = spark.read.table("customer_segmentation.train_raw")

In [0]:
display(train_data)

In [0]:
display(train_data.describe())

In [0]:
display(train_data.summary())

In [0]:
display(dbutils.data.summarize(train_data))

## Proposed Bronze to Silver Cleaning Steps for training data

- Schema changes:
  - standardize column names (no whitespace, no caps, underscores for spaces)
  - convert Yes/No cols to binary 0/1 (Ever_Married,Graduated)
  - Age, Family Size should be Int not Str
  - Work Experience should be Int/Flouat not Str

- Quality Checks
 - report any missings
 - remove missings in ID (we don't know who this is, so we don't segment them) or segment (target variable, data without should not be in train batch)
  

In [0]:
sns.pairplot(train_data.toPandas(), hue="Segmentation")

In [0]:
df_pd = train_data.toPandas()
df_pd.corr()

transform categorical features

In [0]:
df_pd

In [0]:
# look into cols: cardinality, outliers, NAs
cat_cols = ["Gender", "Ever_Married", "Graduated", "Profession", "Spending_Score", "Var_1"]

for col in cat_cols:
    print(f"col: {col}, unique values: {df_pd[col].nunique()}")
    print(df_pd[col].unique())

no high cardinality, let's look at NAs
- Ever_Married and Graduated should be a clear yes or no: either impute, drop or default to No
- None in Profession and Var_1 could be converted to "Other"

In [0]:
na_percentage = df_pd[cat_cols].isnull().mean() * 100
print("Percentage of NAs in categorical columns:")
print(na_percentage)

# Display rows from the dataframe where any of the categorical columns have NAs
sample_nas = df_pd[df_pd[cat_cols].isnull().any(axis=1)]
display(sample_nas.head(10))

there seem to be some odd values such  a lawyer, who has not graduated and has no work experience, these could be students, but some are to old to be students. -> make Student col? CHECK AGAIN WHAT DEFINES GRADUATED

In [0]:
# Create a dataframe that contains rows with None values in more than one column
df_with_multiple_nones = df_pd[df_pd[cat_cols].isnull().sum(axis=1) > 1]
display(df_with_multiple_nones)

there aren't too many here, either drop or impute, rest of the data seems to be there, could impute.

In [0]:
# check overall missings
na_percentage = df_pd.isnull().mean() * 100
print("Percentage of NAs in categorical columns:")
print(na_percentage)

In [0]:
# Create a dataframe that contains rows with None values in more than one column
df_with_multiple_nones = df_pd[df_pd.isnull().sum(axis=1) > 1]
display(df_with_multiple_nones)

In [0]:
# handle missings
# Other category for Profession NAs
df_pd['Profession'].fillna('Other', inplace=True)
# Assume "No" for Ever_Married NAs
df_pd['Ever_Married'].fillna('No', inplace=True)

In [0]:
display(df_pd)

In [0]:
target = df_pd["Segmentation"]
df_pd.drop(["ID","inserted_at","Segmentation"],axis=1,inplace=True)


# Enconde to look for correlation etc.

- Label encode binary cols: Gender, Married, Graduated
- onehot encode: profession, var_1
- ordinal_encode: Spendingscore


In [0]:
# Define categorical columns to be encoded differently
categorical_columns_onehot = ['Profession', 'Var_1']  # For OneHotEncoder
categorical_columns_ordinal = ['Spending_Score']
categorical_columns_custom = ["Ever_Married", "Graduated", "Gender"]  # For custom encoding

In [0]:
for col in categorical_columns_custom:
    print(df_pd[col].unique())

In [0]:
# Custom mapping for categorical columns
custom_mapping = [
    ({'Yes': 1, 'No': 0}, ["Ever_Married", "Graduated"]),  # Mapping for Ever_Married and Graduated
    ({'Male': 1, 'Female': 0}, ["Gender"])  # Mapping for Gender
]

# Custom encoding for Ever_Married, Graduated, and Gender
for mapping, columns in custom_mapping:
    for column in columns:
        df_pd[column] = df_pd[column].map(mapping)




In [0]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline



# Define transformers for the column transformer excluding 'custom' as it's already handled
transformers = [
    ('onehot', OneHotEncoder(), categorical_columns_onehot),
    ('ordinal', OrdinalEncoder(categories=[['Low', 'Average', 'High']]), categorical_columns_ordinal)
]

# Initialize ColumnTransformer with remainder columns passed through
column_transformer = ColumnTransformer(transformers, remainder='passthrough')

# Apply ColumnTransformer
df_encoded_array = column_transformer.fit_transform(df_pd)

# Extract the feature names for onehot encoded columns
onehot_columns = column_transformer.named_transformers_['onehot'].get_feature_names_out(categorical_columns_onehot)

# Combine all column names, ensuring the order matches the original DataFrame as closely as possible
# Start with onehot encoded columns, then ordinal, and finally the remainder
new_columns = list(onehot_columns) + categorical_columns_ordinal + [col for col in df_pd.columns if col not in categorical_columns_onehot and col not in categorical_columns_ordinal]

# Convert the array back to a DataFrame with the correct column names
df_encoded = pd.DataFrame(df_encoded_array, columns=new_columns)

display(df_encoded)

In [0]:
# Impute rest of cols with KNN for now
imputer = KNNImputer(n_neighbors=5)

# Impute missing values in the DataFrame
df_pd_imputed = pd.DataFrame(imputer.fit_transform(df_encoded), columns=df_encoded.columns)

display(df_pd_imputed)

In [0]:
features = df_pd_imputed.copy()

complete = features.copy()


In [0]:

# # add target col for rest of eda with encoding of Segmentation column in a simple numerical manner
segmentation_mapping = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
complete['Segmentation'] = target.map(segmentation_mapping)

complete_no_ohe = complete.drop(columns=onehot_columns)
# Recalculating correlation
corr_matrix = complete_no_ohe.corr()
display(corr_matrix)

In [0]:
sns.heatmap(corr_matrix, annot=True)

In [0]:
sns.pairplot(complete_no_ohe, diag_kind='kde')

Correlations make sense, no very large ones -> colinearity

Figure out for ML

- how to deal with NA in Work Experience and Family Size (this seems different then "zero")
- remove ID?
- how many unique professions are there? do we need to group, same with "Var_1"
- what are the relationships between the categorical columns?

steps for silver to gold(ml-features)
- remove ID, remove target col (Segmentation)
- encode and impute using spark

steps for silver to gold(target)
- take target 

steps for silver to gold(analytics)
- impute NAs (look for other methods than KNN)