# K-means Clustering
---

### Objectives:

- Use scikit-learn's k-means clustering to cluster data

- Apply k-means clustering on a real world data for Customer segmentation

### Installs:

In [0]:
%%capture
%pip install numpy==2.4.0
%pip install pandas==2.3.3
%pip install scikit-learn==1.8.0
%pip install matplotlib==3.10.8
%pip install seaborn==0.13.0
%pip install plotly==6.5.0

In [0]:
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize and Visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import plotly.express as px

# Data Modeling / Model Linear / Metrics
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

# Warnings
import warnings
warnings.filterwarnings('ignore')

### Load the data

In [0]:
df = pd.read_csv('./data/Cust_Segmentation.csv')

### Verify successful load with some randomly selected records


In [0]:
df.sample(9)


### Understand the data

---

This dataset contains historical records of a bank's customer base, including demographic information and financial behavior patterns. The data provides a comprehensive view of the customers' credit profiles, employment stability, and income levels.

The goal is to apply **Customer Segmentation** (Clustering) to partition this customer base into distinct groups with similar characteristics. This segmentation will allow the business to define targeted marketing strategies, identify high-risk profiles, and allocate resources effectively (e.g., offering premium cards to high-income/low-debt clusters or credit recovery plans to high-risk groups).

* **Customer Id** *Categorical* - Unique identifier for each customer.

* **Age** *Continuous* - Customer's age in years.

* **Edu** *Categorical* - Education level (encoded numerically, likely representing degrees like High School, Bachelor, Master, etc.).

* **Years Employed** *Continuous* - Number of years the customer has been employed in their current or relevant jobs.

* **Income** *Continuous* - Annual household income of the customer.

* **Card Debt** *Continuous* - Amount of debt currently held on credit cards.

* **Other Debt** *Continuous* - Amount of debt held in other forms (loans, mortgages, etc.).

* **Defaulted** *Categorical* - Binary indicator of past default history (0 = Never defaulted, 1 = Has defaulted).

* **Address** *Categorical* - Anonymized or coded geographic location of the customer.

* **DebtIncomeRatio** *Continuous* - The ratio of the customer's total monthly debt payments to their gross monthly income (a key indicator of financial health).

### Explore the data
First, consider a statistical summary of the data.

In [0]:
df.describe()

In [0]:
df.info()

In [0]:
df.isnull().sum()

In this dataset, we have the variable `Address`, which refers to the customers' addresses. This variable is categorical, and KMeans does not work directly with categorical variables because it calculates Euclidean distance, and this function is not meaningful for them.

In [0]:
df = df.drop(columns = ['Address'])
df.head()

There are some null values ​​in the `Defaulted` variable. I will be dropping them to keep only the valid and complete data. I understand that using the median, mean, or zero to input this data can bias and negatively impact KMeans.

In [0]:
df = df.dropna()
df.info()

### Visualize features

In [0]:
sns.set_style('whitegrid')

# Define Figure
plt.figure(figsize  = (15, 10))

for i, col in enumerate(df.columns):
    plt.subplot(3, 3, i + 1) # Create a grid 2x2

    # Histogram
    sns.histplot(
        data = df[col],
        kde = True,
        color = 'teal'
    )

    plt.title(f'Distribution of: {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


The dataset presents extreme physical anomalies, such as households with over 1200 occupants or 140 rooms, which likely indicate data errors or the presence of non-residential institutions.

### Checking the correlations between the variables

In [0]:
df.corr()['MedHouseVal'].abs().sort_values(ascending = False)

In [0]:
plt.rc('font', size = 10)
fig, ax = plt.subplots(figsize = (8, 5))
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm', linewidths = 0.5, ax = ax)
ax.set_title('Correlation Features Matrix')
plt.tight_layout()
plt.show()

In [0]:
# Collecting data
correlation_values = df.corr()['MedHouseVal']

if 'MedHouseVal' in  correlation_values.index:
    plot_data = correlation_values.drop('MedHouseVal').sort_values()

else:
    plot_data = correlation_values.sort_values()


# Colors 
colors = ['#2196f3' if x > 0 else '#f44336' for x in plot_data]

# Figure
plt.figure(figsize = (10, 6))

# Plot
plot_data.plot(
    kind = 'barh', 
    color = colors, 
    edgecolor = 'black'
)

plt.title('Correlation of Features with MedHouseVal', fontsize = 10)
plt.xlabel('Correlation Coefficient (Pearson)', fontsize = 10)
plt.axvline(x = 0, color = 'black', linestyle = '--', linewidth = 1)
plt.grid(axis = 'x', linestyle = '--', alpha = 0.7)

sns.despine(left = True, bottom = True) 

plt.tight_layout()
plt.show()

### Selected features an train test split

In [0]:
# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target
print(f'Data shape: {X.shape}')
print(f'Target shape: {y.shape}')

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True, random_state = 33)
print(f'The shape X_train: {X_train.shape}')
print(f'The shape y_train: {y_train.shape}')
print(f'The shape X_test: {X_test.shape}')
print(f'The shape y_test: {y_test.shape}')

#### Train data

In [0]:
X_train[0: 5]

#### Test data

In [0]:
X_test[0: 5]

### XGBoot Regressor Model

In [0]:
# Create a model object
XGB = XGBRegressor(
    n_estimators = 1000,
    learning_rate = 0.05,
    max_depth = 6,
    min_child_weight = 3,
    gamma = 0.1,
    subsample = 0.8,
    colsample_bytree = 0.8,
    objective = 'reg:squarederror',
    eval_metric = 'rmse',
    random_state = 33,    
)

# Train the model in the training data
XGB.fit(X_train, y_train)

In [0]:
# Predict the target variable in the test data
y_pred = XGB.predict(X_test)

### Model Evaluation

#### Metrics

In [0]:
print(f'Mean Absolute error: {mean_absolute_error(y_test, y_pred):.2f}')
print(f'Mean Squared error: {mean_squared_error(y_test, y_pred):.2f}')
print(f'Root men squared error: {root_mean_squared_error(y_test, y_pred):.2f}')
print(f'R2-score: {r2_score(y_test, y_pred):.2f}')

####  Visualize model outputs

In [0]:
# XGBoost plot

# Standard deviation of y_teste
std_y = np.std(y_test)

# Figure
plt.figure(figsize=(14, 6))

plt.scatter(
  y_test, 
  y_pred, 
  alpha = 0.5, 
  color = 'orange', 
  ec = 'k'
)

plt.plot(
  [y_test.min(), 
   y_test.max()], 
  [y_test.min(), 
   y_test.max()], 
  'k--', 
  lw = 2,
  label = 'perfect model'
)
plt.plot(
  [y_test.min(), 
   y_test.max()], 
  [y_test.min() + std_y, 
   y_test.max() + std_y], 
  'r--', lw=1, 
  label='+/-1 Std Dev'
)

plt.plot(
  [y_test.min(), 
   y_test.max()], 
  [y_test.min() - std_y, y_test.max() - std_y], 
  'r--', 
  lw = 1
)

plt.ylim(0,6)
plt.title('XGBoost Predictions vs Actual')
plt.xlabel('Actual Values')
plt.legend()
plt.tight_layout()
plt.show()

In [0]:
importances = pd.Series(XGB.feature_importances_, index = df.columns[:-1])
importances

In [0]:
# Data collect
data_ax = importances.sort_values().reset_index()
data_ax.columns = ['Feature', 'Importances']

# Figure
plt.figure(figsize = (8, 4))
sns.set_style('whitegrid')

# Barplot
sns.barplot(
    data = data_ax,
    y = 'Feature',
    x = 'Importances',
    edgecolor = 'white',
    hue = 'Importances',
    dodge = False,
    palette = 'crest'
)

plt.axvline(x = 0, color = 'black', linestyle = '--', linewidth = 1)
plt.title('Features Importance', fontsize = 15)
plt.xlabel('Importances', fontsize = 12)
plt.ylabel('Features', fontsize = 12)
plt.legend(title = 'Importance', loc = 'upper right', fontsize = 12)

plt.tight_layout()
plt.show()


### Conclusion

---

- The developed **XGBoost Regressor** demonstrates high predictive accuracy and generalization capability for the housing market. With an **R² Score of 0.87**, the model successfully explains **87% of the variance** in property prices, leaving only a small margin of error attributable to unobserved variables or random noise. This indicates a robust fit that captures the complex, non-linear dynamics of the California real estate market.

- The analysis of **Feature Importance** reveals that economic capacity is the single most critical determinant of housing value, far outweighing structural characteristics:

  - **`MedInc` (Median Income)** is the undisputed dominant predictor, accounting for **45.7%** of the model's decision power. This confirms that the purchasing power of the neighborhood is nearly half of the equation for determining price.
  - **Geospatial Features** (`AveOccup` at 12.6%, `Longitude` at 12.1%, and `Latitude` at 10.2%) form the second tier of influence. Combined, these location and density factors contribute over **34%**, validating the real estate principle that location and neighborhood profile are critical value drivers.
  - Structural features like **`HouseAge` (6.9%)** and **`AveRooms` (6.5%)** play a tertiary role, suggesting that the physical attributes of the house are less important than *where* it is located and *who* lives there.


- Detailed error metrics indicate precise estimation capabilities given the scale of the target variable:

  - The **Root Mean Squared Error (RMSE) of 0.42** implies that, on average, the model's predictions deviate by approximately **$42,000** (0.42 * $100k) from the actual values. Considering the variance in real estate prices, this is a competitive margin of error.
  - The **Mean Absolute Error (MAE) of 0.28** is even lower, showing that for the majority of "typical" houses (excluding extreme outliers), the prediction error is closer to **$28,000**.

- In summary, the model successfully prioritized **Socioeconomic factors (`MedInc`)** and **Location** over physical dimensions (`AveBedrms`, `Population`), demonstrating that the XGBoost architecture effectively mapped the non-linear relationship between wealth, geography, and real estate value.