## Member 3: Clustering & Risk Interpretion
This dataset does not contain outcome labels such as defaults, losses, fraud, or failures.
Therefore, risk is not predictive and not financial risk in the strict sense.

In this analysis, risk is defined as elevated uncertainty arising from a company’s:

	- operating profile
	- organisational structure
	- information transparency

Risk reflects how difficult a company is to assess, monitor, or compare, rather than whether it is “bad”.

# Types of risk that are in scope

Based on the available variables, this dataset supports structural and operational risk analysis, specifically:

**1. Transparency & disclosure risk**

These capture how much information a company makes available and how easy it is to assess.

	- Missing or incomplete public-facing information
	- Lack of financial or size indicators
	- Minimal descriptive detail about operations

**2. Operational complexity risk**

These capture the scale and coordination burden of a company’s operations.

	- Large number of operating sites
	- Large workforce size
	- Broad geographic footprint
	- Extensive IT infrastructure

**3. Organisational & ownership complexity risk**

These capture how layered or decentralised control and accountability may be.

	- Presence of parent entities
	- Global or domestic ultimate owners
	- Large corporate family structures
	- Franchise or decentralised operating models

**4. Data quality risk**

These capture the reliability and completeness of reported information.

	- Missing or inconsistent addresses
	- Missing registration or legal identifiers
	- Incomplete geographic information
	- Conflicting records across fields


In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import hdbscan

## 1.Load features


In [None]:
df = pd.read_csv("../data/processed/features_for_clustering_scaled.csv")
X = df

## 2.Sanity Check (For zero NaNs and no obvious garbage columns)

- drop features with standard deviation = 0 as they did not contribute to clustering


In [15]:
X.shape
X.isna().sum().sum()
X.describe().T
(X.std() == 0).sum()

##drop features with standard deviation = 0 as they did not contribute to clustering

zero_val_cols = X.columns[X.std() ==0]
X_final = X.drop(columns=zero_val_cols)
X_final.shape


(8559, 104)

## 3. Clustering (HDBSCAN) and Interpretation

In [20]:
## Run HDBSCAN ##

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=50,
    min_samples=25
)

labels = clusterer.fit_predict(X_final)

pd.Series(labels).value_counts()

df_clusters = df.copy()
df_clusters["cluster"] = labels
cluster_profiles = df_clusters.groupby("cluster").mean()

###############
# Separate Real Clusters VS Noise
cluster_profiles_no_noise = cluster_profiles.loc[cluster_profiles.index != -1]
cluster_profiles_no_noise.T




cluster,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
has_website,0.029924,0.240496,0.180562,1.256767,-0.213532,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,...,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669
has_phone,0.118091,-0.279435,0.767470,1.091614,-0.416025,-0.493089,-0.493089,-0.493089,0.595575,-0.475939,...,2.028029,-0.493089,-0.493089,2.028029,-0.493089,2.028029,-0.493089,-0.493089,-0.493089,-0.493089
has_address,0.204323,0.204323,0.204323,0.204323,0.204323,-1.088265,-4.894216,-4.894216,0.204323,0.204323,...,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323
has_city,0.210251,0.252766,0.252766,0.252766,0.234386,-3.956229,-3.956229,-3.956229,0.252766,0.252766,...,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766
has_state,0.210551,0.253027,0.253027,0.253027,0.234664,-3.952141,-3.952141,-3.952141,0.253027,0.253027,...,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Franchise Status__<NA>,-0.240210,-1.547605,0.251282,-0.638760,0.368347,0.646160,0.646160,0.646160,-1.547605,0.646160,...,0.646160,0.646160,-1.547605,0.646160,0.646160,0.646160,0.646160,0.646160,0.646160,0.646160
Manufacturing Status__Yes,-0.001112,-0.177351,0.055285,5.638537,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,...,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351
Manufacturing Status__<NA>,0.001112,0.177351,-0.055285,-5.638537,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,...,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351
Registration Number Type__indonesia legalization number,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,...,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587


In [21]:
n_companies, n_features = X_final.shape
n_noise = (labels == -1).sum()
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

n_companies, n_features, n_clusters, n_noise, n_noise / n_companies

(8559, 104, 45, np.int64(2350), np.float64(0.27456478560579506))

### HDBSCAN identified **45** clusters among 8,559 companies, with approximately **27.46%** of firms classified as **noise**. Noise firms exhibit sparse or inconsistent records and are treated as high-uncertainty cases rather than forced into clusters.

In [27]:
pd.Series(labels).value_counts().drop(-1).head(5)

43    438
44    353
32    317
28    283
41    265
Name: count, dtype: int64

## Interpretation:

- These are the 5 largest real clusters
- They cover a substantial portion of the dataset



### Cluster 43 – Operationally complex, IT-intensive enterprises

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| log_revenue_usd | High | Large-scale commercial operations |
| log_it_budget | High | Heavy reliance on IT infrastructure |
| Entity Type_Subsidiary | High | Layered organisational structure |
| log_employees_total | High | Large workforce size |
| missing_ratio_it | Low |Strong completeness of IT-related data |
| servers_midpoint_missing | Low | Consistent infrastructure reporting |
| has_company_description | Low | Public-facing information is largely available |
| Entity Type_Branch | Low | Less likely to operate as small branch entities|



This cluster consists of large organisations with high revenue, workforce scale, and substantial IT investment.
The predominance of subsidiary entities indicates layered ownership structures and increased coordination requirements.
At the same time, low missingness across IT and descriptive fields suggests strong reporting quality and organisational maturity.
Overall, risk exposure for this cluster arises primarily from **operational and governance complexity rather than data opacity**, making these firms harder to monitor due to scale rather than lack of information.

In [None]:
cid_1 = 43
cluster_profiles.loc[cid_1].sort_values(ascending=False).head(10)


Entity Type__Subsidiary    1.170580
log_market_value_usd       0.846979
log_revenue_usd            0.834959
log_it_budget              0.798029
log_it_spend               0.797383
it_spend_rate              0.796159
log_abs_it_budget_gap      0.796029
employee_concentration     0.759680
missing_ratio_codes        0.687262
log_employees_total        0.684868
Name: 43, dtype: float64

In [None]:
cluster_profiles.loc[cid_1].sort_values(ascending=True).head(10)

missing_ratio_it                   -0.872091
storage_devices_midpoint_missing   -0.818366
servers_midpoint_missing           -0.801750
Entity Type__Branch                -0.797430
routers_midpoint_missing           -0.791951
has_company_description            -0.750715
Franchise Status__FALSE            -0.630749
log_company_age                    -0.556354
has_phone                          -0.493089
company_age                        -0.489136
Name: 43, dtype: float64

### Cluster 44 - Workforce-concentrated, device-intensive enterprises

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| log_pc_midpoint | High | High density of endpoint devices per organisation |
| log_employees_single_site | High | Workforce concentrated at large operating sites |
| log_employees_total | High | Large overall workforce size |
| log_it_assets_total | High | Extensive IT asset base |
| Entity Type_Subsidiary | High | Layered organisational structure |
| missing_ratio_it | Low | Strong completeness of IT-related data |
| servers_midpoint_missing | Low | Consistent infrastructure reporting |
| Entity Type_Branch | Low | Less likely to operate as small branch entities|

This cluster comprises **large organisations with highly concentrated workforces** and **unusually high endpoint device density**.
The combination of workforce concentration and extensive IT assets suggests elevated operational and endpoint management complexity, particularly around security, maintenance, and governance.
Low levels of missing IT and infrastructure data indicate strong reporting quality, implying that risk arises primarily from **operational scale and device intensity** rather than **data opacity**.

In [33]:
cid_2 = 44
cluster_profiles.loc[cid_2].sort_values(ascending=False).head(10)

log_pc_midpoint              1.535694
Entity Type__Subsidiary      1.170580
log_employees_total          1.092517
log_employees_single_site    1.092176
log_it_assets_total          0.933898
log_market_value_usd         0.931285
log_abs_it_budget_gap        0.878238
log_it_spend                 0.876064
log_it_budget                0.873523
log_revenue_usd              0.823324
Name: 44, dtype: float64

In [34]:
cluster_profiles.loc[cid_2].sort_values(ascending=True).head(10)

missing_ratio_it                   -0.872091
storage_devices_midpoint_missing   -0.818366
servers_midpoint_missing           -0.801750
Entity Type__Branch                -0.797430
routers_midpoint_missing           -0.791951
has_company_description            -0.750715
Franchise Status__FALSE            -0.630749
log_company_age                    -0.617226
company_age                        -0.544412
has_phone                          -0.493089
Name: 44, dtype: float64

In [36]:
cid_3 = 32
cluster_profiles.loc[cid_3].sort_values(ascending=False).head(10)

routers_midpoint_missing            1.262704
Entity Type__Branch                 1.254029
servers_midpoint_missing            1.247272
storage_devices_midpoint_missing    1.221947
missing_ratio_it                    1.102224
missing_ratio_overall               1.057819
missing_ratio_codes                 0.687262
Franchise Status__<NA>              0.646160
longitude_missing                   0.536329
latitude_missing                    0.535968
Name: 32, dtype: float64

In [38]:
cluster_profiles.loc[cid_3].sort_values(ascending=True).head(10)

employee_concentration      -1.316343
it_spend_rate               -1.243620
log_revenue_usd             -1.228507
log_it_budget               -1.218387
log_it_spend                -1.215501
log_abs_it_budget_gap       -1.212850
log_market_value_usd        -1.050259
log_employees_single_site   -1.045327
log_employees_total         -1.044719
Entity Type__Subsidiary     -0.854277
Name: 32, dtype: float64