## Member 3: Clustering & Risk Interpretion
This dataset does not contain outcome labels such as defaults, losses, fraud, or failures.
Therefore, risk is not predictive and not financial risk in the strict sense.

In this analysis, risk is defined as elevated uncertainty arising from a company’s:

	- operating profile
	- organisational structure
	- information transparency

Risk reflects how difficult a company is to assess, monitor, or compare, rather than whether it is “bad”.

# Types of risk that are in scope

Based on the available variables, this dataset supports structural and operational risk analysis, specifically:

**1. Transparency & disclosure risk**

These capture how much information a company makes available and how easy it is to assess.

	- Missing or incomplete public-facing information
	- Lack of financial or size indicators
	- Minimal descriptive detail about operations

**2. Operational complexity risk**

These capture the scale and coordination burden of a company’s operations.

	- Large number of operating sites
	- Large workforce size
	- Broad geographic footprint
	- Extensive IT infrastructure

**3. Organisational & ownership complexity risk**

These capture how layered or decentralised control and accountability may be.

	- Presence of parent entities
	- Global or domestic ultimate owners
	- Large corporate family structures
	- Franchise or decentralised operating models

**4. Data quality risk**

These capture the reliability and completeness of reported information.

	- Missing or inconsistent addresses
	- Missing registration or legal identifiers
	- Incomplete geographic information
	- Conflicting records across fields


In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import hdbscan

## 1.Load features


In [None]:
df_pca = pd.read_csv("../data/processed/pca_data.csv")
X = df_pca.select_dtypes(include="number").to_numpy()

df_feature = pd.read_csv("../data/processed/features_for_clustering_scaled_dropped.csv")
Y = df_feature

assert len(df_pca) == len(df_feature), "Row count mismatch: PCA rows and feature rows not aligned"



<!-- ## 2.Sanity Check (For zero NaNs and no obvious garbage columns)

- drop features with standard deviation = 0 as they did not contribute to clustering -->


In [15]:
# X.shape
# X.isna().sum().sum()
# X.describe().T
# (X.std() == 0).sum()

# ##drop features with standard deviation = 0 as they did not contribute to clustering

# zero_val_cols = X.columns[X.std() ==0]
# X_final = X.drop(columns=zero_val_cols)
# X_final.shape


(8559, 104)

## 2. Clustering (HDBSCAN) and Interpretation

In [55]:
## Run HDBSCAN ##

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=50,
    min_samples=25
)

labels = clusterer.fit_predict(X)

pd.Series(labels).value_counts()

df_clusters = Y.copy()
df_clusters["cluster"] = labels
cluster_profiles = df_clusters.groupby("cluster").mean()

###############
# Separate Real Clusters VS Noise
cluster_profiles_no_noise = cluster_profiles.loc[cluster_profiles.index != -1]
cluster_profiles_no_noise.T




cluster,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
has_website,0.014992,0.310653,0.288752,1.353961,-0.262791,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,...,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669,-0.280669
has_phone,0.088707,-0.105225,0.767470,1.158678,-0.411007,-0.493089,-0.493089,-0.183478,-0.493089,-0.493089,...,-0.493089,0.255758,-0.493089,-0.493089,-0.493089,-0.493089,2.028029,2.028029,-0.493089,-0.493089
has_address,-0.040799,0.204323,0.204323,0.204323,0.204323,-4.814552,-4.266704,0.204323,0.204323,0.204323,...,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323,0.204323
has_city,0.009939,0.188012,0.252766,0.252766,0.233189,-3.956229,-3.956229,0.252766,0.252766,0.252766,...,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766,0.252766
has_state,0.010422,0.188333,0.253027,0.253027,0.233468,-3.952141,-3.952141,0.253027,0.253027,0.253027,...,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027,0.253027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Franchise Status__<NA>,-0.197596,-1.547605,0.199282,-0.677664,0.411478,0.646160,0.646160,0.646160,0.646160,0.646160,...,0.620043,0.646160,0.646160,0.646160,0.646160,0.646160,0.646160,0.646160,0.646160,0.646160
Manufacturing Status__Yes,-0.009585,0.001599,0.145754,5.638537,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,...,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351,-0.177351
Manufacturing Status__<NA>,0.009585,-0.001599,-0.145754,-5.638537,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,...,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351,0.177351
Registration Number Type__indonesia legalization number,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,...,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587,-0.030587


In [60]:
n_companies, n_features = X.shape
n_noise = (labels == -1).sum()
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

n_companies, n_features, n_clusters, n_noise, n_noise / n_companies

(8559, 38, 46, np.int64(2070), np.float64(0.24185068349106204))

### HDBSCAN identified **46** clusters among 8,559 companies, with approximately **24.19%** of firms classified as **noise**. Noise firms exhibit sparse or inconsistent records and are treated as high-uncertainty cases rather than forced into clusters.

In [56]:
pd.Series(labels).value_counts().drop(-1).head(5)

45    451
44    364
27    339
34    336
16    276
Name: count, dtype: int64

## Interpretation:

- These are the 5 largest real clusters
- They cover a substantial portion of the dataset



### Cluster 45 – Large, IT-intensive subsidiary enterprises

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `Entity_Type__Subsidiary` | High | Firms are predominantly subsidiaries, indicating layered organisational structures |
| `log_revenue_usd` | High | High operating revenue and large-scale business activity |
| `log_market_value_usd` | High | Strong firm valuation and economic significance |
| `log_it_budget` | High | Substantial allocation of resources to IT infrastructure |
| `it_spend_rate` | High | IT spending scales aggressively with firm size |
| `log_it_spend` | High | High absolute IT expenditure |
| `log_abs_it_budget_gap` | High | Significant IT budget adjustments, suggesting active IT planning or expansion |
| `log_employees_total` | High | Large workforce and organisational scale |
| `missing_ratio_it` | Low | IT-related data is largely complete and well-reported |
| `servers_midpoint_missing` | Low | Consistent reporting of core IT infrastructure assets |
| `Entity_Type__Branch` | Low | Unlikely to operate as small or peripheral branch entities |

Cluster 45 captures large, economically significant subsidiary enterprises with complex organisational structures and strong reliance on IT infrastructure. These firms exhibit high revenue, valuation, headcount, and IT spending, reflecting operations at substantial scale. The low level of missingness across IT and infrastructure-related variables indicates mature internal reporting practices and well-documented systems. Risk exposure in this cluster is therefore driven primarily by **operational scale and governance complexity**, rather than data opacity or lack of transparency.

In [65]:
cid_1 = 45
cluster_profiles_no_noise.loc[cid_1].sort_values(ascending=False).head(10)


Entity Type__Subsidiary    1.166090
log_revenue_usd            0.833412
log_market_value_usd       0.831414
log_it_budget              0.796774
it_spend_rate              0.796159
log_it_spend               0.796075
log_abs_it_budget_gap      0.794662
employee_concentration     0.759680
missing_ratio_codes        0.687262
log_employees_total        0.679370
Name: 45, dtype: float64

In [64]:
cluster_profiles_no_noise.loc[cid_1].sort_values(ascending=True).head(10)

missing_ratio_it                   -0.872091
storage_devices_midpoint_missing   -0.818366
servers_midpoint_missing           -0.801750
Entity Type__Branch                -0.792881
routers_midpoint_missing           -0.791951
has_company_description            -0.750715
Franchise Status__FALSE            -0.630749
log_company_age                    -0.514108
has_phone                          -0.493089
log_pc_midpoint                    -0.484601
Name: 45, dtype: float64

### Cluster 44 - High-infrastructure, IT-heavy subsidiaries

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `log_pc_midpoint` | High | Very high density of endpoint devices across the organisation |
| `log_employees_total` | High | Large overall workforce size |
| `log_it_assets_total` | High | Extensive IT infrastructure and asset base |
| `log_market_value_usd` | High | Strong firm valuation and economic significance |
| `log_it_budget` | High | Substantial allocation of resources to IT operations |
| `log_it_spend` | High | High absolute expenditure on IT systems |
| `it_spend_rate` | High | IT spending scales aggressively with organisational size |
| `log_abs_it_budget_gap` | High | Active adjustment and expansion of IT budgets |
| `Entity Type_Subsidiary` | High | Predominantly subsidiary entities within larger corporate groups |
| `missing_ratio_it`| Low | Strong completeness and reliability of IT-related data |
| `servers_midpoint_missing` | Low | Consistent reporting of server infrastructure |
| `storage_devices_midpoint_missing` | Low | Consistent reporting of storage infrastructure |
| `routers_midpoint_missing` | Low | Consistent reporting of network infrastructure |
| `Entity Type_Branch` | Low | Less likely to operate as small or peripheral branch entities |

Cluster 44 represents large subsidiary organisations characterised by exceptionally high endpoint density and extensive IT infrastructure. Elevated workforce size, market valuation, and IT expenditure indicate enterprises operating at substantial scale with strong reliance on internal digital systems to support daily operations. The presence of significant IT budget gaps suggests active investment, expansion, or ongoing digital transformation rather than static IT environments. Low levels of IT and infrastructure data missingness reflect **mature governance and well-established reporting practices**. Overall, risk exposure within this cluster is driven primarily by **infrastructure scale, asset sprawl, and change-management complexity**, rather than **data opacity or informational gaps**.

In [69]:
cid_2 = 44
cluster_profiles_no_noise.loc[cid_2].sort_values(ascending=False).head(10)

log_pc_midpoint            1.540776
Entity Type__Subsidiary    1.170580
log_employees_total        1.103940
log_it_assets_total        0.939931
log_market_value_usd       0.915978
log_abs_it_budget_gap      0.876260
log_it_spend               0.874172
log_it_budget              0.871707
log_revenue_usd            0.823631
it_spend_rate              0.796227
Name: 44, dtype: float64

In [72]:
cluster_profiles_no_noise.loc[cid_2].sort_values(ascending=True).head(10)

missing_ratio_it                   -0.866667
servers_midpoint_missing           -0.801750
storage_devices_midpoint_missing   -0.801550
Entity Type__Branch                -0.797430
routers_midpoint_missing           -0.791951
has_company_description            -0.750715
Franchise Status__FALSE            -0.630749
log_company_age                    -0.589069
has_phone                          -0.493089
Entity Type__Parent                -0.483364
Name: 44, dtype: float64

### Cluster 27 – Data-poor, low-IT branch organisations

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `routers_midpoint_missing` | High | Network infrastructure details are largely unreported |
| `servers_midpoint_missing` | High | Server infrastructure information is frequently missing |
| `storage_devices_midpoint_missing` | High | Storage infrastructure is poorly documented |
| `missing_ratio_it` | High | IT-related data is largely incomplete |
| `missing_ratio_overall` | High | High overall data missingness across features |
| `missing_ratio_codes` | High | Operational or classification codes are often absent |
| `Entity Type_Branch` | High | Firms are predominantly small branch-level entities |
| `Franchise Status__NA` | High | Franchise status is frequently unreported or unclear |
| `longitude_missing` | High | Geographic location data is incomplete |
| `latitude_missing` | High | Geographic location data is incomplete |
| `employee_concentration` | Low | Workforce is not concentrated at large operating sites |
| `log_employees_total` | Low | Small overall workforce size |
| `log_revenue_usd` | Low | Low operating revenue |
| `log_market_value_usd` | Low | Low firm valuation |
| `log_it_budget` | Low | Limited allocation of resources to IT |
| `log_it_spend` | Low | Low absolute IT expenditure |
| `it_spend_rate` | Low | IT spending does not scale with operations |
| `log_abs_it_budget_gap` | Low | Minimal IT budget adjustment or investment activity |
| `Entity Type_Subsidiary` | Low | Unlikely to operate as subsidiary entities |

Cluster 27 represents small, branch-level organisations characterised by extensive data incompleteness and limited IT investment. High missingness across IT infrastructure, operational, and geographic variables suggests weak internal reporting practices and low system maturity. These firms operate at relatively small scale, with low revenue, valuation, workforce size, and minimal IT budgets, indicating limited reliance on digital infrastructure. Risk exposure in this cluster is driven primarily by **data opacity and informational gaps** rather than **operational complexity**, making these entities difficult to assess due to lack of visibility rather than scale or technological sophistication.

In [73]:
cid_3 = 27
cluster_profiles_no_noise.loc[cid_3].sort_values(ascending=False).head(10)

routers_midpoint_missing            1.262704
Entity Type__Branch                 1.254029
servers_midpoint_missing            1.247272
storage_devices_midpoint_missing    1.221947
missing_ratio_it                    1.102224
missing_ratio_overall               1.080159
missing_ratio_codes                 0.687262
Franchise Status__<NA>              0.646160
longitude_missing                   0.536329
latitude_missing                    0.535968
Name: 27, dtype: float64

In [74]:
cluster_profiles_no_noise.loc[cid_3].sort_values(ascending=True).head(10)

employee_concentration    -1.316343
it_spend_rate             -1.250437
log_revenue_usd           -1.232485
log_it_budget             -1.221005
log_it_spend              -1.217928
log_abs_it_budget_gap     -1.215094
log_market_value_usd      -1.054372
log_employees_total       -1.044719
Entity Type__Subsidiary   -0.854277
has_company_description   -0.750715
Name: 27, dtype: float64

### Cluster 34 - Mature, high-credibility and transparent firms

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `has_company_description` | High | Public-facing company information is consistently available |
| `credibility_score_norm` | High | High overall credibility and trustworthiness score |
| `log_company_age` | High | Older, more established organisations |
| `it_spend_rate` | High | IT spending scales reliably with operations |
| `employee_concentration` | High | Workforce is concentrated at main operating sites |
| `log_revenue_usd` | High | Stable and relatively high operating revenue |
| `has_phone` | High | Contact information is consistently available |
| `log_it_budget` | High | Sustained allocation of resources to IT |
| `log_it_spend` | High | Consistent IT expenditure |
| `log_market_value_usd` | High | Solid firm valuation and economic standing |
| `latitude_missing` | Low | Geographic location data is consistently available |
| `longitude_missing` | Low | Geographic location data is consistently available |
| `missing_ratio_overall` | Low | Low overall data missingness |
| `missing_ratio_it` | Low | IT-related data is largely complete |
| `servers_midpoint_missing` | Low | Server infrastructure is well-documented |
| `storage_devices_midpoint_missing` | Low | Storage infrastructure is well-documented |
| `routers_midpoint_missing` | Low | Network infrastructure is well-documented |
| `Entity Type_Branch` | Low | Less likely to operate as small branch entities |
| `missing_ratio_codes` | Low | Operational and classification codes are consistently reported |

Cluster 34 represents mature, well-established organisations characterised by high credibility and strong data transparency. These firms exhibit consistent availability of public-facing information, reliable contact details, and comprehensive reporting across operational, geographic, and IT-related dimensions. Elevated company age, stable revenue, and sustained IT spending suggest organisations with settled operating models rather than rapid expansion or restructuring. Low levels of missingness across infrastructure and descriptive variables indicate strong governance and disciplined internal processes. Overall, risk exposure for this cluster is comparatively low and driven more by **standard operational considerations** than by **data opacity, infrastructure sprawl, or governance complexity**.

In [78]:
cid_4 = 34
cluster_profiles.loc[cid_4].sort_values(ascending=False).head(10)

has_company_description    1.232883
credibility_score_norm     1.083316
log_company_age            1.036860
it_spend_rate              0.795913
employee_concentration     0.759680
log_revenue_usd            0.733079
has_phone                  0.722450
log_it_budget              0.708149
log_it_spend               0.703706
log_market_value_usd       0.702375
Name: 34, dtype: float64

In [77]:
cluster_profiles.loc[cid_4].sort_values(ascending=True).head(10)

latitude_missing                   -1.865785
longitude_missing                  -1.864528
missing_ratio_overall              -0.995793
missing_ratio_it                   -0.860339
storage_devices_midpoint_missing   -0.812294
servers_midpoint_missing           -0.801750
routers_midpoint_missing           -0.791951
Entity Type__Branch                -0.681425
missing_ratio_codes                -0.402485
missing_ratio_contact              -0.395565
Name: 34, dtype: float64

### Cluster 16 - Small branch firms with limited IT visibility despite formal registration

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `Franchise Status_FALSE` | High | Firms are predominantly non-franchise entities |
| `has_company_description` | High | Public-facing company information is available |
| `routers_midpoint_missing` | High | Network infrastructure details are largely unreported |
| `servers_midpoint_missing` | High | Server infrastructure information is frequently missing |
| `storage_devices_midpoint_missing` | High | Storage infrastructure is poorly documented |
| `missing_ratio_it` | High | IT-related data is largely incomplete |
| `Entity Type_Branch` | High | Firms are predominantly branch-level entities |
| `credibility_score_norm` | High | Relatively high credibility despite operational simplicity |
| `has_registration_number` | High | Formal registration information is available |
| `log_corporate_family_members` | High | Firms belong to small corporate groups |
| `employee_concentration` | Low | Workforce is not concentrated at large operating sites |
| `log_employees_total` | Low | Small overall workforce size |
| `log_revenue_usd` | Low | Low operating revenue |
| `log_market_value_usd` | Low | Low firm valuation |
| `log_it_budget` | Low | Limited allocation of resources to IT |
| `log_it_spend` | Low | Low absolute IT expenditure |
| `it_spend_rate` | Low | IT spending does not scale with operations |
| `log_abs_it_budget_gap` | Low | Minimal IT budget adjustment or expansion activity |

Cluster 16 represents small, branch-level organisations with limited operational scale and minimal reliance on IT infrastructure. While these firms exhibit relatively strong formal indicators such as company descriptions, registration numbers, and moderate credibility scores, they display substantial gaps in IT and infrastructure reporting. Low workforce size, revenue, valuation, and IT spending suggest simple operating models with limited technological dependence. Risk exposure in this cluster is driven primarily by **weak IT visibility and infrastructure transparency** rather than **governance or organisational complexity**, making oversight challenging due to incomplete technical information despite basic corporate legitimacy.

In [81]:
cid_5 = 16
cluster_profiles.loc[cid_5].sort_values(ascending=False).head(10)

Franchise Status__FALSE             1.585417
has_company_description             1.332063
routers_midpoint_missing            1.262704
Entity Type__Branch                 1.254029
servers_midpoint_missing            1.247272
storage_devices_midpoint_missing    1.221947
missing_ratio_it                    1.102224
credibility_score_norm              0.677902
has_registration_number             0.656348
log_corporate_family_members        0.550856
Name: 16, dtype: float64

In [80]:
cluster_profiles.loc[cid_5].sort_values(ascending=True).head(10)

missing_ratio_codes      -2.016983
Franchise Status__<NA>   -1.547605
employee_concentration   -1.316343
it_spend_rate            -1.256483
log_revenue_usd          -1.235520
log_it_budget            -1.222990
log_it_spend             -1.219741
log_abs_it_budget_gap    -1.216723
log_market_value_usd     -1.057564
log_employees_total      -1.044719
Name: 16, dtype: float64

### Summarise

| Cluster | Dominant Risk Type | Key Driver |
|--------|------------------|-----------|
| 27 | Transparency & data quality risk | High missingness across operational and IT variables |
| 16 | IT visibility risk | Poor infrastructure reporting despite formal registration |
| 34 | Low risk (high transparency) | Mature operations with strong disclosure quality |
| 44 | Operational complexity risk | Dense IT assets and infrastructure sprawl |
| 45 | Governance & coordination risk | Large-scale subsidiary structures with heavy IT reliance |