This dataset does not contain outcome labels such as defaults, losses, fraud, or failures.
Therefore, risk is not predictive and not financial risk in the strict sense.

In this analysis, risk is defined as elevated uncertainty arising from a company’s:
	•	operating profile
	•	organisational structure
	•	information transparency

Risk reflects how difficult a company is to assess, monitor, or compare, rather than whether it is “bad”.

Types of risk that are in scope

Based on the available variables, this dataset supports structural and operational risk analysis, specifically:

-Transparency & disclosure risk

Companies with incomplete or missing core information (e.g. website, phone number, company description, revenue, employee counts) are harder to evaluate and require greater due diligence effort.

-Operational complexity risk

Companies with many sites, large workforces, broad geographic footprints, or extensive IT infrastructure tend to have higher coordination, governance, and execution complexity.

-Organisational & ownership complexity risk

Companies with parent entities, global or domestic ultimate owners, or large corporate family structures may have layered control and accountability, increasing governance and compliance complexity.

-Data quality risk

Inconsistent or sparse records (e.g. missing addresses, registration details, or geographic identifiers) may signal weak reporting systems or fragmented data governance.

1. Transparency & Disclosure Signals
These capture how much information a company makes available and how easy it is to assess.
	•	Missing or incomplete public-facing information
	•	Lack of financial or size indicators
	•	Minimal descriptive detail about operations

Why it matters:
Lower disclosure increases uncertainty and due diligence effort.

⸻

2. Operational Complexity Signals
These capture the scale and coordination burden of a company’s operations.
	•	Large number of operating sites
	•	Large workforce size
	•	Broad geographic footprint
	•	Extensive IT infrastructure

Why it matters:
More complex operations require stronger governance and increase execution risk.

⸻

3. Organisational & Ownership Complexity Signals
These capture how layered or decentralised control and accountability may be.
	•	Presence of parent entities
	•	Global or domestic ultimate owners
	•	Large corporate family structures
	•	Franchise or decentralised operating models

Why it matters:
Layered ownership structures can increase governance and compliance complexity.

⸻

4. Data Quality & Consistency Signals
These capture the reliability and completeness of reported information.
	•	Missing or inconsistent addresses
	•	Missing registration or legal identifiers
	•	Incomplete geographic information
	•	Conflicting records across fields

Why it matters:
Poor data quality may reflect weak reporting systems or fragmented data governance.

In [None]:
## Step 4: Clustering Methodology
## Step 4.1 Choice of algorithm
## Step 4.2 Parameter selection
## Step 5: Cluster Profiling
## Step 6: Risk & Anomaly Analysis

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import hdbscan

In [None]:
## Load Features ##

df = pd.read_csv("data/processed/features_for_clustering.csv")

company_ids = df["company_id"]
X = df.drop(columns=["company_id"])

In [None]:
## Scale Features ##

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
## Run HDBSCAN ##

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=50,
    min_samples=25
)

labels = clusterer.fit_predict(X_scaled)

df["cluster"] = labels

In [None]:
## Read Outputs ##

df["cluster"].value_counts()
df.groupby("cluster").mean()
