# Group A -- Champions Group

This notebook provides a production-ready, reproducible pipeline for:
- Cleaning and auditing the Champions Group dataset
- Engineering interpretable features for clustering
- Running dimensionality reduction and clustering (PCA, HDBSCAN)
- Interpreting and visualizing cluster profiles and risk types


### Import Packages

In [59]:
# Core libraries
import pandas as pd
import numpy as np
import os
import re

# Scikit-learn
from sklearn.metrics import mutual_info_score, silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


# HDBSCAN (for advanced clustering)
import hdbscan

# Scipy (optional, for entropy and sparse matrix check)
from scipy.stats import entropy
try:
    from scipy import sparse
except ImportError:
    sparse = None

pd.set_option('display.max_columns', 250)
from IPython.display import display


### Load Raw Dataset

This cell loads the original Champions Group dataset into memory.  
We also create a full raw copy (`df_raw_copy`) to ensure:
- Full reproducibility  
- Auditability of all later cleaning steps  
- Ability to compare "before vs after" during cleaning  


In [60]:
# Load raw dataset and create a backup for audit
df_raw = pd.read_csv("../raw_data/champions_group_data.csv")
df_raw_copy = df_raw.copy()
df_raw.shape

  df_raw = pd.read_csv("../raw_data/champions_group_data.csv")


(8559, 72)

### Define abbreviation detection logic

`is_abbreviation()` helper function used later to preserve uppercase abbreviations during cleaning.


In [61]:
def is_abbreviation(text: str) -> bool:
    if not isinstance(text, str):
        return False
    t = text.strip()
    if not t:
        return False
    if t.isupper() and len(t) <= 12:
        return True
    uppercase_ratio = sum(c.isupper() for c in t) / max(1, len(t))
    if uppercase_ratio > 0.6 and len(t) <= 18:
        return True
    if re.match(r"^[A-Z0-9][A-Z0-9\-\./]*$", t) and len(t) <= 20:
        return True
    return False

### Main Data Cleaning Function

This cell defines the core cleaning function used for preprocessing the dataset.  
It handles whitespace cleanup, pseudo-missing value replacement, safe numeric conversion,  
and controlled case normalization with abbreviation protection.


In [62]:
# Standardize pseudo-missing values and clean data
PSEUDO_MISSING = {"": pd.NA, "na": pd.NA, "n/a": pd.NA, "none": pd.NA, "null": pd.NA, "-": pd.NA}
GEO_COLUMNS = {"country", "country name", "country/region", "region", "state", "state/province", "parent state", "parent state/province", "global ultimate state", "global ultimate state/province", "domestic ultimate state", "domestic ultimate state/province"}
BUSINESS_CATEGORY_COLUMNS = {"entity type", "ownership type", "company status", "company status (active/inactive)", "manufacturing status", "parent company status", "global ultimate company status", "domestic ultimate company status"}

def clean_preserve_abbrev(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    obj_cols = df.select_dtypes(include="object").columns.tolist()
    for col in obj_cols:
        s = df[col].astype("string").str.strip().str.replace(r"\s+", " ", regex=True).replace(PSEUDO_MISSING, regex=False)
        df[col] = s
    for col in obj_cols:
        sample = df[col].dropna().astype(str).head(50)
        numeric_ratio = sample.str.match(r"^-?\d+(\.\d+)?$").mean()
        if numeric_ratio > 0.85:
            df[col] = pd.to_numeric(df[col], errors="coerce")
    case_cols = df.select_dtypes(include=["object", "string"]).columns.tolist()
    for col in case_cols:
        s = df[col]
        col_lc = col.lower().strip()
        if col_lc in GEO_COLUMNS or col_lc in BUSINESS_CATEGORY_COLUMNS:
            df[col] = s.apply(lambda v: v.title() if isinstance(v, str) else v)
            continue
        if col_lc == "franchise status":
            df[col] = s
            continue
        non_null = s.dropna()
        str_values = non_null[non_null.apply(lambda x: isinstance(x, str))]
        if str_values.empty:
            continue
        avg_len = str_values.map(len).mean()
        max_words = (str_values.str.count(" ").max() or 0) + 1
        if avg_len <= 30 and max_words <= 3:
            df[col] = s.apply(lambda v: v if not isinstance(v, str) else (v if is_abbreviation(v) else v.lower()))
    return df

df_clean = clean_preserve_abbrev(df_raw_copy)


### Feature Engineering



### Helper functions

We standardize "presence" checks so empty strings don't count as filled.
We also parse range strings like `"1 to 10"` into numeric midpoints.


In [63]:
def has_value(series: pd.Series) -> pd.Series:
    s = series.astype("string").str.strip()
    return s.notna() & s.ne("")

def normalize_text(series: pd.Series) -> pd.Series:
    s = series.astype("string").str.strip().str.upper()
    return s.replace({"": pd.NA})

def range_to_midpoint(value):
    if isinstance(value, str):
        if "to" in value:
            parts = value.split("to")
        elif "-" in value:
            parts = value.split("-")
        else:
            return np.nan
        if len(parts) >= 2:
            try:
                low = float(parts[0].strip())
                high = float(parts[1].strip())
                return (low + high) / 2
            except ValueError:
                return np.nan
    return np.nan

_split_re = re.compile(r"[;,|]+")

def count_delimited_items(value) -> int:
    if value is None:
        return 0
    if not isinstance(value, str):
        value = str(value)
    s = value.strip()
    if s == "":
        return 0
    parts = [p.strip() for p in _split_re.split(s) if p.strip()]
    return len(parts) if parts else 0

def safe_to_numeric(series: pd.Series) -> pd.Series:
    return pd.to_numeric(series, errors="coerce")

def add_numeric_feature(features: pd.DataFrame, series: pd.Series, name: str, log: bool = True):
    series = safe_to_numeric(series)
    missing = series.isna()
    features[f"{name}_missing"] = missing.astype(int)
    if series.notna().any():
        fill_value = series.median()
        series_filled = series.fillna(fill_value)
    else:
        series_filled = series.fillna(0)
    features[name] = series_filled
    if log:
        features[f"log_{name}"] = np.log1p(series_filled.clip(lower=0))

def missing_ratio_for(cols):
    cols = [c for c in cols if c in df.columns]
    if not cols:
        return pd.Series(0, index=df.index)
    present = {}
    for c in cols:
        s = df[c]
        if pd.api.types.is_numeric_dtype(s):
            present[c] = s.notna()
        else:
            present[c] = has_value(s)
    present_df = pd.DataFrame(present)
    return 1 - present_df.mean(axis=1)


### Core transparency + structure signals (curated short names)

In [64]:
df = df_clean.copy()

features = pd.DataFrame(index=df.index)

def add_has(col, feature_name):
    if col in df.columns:
        features[feature_name] = has_value(df[col]).astype(int)

# Transparency / traceability
add_has("Website", "has_website")
add_has("Phone Number", "has_phone")
add_has("Address Line 1", "has_address")
add_has("City", "has_city")
add_has("State", "has_state")
add_has("State Or Province Abbreviation", "has_state_abbrev")
add_has("Postal Code", "has_postal_code")
add_has("Country", "has_country")
add_has("Region", "has_region")

# Company name (column appears labeled as Company Sites in this dataset)
add_has("Company Sites", "has_company_name")

# Ownership / structure
add_has("Parent Company", "has_parent")
add_has("Global Ultimate Company", "has_global_ultimate")
add_has("Domestic Ultimate Company", "has_domestic_ultimate")

# Verifiability extras
add_has("Ticker", "has_ticker")
add_has("Registration Number", "has_registration_number")
add_has("Company Description", "has_company_description")
add_has("Legal Status", "has_legal_status")
add_has("Ownership Type", "has_ownership_type")
add_has("Entity Type", "has_entity_type")

# Company status (value-coded)
status_col = "Company Status (Active/Inactive)"
if status_col in df.columns:
    status = df[status_col].astype("string").str.strip().str.lower()
    status_map = {"active": 1, "inactive": 0}
    features["company_status_binary"] = status.map(status_map)
    features["has_company_status"] = features["company_status_binary"].notna().astype(int)
    features["company_status_binary"] = features["company_status_binary"].fillna(0)


### Credibility / completeness score (and missing_ratio)


In [65]:
credibility_flag_cols = [c for c in [
    "has_website", "has_address", "has_phone",
    "has_ticker", "has_parent", "has_global_ultimate", "has_domestic_ultimate",
    "has_registration_number", "has_company_description",
    "has_company_status"
] if c in features.columns]

if credibility_flag_cols:
    features["credibility_score"] = features[credibility_flag_cols].sum(axis=1)
    features["credibility_score_norm"] = features["credibility_score"] / len(credibility_flag_cols)
    features["missing_ratio_credibility"] = 1 - features["credibility_score_norm"]

# Missingness ratios by group
contact_cols = [
    "Website", "Phone Number", "Address Line 1", "City", "State",
    "State Or Province Abbreviation", "Postal Code", "Country", "Region"
]
ownership_cols = [
    "Parent Company", "Parent Country/Region",
    "Global Ultimate Company", "Global Ultimate Country Name",
    "Domestic Ultimate Company"
]
financial_cols = [
    "Employees Single Site", "Employees Total", "Revenue (USD)", "Market Value (USD)",
    "Corporate Family Members", "Year Found"
]
it_cols = [
    "No. of PC", "No. of Desktops", "No. of Laptops", "No. of Routers",
    "No. of Servers", "No. of Storage Devices", "IT Budget", "IT spend"
]
code_cols = [
    "SIC Code", "8-Digit SIC Code", "NAICS Code", "NACE Rev 2 Code",
    "ANZSIC Code", "ISIC Rev 4 Code"
]

features["missing_ratio_contact"] = missing_ratio_for(contact_cols)
features["missing_ratio_ownership"] = missing_ratio_for(ownership_cols)
features["missing_ratio_financial"] = missing_ratio_for(financial_cols)
features["missing_ratio_it"] = missing_ratio_for(it_cols)
features["missing_ratio_codes"] = missing_ratio_for(code_cols)

all_missing_cols = contact_cols + ownership_cols + financial_cols + it_cols + code_cols
features["missing_ratio_overall"] = missing_ratio_for(all_missing_cols)


### Organisational complexity (group size by global ultimate)

Companies sharing the same global ultimate are treated as belonging to the same group.


In [66]:
if "Global Ultimate Company" in df.columns:
    key = normalize_text(df["Global Ultimate Company"])
    group_sizes = key.groupby(key).transform("size")
    features["org_complexity_count"] = group_sizes.fillna(0).astype(int)
    features["log_org_complexity_count"] = np.log1p(features["org_complexity_count"])


### Scale + market signals



In [67]:
# Coerce numeric fields and add log versions
numeric_cols = {
    "Employees Total": "employees_total",
    "Employees Single Site": "employees_single_site",
    "Revenue (USD)": "revenue_usd",
    "Market Value (USD)": "market_value_usd",
    "Corporate Family Members": "corporate_family_members",
}

for col, name in numeric_cols.items():
    if col in df.columns:
        add_numeric_feature(features, df[col], name, log=True)

# Company age
CURRENT_YEAR = 2026
if "Year Found" in df.columns:
    year_found = safe_to_numeric(df["Year Found"])
    company_age = CURRENT_YEAR - year_found
    company_age = company_age.where((company_age >= 0) & (company_age <= 300))
    features["company_age_missing"] = company_age.isna().astype(int)
    fill_value = company_age.median()
    if pd.isna(fill_value):
        fill_value = 0
    company_age_filled = company_age.fillna(fill_value)
    features["company_age"] = company_age_filled
    features["log_company_age"] = np.log1p(company_age_filled.clip(lower=0))

# Ratios
if "employees_total" in features.columns and "employees_single_site" in features.columns:
    denom = features["employees_total"].replace(0, np.nan)
    features["employee_concentration"] = (features["employees_single_site"] / denom).fillna(0)

if "revenue_usd" in features.columns and "employees_total" in features.columns:
    denom = features["employees_total"].replace(0, np.nan)
    features["revenue_per_employee"] = (features["revenue_usd"] / denom).fillna(0)

if "market_value_usd" in features.columns and "employees_total" in features.columns:
    denom = features["employees_total"].replace(0, np.nan)
    features["market_value_per_employee"] = (features["market_value_usd"] / denom).fillna(0)


### Geography + multinational heuristics

We avoid one-hot encoding high-cardinality city names. Use country/region + parent/ultimate country comparisons.


In [68]:
# Coordinates
if "Lattitude" in df.columns:
    add_numeric_feature(features, df["Lattitude"], "latitude", log=False)

if "Longitude" in df.columns:
    add_numeric_feature(features, df["Longitude"], "longitude", log=False)

entity_country = normalize_text(df["Country"]) if "Country" in df.columns else pd.Series(pd.NA, index=df.index)
parent_country = normalize_text(df["Parent Country/Region"]) if "Parent Country/Region" in df.columns else pd.Series(pd.NA, index=df.index)
global_country = normalize_text(df["Global Ultimate Country Name"]) if "Global Ultimate Country Name" in df.columns else pd.Series(pd.NA, index=df.index)

parent_present = has_value(df["Parent Company"]) if "Parent Company" in df.columns else pd.Series(False, index=df.index)
global_present = has_value(df["Global Ultimate Company"]) if "Global Ultimate Company" in df.columns else pd.Series(False, index=df.index)

features["parent_foreign_flag"] = (parent_present & parent_country.notna() & (parent_country != entity_country)).astype(int)
features["global_ultimate_foreign_flag"] = (global_present & global_country.notna() & (global_country != entity_country)).astype(int)
features["multinational_flag"] = ((features["parent_foreign_flag"] == 1) | (features["global_ultimate_foreign_flag"] == 1)).astype(int)

countries_df = pd.concat([entity_country, parent_country, global_country], axis=1)
features["num_countries_reported"] = countries_df.nunique(axis=1, dropna=True)


### IT / operational footprint signals

We parse IT asset ranges into midpoints, then build intensity & composition signals.


In [69]:
def parse_range_or_numeric(value):
    if isinstance(value, (int, float)) and not pd.isna(value):
        return float(value)
    if isinstance(value, str):
        mid = range_to_midpoint(value)
        if not pd.isna(mid):
            return mid
        try:
            return float(value.strip())
        except ValueError:
            return np.nan
    return np.nan

asset_cols_map = {
    "No. of PC": "pc_midpoint",
    "No. of Desktops": "desktops_midpoint",
    "No. of Laptops": "laptops_midpoint",
    "No. of Routers": "routers_midpoint",
    "No. of Servers": "servers_midpoint",
    "No. of Storage Devices": "storage_devices_midpoint",
}

for col, name in asset_cols_map.items():
    if col in df.columns:
        series = df[col].apply(parse_range_or_numeric)
        add_numeric_feature(features, series, name, log=True)

asset_feature_cols = [name for name in asset_cols_map.values() if name in features.columns]
if asset_feature_cols:
    features["it_assets_total"] = features[asset_feature_cols].sum(axis=1)
    features["log_it_assets_total"] = np.log1p(features["it_assets_total"])

if "IT Budget" in df.columns:
    add_numeric_feature(features, df["IT Budget"], "it_budget", log=True)
if "IT spend" in df.columns:
    add_numeric_feature(features, df["IT spend"], "it_spend", log=True)

if "it_budget" in features.columns and "it_spend" in features.columns:
    denom = features["it_budget"].replace(0, np.nan)
    features["it_spend_rate"] = (features["it_spend"] / denom).fillna(0).clip(lower=0, upper=3)
    features["it_budget_gap"] = features["it_budget"] - features["it_spend"]
    features["log_abs_it_budget_gap"] = np.log1p(features["it_budget_gap"].abs())

if "it_assets_total" in features.columns and "employees_total" in features.columns:
    denom = features["employees_total"].replace(0, np.nan)
    features["it_assets_per_employee"] = (features["it_assets_total"] / denom).fillna(0)

if "it_spend" in features.columns and "employees_total" in features.columns:
    denom = features["employees_total"].replace(0, np.nan)
    features["it_spend_per_employee"] = (features["it_spend"] / denom).fillna(0)


### Industry code features (low-cardinality sector buckets)

Full industry codes can be high-cardinality. For clustering/PCA, we use **2-digit buckets** (industry sectors).


In [70]:
code_cols = {
    "SIC Code": "sic_code",
    "8-Digit SIC Code": "sic8_code",
    "NAICS Code": "naics_code",
    "NACE Rev 2 Code": "nace2_code",
    "ANZSIC Code": "anzsic_code",
    "ISIC Rev 4 Code": "isic4_code",
}

for col, prefix in code_cols.items():
    if col in df.columns:
        counts = df[col].astype("string").apply(count_delimited_items)
        features[f"{prefix}_count"] = counts
        features[f"has_{prefix}"] = (counts > 0).astype(int)

if "Registration Number" in df.columns:
    reg_counts = df["Registration Number"].astype("string").apply(count_delimited_items)
    features["registration_number_count"] = reg_counts


### Categorical encoding (PCA/clustering ready)

We one-hot encode selected low-cardinality categoricals + sector buckets.


In [71]:
bool_map = {"true": 1, "false": 0, "yes": 1, "no": 0, "y": 1, "n": 0, "1": 1, "0": 0}

def add_boolean_feature(col, name):
    if col not in df.columns:
        return
    s = df[col].astype("string").str.strip().str.lower()
    mapped = s.map(bool_map)
    features[name] = mapped.fillna(0).astype(int)
    features[f"{name}_missing"] = mapped.isna().astype(int)

add_boolean_feature("Is Headquarters", "is_headquarters")
add_boolean_feature("Is Domestic Ultimate", "is_domestic_ultimate")

candidate_categoricals = [
    "Region",
    "Entity Type",
    "Ownership Type",
    "Legal Status",
    "Franchise Status",
    "Manufacturing Status",
    "Registration Number Type",
]

categorical_cols = []
for col in candidate_categoricals:
    if col in df.columns and df[col].nunique(dropna=True) <= 20:
        categorical_cols.append(col)

if categorical_cols:
    df_cats = df[categorical_cols].copy()
    for col in categorical_cols:
        df_cats[col] = df_cats[col].astype("string").str.strip()
    df_dummies = pd.get_dummies(df_cats, prefix=categorical_cols, prefix_sep="__", dummy_na=True)
else:
    df_dummies = pd.DataFrame(index=df.index)



  features[name] = mapped.fillna(0).astype(int)
  features[f"{name}_missing"] = mapped.isna().astype(int)
  features[name] = mapped.fillna(0).astype(int)
  features[f"{name}_missing"] = mapped.isna().astype(int)


### Build final feature matrices (raw + scaled)

- Drop obvious identifiers & free-text
- Keep numeric + engineered features + one-hot columns
- Create scaled matrix for PCA/clustering


In [72]:
df_features_raw = pd.concat([features, df_dummies], axis=1)

# Drop columns that are fully missing
df_features_raw = df_features_raw.dropna(axis=1, how="all")

# Ensure numeric-only
df_features_raw = df_features_raw.apply(pd.to_numeric, errors="coerce")


In [73]:
# Fix any remaining NaNs
if df_features_raw.isna().any().any():
    nan_cols = df_features_raw.columns[df_features_raw.isna().any()].tolist()
    print("NaNs found in raw features. Top columns:", nan_cols[:20])
    for col in nan_cols:
        series = df_features_raw[col]
        fill_value = series.median()
        if pd.isna(fill_value):
            fill_value = 0
        df_features_raw[col] = series.fillna(fill_value)

assert not df_features_raw.isna().any().any(), "NaNs remain in raw features"

# Scale for PCA/clustering
scaler = StandardScaler()
df_features_scaled = pd.DataFrame(
    scaler.fit_transform(df_features_raw),
    columns=df_features_raw.columns,
    index=df_features_raw.index
)

assert not df_features_scaled.isna().any().any(), "NaNs remain in scaled features"


In [74]:
def find_first_column(candidates):
    for col in candidates:
        if col in df.columns:
            return col
    return None

duns_col = None
for col in df.columns:
    if col.strip().lower() == "duns number":
        duns_col = col
        break

name_col = find_first_column(["Company Name", "Company", "Company Sites"])

row_id_map = pd.DataFrame({"row_index": df.index})
if duns_col:
    row_id_map["DUNS Number"] = df[duns_col]
if name_col:
    row_id_map["Company Name"] = df[name_col]



In [75]:
feature_descriptions = {
    "has_website": "1 if Website is present",
    "has_phone": "1 if Phone Number is present",
    "has_address": "1 if Address Line 1 is present",
    "has_city": "1 if City is present",
    "has_state": "1 if State is present",
    "has_state_abbrev": "1 if State Or Province Abbreviation is present",
    "has_postal_code": "1 if Postal Code is present",
    "has_country": "1 if Country is present",
    "has_region": "1 if Region is present",
    "has_company_name": "1 if Company Sites (company name) is present",
    "has_parent": "1 if Parent Company is present",
    "has_global_ultimate": "1 if Global Ultimate Company is present",
    "has_domestic_ultimate": "1 if Domestic Ultimate Company is present",
    "has_ticker": "1 if Ticker is present",
    "has_registration_number": "1 if Registration Number is present",
    "has_company_description": "1 if Company Description is present",
    "has_legal_status": "1 if Legal Status is present",
    "has_ownership_type": "1 if Ownership Type is present",
    "has_entity_type": "1 if Entity Type is present",
    "company_status_binary": "1 if Company Status is active, 0 if inactive or missing",
    "has_company_status": "1 if Company Status is present",
    "credibility_score": "Count of core transparency/ownership flags present",
    "credibility_score_norm": "Credibility score normalized by number of flags",
    "missing_ratio_credibility": "1 - credibility_score_norm",
    "missing_ratio_contact": "Missing ratio across contact fields",
    "missing_ratio_ownership": "Missing ratio across ownership fields",
    "missing_ratio_financial": "Missing ratio across financial fields",
    "missing_ratio_it": "Missing ratio across IT fields",
    "missing_ratio_codes": "Missing ratio across industry code fields",
    "missing_ratio_overall": "Missing ratio across key groups",
    "org_complexity_count": "Count of entities sharing the same global ultimate",
    "log_org_complexity_count": "log1p of org_complexity_count",
    "employees_total": "Employees total (imputed)",
    "employees_single_site": "Employees single site (imputed)",
    "revenue_usd": "Revenue USD (imputed)",
    "market_value_usd": "Market value USD (imputed)",
    "corporate_family_members": "Corporate family members (imputed)",
    "company_age": "Company age in years (imputed)",
    "employee_concentration": "Employees single site divided by employees total",
    "revenue_per_employee": "Revenue per employee (USD)",
    "market_value_per_employee": "Market value per employee (USD)",
    "latitude": "Latitude (imputed)",
    "longitude": "Longitude (imputed)",
    "parent_foreign_flag": "1 if parent country differs from entity country",
    "global_ultimate_foreign_flag": "1 if global ultimate country differs from entity country",
    "multinational_flag": "1 if parent or global ultimate is foreign",
    "num_countries_reported": "Count of unique countries reported (entity/parent/global)",
    "pc_midpoint": "Midpoint of PC count range (imputed)",
    "desktops_midpoint": "Midpoint of desktop count range (imputed)",
    "laptops_midpoint": "Midpoint of laptop count range (imputed)",
    "routers_midpoint": "Midpoint of router count range (imputed)",
    "servers_midpoint": "Midpoint of server count range (imputed)",
    "storage_devices_midpoint": "Midpoint of storage device count range (imputed)",
    "it_assets_total": "Sum of IT asset midpoints",
    "log_it_assets_total": "log1p of it_assets_total",
    "it_budget": "IT budget (imputed)",
    "it_spend": "IT spend (imputed)",
    "it_spend_rate": "IT spend divided by IT budget (clipped)",
    "it_budget_gap": "IT budget minus IT spend",
    "log_abs_it_budget_gap": "log1p of absolute IT budget gap",
    "it_assets_per_employee": "IT assets per employee",
    "it_spend_per_employee": "IT spend per employee",
    "sic_code_count": "Number of SIC codes (split on delimiters)",
    "sic8_code_count": "Number of 8-digit SIC codes (split on delimiters)",
    "naics_code_count": "Number of NAICS codes (split on delimiters)",
    "nace2_code_count": "Number of NACE Rev 2 codes (split on delimiters)",
    "anzsic_code_count": "Number of ANZSIC codes (split on delimiters)",
    "isic4_code_count": "Number of ISIC Rev 4 codes (split on delimiters)",
    "has_sic_code": "1 if SIC Code is present",
    "has_sic8_code": "1 if 8-digit SIC Code is present",
    "has_naics_code": "1 if NAICS Code is present",
    "has_nace2_code": "1 if NACE Rev 2 Code is present",
    "has_anzsic_code": "1 if ANZSIC Code is present",
    "has_isic4_code": "1 if ISIC Rev 4 Code is present",
    "registration_number_count": "Number of registration numbers (split on delimiters)",
}

def describe_feature(name):
    if name in feature_descriptions:
        return feature_descriptions[name]
    if name.endswith("_missing"):
        base = name[:-8]
        return f"Missing indicator for {base}"
    if name.startswith("log_"):
        base = name[4:]
        return f"log1p of {base}"
    if "__" in name:
        base, val = name.split("__", 1)
        val_norm = val.strip().lower()
        if val_norm in ("nan", "<na>"):
            return f"Missing indicator for {base}"
        return f"One-hot: {base} = {val}"
    return "Derived numeric feature"

feature_dict = pd.DataFrame({
    "feature": df_features_raw.columns,
    "description": [describe_feature(c) for c in df_features_raw.columns],
})


In [76]:
# Drop Similar columns
cols_to_drop = [
    "has_state_abbrev",
    "company_status_binary",
    "credibility_score",
    "missing_ratio_credibility",
    "org_complexity_count",
    "employees_total",
    "log_employees_single_site",
    "revenue_usd",
    "market_value_usd",
    "corporate_family_members",
    "company_age",
    "pc_midpoint",
    "desktops_midpoint",
    "laptops_midpoint",
    "routers_midpoint",
    "servers_midpoint",
    "storage_devices_midpoint",
    "it_assets_total",
    "it_budget",
    "it_spend",
    "it_budget_gap",]

bool_list = [col in df_features_scaled.columns for col in cols_to_drop]
df_features_scaled_drop = df_features_scaled.drop(columns=cols_to_drop)

In [77]:
df_features_scaled_drop.to_csv(
    "../processed_data/features_for_clustering_scaled_dropped.csv",
    index=False
)


### PCA Dimensionality Reduction

We reduce dimensionality to make clustering faster and more stable.


In [78]:
# C) Config / File Paths (centralized)
SCALED_PATH = "../processed_data/features_for_clustering_scaled_dropped.csv"

# PCA-transformed (reduced) feature space output
OUT_PCA_DATA = "../processed_data/pca_data.csv"

RANDOM_STATE = 42

In [79]:
# E2) Compute PCA (dense) or TruncatedSVD (sparse) -> df_pca
df_scaled = pd.read_csv(SCALED_PATH)

reducer = PCA(n_components=0.95, random_state=RANDOM_STATE)
X_reduced = reducer.fit_transform(df_scaled)


df_pca = pd.DataFrame(X_reduced, columns=[f"PC{i+1}" for i in range(X_reduced.shape[1])])

# Save reduced feature space for downstream modeling/analysis
df_pca.to_csv(OUT_PCA_DATA, index=False)


## Clustering & Risk Interpretion
This dataset does not contain outcome labels such as defaults, losses, fraud, or failures.
Therefore, risk is not predictive and not financial risk in the strict sense.

In this analysis, risk is defined as elevated uncertainty arising from a company’s:

	- operating profile
	- organisational structure
	- information transparency

Risk reflects how difficult a company is to assess, monitor, or compare, rather than whether it is “bad”.

# Types of risk that are in scope

Based on the available variables, this dataset supports structural and operational risk analysis, specifically:

**1. Transparency & disclosure risk**

These capture how much information a company makes available and how easy it is to assess.

	- Missing or incomplete public-facing information
	- Lack of financial or size indicators
	- Minimal descriptive detail about operations

**2. Operational complexity risk**

These capture the scale and coordination burden of a company’s operations.

	- Large number of operating sites
	- Large workforce size
	- Broad geographic footprint
	- Extensive IT infrastructure

**3. Organisational & ownership complexity risk**

These capture how layered or decentralised control and accountability may be.

	- Presence of parent entities
	- Global or domestic ultimate owners
	- Large corporate family structures
	- Franchise or decentralised operating models

**4. Data quality risk**

These capture the reliability and completeness of reported information.

	- Missing or inconsistent addresses
	- Missing registration or legal identifiers
	- Incomplete geographic information
	- Conflicting records across fields


In [80]:
df_pca = pd.read_csv("../processed_data/pca_data.csv")
X = df_pca.select_dtypes(include="number").to_numpy()

df_feature = pd.read_csv("../processed_data/features_for_clustering_scaled_dropped.csv")
Y = df_feature

assert len(df_pca) == len(df_feature), "Row count mismatch: PCA rows and feature rows not aligned" 

## 2. Clustering (HDBSCAN) and Interpretation

In [81]:
## Run HDBSCAN ##
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=50,
    min_samples=25
)

labels = clusterer.fit_predict(X)

pd.Series(labels).value_counts()

df_clusters = Y.copy()
df_clusters["cluster"] = labels
cluster_profiles = df_clusters.groupby("cluster").mean()

###############
# Separate Real Clusters VS Noise
cluster_profiles_no_noise = cluster_profiles.loc[cluster_profiles.index != -1]

# Save cluster profile tables
out_dir = "../processed_data"
os.makedirs(out_dir, exist_ok=True)


cluster_profiles_no_noise.T.to_csv(
    os.path.join(out_dir, "cluster_profiles_no_noise_transposed.csv"),
    index=True,
)


In [82]:
# Attach cluster labels to raw champions group data
# use for LLM analysis later
if len(df_raw) != len(labels):
    raise ValueError(
        f"Row count mismatch: raw_data has {len(df_raw)} rows, labels has {len(labels)} rows"
    )

df_raw_with_cluster = df_raw.copy()
df_raw_with_cluster["cluster_id"] = labels

out_path = "../raw_data/champions_group_data_with_cluster.csv"
df_raw_with_cluster.to_csv(out_path, index=False)



In [83]:
n_companies, n_features = X.shape
n_noise = (labels == -1).sum()
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

### HDBSCAN identified **46** clusters among 8,559 companies, with approximately **24.19%** of firms classified as **noise**. Noise firms exhibit sparse or inconsistent records and are treated as high-uncertainty cases rather than forced into clusters.

In [84]:
pd.Series(labels).value_counts().drop(-1).head(5)

46    451
45    364
28    340
35    336
16    276
Name: count, dtype: int64

## Interpretation:

- These are the 5 largest real clusters
- They cover a substantial portion of the dataset



### Cluster 45 – Large, IT-intensive subsidiary enterprises

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `Entity_Type__Subsidiary` | High | Firms are predominantly subsidiaries, indicating layered organisational structures |
| `log_revenue_usd` | High | High operating revenue and large-scale business activity |
| `log_market_value_usd` | High | Strong firm valuation and economic significance |
| `log_it_budget` | High | Substantial allocation of resources to IT infrastructure |
| `it_spend_rate` | High | IT spending scales aggressively with firm size |
| `log_it_spend` | High | High absolute IT expenditure |
| `log_abs_it_budget_gap` | High | Significant IT budget adjustments, suggesting active IT planning or expansion |
| `log_employees_total` | High | Large workforce and organisational scale |
| `missing_ratio_it` | Low | IT-related data is largely complete and well-reported |
| `servers_midpoint_missing` | Low | Consistent reporting of core IT infrastructure assets |
| `Entity_Type__Branch` | Low | Unlikely to operate as small or peripheral branch entities |

Cluster 45 captures large, economically significant subsidiary enterprises with complex organisational structures and strong reliance on IT infrastructure. These firms exhibit high revenue, valuation, headcount, and IT spending, reflecting operations at substantial scale. The low level of missingness across IT and infrastructure-related variables indicates mature internal reporting practices and well-documented systems. Risk exposure in this cluster is therefore driven primarily by **operational scale and governance complexity**, rather than data opacity or lack of transparency.

In [85]:
cid_1 = 45
cluster_profiles_no_noise.loc[cid_1].sort_values(ascending=False).head(10)


log_pc_midpoint            1.540776
Entity Type__Subsidiary    1.170580
log_employees_total        1.103940
log_it_assets_total        0.939931
log_market_value_usd       0.915978
log_abs_it_budget_gap      0.876260
log_it_spend               0.874172
log_it_budget              0.871707
log_revenue_usd            0.823631
it_spend_rate              0.796227
Name: 45, dtype: float64

In [86]:
cluster_profiles_no_noise.loc[cid_1].sort_values(ascending=True).head(10)

missing_ratio_it                   -0.866667
servers_midpoint_missing           -0.801750
storage_devices_midpoint_missing   -0.801550
Entity Type__Branch                -0.797430
routers_midpoint_missing           -0.791951
has_company_description            -0.750715
Franchise Status__FALSE            -0.630749
log_company_age                    -0.589069
has_phone                          -0.493089
Entity Type__Parent                -0.483364
Name: 45, dtype: float64

### Cluster 44 - High-infrastructure, IT-heavy subsidiaries

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `log_pc_midpoint` | High | Very high density of endpoint devices across the organisation |
| `log_employees_total` | High | Large overall workforce size |
| `log_it_assets_total` | High | Extensive IT infrastructure and asset base |
| `log_market_value_usd` | High | Strong firm valuation and economic significance |
| `log_it_budget` | High | Substantial allocation of resources to IT operations |
| `log_it_spend` | High | High absolute expenditure on IT systems |
| `it_spend_rate` | High | IT spending scales aggressively with organisational size |
| `log_abs_it_budget_gap` | High | Active adjustment and expansion of IT budgets |
| `Entity Type_Subsidiary` | High | Predominantly subsidiary entities within larger corporate groups |
| `missing_ratio_it`| Low | Strong completeness and reliability of IT-related data |
| `servers_midpoint_missing` | Low | Consistent reporting of server infrastructure |
| `storage_devices_midpoint_missing` | Low | Consistent reporting of storage infrastructure |
| `routers_midpoint_missing` | Low | Consistent reporting of network infrastructure |
| `Entity Type_Branch` | Low | Less likely to operate as small or peripheral branch entities |

Cluster 44 represents large subsidiary organisations characterised by exceptionally high endpoint density and extensive IT infrastructure. Elevated workforce size, market valuation, and IT expenditure indicate enterprises operating at substantial scale with strong reliance on internal digital systems to support daily operations. The presence of significant IT budget gaps suggests active investment, expansion, or ongoing digital transformation rather than static IT environments. Low levels of IT and infrastructure data missingness reflect **mature governance and well-established reporting practices**. Overall, risk exposure within this cluster is driven primarily by **infrastructure scale, asset sprawl, and change-management complexity**, rather than **data opacity or informational gaps**.

In [87]:
cid_2 = 44
cluster_profiles_no_noise.loc[cid_2].sort_values(ascending=False).head(10)

has_phone                  2.028029
log_company_age            1.444366
Entity Type__Subsidiary    0.991916
it_spend_rate              0.795564
employee_concentration     0.759680
missing_ratio_codes        0.687262
log_employees_total        0.683026
credibility_score_norm     0.677902
has_registration_number    0.656348
Franchise Status__<NA>     0.646160
Name: 44, dtype: float64

In [88]:
cluster_profiles_no_noise.loc[cid_2].sort_values(ascending=True).head(10)

missing_ratio_it                   -0.856606
missing_ratio_contact              -0.809810
servers_midpoint_missing           -0.801750
routers_midpoint_missing           -0.791951
storage_devices_midpoint_missing   -0.770359
has_company_description            -0.750715
Franchise Status__FALSE            -0.630749
Entity Type__Branch                -0.616419
Entity Type__Parent                -0.483364
missing_ratio_overall              -0.435175
Name: 44, dtype: float64

### Cluster 27 – Data-poor, low-IT branch organisations

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `routers_midpoint_missing` | High | Network infrastructure details are largely unreported |
| `servers_midpoint_missing` | High | Server infrastructure information is frequently missing |
| `storage_devices_midpoint_missing` | High | Storage infrastructure is poorly documented |
| `missing_ratio_it` | High | IT-related data is largely incomplete |
| `missing_ratio_overall` | High | High overall data missingness across features |
| `missing_ratio_codes` | High | Operational or classification codes are often absent |
| `Entity Type_Branch` | High | Firms are predominantly small branch-level entities |
| `Franchise Status__NA` | High | Franchise status is frequently unreported or unclear |
| `longitude_missing` | High | Geographic location data is incomplete |
| `latitude_missing` | High | Geographic location data is incomplete |
| `employee_concentration` | Low | Workforce is not concentrated at large operating sites |
| `log_employees_total` | Low | Small overall workforce size |
| `log_revenue_usd` | Low | Low operating revenue |
| `log_market_value_usd` | Low | Low firm valuation |
| `log_it_budget` | Low | Limited allocation of resources to IT |
| `log_it_spend` | Low | Low absolute IT expenditure |
| `it_spend_rate` | Low | IT spending does not scale with operations |
| `log_abs_it_budget_gap` | Low | Minimal IT budget adjustment or investment activity |
| `Entity Type_Subsidiary` | Low | Unlikely to operate as subsidiary entities |

Cluster 27 represents small, branch-level organisations characterised by extensive data incompleteness and limited IT investment. High missingness across IT infrastructure, operational, and geographic variables suggests weak internal reporting practices and low system maturity. These firms operate at relatively small scale, with low revenue, valuation, workforce size, and minimal IT budgets, indicating limited reliance on digital infrastructure. Risk exposure in this cluster is driven primarily by **data opacity and informational gaps** rather than **operational complexity**, making these entities difficult to assess due to lack of visibility rather than scale or technological sophistication.

In [89]:
cid_3 = 27
cluster_profiles_no_noise.loc[cid_3].sort_values(ascending=False).head(10)

log_org_complexity_count            5.521514
log_corporate_family_members        3.259565
Franchise Status__FALSE             1.585417
Entity Type__Branch                 1.254029
servers_midpoint_missing            1.247272
storage_devices_midpoint_missing    1.221947
has_company_description             1.089072
credibility_score_norm              0.947800
it_assets_per_employee              0.917558
employee_concentration              0.759680
Name: 27, dtype: float64

In [90]:
cluster_profiles_no_noise.loc[cid_3].sort_values(ascending=True).head(10)

latitude_missing          -1.865785
longitude_missing         -1.864528
Franchise Status__<NA>    -1.547605
it_spend_rate             -1.256483
log_revenue_usd           -1.235520
log_it_budget             -1.222990
log_it_spend              -1.219741
log_abs_it_budget_gap     -1.216723
log_market_value_usd      -1.057564
Entity Type__Subsidiary   -0.854277
Name: 27, dtype: float64

### Cluster 34 - Mature, high-credibility and transparent firms

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `has_company_description` | High | Public-facing company information is consistently available |
| `credibility_score_norm` | High | High overall credibility and trustworthiness score |
| `log_company_age` | High | Older, more established organisations |
| `it_spend_rate` | High | IT spending scales reliably with operations |
| `employee_concentration` | High | Workforce is concentrated at main operating sites |
| `log_revenue_usd` | High | Stable and relatively high operating revenue |
| `has_phone` | High | Contact information is consistently available |
| `log_it_budget` | High | Sustained allocation of resources to IT |
| `log_it_spend` | High | Consistent IT expenditure |
| `log_market_value_usd` | High | Solid firm valuation and economic standing |
| `latitude_missing` | Low | Geographic location data is consistently available |
| `longitude_missing` | Low | Geographic location data is consistently available |
| `missing_ratio_overall` | Low | Low overall data missingness |
| `missing_ratio_it` | Low | IT-related data is largely complete |
| `servers_midpoint_missing` | Low | Server infrastructure is well-documented |
| `storage_devices_midpoint_missing` | Low | Storage infrastructure is well-documented |
| `routers_midpoint_missing` | Low | Network infrastructure is well-documented |
| `Entity Type_Branch` | Low | Less likely to operate as small branch entities |
| `missing_ratio_codes` | Low | Operational and classification codes are consistently reported |

Cluster 34 represents mature, well-established organisations characterised by high credibility and strong data transparency. These firms exhibit consistent availability of public-facing information, reliable contact details, and comprehensive reporting across operational, geographic, and IT-related dimensions. Elevated company age, stable revenue, and sustained IT spending suggest organisations with settled operating models rather than rapid expansion or restructuring. Low levels of missingness across infrastructure and descriptive variables indicate strong governance and disciplined internal processes. Overall, risk exposure for this cluster is comparatively low and driven more by **standard operational considerations** than by **data opacity, infrastructure sprawl, or governance complexity**.

In [91]:
cid_4 = 34
cluster_profiles.loc[cid_4].sort_values(ascending=False).head(10)

Entity Type__Branch        1.229894
has_company_description    1.111534
Franchise Status__FALSE    0.672878
has_registration_number    0.630702
credibility_score_norm     0.577041
log_company_age            0.393159
Legal Status__3.0          0.385223
Ownership Type__Private    0.384422
has_ownership_type         0.380408
has_state                  0.253027
Name: 34, dtype: float64

In [92]:
cluster_profiles.loc[cid_4].sort_values(ascending=True).head(10)

latitude_missing                   -1.865785
longitude_missing                  -1.864528
it_spend_rate                      -1.256483
log_revenue_usd                    -1.235520
log_it_budget                      -1.222990
log_it_spend                       -1.219741
log_abs_it_budget_gap              -1.216723
log_market_value_usd               -1.057564
Entity Type__Subsidiary            -0.830456
storage_devices_midpoint_missing   -0.818366
Name: 34, dtype: float64

### Cluster 16 - Small branch firms with limited IT visibility despite formal registration

| Key Signal | Direction | Meaning |
|-----------|----------|--------|
| `Franchise Status_FALSE` | High | Firms are predominantly non-franchise entities |
| `has_company_description` | High | Public-facing company information is available |
| `routers_midpoint_missing` | High | Network infrastructure details are largely unreported |
| `servers_midpoint_missing` | High | Server infrastructure information is frequently missing |
| `storage_devices_midpoint_missing` | High | Storage infrastructure is poorly documented |
| `missing_ratio_it` | High | IT-related data is largely incomplete |
| `Entity Type_Branch` | High | Firms are predominantly branch-level entities |
| `credibility_score_norm` | High | Relatively high credibility despite operational simplicity |
| `has_registration_number` | High | Formal registration information is available |
| `log_corporate_family_members` | High | Firms belong to small corporate groups |
| `employee_concentration` | Low | Workforce is not concentrated at large operating sites |
| `log_employees_total` | Low | Small overall workforce size |
| `log_revenue_usd` | Low | Low operating revenue |
| `log_market_value_usd` | Low | Low firm valuation |
| `log_it_budget` | Low | Limited allocation of resources to IT |
| `log_it_spend` | Low | Low absolute IT expenditure |
| `it_spend_rate` | Low | IT spending does not scale with operations |
| `log_abs_it_budget_gap` | Low | Minimal IT budget adjustment or expansion activity |

Cluster 16 represents small, branch-level organisations with limited operational scale and minimal reliance on IT infrastructure. While these firms exhibit relatively strong formal indicators such as company descriptions, registration numbers, and moderate credibility scores, they display substantial gaps in IT and infrastructure reporting. Low workforce size, revenue, valuation, and IT spending suggest simple operating models with limited technological dependence. Risk exposure in this cluster is driven primarily by **weak IT visibility and infrastructure transparency** rather than **governance or organisational complexity**, making oversight challenging due to incomplete technical information despite basic corporate legitimacy.

In [93]:
cid_5 = 16
cluster_profiles.loc[cid_5].sort_values(ascending=False).head(10)

Franchise Status__FALSE             1.585417
has_company_description             1.332063
routers_midpoint_missing            1.262704
Entity Type__Branch                 1.254029
servers_midpoint_missing            1.247272
storage_devices_midpoint_missing    1.221947
missing_ratio_it                    1.102224
credibility_score_norm              0.677902
has_registration_number             0.656348
log_corporate_family_members        0.550856
Name: 16, dtype: float64

In [94]:
cluster_profiles.loc[cid_5].sort_values(ascending=True).head(10)

missing_ratio_codes      -2.016983
Franchise Status__<NA>   -1.547605
employee_concentration   -1.316343
it_spend_rate            -1.256483
log_revenue_usd          -1.235520
log_it_budget            -1.222990
log_it_spend             -1.219741
log_abs_it_budget_gap    -1.216723
log_market_value_usd     -1.057564
log_employees_total      -1.044719
Name: 16, dtype: float64

### Summarise

| Cluster | Dominant Risk Type | Key Driver |
|--------|------------------|-----------|
| 27 | Transparency & data quality risk | High missingness across operational and IT variables |
| 16 | IT visibility risk | Poor infrastructure reporting despite formal registration |
| 34 | Low risk (high transparency) | Mature operations with strong disclosure quality |
| 44 | Operational complexity risk | Dense IT assets and infrastructure sprawl |
| 45 | Governance & coordination risk | Large-scale subsidiary structures with heavy IT reliance |