# Member 2 â€” Feature Engineering (Signal Construction)

**Goal:** Turn company attributes into **numeric, comparable, explainable signals** for **PCA + clustering**.

**Scope (Member 2):**
- Start from **Member 1's cleaned & imputed dataset** (recommended: `data/processed/clean_base.csv`).
- Do **not** make global cleaning decisions (dropping rows/columns, global imputation strategies, etc.).
- You *may* create **presence indicators** and derived features that treat missingness as signal.

**Outputs (for handoff):**
- `df_features_raw`: numeric feature matrix (not scaled)
- `df_features_scaled`: scaled matrix for PCA/clustering
- `feature_dict`: short description of engineered features

---


In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_columns', 250)


In [4]:
# -------------------------------
# Load data
# -------------------------------
# Prefer Member 1's output (cleaned & imputed). Fallback to raw for sandboxing.

import os

CLEAN_PATH = '../data/processed/clean_base.csv'
RAW_FALLBACK = '../data/champions_group_data.csv'

if os.path.exists(CLEAN_PATH):
    df = pd.read_csv(CLEAN_PATH)
    data_source = CLEAN_PATH
else:
    df = pd.read_csv(RAW_FALLBACK)
    data_source = RAW_FALLBACK

print('Loaded:', data_source)
print('Shape:', df.shape)


Loaded: ../data/champions_group_data.csv
Shape: (8559, 72)


  df = pd.read_csv(RAW_FALLBACK)


## 1) Helper functions

We standardize "presence" checks so empty strings don't count as filled.
We also parse range strings like `"1 to 10"` into numeric midpoints.


In [5]:
def has_value(series: pd.Series) -> pd.Series:
    # True if non-missing AND not empty/whitespace
    s = series.astype('string').str.strip()
    return s.notna() & s.ne('')


def range_to_midpoint(value):
    # Convert 'x to y' -> midpoint; return NaN otherwise
    if isinstance(value, str) and 'to' in value:
        low, high = value.split('to')
        try:
            return (float(low.strip()) + float(high.strip())) / 2
        except ValueError:
            return np.nan
    return np.nan


## 2) Core transparency + structure signals (curated short names)

These are your **interpretable, low-risk** binary signals.


In [6]:
# Transparency / traceability

df['has_website'] = has_value(df['Website']).astype(int)
df['has_phone']   = has_value(df['Phone Number']).astype(int)
df['has_address'] = has_value(df['Address Line 1']).astype(int)

# Ownership / structure

df['has_parent']           = has_value(df['Parent Company']).astype(int)
df['has_global_ultimate']  = has_value(df['Global Ultimate Company']).astype(int)
df['has_domestic_ultimate']= has_value(df['Domestic Ultimate Company']).astype(int)

# Verifiability extras

df['has_ticker'] = has_value(df['Ticker']).astype(int)
df['has_registration_number'] = has_value(df['Registration Number']).astype(int)
df['has_company_description'] = has_value(df['Company Description']).astype(int)

# Company status (value-coded)
status_col = 'Company Status (Active/Inactive)'
df['company_status_binary'] = (
    df[status_col].astype('string').str.strip().str.lower().map({'active': 1, 'inactive': 0})
)
# Presence of status (for completeness scoring)
df['has_company_status'] = df['company_status_binary'].notna().astype(int)


## 3) Credibility / completeness score (and missing_ratio)

We include **status-known** (not status value) so inactive firms aren't penalized.


In [7]:
credibility_flag_cols = [
    'has_website', 'has_address', 'has_phone',
    'has_ticker', 'has_parent', 'has_global_ultimate', 'has_domestic_ultimate',
    'has_registration_number', 'has_company_description',
    'has_company_status'
]

# Completeness score: how many key fields are filled

df['credibility_score'] = df[credibility_flag_cols].sum(axis=1)
df['credibility_score_norm'] = df['credibility_score'] / len(credibility_flag_cols)

# Missingness proxy: higher = more opaque record

df['missing_ratio'] = 1 - df['credibility_score_norm']


## 4) Organisational complexity (group size by global ultimate)

Companies sharing the same global ultimate are treated as belonging to the same group.


In [8]:
df['global_ultimate_key'] = (
    df['Global Ultimate Company'].astype('string').str.strip().str.upper()
)

# Group size (aligned to rows)
group_sizes = df.groupby('global_ultimate_key')['global_ultimate_key'].transform('size')

# Avoid treating missing global ultimate as one mega-group

df['org_complexity_count'] = np.where(df['global_ultimate_key'].notna(), group_sizes, 0)
df['log_org_complexity_count'] = np.log1p(df['org_complexity_count'])


## 5) Scale + market signals

These control for size so clustering isn't just "big vs small".


In [9]:
# Coerce likely numeric fields

for col in ['Employees Single Site', 'Employees Total', 'Revenue (USD)', 'Market Value (USD)', 'Company Sites', 'Corporate Family Members', 'Year Found']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Log transforms

df['log_employees_total'] = np.log1p(df['Employees Total'])
df['log_employees_single_site'] = np.log1p(df['Employees Single Site'])
df['log_revenue_usd'] = np.log1p(df['Revenue (USD)'])

df['log_market_value_usd'] = np.log1p(df['Market Value (USD)'])
df['log_company_sites'] = np.log1p(df['Company Sites'])
df['log_corporate_family_members'] = np.log1p(df['Corporate Family Members'])

# Company age
CURRENT_YEAR = 2026

df['company_age'] = CURRENT_YEAR - df['Year Found']
# Clamp impossible ages to NaN (Member 1 should have cleaned, but this is a safety net)
df.loc[(df['company_age'] < 0) | (df['company_age'] > 300), 'company_age'] = np.nan


## 6) Geography + multinational heuristics

We avoid one-hot encoding high-cardinality city names. Use country/region + parent/ultimate country comparisons.


In [10]:
# Coordinates are useful for PCA/clustering if available
for col in ['Lattitude', 'Longitude']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Multinational flags: compare entity country to parent/ultimate country

def normalized_country(series):
    return series.astype('string').str.strip().str.upper()

entity_country = normalized_country(df['Country'])
parent_country = normalized_country(df['Parent Country/Region'])
global_country = normalized_country(df['Global Ultimate Country Name'])

df['parent_foreign_flag'] = ((df['has_parent'] == 1) & (parent_country.ne('') ) & (parent_country != entity_country)).astype(int)
df['global_ultimate_foreign_flag'] = ((df['has_global_ultimate'] == 1) & (global_country.ne('')) & (global_country != entity_country)).astype(int)

df['multinational_flag'] = ((df['parent_foreign_flag'] == 1) | (df['global_ultimate_foreign_flag'] == 1)).astype(int)


## 7) IT / operational footprint signals

We parse IT asset ranges into midpoints, then build intensity & composition signals.


In [11]:
# Midpoints for range-like IT asset fields

df['pc_midpoint'] = df['No. of PC'].apply(range_to_midpoint)
df['desktops_midpoint'] = df['No. of Desktops'].apply(range_to_midpoint)
df['laptops_midpoint'] = df['No. of Laptops'].apply(range_to_midpoint)
df['routers_midpoint'] = df['No. of Routers'].apply(range_to_midpoint)
df['servers_midpoint'] = df['No. of Servers'].apply(range_to_midpoint)
df['storage_devices_midpoint'] = df['No. of Storage Devices'].apply(range_to_midpoint)

it_assets = ['pc_midpoint', 'desktops_midpoint', 'laptops_midpoint', 'routers_midpoint', 'servers_midpoint', 'storage_devices_midpoint']

df['it_assets_total'] = df[it_assets].fillna(0).sum(axis=1)
df['log_it_assets_total'] = np.log1p(df['it_assets_total'])

# IT Budget / Spend

df['it_budget'] = pd.to_numeric(df['IT Budget'], errors='coerce')
df['it_spend']  = pd.to_numeric(df['IT spend'], errors='coerce')

df['log_it_budget'] = np.log1p(df['it_budget'])
df['log_it_spend']  = np.log1p(df['it_spend'])

df['it_spend_rate'] = (df['it_spend'] / df['it_budget']).clip(lower=0, upper=3)
df['it_budget_gap'] = df['it_budget'] - df['it_spend']
df['log_abs_it_budget_gap'] = np.log1p(df['it_budget_gap'].abs())

# Intensity per employee (size-normalized)
emp = df['Employees Total']
mask_emp = emp > 0

df['it_assets_per_employee'] = np.nan
df.loc[mask_emp, 'it_assets_per_employee'] = df.loc[mask_emp, 'it_assets_total'] / emp[mask_emp]

df['it_spend_per_employee'] = np.nan
df.loc[mask_emp, 'it_spend_per_employee'] = df.loc[mask_emp, 'it_spend'] / emp[mask_emp]

df['log_it_assets_per_employee'] = np.log1p(df['it_assets_per_employee'])
df['log_it_spend_per_employee'] = np.log1p(df['it_spend_per_employee'])

# Composition: infrastructure vs endpoints
endpoint_total = df[['desktops_midpoint', 'laptops_midpoint', 'pc_midpoint']].fillna(0).sum(axis=1)
infra_total = df[['servers_midpoint', 'storage_devices_midpoint', 'routers_midpoint']].fillna(0).sum(axis=1)

eps = 1e-9

df['infra_to_endpoint_ratio'] = infra_total / (endpoint_total + eps)
df['log_infra_to_endpoint_ratio'] = np.log1p(df['infra_to_endpoint_ratio'])

# IT reporting completeness (how many IT fields are present)
it_raw_cols = [
    'No. of Desktops', 'No. of Laptops', 'No. of Routers', 'No. of Servers',
    'No. of Storage Devices', 'No. of PC', 'IT Budget', 'IT spend'
]

df_it_trimmed = df[it_raw_cols].apply(lambda s: s.astype('string').str.strip())
df['it_reporting_score'] = (df_it_trimmed.notna() & df_it_trimmed.ne('')).sum(axis=1)
df['it_reporting_score_norm'] = df['it_reporting_score'] / len(it_raw_cols)
df['it_missing_ratio'] = 1 - df['it_reporting_score_norm']


## 8) Industry code features (low-cardinality sector buckets)

Full industry codes can be high-cardinality. For clustering/PCA, we use **2-digit buckets** (industry sectors).


In [12]:
def code_prefix(series, n=2):
    # Convert codes to strings and take first n digits/characters
    s = series.astype('string').str.strip()
    # Keep digits only (common for codes)
    s = s.str.replace(r'\D+', '', regex=True)
    return s.str[:n]

# 2-digit buckets (safe for PCA/clustering)

df['sic2'] = code_prefix(df['SIC Code'], n=2)
df['naics2'] = code_prefix(df['NAICS Code'], n=2)
df['nace2'] = code_prefix(df['NACE Rev 2 Code'], n=2)
df['anzsic2'] = code_prefix(df['ANZSIC Code'], n=2)
df['isic2'] = code_prefix(df['ISIC Rev 4 Code'], n=2)


## 9) Categorical encoding (PCA/clustering ready)

We one-hot encode selected low-cardinality categoricals + sector buckets.


In [13]:
categorical_cols = [
    'Region', 'Entity Type', 'Ownership Type',
    'Legal Status', 'Franchise Status', 'Manufacturing Status',
    'Is Headquarters', 'Is Domestic Ultimate',
    'Registration Number Type',
    'sic2', 'naics2', 'nace2', 'anzsic2', 'isic2'
]

# Convert booleans/flags that may be stored as strings
for col in ['Is Headquarters', 'Is Domestic Ultimate']:
    if col in df.columns:
        df[col] = df[col].astype('string').str.strip().str.lower().map({'true': 1, 'false': 0, 'yes': 1, 'no': 0}).fillna(df[col])

# One-hot encode

df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=False, dummy_na=False)
print('After encoding:', df_encoded.shape)


After encoding: (8559, 422)


## 10) Build final feature matrices (raw + scaled)

- Drop obvious identifiers & free-text
- Keep numeric + engineered features + one-hot columns
- Create scaled matrix for PCA/clustering


In [22]:
# Drop identifier and free-text columns (keep only derived numeric/one-hot)
id_cols = ['DUNS Number ', 'DUNS Number']
text_cols = [
    'Website', 'Address Line 1', 'City', 'State', 'State Or Province Abbreviation', 'Postal Code', 'Country', 'Phone Number',
    'SIC Description', '8-Digit SIC Description', 'NAICS Description', 'NACE Rev 2 Description',
    'ANZSIC Description', 'ISIC Rev 4 Description',
    'Company Description',
    'Parent Company', 'Parent Street Address', 'Parents City', 'Parent State/Province', 'Parent State/Province Abbreviation', 'Parent Postal Code', 'Parent Country/Region',
    'Global Ultimate Company', 'Global Ultimate Street Address', 'Global Ultimate City Name', 'Global Ultimate State/Province', 'Ultimate State/Province Abbreviation', 'Global Ultimate Postal Code', 'Global Ultimate Country Name',
    'Domestic Ultimate Company', 'Domestic Ultimate Street Address', 'Domestic Ultimate City Name', 'Domestic Ultimate State/Province Name', 'Domestic Ultimate State Abbreviation', 'Domestic Ultimate Postal Code',
    'Registration Number',
    status_col,
]

cols_to_drop = [c for c in (id_cols + text_cols) if c in df_encoded.columns]

# Keep features

df_features_raw = df_encoded.drop(columns=cols_to_drop)

# Remove any remaining non-numeric columns (safety)
df_features_raw = df_features_raw.select_dtypes(include=[np.number])

print('Feature matrix shape:', df_features_raw.shape)

# Scale for PCA/clustering
scaler = StandardScaler()
df_features_scaled = pd.DataFrame(
    scaler.fit_transform(df_features_raw),
    columns=df_features_raw.columns,
    index=df_features_raw.index
)

# Quick check
print('Any NaNs in features?', df_features_raw.isna().any().any())
df_features_raw.to_csv("df_features_raw_member2.csv", index=False)


Feature matrix shape: (8559, 68)
Any NaNs in features? True


  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


In [19]:
# 1) Count missing values (NaNs) in each feature column
nan_counts = df_features_raw.isna().sum()

# 2) Keep only columns that actually have NaNs
nan_counts = nan_counts[nan_counts > 0].sort_values(ascending=False)

# 3) Convert into a clean table for export
nan_report = nan_counts.reset_index()
nan_report.columns = ["feature_name", "nan_count"]

# 4) Add percentage of rows missing in each feature (easier to interpret)
nan_report["nan_pct"] = nan_report["nan_count"] / len(df_features_raw)

# 5) Save to CSV so I can inspect it
nan_report.to_csv("nan_report_member2.csv", index=False)

print("Saved:", "nan_report_member2.csv")
nan_report.head(10)

Saved: nan_report_member2.csv


Unnamed: 0,feature_name,nan_count,nan_pct
0,Company Sites,8559,1.0
1,log_company_sites,8559,1.0
2,Ticker,8555,0.999533
3,ANZSIC Code,7133,0.833392
4,NACE Rev 2 Code,7045,0.82311
5,ISIC Rev 4 Code,7045,0.82311
6,Lattitude,6649,0.776843
7,Longitude,6647,0.776609
8,NAICS Code,5387,0.629396
9,8-Digit SIC Code,5309,0.620283


In [20]:
# 1) Identify which rows have ANY NaNs across the feature matrix
rows_with_nan_mask = df_features_raw.isna().any(axis=1)

# 2) Pull out the rows that have NaNs
rows_with_nan = df_features_raw.loc[rows_with_nan_mask]

print("Rows with at least one NaN:", rows_with_nan.shape[0])

# 3) Find the top 20 columns with the most NaNs (to focus the sample)
top_nan_features = (
    df_features_raw.isna().sum()
    .sort_values(ascending=False)
    .head(20)
    .index
)

# 4) Export a small sample (e.g., first 200 rows) of ONLY the top NaN columns
nan_sample = rows_with_nan.loc[:, top_nan_features].head(200)

# 5) Save to CSV for inspection
nan_sample.to_csv("nan_rows_sample_member2.csv", index=False)

print("Saved:", "nan_rows_sample_member2.csv")
nan_sample.head()

Rows with at least one NaN: 8559
Saved: nan_rows_sample_member2.csv


Unnamed: 0,Company Sites,log_company_sites,Ticker,ANZSIC Code,NACE Rev 2 Code,ISIC Rev 4 Code,Lattitude,Longitude,NAICS Code,8-Digit SIC Code,storage_devices_midpoint,servers_midpoint,it_spend_rate,routers_midpoint,it_spend_per_employee,log_it_assets_per_employee,log_it_spend_per_employee,it_assets_per_employee,company_age,Year Found
0,,,,3322.0,4672.0,4662.0,,,423510.0,50510000.0,,,,5.5,0.0,3.135494,0.0,22.0,3.0,2023.0
1,,,,,,,,,,,5.5,5.5,0.619889,5.5,173.6,0.97456,5.162498,1.65,18.0,2008.0
2,,,,,,,47.34088,123.96045,311411.0,20370000.0,5.5,5.5,0.619999,5.5,605.404494,0.501796,6.407547,0.651685,13.0,2013.0
3,,,,,,,,,,,,,0.619948,,,,,,14.0,2012.0
4,,,,,,,,,,,5.5,5.5,0.619993,5.5,29314.5,2.862201,10.285872,16.5,15.0,2011.0


In [21]:
# 1) Capture dtypes for every feature column
feature_dtypes = df_features_raw.dtypes.astype(str).reset_index()
feature_dtypes.columns = ["feature_name", "dtype"]

# 2) Save
feature_dtypes.to_csv("feature_dtypes_member2.csv", index=False)

print("Saved:", "feature_dtypes_member2.csv")
feature_dtypes.head()

Saved: feature_dtypes_member2.csv


Unnamed: 0,feature_name,dtype
0,Company Sites,float64
1,Employees Single Site,float64
2,Employees Total,int64
3,Revenue (USD),int64
4,SIC Code,int64


## 11) Export for Member 3 (optional)

Uncomment to export once Member 1's clean_base.csv is ready.


In [15]:
# df_features_raw.to_csv('data/processed/features_for_clustering_raw.csv', index=False)
# df_features_scaled.to_csv('data/processed/features_for_clustering_scaled.csv', index=False)

# Save feature list for handoff
# pd.DataFrame({'feature': df_features_raw.columns}).to_csv('docs/feature_list_member2.csv', index=False)
